Statistical Significance in F2P Games
A product manager's guide to not fooling yourself. How to run valid A/B tests, avoid common traps, and make data-driven decisions that actually hold up.
The Core Problem
"ROAS dropped from 50% to 44%. Is this a crisis or just noise?" Without proper statistical rigor, you're guessing. This guide covers the traps that catch even experienced PMs and how to avoid them.
The Hierarchy: Product Enables Marketing
Before worrying about sample sizes and p-values, understand the hierarchy. You cannot A/B test your way to good ROAS if retention is broken.
How Do You Know If Each Level "Works"?
The hierarchy shows what to test in order. But how do you know if retention or monetization is actually working? These are the top 25% benchmarks by genre — if you're below these, that's where to focus your experiments:
| Genre | D1 | D7 | D30 | D90 | Payer Conv | CPI (US) |
|---|---|---|---|---|---|---|
| Hypercasual | 45% | 15% | 6% | 2% | 3% | $1.20 |
| Hybrid-Casual | 50% | 20% | 8% | 3% | 5% | $1.80 |
| Casual Puzzle | 55% | 25% | 12% | 6% | 8% | $2.50 |
| Midcore RPG | 40% | 18% | 10% | 5% | 15% | $4.00 |
| Strategy/4X | 45% | 22% | 12% | 7% | 12% | $3.50 |
| Casino/Social | 60% | 30% | 15% | 8% | 20% | $5.00 |
Sources: Sensor Tower 2024, data.ai 2025, Liftoff Q3 2024
CPI Warning: CPIs vary 5-10x by Geo/OS. US iOS Midcore can hit $15+; Android ROW may be $2. Use these only for relative comparison between genres.
The "Scale Anyway" Trap
"We'll improve retention while scaling" — this is burning money. Every cohort acquired on bad retention has lower LTV, pollutes your data, and makes it harder to see if product changes work.
Rule: Don't spend more than $500/day until D7 retention meets category benchmark.
The Funnel Law: Deeper = Harder to Test
Sample size requirements increase as you go deeper into the funnel. Detecting a 10% lift on 50% (5pp absolute) is much easier than detecting 10% lift on 5% (0.5pp absolute).
Reading this table: To detect a 10% relative improvement in each metric (with 80% statistical power, p<0.05), you need the indicated sample sizes. "Per Variant" means each of your A/B groups needs this many users.
| Funnel Stage | Baseline | Per Variant | Total Installs | Min Duration |
|---|---|---|---|---|
| Tutorial Completion | 65% | 806 | 1,612 | 7 days |
| D1 Retention | 50% | 1,566 | 3,132 | 7 days |
| D7 Retention | 20% | 6,511 | 13,022 | 14 days |
| D30 Retention | 8% | 18,873 | 37,746 | 37 days |
| Payer Conversion | 2.5% | 64,200 | 128,400 | 14 days |
The math is unforgiving: Testing payer conversion requires 40x more users than testing tutorial completion.
Genre-Specific Sample Sizes
Higher-retention genres (Casino, Puzzle) need fewer installs to detect the same relative lift. Lower-retention genres (Hypercasual) need more. Use your genre's baseline when planning tests:
D7 Retention Test (10% Relative Lift)
| Genre | D7 Baseline | Per Variant | Total Installs |
|---|---|---|---|
| Hypercasual | 15% | 9,258 | 18,516 |
| Hybrid-Casual | 20% | 6,511 | 13,022 |
| Midcore RPG | 18% | 7,427 | 14,854 |
| Strategy/4X | 22% | 5,762 | 11,524 |
| Casual Puzzle | 25% | 4,863 | 9,726 |
| Casino/Social | 30% | 3,764 | 7,528 |
Payer Conversion Test (20% Relative Lift)
| Genre | Payer Conv | Per Variant | Total Installs |
|---|---|---|---|
| Hypercasual | 3% | 13,915 | 27,830 |
| Hybrid-Casual | 5% | 8,159 | 16,318 |
| Casual Puzzle | 8% | 4,922 | 9,844 |
| Strategy/4X | 12% | 3,123 | 6,246 |
| Midcore RPG | 15% | 2,404 | 4,808 |
| Casino/Social | 20% | 1,684 | 3,368 |
Key insight: Higher payer conversion = easier to test. Casino/Midcore can test monetization with ~5K users. Hypercasual needs ~28K.
The Full Week Rule (Critical)
User behavior varies by day of week. Weekend users behave differently than weekday users. IAP conversion spikes on paydays.
Rule: Tests must run for complete 7-day cycles (7, 14, 21, 28 days) regardless of sample size.
The Four Silent Killers of A/B Tests
Silent Killer #1: Sample Ratio Mismatch (SRM)
What it is: You target 50/50 split but get 52/48 or worse.
The "missing" users aren't random. If 200 users disappear from variant B, they're probably the impatient ones who churned. You're now comparing "all users" vs "patient users."
Silent Killer #2: Peeking
What it is: Checking your dashboard daily and stopping when you see "significant."
Check once per day for 14 days? Your false positive rate explodes from 5% to 30%+. Day 3 shows p=0.04. You stop. Celebrate. Ship. But it was random noise.
Silent Killer #3: Novelty Effect
What it is: Users engage more with any change simply because it's new.
A UI change might spike metrics for 3-5 days, then drop below baseline. If you measure during the novelty window, you ship a worse feature.
Silent Killer #4: Simpson's Paradox
What it is: A change that looks flat in aggregate actually helps one segment and hurts another.
iOS: +8% retention (60% of users). Android: -15% retention (40% of users). Aggregate: +0.2% (not significant). You kill the feature, but should have shipped iOS-only.
Monetization Tests Are Different (The Whale Problem)
Retention data is roughly normal. Monetization data is Pareto-distributed: 95-98% spend $0, 2-5% are payers, 0.1-0.5% are whales (50%+ of revenue). One whale in Variant A spending $500 can flip your entire result.
| Metric | Distribution | Sample Size | Recommendation |
|---|---|---|---|
| Payer Conversion | Binary | 15K-65K | Use this |
| ARPPU | Pareto | 50K+ payers | Very hard |
| ARPDAU | Extreme Pareto | 100K+ users | Don't A/B test |
| LTV | Extreme Pareto | N/A | Don't A/B test |
Recommendation: Test payer conversion rate (binary: paid or not paid). For ARPDAU/LTV impact, use a long-running Holdout Group (10% of users held back for 60-90 days) rather than short-term A/B tests.
Ad Revenue vs IAP Revenue
Good news: ad monetization tests don't suffer the same whale problem as IAP tests.
| Revenue Type | Variance | Test Difficulty |
|---|---|---|
| Ad eCPM | Low | Easy (like retention) |
| Interstitial frequency | Low | Easy |
| Rewarded ad conversion | Medium | Moderate |
| IAP bundle test | Very High | Hard |
| IAP price point test | Extreme | Very Hard |
Rule: You can A/B test ad frequency easily. Testing a new $99.99 bundle? Accept high uncertainty or use qualitative methods.
Who Enters Your Test Matters
If you run a test on "All Active Users," you mix old users (established habits, harder to change) with new users (fresh, malleable). Old users dilute your signal. A tutorial change has zero effect on D30+ users—they already completed the tutorial.
| Test Type | Population | Why |
|---|---|---|
| Onboarding/Tutorial | New installs only | Old users already passed tutorial |
| Core loop changes | New installs only | Habit formation happens early |
| Monetization (new offer) | Active users OK | But segment by tenure |
| UI/UX polish | Active users OK | But check for novelty effect |
| LiveOps events | Active users OK | Time-limited by design |
Rule for Retention Tests
Only include users who installed AFTER test start. Never mix pre-existing users into retention or onboarding experiments.
Seasonality Will Fool You
"We launched the feature on Monday. Revenue is up 15% vs last week!" — Last week was post-holiday. This week is normal. You measured seasonality, not your feature.
Valid Methods
- A/B test with control group
- Same-period comparison (Tuesday vs Tuesday)
- Day-of-week fixed effects
- Week 23 vs Week 23 last year
Invalid Methods
- "This week vs last week"
- Pre/post without control
- Ignoring holiday proximity
- Weekend vs weekday comparisons
External Validity: Soft Launch ≠ Global
You soft-launch in the Philippines. D1 retention is 45%. You launch in the US. D1 retention: 32%. What happened?
| Factor | Philippines/SEA | US/Western Europe |
|---|---|---|
| Device quality | Lower spec, more tolerant | Higher spec, less tolerant |
| Network speed | Variable, patient with loading | Fast, rage quit on lag |
| Competition | Less saturated | Hyper-saturated |
| Ad tolerance | Higher | Lower |
Before Scaling UA
Rule: Validate with at least 2,000 installs in target geo (US/UK) before scaling. Use European T2 markets (Poland, Brazil, Canada) for better US proxy than SEA.
Budgeting for Significance: The Spend-to-Learn Ratio
Your team ships a tutorial update every 2 weeks. Your UA is spending $3,000/day. Are you wasting money or wasting learnings?
The Formula
Optimal Daily Spend = (Required Installs / Test Duration Days) × CPI
Constraint: Test Duration ≥ 7 days (Full Week Rule)
Example: Hybrid-Casual D7 Retention Test
| Parameter | Value |
|---|---|
| Total installs needed | 13,000 (6,500 per variant) |
| Minimum test duration | 14 days |
| CPI | $1.80 |
| Optimal daily spend | (13,000 / 14) × $1.80 = $1,671/day |
Aligning Spend with Ship Velocity
| Scenario | Problem |
|---|---|
| Ship every 3 days, spend $500/day | Can't measure anything—insufficient sample |
| Ship every 3 weeks, spend $5K/day | Buying users on unvalidated product |
| Ship randomly, spend $3K/day | No coordination, no learnings |
| Ship every 14 days, spend matches test needs | Every dollar teaches something |
Pre-Flight Checklists
- What metric am I measuring?
- What's the baseline from OUR data (not industry average)?
- What's the minimum improvement worth detecting?
- How many users do I need per variant?
- Is test duration at least 7 days (full week rule)?
- Am I testing on NEW USERS only (for retention)?
- Have I committed to NOT peeking until test ends?
- SRM Check: Is actual split within 1% of target?
- Did I hit target sample size?
- Did test run for complete 7-day cycle(s)?
- Were there technical issues (crashes, hotfixes)?
- Did external factors change (featuring, holiday)?
- Is p-value < 0.05?
- Is effect size practically meaningful (not just significant)?
- Novelty check: Is effect consistent first vs last 7 days?
- Segment check: Does result hold for iOS AND Android?
- Segment check: Does result hold in top 3 geos?
The Eight Commandments
- Product enables marketing. Fix retention before scaling UA. Know your category benchmarks.
- Deeper funnel = harder to test. Tutorial needs 1,600 users. Payer conversion needs 128,000.
- Sample size is users, not days. But tests need full 7-day cycles.
- Check for SRM before analyzing. If your 50/50 split is 52/48, your test is invalid.
- Never peek. Decide duration upfront. Daily checking inflates false positives to 30%+.
- Watch for novelty effects. Don't trust first 3 days. Compare early vs late periods.
- Monetization tests are different. Test conversion rate, not revenue. Whales break statistics.
- Always segment. Check iOS vs Android, top geos. Simpson's Paradox is real.
Tools & Resources
- Sample Size Calculator — evanmiller.org/ab-testing/sample-size
- SRM Checker — thumbtack.github.io/abba/demo/srm
- Bayesian Calculator — evanmiller.org/bayesian-ab-testing
- Sequential Testing — evanmiller.org/sequential-ab-testing (if you must peek)
Excel/Google Sheets Formula
Calculate sample size per variant directly in your spreadsheet:
Where B2 = baseline rate (e.g., 0.20 for 20% retention) and C2 = relative lift to detect (e.g., 0.10 for 10% improvement). Note: Uses pooled variance approximation—estimates are ~5% lower than exact calculations.
The time you invest in statistical rigor pays off massively. You'll catch real problems, avoid false positives, and make decisions that actually move the needle instead of chasing noise.
