Loading...

Private and Confidential
← Back to Guides

Statistical Significance in F2P Games

A product manager's guide to not fooling yourself. How to run valid A/B tests, avoid common traps, and make data-driven decisions that actually hold up.

Analytics Product

The Core Problem

"ROAS dropped from 50% to 44%. Is this a crisis or just noise?" Without proper statistical rigor, you're guessing. This guide covers the traps that catch even experienced PMs and how to avoid them.

The Hierarchy: Product Enables Marketing

Before worrying about sample sizes and p-values, understand the hierarchy. You cannot A/B test your way to good ROAS if retention is broken.

The KPI Hierarchy
Marketing Can Scale ROAS > 100%, positive unit economics
ENABLED BY
Monetization Works ARPDAU, conversion, LTV
ENABLED BY
Core Loop Works Session length, sessions/day
Retention Works D1, D7, D30
ENABLED BY
Systems Work Technically stable, performant, no crashes
ENABLED BY
Acquisition Funnel Works Store page converts, tutorial completes

How Do You Know If Each Level "Works"?

The hierarchy shows what to test in order. But how do you know if retention or monetization is actually working? These are the top 25% benchmarks by genre — if you're below these, that's where to focus your experiments:

Genre D1 D7 D30 D90 Payer Conv CPI (US)
Hypercasual 45% 15% 6% 2% 3% $1.20
Hybrid-Casual 50% 20% 8% 3% 5% $1.80
Casual Puzzle 55% 25% 12% 6% 8% $2.50
Midcore RPG 40% 18% 10% 5% 15% $4.00
Strategy/4X 45% 22% 12% 7% 12% $3.50
Casino/Social 60% 30% 15% 8% 20% $5.00

Sources: Sensor Tower 2024, data.ai 2025, Liftoff Q3 2024

CPI Warning: CPIs vary 5-10x by Geo/OS. US iOS Midcore can hit $15+; Android ROW may be $2. Use these only for relative comparison between genres.

The "Scale Anyway" Trap

"We'll improve retention while scaling" — this is burning money. Every cohort acquired on bad retention has lower LTV, pollutes your data, and makes it harder to see if product changes work.

Rule: Don't spend more than $500/day until D7 retention meets category benchmark.

The Funnel Law: Deeper = Harder to Test

Sample size requirements increase as you go deeper into the funnel. Detecting a 10% lift on 50% (5pp absolute) is much easier than detecting 10% lift on 5% (0.5pp absolute).

Reading this table: To detect a 10% relative improvement in each metric (with 80% statistical power, p<0.05), you need the indicated sample sizes. "Per Variant" means each of your A/B groups needs this many users.

Funnel Stage Baseline Per Variant Total Installs Min Duration
Tutorial Completion 65% 806 1,612 7 days
D1 Retention 50% 1,566 3,132 7 days
D7 Retention 20% 6,511 13,022 14 days
D30 Retention 8% 18,873 37,746 37 days
Payer Conversion 2.5% 64,200 128,400 14 days

The math is unforgiving: Testing payer conversion requires 40x more users than testing tutorial completion.

Genre-Specific Sample Sizes

Higher-retention genres (Casino, Puzzle) need fewer installs to detect the same relative lift. Lower-retention genres (Hypercasual) need more. Use your genre's baseline when planning tests:

D7 Retention Test (10% Relative Lift)

Genre D7 Baseline Per Variant Total Installs
Hypercasual 15% 9,258 18,516
Hybrid-Casual 20% 6,511 13,022
Midcore RPG 18% 7,427 14,854
Strategy/4X 22% 5,762 11,524
Casual Puzzle 25% 4,863 9,726
Casino/Social 30% 3,764 7,528

Payer Conversion Test (20% Relative Lift)

Genre Payer Conv Per Variant Total Installs
Hypercasual 3% 13,915 27,830
Hybrid-Casual 5% 8,159 16,318
Casual Puzzle 8% 4,922 9,844
Strategy/4X 12% 3,123 6,246
Midcore RPG 15% 2,404 4,808
Casino/Social 20% 1,684 3,368

Key insight: Higher payer conversion = easier to test. Casino/Midcore can test monetization with ~5K users. Hypercasual needs ~28K.

The Full Week Rule (Critical)

User behavior varies by day of week. Weekend users behave differently than weekday users. IAP conversion spikes on paydays.

Rule: Tests must run for complete 7-day cycles (7, 14, 21, 28 days) regardless of sample size.

The Four Silent Killers of A/B Tests

Silent Killer #1: Sample Ratio Mismatch (SRM)

What it is: You target 50/50 split but get 52/48 or worse.

The "missing" users aren't random. If 200 users disappear from variant B, they're probably the impatient ones who churned. You're now comparing "all users" vs "patient users."

Rule: If actual split differs from expected by more than 1%, investigate before trusting results.

Silent Killer #2: Peeking

What it is: Checking your dashboard daily and stopping when you see "significant."

Check once per day for 14 days? Your false positive rate explodes from 5% to 30%+. Day 3 shows p=0.04. You stop. Celebrate. Ship. But it was random noise.

Rule: Decide the duration BEFORE you start. Do not check until it's over.

Silent Killer #3: Novelty Effect

What it is: Users engage more with any change simply because it's new.

A UI change might spike metrics for 3-5 days, then drop below baseline. If you measure during the novelty window, you ship a worse feature.

Rule: Never trust the first 3 days of data. Run for at least 7 days, ideally 14.

Silent Killer #4: Simpson's Paradox

What it is: A change that looks flat in aggregate actually helps one segment and hurts another.

iOS: +8% retention (60% of users). Android: -15% retention (40% of users). Aggregate: +0.2% (not significant). You kill the feature, but should have shipped iOS-only.

Rule: Always segment results by Platform (iOS/Android) and Top 3 Geos before making decisions.

Monetization Tests Are Different (The Whale Problem)

Retention data is roughly normal. Monetization data is Pareto-distributed: 95-98% spend $0, 2-5% are payers, 0.1-0.5% are whales (50%+ of revenue). One whale in Variant A spending $500 can flip your entire result.

Metric Distribution Sample Size Recommendation
Payer Conversion Binary 15K-65K Use this
ARPPU Pareto 50K+ payers Very hard
ARPDAU Extreme Pareto 100K+ users Don't A/B test
LTV Extreme Pareto N/A Don't A/B test

Recommendation: Test payer conversion rate (binary: paid or not paid). For ARPDAU/LTV impact, use a long-running Holdout Group (10% of users held back for 60-90 days) rather than short-term A/B tests.

Ad Revenue vs IAP Revenue

Good news: ad monetization tests don't suffer the same whale problem as IAP tests.

Revenue Type Variance Test Difficulty
Ad eCPM Low Easy (like retention)
Interstitial frequency Low Easy
Rewarded ad conversion Medium Moderate
IAP bundle test Very High Hard
IAP price point test Extreme Very Hard

Rule: You can A/B test ad frequency easily. Testing a new $99.99 bundle? Accept high uncertainty or use qualitative methods.

Who Enters Your Test Matters

If you run a test on "All Active Users," you mix old users (established habits, harder to change) with new users (fresh, malleable). Old users dilute your signal. A tutorial change has zero effect on D30+ users—they already completed the tutorial.

Test Type Population Why
Onboarding/Tutorial New installs only Old users already passed tutorial
Core loop changes New installs only Habit formation happens early
Monetization (new offer) Active users OK But segment by tenure
UI/UX polish Active users OK But check for novelty effect
LiveOps events Active users OK Time-limited by design

Rule for Retention Tests

Only include users who installed AFTER test start. Never mix pre-existing users into retention or onboarding experiments.

Seasonality Will Fool You

"We launched the feature on Monday. Revenue is up 15% vs last week!" — Last week was post-holiday. This week is normal. You measured seasonality, not your feature.

Valid Methods

  • A/B test with control group
  • Same-period comparison (Tuesday vs Tuesday)
  • Day-of-week fixed effects
  • Week 23 vs Week 23 last year

Invalid Methods

  • "This week vs last week"
  • Pre/post without control
  • Ignoring holiday proximity
  • Weekend vs weekday comparisons

External Validity: Soft Launch ≠ Global

You soft-launch in the Philippines. D1 retention is 45%. You launch in the US. D1 retention: 32%. What happened?

Factor Philippines/SEA US/Western Europe
Device quality Lower spec, more tolerant Higher spec, less tolerant
Network speed Variable, patient with loading Fast, rage quit on lag
Competition Less saturated Hyper-saturated
Ad tolerance Higher Lower

Before Scaling UA

Rule: Validate with at least 2,000 installs in target geo (US/UK) before scaling. Use European T2 markets (Poland, Brazil, Canada) for better US proxy than SEA.

Budgeting for Significance: The Spend-to-Learn Ratio

Your team ships a tutorial update every 2 weeks. Your UA is spending $3,000/day. Are you wasting money or wasting learnings?

The Formula

Optimal Daily Spend = (Required Installs / Test Duration Days) × CPI

Constraint: Test Duration ≥ 7 days (Full Week Rule)

Example: Hybrid-Casual D7 Retention Test

Parameter Value
Total installs needed 13,000 (6,500 per variant)
Minimum test duration 14 days
CPI $1.80
Optimal daily spend (13,000 / 14) × $1.80 = $1,671/day

Aligning Spend with Ship Velocity

Scenario Problem
Ship every 3 days, spend $500/day Can't measure anything—insufficient sample
Ship every 3 weeks, spend $5K/day Buying users on unvalidated product
Ship randomly, spend $3K/day No coordination, no learnings
Ship every 14 days, spend matches test needs Every dollar teaches something

Pre-Flight Checklists

Before Starting Test
  • What metric am I measuring?
  • What's the baseline from OUR data (not industry average)?
  • What's the minimum improvement worth detecting?
  • How many users do I need per variant?
  • Is test duration at least 7 days (full week rule)?
  • Am I testing on NEW USERS only (for retention)?
  • Have I committed to NOT peeking until test ends?
Before Looking at Results
  • SRM Check: Is actual split within 1% of target?
  • Did I hit target sample size?
  • Did test run for complete 7-day cycle(s)?
  • Were there technical issues (crashes, hotfixes)?
  • Did external factors change (featuring, holiday)?
After Getting Results
  • Is p-value < 0.05?
  • Is effect size practically meaningful (not just significant)?
  • Novelty check: Is effect consistent first vs last 7 days?
  • Segment check: Does result hold for iOS AND Android?
  • Segment check: Does result hold in top 3 geos?

The Eight Commandments

  1. Product enables marketing. Fix retention before scaling UA. Know your category benchmarks.
  2. Deeper funnel = harder to test. Tutorial needs 1,600 users. Payer conversion needs 128,000.
  3. Sample size is users, not days. But tests need full 7-day cycles.
  4. Check for SRM before analyzing. If your 50/50 split is 52/48, your test is invalid.
  5. Never peek. Decide duration upfront. Daily checking inflates false positives to 30%+.
  6. Watch for novelty effects. Don't trust first 3 days. Compare early vs late periods.
  7. Monetization tests are different. Test conversion rate, not revenue. Whales break statistics.
  8. Always segment. Check iOS vs Android, top geos. Simpson's Paradox is real.

Tools & Resources

  1. Sample Size Calculatorevanmiller.org/ab-testing/sample-size
  2. SRM Checkerthumbtack.github.io/abba/demo/srm
  3. Bayesian Calculatorevanmiller.org/bayesian-ab-testing
  4. Sequential Testingevanmiller.org/sequential-ab-testing (if you must peek)

Excel/Google Sheets Formula

Calculate sample size per variant directly in your spreadsheet:

=POWER(NORM.S.INV(1-0.05/2)+NORM.S.INV(0.8),2)*2*B2*(1-B2)/POWER(B2*C2,2)

Where B2 = baseline rate (e.g., 0.20 for 20% retention) and C2 = relative lift to detect (e.g., 0.10 for 10% improvement). Note: Uses pooled variance approximation—estimates are ~5% lower than exact calculations.

The time you invest in statistical rigor pays off massively. You'll catch real problems, avoid false positives, and make decisions that actually move the needle instead of chasing noise.

✍️ Select text to comment
➕ Add Comment

💬 Comments

💬

No comments yet.

Select text to add the first comment.

Add Comment