← Back to Guides

Statistical Significance in F2P Games

A product manager's guide to not fooling yourself. How to run valid A/B tests, avoid common traps, and make data-driven decisions that actually hold up.

Analytics Product

The Core Problem

"ROAS dropped from 50% to 44%. Is this a crisis or just noise?" Without proper statistical rigor, you're guessing. This guide covers the traps that catch even experienced PMs and how to avoid them.

The Hierarchy: Product Enables Marketing

Before worrying about sample sizes and p-values, understand the hierarchy. You cannot A/B test your way to good ROAS if retention is broken.

The KPI Hierarchy

Marketing Can Scale ROAS > 100%, positive unit economics

ENABLED BY

Monetization Works ARPDAU, conversion, LTV

ENABLED BY

Core Loop Works Session length, sessions/day

Retention Works D1, D7, D30

ENABLED BY

Systems Work Technically stable, performant, no crashes

ENABLED BY

Acquisition Funnel Works Store page converts, tutorial completes

How Do You Know If Each Level "Works"?

The hierarchy shows what to test in order. But how do you know if retention or monetization is actually working? These are the top 25% benchmarks by genre — if you're below these, that's where to focus your experiments:

Genre	D1	D7	D30	D90	Payer Conv	CPI (US)
Hypercasual	45%	15%	6%	2%	3%	$1.20
Hybrid-Casual	50%	20%	8%	3%	5%	$1.80
Casual Puzzle	55%	25%	12%	6%	8%	$2.50
Midcore RPG	40%	18%	10%	5%	15%	$4.00
Strategy/4X	45%	22%	12%	7%	12%	$3.50
Casino/Social	60%	30%	15%	8%	20%	$5.00

Sources: Sensor Tower 2024, data.ai 2025, Liftoff Q3 2024

CPI Warning: CPIs vary 5-10x by Geo/OS. US iOS Midcore can hit $15+; Android ROW may be $2. Use these only for relative comparison between genres.

The "Scale Anyway" Trap

"We'll improve retention while scaling" — this is burning money. Every cohort acquired on bad retention has lower LTV, pollutes your data, and makes it harder to see if product changes work.

Rule: Don't spend more than $500/day until D7 retention meets category benchmark.

The Funnel Law: Deeper = Harder to Test

Sample size requirements increase as you go deeper into the funnel. Detecting a 10% lift on 50% (5pp absolute) is much easier than detecting 10% lift on 5% (0.5pp absolute).

Reading this table: To detect a 10% relative improvement in each metric (with 80% statistical power, p<0.05), you need the indicated sample sizes. "Per Variant" means each of your A/B groups needs this many users.

Funnel Stage	Baseline	Per Variant	Total Installs	Min Duration
Tutorial Completion	65%	806	1,612	7 days
D1 Retention	50%	1,566	3,132	7 days
D7 Retention	20%	6,511	13,022	14 days
D30 Retention	8%	18,873	37,746	37 days
Payer Conversion	2.5%	64,200	128,400	14 days

The math is unforgiving: Testing payer conversion requires 40x more users than testing tutorial completion.

Genre-Specific Sample Sizes

Higher-retention genres (Casino, Puzzle) need fewer installs to detect the same relative lift. Lower-retention genres (Hypercasual) need more. Use your genre's baseline when planning tests:

D7 Retention Test (10% Relative Lift)

Genre	D7 Baseline	Per Variant	Total Installs
Hypercasual	15%	9,258	18,516
Hybrid-Casual	20%	6,511	13,022
Midcore RPG	18%	7,427	14,854
Strategy/4X	22%	5,762	11,524
Casual Puzzle	25%	4,863	9,726
Casino/Social	30%	3,764	7,528

Payer Conversion Test (20% Relative Lift)

Genre	Payer Conv	Per Variant	Total Installs
Hypercasual	3%	13,915	27,830
Hybrid-Casual	5%	8,159	16,318
Casual Puzzle	8%	4,922	9,844
Strategy/4X	12%	3,123	6,246
Midcore RPG	15%	2,404	4,808
Casino/Social	20%	1,684	3,368

Key insight: Higher payer conversion = easier to test. Casino/Midcore can test monetization with ~5K users. Hypercasual needs ~28K.

The Full Week Rule (Critical)

User behavior varies by day of week. Weekend users behave differently than weekday users. IAP conversion spikes on paydays.

Rule: Tests must run for complete 7-day cycles (7, 14, 21, 28 days) regardless of sample size.

The Four Silent Killers of A/B Tests

Silent Killer #1: Sample Ratio Mismatch (SRM)

What it is: You target 50/50 split but get 52/48 or worse.

The "missing" users aren't random. If 200 users disappear from variant B, they're probably the impatient ones who churned. You're now comparing "all users" vs "patient users."

Rule: If actual split differs from expected by more than 1%, investigate before trusting results.

Silent Killer #2: Peeking

What it is: Checking your dashboard daily and stopping when you see "significant."

Check once per day for 14 days? Your false positive rate explodes from 5% to 30%+. Day 3 shows p=0.04. You stop. Celebrate. Ship. But it was random noise.

Rule: Decide the duration BEFORE you start. Do not check until it's over.

Silent Killer #3: Novelty Effect

What it is: Users engage more with any change simply because it's new.

A UI change might spike metrics for 3-5 days, then drop below baseline. If you measure during the novelty window, you ship a worse feature.

Rule: Never trust the first 3 days of data. Run for at least 7 days, ideally 14.

Silent Killer #4: Simpson's Paradox

What it is: A change that looks flat in aggregate actually helps one segment and hurts another.

iOS: +8% retention (60% of users). Android: -15% retention (40% of users). Aggregate: +0.2% (not significant). You kill the feature, but should have shipped iOS-only.

Rule: Always segment results by Platform (iOS/Android) and Top 3 Geos before making decisions.

Monetization Tests Are Different (The Whale Problem)

Retention data is roughly normal. Monetization data is Pareto-distributed: 95-98% spend $0, 2-5% are payers, 0.1-0.5% are whales (50%+ of revenue). One whale in Variant A spending $500 can flip your entire result.

Metric	Distribution	Sample Size	Recommendation
Payer Conversion	Binary	15K-65K	Use this
ARPPU	Pareto	50K+ payers	Very hard
ARPDAU	Extreme Pareto	100K+ users	Don't A/B test
LTV	Extreme Pareto	N/A	Don't A/B test

Recommendation: Test payer conversion rate (binary: paid or not paid). For ARPDAU/LTV impact, use a long-running Holdout Group (10% of users held back for 60-90 days) rather than short-term A/B tests.

Ad Revenue vs IAP Revenue

Good news: ad monetization tests don't suffer the same whale problem as IAP tests.

Revenue Type	Variance	Test Difficulty
Ad eCPM	Low	Easy (like retention)
Interstitial frequency	Low	Easy
Rewarded ad conversion	Medium	Moderate
IAP bundle test	Very High	Hard
IAP price point test	Extreme	Very Hard

Rule: You can A/B test ad frequency easily. Testing a new $99.99 bundle? Accept high uncertainty or use qualitative methods.

Who Enters Your Test Matters

If you run a test on "All Active Users," you mix old users (established habits, harder to change) with new users (fresh, malleable). Old users dilute your signal. A tutorial change has zero effect on D30+ users—they already completed the tutorial.

Test Type	Population	Why
Onboarding/Tutorial	New installs only	Old users already passed tutorial
Core loop changes	New installs only	Habit formation happens early
Monetization (new offer)	Active users OK	But segment by tenure
UI/UX polish	Active users OK	But check for novelty effect
LiveOps events	Active users OK	Time-limited by design

Rule for Retention Tests

Only include users who installed AFTER test start. Never mix pre-existing users into retention or onboarding experiments.

Seasonality Will Fool You

"We launched the feature on Monday. Revenue is up 15% vs last week!" — Last week was post-holiday. This week is normal. You measured seasonality, not your feature.

Valid Methods

A/B test with control group
Same-period comparison (Tuesday vs Tuesday)
Day-of-week fixed effects
Week 23 vs Week 23 last year

Invalid Methods

"This week vs last week"
Pre/post without control
Ignoring holiday proximity
Weekend vs weekday comparisons

External Validity: Soft Launch ≠ Global

You soft-launch in the Philippines. D1 retention is 45%. You launch in the US. D1 retention: 32%. What happened?

Factor	Philippines/SEA	US/Western Europe
Device quality	Lower spec, more tolerant	Higher spec, less tolerant
Network speed	Variable, patient with loading	Fast, rage quit on lag
Competition	Less saturated	Hyper-saturated
Ad tolerance	Higher	Lower

Before Scaling UA

Rule: Validate with at least 2,000 installs in target geo (US/UK) before scaling. Use European T2 markets (Poland, Brazil, Canada) for better US proxy than SEA.

Budgeting for Significance: The Spend-to-Learn Ratio

Your team ships a tutorial update every 2 weeks. Your UA is spending $3,000/day. Are you wasting money or wasting learnings?

The Formula

Optimal Daily Spend = (Required Installs / Test Duration Days) × CPI

Constraint: Test Duration ≥ 7 days (Full Week Rule)

Example: Hybrid-Casual D7 Retention Test

Parameter	Value
Total installs needed	13,000 (6,500 per variant)
Minimum test duration	14 days
CPI	$1.80
Optimal daily spend	(13,000 / 14) × $1.80 = $1,671/day

Aligning Spend with Ship Velocity

Scenario	Problem
Ship every 3 days, spend $500/day	Can't measure anything—insufficient sample
Ship every 3 weeks, spend $5K/day	Buying users on unvalidated product
Ship randomly, spend $3K/day	No coordination, no learnings
Ship every 14 days, spend matches test needs	Every dollar teaches something

Pre-Flight Checklists

Before Starting Test

What metric am I measuring?
What's the baseline from OUR data (not industry average)?
What's the minimum improvement worth detecting?
How many users do I need per variant?
Is test duration at least 7 days (full week rule)?
Am I testing on NEW USERS only (for retention)?
Have I committed to NOT peeking until test ends?

Before Looking at Results

SRM Check: Is actual split within 1% of target?
Did I hit target sample size?
Did test run for complete 7-day cycle(s)?
Were there technical issues (crashes, hotfixes)?
Did external factors change (featuring, holiday)?

After Getting Results

Is p-value < 0.05?
Is effect size practically meaningful (not just significant)?
Novelty check: Is effect consistent first vs last 7 days?
Segment check: Does result hold for iOS AND Android?
Segment check: Does result hold in top 3 geos?

The Eight Commandments

Product enables marketing. Fix retention before scaling UA. Know your category benchmarks.
Deeper funnel = harder to test. Tutorial needs 1,600 users. Payer conversion needs 128,000.
Sample size is users, not days. But tests need full 7-day cycles.
Check for SRM before analyzing. If your 50/50 split is 52/48, your test is invalid.
Never peek. Decide duration upfront. Daily checking inflates false positives to 30%+.
Watch for novelty effects. Don't trust first 3 days. Compare early vs late periods.
Monetization tests are different. Test conversion rate, not revenue. Whales break statistics.
Always segment. Check iOS vs Android, top geos. Simpson's Paradox is real.

Tools & Resources

Sample Size Calculator — evanmiller.org/ab-testing/sample-size
SRM Checker — thumbtack.github.io/abba/demo/srm
Bayesian Calculator — evanmiller.org/bayesian-ab-testing
Sequential Testing — evanmiller.org/sequential-ab-testing (if you must peek)

Excel/Google Sheets Formula

Calculate sample size per variant directly in your spreadsheet:

                        =POWER(NORM.S.INV(1-0.05/2)+NORM.S.INV(0.8),2)*2*B2*(1-B2)/POWER(B2*C2,2)
                    

Where B2 = baseline rate (e.g., 0.20 for 20% retention) and C2 = relative lift to detect (e.g., 0.10 for 10% improvement). Note: Uses pooled variance approximation—estimates are ~5% lower than exact calculations.

The time you invest in statistical rigor pays off massively. You'll catch real problems, avoid false positives, and make decisions that actually move the needle instead of chasing noise.

Welcome to Transcend Tools

Access Denied

Statistical Significance in F2P Games

The Core Problem

The Hierarchy: Product Enables Marketing

How Do You Know If Each Level "Works"?

The "Scale Anyway" Trap

The Funnel Law: Deeper = Harder to Test

Genre-Specific Sample Sizes

D7 Retention Test (10% Relative Lift)

Payer Conversion Test (20% Relative Lift)

The Full Week Rule (Critical)

The Four Silent Killers of A/B Tests

Silent Killer #1: Sample Ratio Mismatch (SRM)

Silent Killer #2: Peeking

Silent Killer #3: Novelty Effect

Silent Killer #4: Simpson's Paradox

Monetization Tests Are Different (The Whale Problem)

Ad Revenue vs IAP Revenue

Who Enters Your Test Matters

Rule for Retention Tests

Seasonality Will Fool You

Valid Methods

Invalid Methods

External Validity: Soft Launch ≠ Global

Before Scaling UA

Budgeting for Significance: The Spend-to-Learn Ratio

The Formula

Example: Hybrid-Casual D7 Retention Test

Aligning Spend with Ship Velocity

Pre-Flight Checklists

The Eight Commandments

Tools & Resources

Excel/Google Sheets Formula

💬 Comments

Add Comment