KPI Measurement Guide
The Version-Based Approach to Statistical Significance in Mobile Gaming
Everyone measures statistical significance by "installs per day." This is fundamentally wrong. The correct frame: measure by installs per app version to answer "Is this build different from the prior build?"
Jump to Section
The Problem with "Installs Per Day"
What the Industry Gets Wrong
"You need 1,000+ installs per day to reach statistical significance"
Why It's Wrong:
- Conflates time and version - Days pass regardless of code changes
- Ignores cohort contamination - Users from different versions mixed together
- Assumes stable external factors - Ad networks, seasonality, competition all change daily
- No causal attribution - Can't answer "did the change work?" only "did numbers go up?"
The Version-Based Mental Model
Each app version is a discrete experiment. The independent variable is the VERSION, not the calendar date.
Framework Applicability
| Stage | Recommended Approach | Why |
|---|---|---|
| Pre-Series A | Version-Based (this framework) | Simple to implement, no backend infra required |
| Series B+ | Parallel A/B Testing (Feature Flags) | Eliminates time bias, requires Remote Config infra |
Version-based measurement is "Sequential Cohort Analysis"—better than day-based, but still vulnerable to external factors. For mature games, parallel A/B testing via feature flags is the gold standard. This framework is optimized for early-stage teams shipping whole binaries.
The Math Actually Works Differently by KPI
KPI Categories and Statistical Properties
| KPI | Type | Test | Key Consideration |
|---|---|---|---|
| App Store CVR | Proportion | Z-test | Very low base rate (2-8%) |
| FTUE Completion | Proportion | Z-test | Medium base rate (30-70%) |
| D1 Retention | Proportion | Z-test | Medium base rate (25-45%) |
| D1 ROAS | Continuous | Mann-Whitney U + Bootstrap | High variance, whales distort mean |
| ARPDAU | Continuous | Mann-Whitney U + Bootstrap | Non-normal, need robust methods |
Sample Size Requirements Vary Dramatically
| KPI | Baseline | Detect 10% Lift | Detect 20% Lift |
|---|---|---|---|
| App Store CVR | 5% | ~30,000 page views/version | ~7,500 page views/version |
| FTUE Completion | 50% | ~4,000 installs/version | ~1,000 installs/version |
| D1 Retention | 35% | ~5,000 installs/version | ~1,250 installs/version |
App Store Conversion requires 6-8x more volume than D1 Retention to detect the same relative lift because of the lower base rate.
The ROAS Problem: Why Means Lie
ROAS is not normally distributed. A single whale can shift the mean dramatically.
t-test comparing mean ROAS between versions
- Mann-Whitney U: Tests if one version consistently outperforms (rank-based, robust to outliers)
- Bootstrap Mean CI: Tests if total revenue differs (captures whale impact)
Mann-Whitney U tests rank, not magnitude. Version A could have better ranks (more consistent minnow conversion) while Version B has higher mean revenue (better whale monetization). You need both tests!
The Version-Based Measurement Framework
Core Principles
- Version is the Unit of Analysis - Not days, not weeks, not campaigns
- Hypothesis-First Development - Every release has a target KPI and expected lift
- Minimum Viable Data - Don't spend money on data you don't need
- Clean Cohort Separation - Never mix users from different versions in analysis
The Release → Measure → Learn Cycle
1. Release Planning
- Define target KPI (one primary, max 2 secondary)
- State hypothesis: "Feature X will improve KPI Y by Z%"
- Calculate required sample size
- Estimate days to reach sample size
2. Release Phase
- Deploy with clear version number
- Force update: ensure 95%+ on target version
- Tag all events with client version
- Wait 5-7 days for ad network re-learning
3. Measurement Phase
- Wait for sample size (NOT arbitrary time)
- Run for FULL WEEK CYCLES (7, 14, 21 days)
- Pull cohort data by version, not by date
- Run appropriate statistical test
4. Learn Phase
- Hypothesis confirmed, rejected, or inconclusive?
- Update internal benchmarks
- Archive: version → hypothesis → result → learning
Spend Optimization: Minimum Viable Data
Every dollar spent before reaching sample size is buying noise. Every dollar spent AFTER reaching sample size on a failing build is waste.
| Sample Size Status | Build Performance | Action |
|---|---|---|
| Below target | Unknown | Continue spending, maintain level |
| Below target | Clearly worse | STOP immediately, iterate |
| At target | Significantly better | Scale spend, ship next version |
| At target | Not different | Hold spend, ship next version |
| Above target | Any | You're wasting money on data you have |
Ad Network Learning Reset
Modern UA platforms (Google UAC, Facebook AEO, TikTok) require 5-7 days to "learn" optimal targeting. When you release a new binary, some platforms reset optimization history.
| Days Post-Release | Network State | CPI | User Quality |
|---|---|---|---|
| Days 1-3 | Learning | +30-50% | Variable |
| Days 4-7 | Stabilizing | Normalizing | Improving |
| Days 8+ | Optimized | Baseline | Consistent |
True Measurement Window = Stated Window + 7-day Learning Buffer
First 7 days post-release should be excluded from analysis or budgeted as sunk cost.
Prerequisites Checklist
Version Tracking P0 Critical
Forced Update Mechanism P0 Critical
Analytics Event Quality P0 Critical
UA Spend Discipline P1
Analysis Capabilities P2
Common Mistakes and How to Avoid Them
The "It's Been Two Weeks" Fallacy
The "Looks Better" Trap
The "Whale Distortion" Problem
The "Mixed Cohort" Error
The "Day-of-Week" Blindspot
Red Flags to Watch For
All Calculators
These calculators implement the version-based measurement framework. Use them in sequence: Sample Size → Duration → Spend → Test Results.
Quick Reference: Sample Sizes (80% power, 95% CI, 10% lift)
| KPI | Baseline | Installs per Version |
|---|---|---|
| App Store CVR | 5% | ~30,000 page views |
| FTUE Completion | 50% | ~4,000 installs |
| D1 Retention | 35% | ~5,000 installs |
| D7 Retention | 15% | ~11,000 installs |
| D1 ROAS (median) | varies | ~2,000 installs |
