Loading...

← Back to Dashboard

KPI Measurement Guide

The Version-Based Approach to Statistical Significance in Mobile Gaming

The Industry Problem

Everyone measures statistical significance by "installs per day." This is fundamentally wrong. The correct frame: measure by installs per app version to answer "Is this build different from the prior build?"

1

The Problem with "Installs Per Day"

What the Industry Gets Wrong

Common Belief

"You need 1,000+ installs per day to reach statistical significance"

Why It's Wrong:

  • Conflates time and version - Days pass regardless of code changes
  • Ignores cohort contamination - Users from different versions mixed together
  • Assumes stable external factors - Ad networks, seasonality, competition all change daily
  • No causal attribution - Can't answer "did the change work?" only "did numbers go up?"

The Version-Based Mental Model

// WRONG Time → Installs → Statistical Significance // RIGHT Version → Installs per Version → Build-over-Build Comparison
Key Insight

Each app version is a discrete experiment. The independent variable is the VERSION, not the calendar date.

Framework Applicability

StageRecommended ApproachWhy
Pre-Series AVersion-Based (this framework)Simple to implement, no backend infra required
Series B+Parallel A/B Testing (Feature Flags)Eliminates time bias, requires Remote Config infra
Important Caveat

Version-based measurement is "Sequential Cohort Analysis"—better than day-based, but still vulnerable to external factors. For mature games, parallel A/B testing via feature flags is the gold standard. This framework is optimized for early-stage teams shipping whole binaries.

2

The Math Actually Works Differently by KPI

KPI Categories and Statistical Properties

KPITypeTestKey Consideration
App Store CVRProportionZ-testVery low base rate (2-8%)
FTUE CompletionProportionZ-testMedium base rate (30-70%)
D1 RetentionProportionZ-testMedium base rate (25-45%)
D1 ROASContinuousMann-Whitney U + BootstrapHigh variance, whales distort mean
ARPDAUContinuousMann-Whitney U + BootstrapNon-normal, need robust methods

Sample Size Requirements Vary Dramatically

KPIBaselineDetect 10% LiftDetect 20% Lift
App Store CVR5%~30,000 page views/version~7,500 page views/version
FTUE Completion50%~4,000 installs/version~1,000 installs/version
D1 Retention35%~5,000 installs/version~1,250 installs/version
Critical Insight

App Store Conversion requires 6-8x more volume than D1 Retention to detect the same relative lift because of the lower base rate.

1

Sample Size Calculator

Calculate required installs (or page views for CVR) per version to detect your target lift

CVR=store visitors who install, FTUE=onboarding completion, D1/D7=users returning after 1/7 days
Relative lift (10% = 35% → 38.5%)

Results

The ROAS Problem: Why Means Lie

ROAS is not normally distributed. A single whale can shift the mean dramatically.

Wrong Approach

t-test comparing mean ROAS between versions

Right Approach: Two-Test Protocol
  1. Mann-Whitney U: Tests if one version consistently outperforms (rank-based, robust to outliers)
  2. Bootstrap Mean CI: Tests if total revenue differs (captures whale impact)
Critical Warning

Mann-Whitney U tests rank, not magnitude. Version A could have better ranks (more consistent minnow conversion) while Version B has higher mean revenue (better whale monetization). You need both tests!

3

The Version-Based Measurement Framework

Core Principles

  1. Version is the Unit of Analysis - Not days, not weeks, not campaigns
  2. Hypothesis-First Development - Every release has a target KPI and expected lift
  3. Minimum Viable Data - Don't spend money on data you don't need
  4. Clean Cohort Separation - Never mix users from different versions in analysis

The Release → Measure → Learn Cycle

1. Release Planning

  • Define target KPI (one primary, max 2 secondary)
  • State hypothesis: "Feature X will improve KPI Y by Z%"
  • Calculate required sample size
  • Estimate days to reach sample size

2. Release Phase

  • Deploy with clear version number
  • Force update: ensure 95%+ on target version
  • Tag all events with client version
  • Wait 5-7 days for ad network re-learning

3. Measurement Phase

  • Wait for sample size (NOT arbitrary time)
  • Run for FULL WEEK CYCLES (7, 14, 21 days)
  • Pull cohort data by version, not by date
  • Run appropriate statistical test

4. Learn Phase

  • Hypothesis confirmed, rejected, or inconclusive?
  • Update internal benchmarks
  • Archive: version → hypothesis → result → learning
2

Test Duration Estimator

Estimate days to significance, rounded to full week cycles

From Sample Size Calculator (installs per version)

Results

Spend Optimization: Minimum Viable Data

Every dollar spent before reaching sample size is buying noise. Every dollar spent AFTER reaching sample size on a failing build is waste.

Sample Size StatusBuild PerformanceAction
Below targetUnknownContinue spending, maintain level
Below targetClearly worseSTOP immediately, iterate
At targetSignificantly betterScale spend, ship next version
At targetNot differentHold spend, ship next version
Above targetAnyYou're wasting money on data you have
3

Spend Optimizer

Calculate optimal spend including 7-day ad network learning buffer

Results

Ad Network Learning Reset

Modern UA platforms (Google UAC, Facebook AEO, TikTok) require 5-7 days to "learn" optimal targeting. When you release a new binary, some platforms reset optimization history.

Days Post-ReleaseNetwork StateCPIUser Quality
Days 1-3Learning+30-50%Variable
Days 4-7StabilizingNormalizingImproving
Days 8+OptimizedBaselineConsistent
The Hidden Cost

True Measurement Window = Stated Window + 7-day Learning Buffer

First 7 days post-release should be excluded from analysis or budgeted as sunk cost.

4

Prerequisites Checklist

Version Tracking P0 Critical

Forced Update Mechanism P0 Critical

Analytics Event Quality P0 Critical

UA Spend Discipline P1

Analysis Capabilities P2

5

Common Mistakes and How to Avoid Them

The "It's Been Two Weeks" Fallacy

Mistake: "We've been running for two weeks, time to analyze"
Reality: Time elapsed ≠ sample size achieved
Fix: Wait for sample size, not arbitrary time periods

The "Looks Better" Trap

Mistake: "D1 retention went from 32% to 35%, we improved!"
Reality: Without statistical test, this could be noise
Fix: Always run the appropriate test, report p-value AND confidence interval

The "Whale Distortion" Problem

Mistake: Using mean ROAS to compare versions
Reality: One whale can shift mean by 50%+ in small samples
Fix: Use Two-Test Protocol: Mann-Whitney U (consistency) + Bootstrap Mean (magnitude)

The "Mixed Cohort" Error

Mistake: Analyzing all users from "last 30 days"
Reality: Multiple versions released in 30 days, cohorts contaminated
Fix: Always segment by install version, never by date alone

The "Day-of-Week" Blindspot

Mistake: Version A runs Mon-Wed, Version B runs Thu-Sat
Reality: Weekend retention/engagement is typically 10-20% higher
Fix: Always measure in full week cycles (7, 14, 21 days)

Red Flags to Watch For

"Installs per day" as significance metric
Analyzing by date instead of by version
Using mean ONLY for revenue comparisons
Stopping test before sample size reached
Measurement windows not in full week cycles
Judging performance in first 7 days post-release
Mixed cohorts (40%+ users still on old version)
6

All Calculators

These calculators implement the version-based measurement framework. Use them in sequence: Sample Size → Duration → Spend → Test Results.

4

Statistical Test Runner

Run the appropriate test for your data. For ROAS/revenue, uses the Two-Test Protocol.

Comma-separated values

Results

5

Seasonality Normalizer

Adjust for market-wide shifts using a control metric (e.g., organic retention)

Results

Quick Reference: Sample Sizes (80% power, 95% CI, 10% lift)

KPIBaselineInstalls per Version
App Store CVR5%~30,000 page views
FTUE Completion50%~4,000 installs
D1 Retention35%~5,000 installs
D7 Retention15%~11,000 installs
D1 ROAS (median)varies~2,000 installs
Select text to comment
+ Add Comment

Comments

No comments yet.

Add Comment