KPI Measurement Guide

1

The Problem with "Installs Per Day"

What the Industry Gets Wrong

Common Belief

"You need 1,000+ installs per day to reach statistical significance"

Why It's Wrong:

Conflates time and version - Days pass regardless of code changes
Ignores cohort contamination - Users from different versions mixed together
Assumes stable external factors - Ad networks, seasonality, competition all change daily
No causal attribution - Can't answer "did the change work?" only "did numbers go up?"

The Version-Based Mental Model

// WRONG
Time → Installs → Statistical Significance

// RIGHT
Version → Installs per Version → Build-over-Build Comparison
                    

Key Insight

Each app version is a discrete experiment. The independent variable is the VERSION, not the calendar date.

Framework Applicability

Stage	Recommended Approach	Why
Pre-Series A	Version-Based (this framework)	Simple to implement, no backend infra required
Series B+	Parallel A/B Testing (Feature Flags)	Eliminates time bias, requires Remote Config infra

Important Caveat

Version-based measurement is "Sequential Cohort Analysis"—better than day-based, but still vulnerable to external factors. For mature games, parallel A/B testing via feature flags is the gold standard. This framework is optimized for early-stage teams shipping whole binaries.

2

The Math Actually Works Differently by KPI

KPI Categories and Statistical Properties

KPI	Type	Test	Key Consideration
App Store CVR	Proportion	Z-test	Very low base rate (2-8%)
FTUE Completion	Proportion	Z-test	Medium base rate (30-70%)
D1 Retention	Proportion	Z-test	Medium base rate (25-45%)
D1 ROAS	Continuous	Mann-Whitney U + Bootstrap	High variance, whales distort mean
ARPDAU	Continuous	Mann-Whitney U + Bootstrap	Non-normal, need robust methods

Sample Size Requirements Vary Dramatically

KPI	Baseline	Detect 10% Lift	Detect 20% Lift
App Store CVR	5%	~30,000 page views/version	~7,500 page views/version
FTUE Completion	50%	~4,000 installs/version	~1,000 installs/version
D1 Retention	35%	~5,000 installs/version	~1,250 installs/version

Critical Insight

App Store Conversion requires 6-8x more volume than D1 Retention to detect the same relative lift because of the lower base rate.

1

Sample Size Calculator

Calculate required installs (or page views for CVR) per version to detect your target lift

KPI Type CVR=store visitors who install, FTUE=onboarding completion, D1/D7=users returning after 1/7 days

Custom Baseline (%)

Target Lift (%) Relative lift (10% = 35% → 38.5%)

Statistical Power

Confidence Level

Results

The ROAS Problem: Why Means Lie

ROAS is not normally distributed. A single whale can shift the mean dramatically.

Wrong Approach

t-test comparing mean ROAS between versions

Right Approach: Two-Test Protocol

Mann-Whitney U: Tests if one version consistently outperforms (rank-based, robust to outliers)
Bootstrap Mean CI: Tests if total revenue differs (captures whale impact)

Critical Warning

Mann-Whitney U tests rank, not magnitude. Version A could have better ranks (more consistent minnow conversion) while Version B has higher mean revenue (better whale monetization). You need both tests!

3

The Version-Based Measurement Framework

Core Principles

Version is the Unit of Analysis - Not days, not weeks, not campaigns
Hypothesis-First Development - Every release has a target KPI and expected lift
Minimum Viable Data - Don't spend money on data you don't need
Clean Cohort Separation - Never mix users from different versions in analysis

The Release → Measure → Learn Cycle

1. Release Planning

Define target KPI (one primary, max 2 secondary)
State hypothesis: "Feature X will improve KPI Y by Z%"
Calculate required sample size
Estimate days to reach sample size

↓

2. Release Phase

Deploy with clear version number
Force update: ensure 95%+ on target version
Tag all events with client version
Wait 5-7 days for ad network re-learning

↓

3. Measurement Phase

Wait for sample size (NOT arbitrary time)
Run for FULL WEEK CYCLES (7, 14, 21 days)
Pull cohort data by version, not by date
Run appropriate statistical test

↓

4. Learn Phase

Hypothesis confirmed, rejected, or inconclusive?
Update internal benchmarks
Archive: version → hypothesis → result → learning

2

Test Duration Estimator

Estimate days to significance, rounded to full week cycles

Total Installs Needed From Sample Size Calculator (installs per version)

Daily Install Volume

Results

Spend Optimization: Minimum Viable Data

Every dollar spent before reaching sample size is buying noise. Every dollar spent AFTER reaching sample size on a failing build is waste.

Sample Size Status	Build Performance	Action
Below target	Unknown	Continue spending, maintain level
Below target	Clearly worse	STOP immediately, iterate
At target	Significantly better	Scale spend, ship next version
At target	Not different	Hold spend, ship next version
Above target	Any	You're wasting money on data you have

3

Spend Optimizer

Calculate optimal spend including 7-day ad network learning buffer

Total Installs Needed

CPI ($)

Daily Installs

Results

Ad Network Learning Reset

Modern UA platforms (Google UAC, Facebook AEO, TikTok) require 5-7 days to "learn" optimal targeting. When you release a new binary, some platforms reset optimization history.

Days Post-Release	Network State	CPI	User Quality
Days 1-3	Learning	+30-50%	Variable
Days 4-7	Stabilizing	Normalizing	Improving
Days 8+	Optimized	Baseline	Consistent

The Hidden Cost

True Measurement Window = Stated Window + 7-day Learning Buffer

First 7 days post-release should be excluded from analysis or budgeted as sunk cost.

4

Prerequisites Checklist

Version Tracking P0 Critical

Client version in every analytics event (format: major.minor.patch)

Server version tracking (API version header on all requests)

Version → Release mapping table (version, date, hypothesis)

Forced Update Mechanism P0 Critical

Force update implementation (server-side minimum version check)

Version adoption tracking (dashboard showing % users per version)

Alert when adoption falls below 95%

Analytics Event Quality P0 Critical

Timestamp precision: Millisecond UTC timestamps

User ID consistency: Same ID across sessions

Install attribution: Source, campaign, creative, ad network

Revenue tracking: Transaction-level, currency-normalized

UA Spend Discipline P1

Daily spend tracking by version

CPI by version tracking (flag >15% changes)

Network maturity thresholds documented

Analysis Capabilities P2

Statistical testing (Chi-squared, Mann-Whitney U, Bootstrap)

Cohort query by install version (not just date)

Version-over-version comparison visualization

5

Common Mistakes and How to Avoid Them

The "It's Been Two Weeks" Fallacy

Mistake: "We've been running for two weeks, time to analyze"

Reality: Time elapsed ≠ sample size achieved

Fix: Wait for sample size, not arbitrary time periods

The "Looks Better" Trap

Mistake: "D1 retention went from 32% to 35%, we improved!"

Reality: Without statistical test, this could be noise

Fix: Always run the appropriate test, report p-value AND confidence interval

The "Whale Distortion" Problem

Mistake: Using mean ROAS to compare versions

Reality: One whale can shift mean by 50%+ in small samples

Fix: Use Two-Test Protocol: Mann-Whitney U (consistency) + Bootstrap Mean (magnitude)

The "Mixed Cohort" Error

Mistake: Analyzing all users from "last 30 days"

Reality: Multiple versions released in 30 days, cohorts contaminated

Fix: Always segment by install version, never by date alone

The "Day-of-Week" Blindspot

Mistake: Version A runs Mon-Wed, Version B runs Thu-Sat

Reality: Weekend retention/engagement is typically 10-20% higher

Fix: Always measure in full week cycles (7, 14, 21 days)

Red Flags to Watch For

✕ "Installs per day" as significance metric

✕ Analyzing by date instead of by version

✕ Using mean ONLY for revenue comparisons

✕ Stopping test before sample size reached

✕ Measurement windows not in full week cycles

✕ Judging performance in first 7 days post-release

✕ Mixed cohorts (40%+ users still on old version)

6

All Calculators

These calculators implement the version-based measurement framework. Use them in sequence: Sample Size → Duration → Spend → Test Results.

4

Statistical Test Runner

Run the appropriate test for your data. For ROAS/revenue, uses the Two-Test Protocol.

Data Type

Version A Data Comma-separated values

Version B Data

Results

5

Seasonality Normalizer

Adjust for market-wide shifts using a control metric (e.g., organic retention)

Version A KPI

Version B KPI

Control A (e.g., Organic Ret)

Control B

Results

Quick Reference: Sample Sizes (80% power, 95% CI, 10% lift)

KPI	Baseline	Installs per Version
App Store CVR	5%	~30,000 page views
FTUE Completion	50%	~4,000 installs
D1 Retention	35%	~5,000 installs
D7 Retention	15%	~11,000 installs
D1 ROAS (median)	varies	~2,000 installs