F2P Marketing Experimentation System
A rigorous, operator-level guide for designing, executing, and learning from structured marketing experiments across F2P mobile game portfolios.
1. Philosophy & Principles
Why Structured Experimentation Matters for F2P UA
In F2P mobile gaming, the difference between a portfolio company achieving sustainable scale and burning through a Series A is often found in the rigor of their experimentation. Many studios "test" by throwing creatives at a wall and seeing what sticks. This is not experimentation — it is gambling.
F2P games rely on high-volume acquisition to fuel monetization through IAP, ads, and subscriptions. Without a disciplined experimentation system, teams waste budgets on unproven assumptions, scale losers prematurely, and repeat the same mistakes quarter after quarter.
Why This Matters
In F2P UA, where cohort maturity takes weeks to reveal true LTV, ad platforms reward precision. Poor testing leads to creative fatigue within days, inflating CPI by 20-50%. Studios that run 4+ high-quality experiments per month will eventually find the local maxima for their game. Studios that rely on "creative genius" will eventually run out of ideas — and cash.
The "Testing" vs. "Analyzing" Distinction
This guide focuses exclusively on Testing — the prospective, controlled application of variables to measure causal impact on future performance.
| Dimension | Testing (This Guide) | Analyzing (Guide 2) |
|---|---|---|
| Orientation | Prospective | Retrospective |
| Method | Controlled experiments | Observational analysis |
| Output | Causal inference | Correlative insight |
| Example | "If we change the first 3 seconds of this video, how does D1 ROAS change?" | "Why did our ROAS drop last Tuesday?" |
Core Principles
- Isolate One Variable: If you change the creative AND the targeting simultaneously, you have learned nothing. Multi-variable tests require factorial designs that demand 4x the sample size.
- Statistical Rigor Before Decisions: We do not call winners based on "gut feel" or "early trends." We wait for the minimum sample size and requisite confidence level.
- Document Failures: Negative results are as valuable as positive ones. An undocumented failure guarantees someone will repeat it.
- Platform Awareness: Treat iOS and Android as two different games. Test platforms separately to avoid confounding.
F2P-Specific Considerations
- Cohort Maturity: Early signals (CPI, CTR) are often decoupled from long-term value. Wait for full attribution windows — 7-day click minimum, 28-day for LTV signals. Immature cohorts overestimate uplift by 10-20%.
- Attribution Windows: iOS attribution is transitioning from SKAdNetwork (SKAN) to AdAttributionKit (AAK). Both create elongated feedback loops. Android offers fuller attribution but carries 10-15% invalid traffic risk.
- The Power Law of Creatives: 95% of creatives fail. The top creative typically drives 30-40% of all conversions. Your system must be designed to fail fast and scale the 5% that work.
- Platform Behavior Differences: iOS users skew premium (higher ARPU, 5-10% better retention). Android volumes are roughly 2x but CPI is 20% lower. Never combine platforms in a single test arm.
| Principle | F2P Application | Risk of Violation |
|---|---|---|
| One Variable | Test hook text only, not entire creative | Confounded results; 2x test duration wasted |
| Statistical Rigor | Minimum 300 conversions per variant | False positives; wasted scale budget |
| Cohort Maturity | 28-day window for LTV metrics | Overestimate ROAS by 15-25% |
| Platform Split | Separate iOS/Android test arms | Skewed metrics; Android fraud inflates installs |
| Document Failures | Log every negative result | Same paywall flop tested twice, 6 months apart |
2. Test Taxonomy (MECE)
Every experiment must fall into exactly one of the following ten mutually exclusive categories. For each, we define the hypothesis template, key metrics, minimum sample sizes (80% power, 5% significance), typical duration, and success criteria.
2.1 Creative Testing
Definition: Testing visual and auditory ad assets — static vs. video vs. UGC, hook variants (first 3 seconds), thumbnail A/B variants, format variations.
Hypothesis Template: "If we replace the 'cinematic reveal' hook with a 'fail gameplay' hook, then CTR will increase by 20% because it triggers a 'fix-it' psychological impulse in casual puzzle players."
Key Metrics: Primary: CTR, IPM, Install Rate. Guardrails: CPI, D1 Retention, D1 ROAS.
Sample Sizes: CTR baseline 1-5% (median ~2.5%). For 10% relative MDE: ~78,000 impressions/variant. For 20%: ~20,000.
Duration: 3-7 days. Monitor for minimum 50 conversions before any directional read.
Success Criteria: CTR uplift >15% AND no degradation in D1 retention >5%. Scale if CPI improves or holds. Kill if CPI spikes >20%.
Creatives account for 30-40% of conversion variance in F2P UA. The top 20% of creatives drive 80% of conversions. Weekly testing of 10-20 variants is essential to combat fatigue and find the next winner.
2.2 Audience & Targeting Testing
Definition: Testing who sees the ads — LAL seed quality, interest stacking, broad vs. narrow targeting, geographic expansion.
Hypothesis Template: "If we move from a 1% LAL based on 'Installs' to a 5% LAL based on 'Highest IAP Spenders,' then D7 ROAS will increase by 15%."
Key Metrics: Primary: Install Rate, Trial Completion, D7 ROAS. Guardrails: CPI, D1 Retention, ARPU.
Sample Sizes: Install-to-trial baseline 5-15% (median ~10%). For 10% MDE: ~15,500 users/arm. For 20%: ~3,875.
Duration: 7-14 days. Geo expansions need longer for fraud detection.
Success Criteria: 15%+ trial uplift AND CPI < baseline + 10%. Guard: No D30 retention drop >3%.
2.3 Bid Strategy Testing
Definition: Testing the algorithm's optimization goal and price ceiling — tCPA, tROAS, MAI, bid caps, budget allocation.
Hypothesis Template: "If we switch from tCPA at $3.00 to tROAS at 120%, then D30 ROAS will improve by 25% because the algorithm will prioritize high-value conversions."
Key Metrics: Primary: ROAS (D7/D30), Effective CPI. Guardrails: Install Volume, Paywall CVR, Budget Utilization.
Sample Sizes: ROAS variance is high; require 500 conversions per arm minimum for 15% relative MDE.
Duration: 14-28 days. LTV signals need time to stabilize past the platform learning phase.
Success Criteria: ROAS improvement >1.2x baseline AND volume stable within +/-10%. The algorithm must be able to spend the full daily budget — a "better ROAS" at 30% of prior volume is not a win.
2.4 Campaign Structure Testing
Definition: Testing architectural setup within the ad manager — consolidation vs. segmentation, AEO vs. VO, CBO vs. ABO.
Hypothesis Template: "If we consolidate 5 separate ad sets into one broad CBO structure, the algorithm will find a lower blended CPI through increased liquidity."
Key Metrics: Primary: CPI, Scale Efficiency, CPM Stability. Guardrails: ROAS, Retention, Learning Phase duration.
Sample Sizes: 1,000 installs per structure for 10% relative MDE on CPI.
Duration: 7-21 days. Account for platform learning phase (3-5 days) before measurement.
Success Criteria: 10% CPI reduction AND ROAS stable. Guard: D1 retention stays above 25%.
Poor campaign structure halves efficiency. Over-segmentation starves algorithms of data; over-consolidation dilutes signal. Below $500/day, consolidation almost always wins.
2.5 Paywall & Pricing Testing
Definition: Testing the monetization gate and price elasticity — price points, trial lengths, discount depth, paywall timing/placement, copy/design.
Hypothesis Template: "If we offer a $4.99 weekly subscription with a 3-day trial instead of $9.99 monthly, trial-to-paid conversion will increase by 25%."
Key Metrics: Primary: Paywall CVR, Trial Start Rate, Trial-to-Paid. Guardrails: D7 Retention, ARPU, Churn, Total Net Revenue.
Sample Sizes: Paywall CVR baseline 2-5% (median ~3%). For 20% MDE at 5% base: ~7,850 users/variant. One of the most sample-hungry test types.
Duration: 14-28 days. Trial length directly extends measurement.
Success Criteria: CVR improvement that translates to higher total net revenue (accounting for platform fees and churn). Guard: D7 retention drop <5%.
2.6 Onboarding Flow Testing
Definition: Testing FTUE up to first meaningful action — step count, personalization depth, value proposition messaging, commitment devices.
Hypothesis Template: "If we reduce onboarding from 5 to 3 steps by deferring account creation, D1 retention will increase by 10%."
Key Metrics: Primary: Completion Rate, Time to First Play, D1 Retention. Guardrails: Trial Start Rate, D7 Retention.
Sample Sizes: Completion baseline 45-55%. For 10% MDE: ~600 users/flow.
Duration: 3-7 days. D1 retention results arrive within 48 hours.
Success Criteria: Completion >50% AND D1 retention >28%. Guard: no drop in downstream conversion.
2.7 Monetization Surface Testing
Definition: Testing placement and frequency of revenue touchpoints — energy/lives cap, ad load, IAP placement, subscription vs. IAP mix, rewarded video.
Hypothesis Template: "If we lower the energy cap from 10 to 5 but add a rewarded video option, ARPDAU will increase by 15%."
Key Metrics: Primary: ARPU, ARPDAU, Ad Revenue/User, IAP CVR. Guardrails: Session Length, D7 Retention, Churn.
Sample Sizes: ARPU has high variance; 1,000 users/arm minimum for 15% MDE.
Duration: 21-42 days. Revenue cycles are long; IAP shows delayed effects.
Success Criteria: ARPU increase >10% AND churn increase <5%. Guard: session length and D7 retention stable.
Energy Systems Are Dual-Purpose
Energy/lives systems are simultaneously monetization mechanics AND retention mechanics. A lower cap increases monetization pressure but also gates content, which can improve session pacing. If fewer than 5% of users ever recharge energy, the cap is too high — it is not functioning as a monetization surface.
2.8 Retention Mechanic Testing
Definition: Testing engagement loops — push notification timing/content, streak systems, daily reward calibration, re-engagement campaigns.
Hypothesis Template: "If we send a push at the user's historical peak play time instead of fixed 9:00 AM, D7 retention will improve by 5%."
Key Metrics: Primary: D7/D14/D30 Retention, Re-engagement Rate, Push Open Rate. Guardrails: Uninstall Rate, Push Opt-out Rate.
Sample Sizes: D1 retention baseline 26-28%. For 10% MDE: ~2,000 users/arm.
Duration: 7-30 days. D30 retention obviously requires 30 days of measurement.
Success Criteria: Retention improvement >10% AND opt-in >40% iOS. Guard: no increase in uninstall rate.
2.9 Landing Page / App Store (ASO) Testing
Definition: Testing App/Play Store presence — screenshot order, preview video, description copy, icon variants, custom product pages.
Hypothesis Template: "If we use screenshots featuring multiplayer elements instead of solo play, store CVR will increase by 12%."
Key Metrics: Primary: Page View to Install CVR, Impressions-to-Install Rate. Guardrails: Keyword Rankings, Paid Cannibalization Rate.
Sample Sizes: Organic CVR baseline 25-30%. For 10% MDE: ~2,000 page views/variant.
Duration: 14-30 days. ASO changes ramp slowly as store algorithms adjust.
Success Criteria: CVR improvement >10% AND rankings stable. Guard: no cannibalization of paid installs.
2.10 LiveOps & Seasonal Event Testing
Definition: Testing time-limited content parameters — event reward structures, durations, Battle Pass pricing/tiers, LTOs, seasonal cadence.
Hypothesis Template: "If we shorten the seasonal event from 14 to 7 days with the same reward pool, participation will increase by 20% due to urgency."
Key Metrics: Primary: Event Participation, Completion Rate, Event Revenue/Participant, Battle Pass Attach Rate. Guardrails: Post-event D7 Retention, Non-Event Session Length, Post-event Churn.
Sample Sizes: Participation baseline 15-30% (median ~20% DAU). For 15% MDE: ~3,500 DAU/variant. Battle Pass attach (5-8% baseline): ~6,000 DAU/variant.
Duration: One full event cycle + 7-day post-event observation. Weekly events: 14 days. Monthly events: 37 days.
Success Criteria: Event revenue/participant increase >15% AND post-event D7 retention drop <3%. Guard: non-event session length stable.
LiveOps Key Considerations
Reward Calibration: Over-generous rewards devalue the core economy; under-generous produce low participation. Test reward tiers independently from magnitude.
Duration vs. Intensity: Shorter events increase urgency but exclude casual players. Test duration separately from reward structure.
Battle Pass Pricing: At $4.99, attach rates of 8-12% are common; at $9.99, 3-5%. Always test — total revenue depends on game-specific price sensitivity.
LTO Frequency: More than 2 LTOs/week creates "deal fatigue." Test cadence with full-price IAP revenue as a guardrail.
Test Taxonomy Summary
| Category | ID Code | Primary Metric | Duration | Min. Conversions |
|---|---|---|---|---|
| Creative | CREATIVE | CTR, IPM | 3-7 days | 300+/variant |
| Audience/Targeting | AUD | Install Rate, D7 ROAS | 7-14 days | 500+/variant |
| Bid Strategy | BID | ROAS, CPI | 14-28 days | 500+/variant |
| Campaign Structure | CAMP | CPI, Scale Efficiency | 7-21 days | 1,000+ installs |
| Paywall/Pricing | PAYWALL | Paywall CVR, Trial-to-Paid | 14-28 days | 800+/variant |
| Onboarding Flow | ONBOARD | Completion Rate, D1 Ret. | 3-7 days | 600+/variant |
| Monetization Surface | MONET | ARPU, ARPDAU | 21-42 days | 1,000+/variant |
| Retention Mechanic | RETAIN | D7/D30 Retention | 7-30 days | 2,000+/variant |
| Landing Page / ASO | ASO | Store CVR | 14-30 days | 2,000+ views |
| LiveOps / Seasonal | LIVEOPS | Event Revenue, Attach Rate | 14-37 days | 3,500+ DAU |
3. Statistical Framework (F2P-Specific)
Why This Section Exists
We have seen numerous portfolio companies scale "winning" creatives based on 50 conversions, only to see ROAS collapse at scale. The "winner" was a statistical fluke — regression to the mean is inevitable with small samples.
The Base-Rate-Aware Sample Size Formula
To calculate the number of users required per variant (n):
Where: Zα/2 = 1.96 (95% confidence), Zβ = 0.84 (80% power), p = baseline rate, d = absolute MDE.
Base Rate Matters Enormously
A 10% relative lift on a 2% base rate (d = 0.002) requires 16x more samples than a 10% relative lift on a 30% base rate (d = 0.03). Low-base-rate metrics like CTR and paywall CVR require massive sample sizes.
Worked example: A puzzle RPG has paywall CVR of 5%. The team wants to detect a 20% relative improvement (5% → 6%). Absolute MDE d = 0.01.
Sample Size Lookup Table
All values per variant. 95% confidence, 80% power.
| Metric | Base Rate (p) | MDE 10% Rel. | MDE 20% Rel. | MDE 50% Rel. |
|---|---|---|---|---|
| CTR | 2% | 78,400 | 19,600 | 3,140 |
| Paywall CVR | 5% | 31,400 | 7,850 | 1,250 |
| Install-to-Trial | 10% | 15,500 | 3,875 | 620 |
| D1 Retention | 27% | 12,900 | 3,225 | 515 |
| Trial-to-Paid | 30% | 11,650 | 2,910 | 465 |
| Onboarding Completion | 50% | 8,400 | 2,100 | 335 |
iOS Attribution: SKAN 4 and AdAttributionKit (AAK)
iOS attribution is in transition. SKAN 4 remains operational but AdAttributionKit (AAK) is the strategic successor.
| Framework | Multiplier on Base n | Rationale |
|---|---|---|
| Android (full attribution) | 1.0x | Baseline — full deterministic attribution |
| SKAN 4 (high volume) | 1.3x | Crowd anonymity met; fine-grained values but delayed |
| SKAN 4 (low volume) | 1.5-2.0x | Coarse values only; significant measurement noise |
| AAK (early adoption) | 1.3-1.5x | Similar privacy tiers; re-engagement data available but network support varies |
✓ DO
- Run creative tests on Android first, port winners to iOS
- Use 1.3-1.5x multiplier for iOS ROAS tests
- Ensure MMP supports AAK postbacks
- Configure developer-mode postbacks for test campaigns
✗ DON'T
- Trust iOS creative-to-purchase attribution for small tests
- Use Android sample sizes for iOS tests
- Ignore AAK transition in test infrastructure
- Combine platforms in a single test arm
Sequential Testing and Early Stopping
F2P data accumulates slowly. We employ Bayesian sequential testing:
- Early Stop (Negative): Posterior probability drops below 5% AND ≥30% of planned n reached → stop. Clear loser.
- Early Stop (Positive): Posterior exceeds 95% for three consecutive periods AND ≥50% of planned n → may call early winner.
- Mandatory Cap: Never save more than 20% of planned runtime through early stopping.
- Never Stop Between: If probability is 5-95%, continue to planned n. The "gray zone" is where most false conclusions live.
Multi-Arm Bandit Considerations
| Approach | When to Use | F2P Application |
|---|---|---|
| Fixed-Horizon A/B | Low volume, high stakes | Paywall, monetization changes |
| Multi-Arm Bandit | High volume, many variants | Creative rotation with 10+ variants |
| Explore/Exploit | Need to minimize regret during test | Scale winning creative while testing new hooks |
CUPED: Variance Reduction for Product Tests
How CUPED Works
For product experiments (onboarding, monetization, retention), adjust each user's outcome by their pre-experiment behavior (session count, device type, channel). This removes "noise," reducing required sample sizes by 20-40%. Not applicable to creative tests where users have no pre-experiment relationship.
Interaction Effects: Avoiding Test Pollution
- Preferred: Audience Isolation. Hash UserIDs to non-overlapping buckets (UserID % 4).
- Acceptable: Sequential Testing. Run Test A to completion, then Test B.
- Last Resort: Document and Model. Run both, document overlap, use ANOVA post-hoc. Accept wider CIs.
4. Test Documentation Standard
Every experiment must be documented in a central registry before launch. Undocumented tests repeat failures and waste review time.
Test ID Format
Required Document Fields
| Section | Contents | Example |
|---|---|---|
| Header | ID, Type, Status, Start/Review Date, Owner | 260115-CREATIVE-FailGameplay-V1, LIVE, Jan 15 |
| Hypothesis | "If [change], then [metric] will [direction] by [magnitude] because [mechanism]" | "If 'fail gameplay' hook, CTR +20% via fix-it impulse" |
| Baseline | Metric values with 95% CI, source, date range | CTR: 2.5% (2.3-2.7%), Google Ads, Dec 1-Jan 14 |
| Control | What stays the same | Standard cinematic ad, US targeting, tCPA $3.00 |
| Treatment | What changes (ONE variable) | Hook replaced with fail-gameplay footage |
| Success Gates | Primary threshold + guardrail constraints | CTR >15% AND CPI ≤+10% AND D1 ret. holds |
| Sample Size | Show the calculation | 2.5% CTR, 20% MDE: n = 19,600/group |
| Max Runtime | Kill date regardless of significance | 14 days OR $10K budget cap |
| Results | Observed values, p-value/posterior, CIs | CTR: 3.1% (2.8-3.4%), p=0.02, Pr=97% |
| Decision | One of five outcomes | SCALE |
| Learnings | What we now know (use Appendix C template) | "Fail-gameplay hooks outperform cinematic for casual puzzle" |
Decision Definitions
| Decision | Meaning | Action |
|---|---|---|
| SCALE | Treatment wins decisively | Roll to 100%, increase budget |
| ADOPT | Treatment is better | Treatment becomes new default |
| EXTEND | Promising but not yet significant | Continue to full n |
| REVERT | Treatment loses or harms guardrails | Return to control immediately |
| INCONCLUSIVE | No detectable difference | Document, archive, redesign test |
5. Test Lifecycle
The lifecycle ensures no test is forgotten, no budget wasted on zombie experiments, and every test produces a documented outcome.
State Workflow
PLANNED
Document complete, peer reviewed, budget allocated, audience isolation confirmed.
LIVE
Spend confirmed for >48 hours, no technical issues, tracking verified.
MEASURING
Wait for target n (sequential monitoring active). Zombie rule: kill if >21 days without reaching n.
SIGNAL
Sample size reached. Compute p-value / Bayesian posterior. Positive (>80%), Negative (<20%), or Neutral (20-80%).
DECISION
Statistical output reviewed, guardrails checked. Outcome: SCALE / ADOPT / REVERT / EXTEND / INCONCLUSIVE.
ARCHIVED
Learnings documented (Appendix C template), stakeholders notified, registry updated.
Escalation Rules
Kill Switch Triggers
CPI Spike: CPI >40% above baseline in first 48 hours → auto-pause for emergency review.
Budget Breach: Spend exceeds 150% of plan without reaching n → pause and evaluate.
Zombie Rule: Any test in MEASURING for >21 days without target n → kill as INCONCLUSIVE.
Guardrail Breach: Any guardrail degrades beyond threshold → immediate review, likely REVERT.
Review Cadence
| Cadence | Activity | Participants |
|---|---|---|
| Daily | Automated scanner report (spend, conversions, probability) | UA Analyst (async) |
| Weekly | Status scan of all active tests, flag approaching n or timeline | UA Lead + Analyst |
| Bi-weekly | Deep review: full stats, guardrails, decisions on tests at SIGNAL | Review Committee |
6. Portfolio-Level Test Orchestration
Managing experimentation across 42 portfolio companies requires a prioritization framework that accounts for lifecycle stage, available resources, and test category urgency.
Test Priority Matrix by Company Lifecycle
| Test Category | Pre-PMF (to D90) | Scaling ($50K-$500K/mo) | Mature ($500K+/mo) |
|---|---|---|---|
| ONBOARD | Required | If below median | Only if regressed |
| CREATIVE | Required | Required | Required |
| PAYWALL | If monetization live | Required | Required |
| AUD | Not recommended | Required | Required |
| BID | Not recommended | Required | Required |
| CAMP | Not recommended | If >$200K/mo | Required |
| MONET | Not recommended | If D30 data available | Required |
| RETAIN | If D7 data available | Required | Required |
| ASO | If launched | If organic >20% | Required |
| LIVEOPS | Not recommended | If cadence established | Required |
Resource Conflict Resolution (Priority Stack)
Active Guardrail Breach
Any company with a live test showing guardrail degradation gets immediate analyst attention.
Tests at SIGNAL Stage
Tests that reached target n need statistical review. Delayed decisions waste the budget already spent.
Higher Lifecycle Stage
Mature companies generate more absolute value per test improvement than pre-PMF companies.
Test ROI Score
Use Appendix B calculator to compare expected value when lifecycle stages are equal.
First-Come, First-Served
Tiebreaker when all other factors are equal.
7. Automation & Tooling
The experimentation system is supported by a Claude Code automation layer that manages scale across the portfolio.
Scanner Data Sources
| Data Source | What It Queries | Test Types Served |
|---|---|---|
| Google Ads | Impressions, clicks, CTR, CPI, conversions | CREATIVE, AUD, BID, CAMP |
| Meta Ads | ROAS by placement, breakdowns by creative | CREATIVE, AUD, CAMP |
| AppsFlyer | Attribution cohorts, D1/D7/D30 retention | All types |
| Stripe / RevenueCat | Subscription events, trial-to-paid, churn | PAYWALL, MONET, LIVEOPS |
| GA4 / BigQuery | Onboarding funnels, session data | ONBOARD, RETAIN, MONET, LIVEOPS |
Daily Scanner Output Example
External AI Review Integration
| AI Model | Function | Application |
|---|---|---|
| Gemini | Creative pattern analysis | Visual patterns in winning/losing video files |
| Grok | Market research | Competitor ad libraries and emerging trends |
| Claude | Statistical review | Sample size validation and interaction effect flagging |
8. Anti-Patterns (Learned from Real Portfolio Work)
These patterns have been observed across portfolio companies and have cost real money. Every one is preventable with this system.
8.1 The "Early Winner" Fallacy
The mistake: Calling a creative a winner because it had $0.50 CPI on day 1 with 15 installs.
❌ What Goes Wrong
Regression to the mean is inevitable. A creative that looks 40% better with 50 conversions has a coin-flip chance of being worse at 500.
✔ The Rule
Never make a scaling decision with fewer than 300 conversion events per variant.
8.2 Selection Bias Masquerading as Causation
The mistake: "Users who complete 6 lessons convert at 97%. Let's force all users through 6 lessons."
❌ What Goes Wrong
Users who completed 6 lessons arrived with intent. The lessons did not create that intent. Forcing unmotivated users will cause churn, not conversion.
✔ The Rule
Use randomized A/B tests. Compare "offered 6 lessons" vs. "offered 3" with random assignment.
8.3 The Push Notification Illusion
The mistake: "Users with push enabled have 3x LTV. Force push opt-in."
❌ What Goes Wrong
Push permission is an intent signal, not a causal lever. Forcing opt-in increases uninstall rates by 8-12%.
✔ The Rule
Test push opt-in timing as a value-add, not a gate. Measure uninstall rate as guardrail.
8.4 Creative Testing on iOS Without ATT Accounting
The mistake: Relying on Meta/Google reported "Purchases" for iOS creative tests, then scaling the "winner."
❌ What Goes Wrong
Due to ATT, iOS creative-to-purchase attribution is heavily modeled and wrong for small-scale tests below ~500 conversions.
✔ The Rule
Use Android as the "lab." Identify winners with full attribution, then port to iOS for scaling.
8.5 Volume Over Value in Creative Format Selection
The mistake: Scaling UGC video because it drives the most impressions and installs at lowest CPM.
❌ What Goes Wrong
When measured on ROAS rather than volume, static ads sometimes convert at 3.5x the rate of video. A high-volume, low-ROAS creative is actively losing money at scale.
✔ The Rule
Always evaluate creatives on ROAS or LTV, not volume. Use sample size table to compare ROAS, not just installs.
8.6 Over-Engineering the Solved Funnel
The mistake: Running elaborate A/B tests on onboarding when completion is 92-94% (top decile).
❌ What Goes Wrong
Chasing 1% gain in a solved funnel while D7 retention sits at 5%. Massive diminishing returns.
✔ The Rule
Use benchmarks (Section 9) to prioritize. Only test funnels below median. If onboarding is excellent, the problem is downstream.
Additional Anti-Patterns (8.7-8.11)
8.7 Energy Systems as Pure Monetization: Energy caps serve dual monetization+retention roles. Optimizing ARPU alone produces short-term spike then 20-30% D30 retention drop.
8.8 The Undocumented Failure: Negative results not logged = same test repeated 6 months later. Portfolio-wide cost: $50K-$100K/year in redundant testing.
8.9 Overlapping Tests Without Isolation: Two overlapping tests = four conditions, two measured. Both results contaminated. Cost: $30K-$150K per incident.
8.10 Organic vs. Paid Cohort Comparison: Organic users self-select with higher intent. Never use organic as control for paid UA tests.
8.11 ARPU/ROAS as Product Quality Signals: Both contaminated by bid strategy, channel mix, geo mix, user quality. Use Causal Attribution Analyzer to decompose.
Portfolio-Wide Impact
Each anti-pattern costs $15K-$150K per incident. Across 42 companies, anti-pattern violations are conservatively estimated at $300K-$500K per year in aggregate waste.
9. F2P Benchmarks Reference Table
Use these ranges to contextualize test results and prioritize what to test. If a metric is in the "Excellent" range, move testing effort to a category with larger upside.
Top-of-Funnel (UA Metrics)
| Metric | Median (Good) | Excellent (Top Decile) | Test If Below |
|---|---|---|---|
| CTR (Video) | 1.2-1.8% | 3.5%+ | 1.0% |
| CTR (Static) | 0.8-1.2% | 2.5%+ | 0.7% |
| IPM | 8-12 | 25+ | 5 |
| Store CVR (Organic) | 25-30% | 45%+ | 20% |
| Store CVR (Paid) | 15-20% | 35%+ | 12% |
Product & Engagement
| Metric | Median (Good) | Excellent (Top Decile) | Test If Below |
|---|---|---|---|
| Onboarding Completion | 50-60% | 85%+ | 45% |
| D1 Retention | 25-32% | 45%+ | 25% |
| D7 Retention | 8-12% | 18%+ | 7% |
| D30 Retention | 3-5% | 8%+ | 2% |
| Push Opt-in (iOS) | 40-56% | 60%+ | 35% |
| Push Opt-in (Android) | 70-80% | 90%+ | 60% |
Monetization
| Metric | Median (Good) | Excellent (Top Decile) | Test If Below |
|---|---|---|---|
| Paywall CVR (Soft) | 2-4% | 10%+ | 2% |
| Trial-to-Paid | 25-35% | 50%+ | 20% |
| ARPDAU (Casual) | $0.05-$0.12 | $0.25+ | $0.04 |
| ARPDAU (Mid-Core) | $0.10-$0.25 | $0.50+ | $0.08 |
| Energy Recharge Rate | 5-10% ever | 15%+ | 5% (cap too high) |
LiveOps & Events
| Metric | Median (Good) | Excellent (Top Decile) | Test If Below |
|---|---|---|---|
| Event Participation (% DAU) | 15-25% | 40%+ | 12% |
| Event Completion Rate | 20-35% | 50%+ | 15% |
| BP Attach ($4.99) | 5-8% | 12%+ | 4% |
| BP Attach ($9.99) | 3-5% | 8%+ | 2% |
| Post-Event D7 Ret. Delta | -1% to +2% | +3%+ | -3% (draining) |
Platform Fee Context
< $1M Annual Revenue
Platform Fee: 15% (Small Business Program)
Breakeven ROAS: 118%
> $1M Annual Revenue
Platform Fee: 30%
Breakeven ROAS: 143%
A game with 120% D30 ROAS looks profitable at <$1M revenue but is break-even at >$1M. Always check the company's current revenue tier before interpreting ROAS results.
10. Cross-Reference to Other Guides
| Guide | Relationship | When to Use |
|---|---|---|
| Portfolio Company Analysis Playbook | Post-hoc analysis of test results | After a test completes, for broader context |
| ROAS Analysis Pipeline | LTV projection methodology | When defining ROAS-based success criteria |
| Statistical Significance Framework | Detailed sample size calculations | When Section 3 lookup table is insufficient |
| F2P Marketing Analysis Framework | Spend allocation decisions | When results indicate winning channel/geo |
| Causal Attribution Analyzer | Decomposing product vs. marketing effects | When confounded by simultaneous product changes |
Appendix A: Quick-Start Checklist for UA Managers
Pre-Launch Verification
- Test ID formatted correctly:
[YYMMDD]-[TYPE]-[NAME]-V[X] - Exactly ONE variable changed between control and treatment
- Sample size n calculated using base rate formula (not guessed)
- Both a primary metric and at least one guardrail metric defined
- Maximum runtime and kill criteria documented
- Running on Android if creative attribution is required
- Audience isolation confirmed (no overlap with other active tests)
- Hypothesis documented in the test registry
- Automated scanner configured to track the test
- Metric being tested is below "Excellent" benchmark (otherwise test something else)
- Test category appropriate for company's lifecycle stage (Section 6 matrix)
- Test ROI estimated using Appendix B (for resource prioritization)
The VC Perspective
We do not invest in "magic." We invest in systems. A studio that can consistently run 4 high-quality experiments per month will eventually find the local maxima for their game. A studio that relies on creative genius will eventually run out of ideas — and cash.
Appendix B: Test ROI Calculator
Formula
Potential Uplift % = MDE you're testing for. Affected Monthly Revenue = revenue stream impacted. Probability of Success = historical win rate (default 15% if unknown). 12 = annualization factor. Test Cost = direct + opportunity cost.
Historical Win Rates by Category
| Category | Win Rate | Typical Uplift | Notes |
|---|---|---|---|
| CREATIVE | 5-10% | 15-30% CTR | High volume compensates low win rate |
| AUD | 15-20% | 10-20% ROAS | Fewer tests, higher hit rate |
| BID | 20-25% | 10-15% ROAS | Incremental algorithmic improvements |
| PAYWALL | 10-15% | 15-25% CVR | High impact but pricing is sensitive |
| ONBOARD | 20-30% | 10-20% completion | Product-side tests have higher hit rates |
| RETAIN | 15-25% | 5-15% retention | Compounding makes small wins very valuable |
| LIVEOPS | 15-25% | 10-25% event revenue | High variance; seasonal dependencies |
Worked Example
A scaling company ($150K/month spend, $80K/month revenue) choosing between two tests:
Option A: Paywall Pricing Test
Uplift: 20% on $50K/mo subscription
P(success): 12% | Cost: $8K
Value = $14,400 - $8,000 = $6,400
Option B: Creative Hook Test
Uplift: 25% CTR → ~10% CPI reduction on $150K
P(success): 8% | Cost: $5K
Value = $14,400 - $5,000 = $9,400
Decision: Run Option B first (higher expected value: $9,400 vs $6,400)
Self-Triage: Test Value < $0 = not worth running. Test Value < $2,000 = marginal, run only if no better alternatives.
Appendix C: Knowledge Management Template
Use this template for the "Learnings" field in every test document. Standardizing across all 42 portfolio companies ensures institutional memory is searchable and actionable.
Usage Rules
- Every test gets a Learnings entry — including INCONCLUSIVE and REVERT results.
- The "Transferability" section is mandatory. This makes the learning useful beyond the company that ran the test.
- Tags enable portfolio-wide search. "Has anyone tested Battle Pass pricing?" becomes answerable in seconds.
- "What To Do Next" prevents orphaned learnings. Every finding should point to the next action.
