Loading...

Private & Confidential — Transcend
← Back to Guides

F2P Marketing Experimentation System

A rigorous, operator-level guide for designing, executing, and learning from structured marketing experiments across F2P mobile game portfolios.

Marketing Experimentation F2P Version 2.0 Last Updated: January 2026

1. Philosophy & Principles

Why Structured Experimentation Matters for F2P UA

In F2P mobile gaming, the difference between a portfolio company achieving sustainable scale and burning through a Series A is often found in the rigor of their experimentation. Many studios "test" by throwing creatives at a wall and seeing what sticks. This is not experimentation — it is gambling.

F2P games rely on high-volume acquisition to fuel monetization through IAP, ads, and subscriptions. Without a disciplined experimentation system, teams waste budgets on unproven assumptions, scale losers prematurely, and repeat the same mistakes quarter after quarter.

Why This Matters

In F2P UA, where cohort maturity takes weeks to reveal true LTV, ad platforms reward precision. Poor testing leads to creative fatigue within days, inflating CPI by 20-50%. Studios that run 4+ high-quality experiments per month will eventually find the local maxima for their game. Studios that rely on "creative genius" will eventually run out of ideas — and cash.

The "Testing" vs. "Analyzing" Distinction

This guide focuses exclusively on Testing — the prospective, controlled application of variables to measure causal impact on future performance.

DimensionTesting (This Guide)Analyzing (Guide 2)
OrientationProspectiveRetrospective
MethodControlled experimentsObservational analysis
OutputCausal inferenceCorrelative insight
Example"If we change the first 3 seconds of this video, how does D1 ROAS change?""Why did our ROAS drop last Tuesday?"

Core Principles

  1. Isolate One Variable: If you change the creative AND the targeting simultaneously, you have learned nothing. Multi-variable tests require factorial designs that demand 4x the sample size.
  2. Statistical Rigor Before Decisions: We do not call winners based on "gut feel" or "early trends." We wait for the minimum sample size and requisite confidence level.
  3. Document Failures: Negative results are as valuable as positive ones. An undocumented failure guarantees someone will repeat it.
  4. Platform Awareness: Treat iOS and Android as two different games. Test platforms separately to avoid confounding.

F2P-Specific Considerations

  • Cohort Maturity: Early signals (CPI, CTR) are often decoupled from long-term value. Wait for full attribution windows — 7-day click minimum, 28-day for LTV signals. Immature cohorts overestimate uplift by 10-20%.
  • Attribution Windows: iOS attribution is transitioning from SKAdNetwork (SKAN) to AdAttributionKit (AAK). Both create elongated feedback loops. Android offers fuller attribution but carries 10-15% invalid traffic risk.
  • The Power Law of Creatives: 95% of creatives fail. The top creative typically drives 30-40% of all conversions. Your system must be designed to fail fast and scale the 5% that work.
  • Platform Behavior Differences: iOS users skew premium (higher ARPU, 5-10% better retention). Android volumes are roughly 2x but CPI is 20% lower. Never combine platforms in a single test arm.
PrincipleF2P ApplicationRisk of Violation
One VariableTest hook text only, not entire creativeConfounded results; 2x test duration wasted
Statistical RigorMinimum 300 conversions per variantFalse positives; wasted scale budget
Cohort Maturity28-day window for LTV metricsOverestimate ROAS by 15-25%
Platform SplitSeparate iOS/Android test armsSkewed metrics; Android fraud inflates installs
Document FailuresLog every negative resultSame paywall flop tested twice, 6 months apart

2. Test Taxonomy (MECE)

Every experiment must fall into exactly one of the following ten mutually exclusive categories. For each, we define the hypothesis template, key metrics, minimum sample sizes (80% power, 5% significance), typical duration, and success criteria.

2.1 Creative Testing

Definition: Testing visual and auditory ad assets — static vs. video vs. UGC, hook variants (first 3 seconds), thumbnail A/B variants, format variations.

Hypothesis Template: "If we replace the 'cinematic reveal' hook with a 'fail gameplay' hook, then CTR will increase by 20% because it triggers a 'fix-it' psychological impulse in casual puzzle players."

Key Metrics: Primary: CTR, IPM, Install Rate. Guardrails: CPI, D1 Retention, D1 ROAS.

Sample Sizes: CTR baseline 1-5% (median ~2.5%). For 10% relative MDE: ~78,000 impressions/variant. For 20%: ~20,000.

Duration: 3-7 days. Monitor for minimum 50 conversions before any directional read.

Success Criteria: CTR uplift >15% AND no degradation in D1 retention >5%. Scale if CPI improves or holds. Kill if CPI spikes >20%.

Creatives account for 30-40% of conversion variance in F2P UA. The top 20% of creatives drive 80% of conversions. Weekly testing of 10-20 variants is essential to combat fatigue and find the next winner.

2.2 Audience & Targeting Testing

Definition: Testing who sees the ads — LAL seed quality, interest stacking, broad vs. narrow targeting, geographic expansion.

Hypothesis Template: "If we move from a 1% LAL based on 'Installs' to a 5% LAL based on 'Highest IAP Spenders,' then D7 ROAS will increase by 15%."

Key Metrics: Primary: Install Rate, Trial Completion, D7 ROAS. Guardrails: CPI, D1 Retention, ARPU.

Sample Sizes: Install-to-trial baseline 5-15% (median ~10%). For 10% MDE: ~15,500 users/arm. For 20%: ~3,875.

Duration: 7-14 days. Geo expansions need longer for fraud detection.

Success Criteria: 15%+ trial uplift AND CPI < baseline + 10%. Guard: No D30 retention drop >3%.

2.3 Bid Strategy Testing

Definition: Testing the algorithm's optimization goal and price ceiling — tCPA, tROAS, MAI, bid caps, budget allocation.

Hypothesis Template: "If we switch from tCPA at $3.00 to tROAS at 120%, then D30 ROAS will improve by 25% because the algorithm will prioritize high-value conversions."

Key Metrics: Primary: ROAS (D7/D30), Effective CPI. Guardrails: Install Volume, Paywall CVR, Budget Utilization.

Sample Sizes: ROAS variance is high; require 500 conversions per arm minimum for 15% relative MDE.

Duration: 14-28 days. LTV signals need time to stabilize past the platform learning phase.

Success Criteria: ROAS improvement >1.2x baseline AND volume stable within +/-10%. The algorithm must be able to spend the full daily budget — a "better ROAS" at 30% of prior volume is not a win.

2.4 Campaign Structure Testing

Definition: Testing architectural setup within the ad manager — consolidation vs. segmentation, AEO vs. VO, CBO vs. ABO.

Hypothesis Template: "If we consolidate 5 separate ad sets into one broad CBO structure, the algorithm will find a lower blended CPI through increased liquidity."

Key Metrics: Primary: CPI, Scale Efficiency, CPM Stability. Guardrails: ROAS, Retention, Learning Phase duration.

Sample Sizes: 1,000 installs per structure for 10% relative MDE on CPI.

Duration: 7-21 days. Account for platform learning phase (3-5 days) before measurement.

Success Criteria: 10% CPI reduction AND ROAS stable. Guard: D1 retention stays above 25%.

Poor campaign structure halves efficiency. Over-segmentation starves algorithms of data; over-consolidation dilutes signal. Below $500/day, consolidation almost always wins.

2.5 Paywall & Pricing Testing

Definition: Testing the monetization gate and price elasticity — price points, trial lengths, discount depth, paywall timing/placement, copy/design.

Hypothesis Template: "If we offer a $4.99 weekly subscription with a 3-day trial instead of $9.99 monthly, trial-to-paid conversion will increase by 25%."

Key Metrics: Primary: Paywall CVR, Trial Start Rate, Trial-to-Paid. Guardrails: D7 Retention, ARPU, Churn, Total Net Revenue.

Sample Sizes: Paywall CVR baseline 2-5% (median ~3%). For 20% MDE at 5% base: ~7,850 users/variant. One of the most sample-hungry test types.

Duration: 14-28 days. Trial length directly extends measurement.

Success Criteria: CVR improvement that translates to higher total net revenue (accounting for platform fees and churn). Guard: D7 retention drop <5%.

2.6 Onboarding Flow Testing

Definition: Testing FTUE up to first meaningful action — step count, personalization depth, value proposition messaging, commitment devices.

Hypothesis Template: "If we reduce onboarding from 5 to 3 steps by deferring account creation, D1 retention will increase by 10%."

Key Metrics: Primary: Completion Rate, Time to First Play, D1 Retention. Guardrails: Trial Start Rate, D7 Retention.

Sample Sizes: Completion baseline 45-55%. For 10% MDE: ~600 users/flow.

Duration: 3-7 days. D1 retention results arrive within 48 hours.

Success Criteria: Completion >50% AND D1 retention >28%. Guard: no drop in downstream conversion.

2.7 Monetization Surface Testing

Definition: Testing placement and frequency of revenue touchpoints — energy/lives cap, ad load, IAP placement, subscription vs. IAP mix, rewarded video.

Hypothesis Template: "If we lower the energy cap from 10 to 5 but add a rewarded video option, ARPDAU will increase by 15%."

Key Metrics: Primary: ARPU, ARPDAU, Ad Revenue/User, IAP CVR. Guardrails: Session Length, D7 Retention, Churn.

Sample Sizes: ARPU has high variance; 1,000 users/arm minimum for 15% MDE.

Duration: 21-42 days. Revenue cycles are long; IAP shows delayed effects.

Success Criteria: ARPU increase >10% AND churn increase <5%. Guard: session length and D7 retention stable.

Energy Systems Are Dual-Purpose

Energy/lives systems are simultaneously monetization mechanics AND retention mechanics. A lower cap increases monetization pressure but also gates content, which can improve session pacing. If fewer than 5% of users ever recharge energy, the cap is too high — it is not functioning as a monetization surface.

2.8 Retention Mechanic Testing

Definition: Testing engagement loops — push notification timing/content, streak systems, daily reward calibration, re-engagement campaigns.

Hypothesis Template: "If we send a push at the user's historical peak play time instead of fixed 9:00 AM, D7 retention will improve by 5%."

Key Metrics: Primary: D7/D14/D30 Retention, Re-engagement Rate, Push Open Rate. Guardrails: Uninstall Rate, Push Opt-out Rate.

Sample Sizes: D1 retention baseline 26-28%. For 10% MDE: ~2,000 users/arm.

Duration: 7-30 days. D30 retention obviously requires 30 days of measurement.

Success Criteria: Retention improvement >10% AND opt-in >40% iOS. Guard: no increase in uninstall rate.

2.9 Landing Page / App Store (ASO) Testing

Definition: Testing App/Play Store presence — screenshot order, preview video, description copy, icon variants, custom product pages.

Hypothesis Template: "If we use screenshots featuring multiplayer elements instead of solo play, store CVR will increase by 12%."

Key Metrics: Primary: Page View to Install CVR, Impressions-to-Install Rate. Guardrails: Keyword Rankings, Paid Cannibalization Rate.

Sample Sizes: Organic CVR baseline 25-30%. For 10% MDE: ~2,000 page views/variant.

Duration: 14-30 days. ASO changes ramp slowly as store algorithms adjust.

Success Criteria: CVR improvement >10% AND rankings stable. Guard: no cannibalization of paid installs.

2.10 LiveOps & Seasonal Event Testing

Definition: Testing time-limited content parameters — event reward structures, durations, Battle Pass pricing/tiers, LTOs, seasonal cadence.

Hypothesis Template: "If we shorten the seasonal event from 14 to 7 days with the same reward pool, participation will increase by 20% due to urgency."

Key Metrics: Primary: Event Participation, Completion Rate, Event Revenue/Participant, Battle Pass Attach Rate. Guardrails: Post-event D7 Retention, Non-Event Session Length, Post-event Churn.

Sample Sizes: Participation baseline 15-30% (median ~20% DAU). For 15% MDE: ~3,500 DAU/variant. Battle Pass attach (5-8% baseline): ~6,000 DAU/variant.

Duration: One full event cycle + 7-day post-event observation. Weekly events: 14 days. Monthly events: 37 days.

Success Criteria: Event revenue/participant increase >15% AND post-event D7 retention drop <3%. Guard: non-event session length stable.

LiveOps Key Considerations

Reward Calibration: Over-generous rewards devalue the core economy; under-generous produce low participation. Test reward tiers independently from magnitude.

Duration vs. Intensity: Shorter events increase urgency but exclude casual players. Test duration separately from reward structure.

Battle Pass Pricing: At $4.99, attach rates of 8-12% are common; at $9.99, 3-5%. Always test — total revenue depends on game-specific price sensitivity.

LTO Frequency: More than 2 LTOs/week creates "deal fatigue." Test cadence with full-price IAP revenue as a guardrail.

Test Taxonomy Summary

CategoryID CodePrimary MetricDurationMin. Conversions
CreativeCREATIVECTR, IPM3-7 days300+/variant
Audience/TargetingAUDInstall Rate, D7 ROAS7-14 days500+/variant
Bid StrategyBIDROAS, CPI14-28 days500+/variant
Campaign StructureCAMPCPI, Scale Efficiency7-21 days1,000+ installs
Paywall/PricingPAYWALLPaywall CVR, Trial-to-Paid14-28 days800+/variant
Onboarding FlowONBOARDCompletion Rate, D1 Ret.3-7 days600+/variant
Monetization SurfaceMONETARPU, ARPDAU21-42 days1,000+/variant
Retention MechanicRETAIND7/D30 Retention7-30 days2,000+/variant
Landing Page / ASOASOStore CVR14-30 days2,000+ views
LiveOps / SeasonalLIVEOPSEvent Revenue, Attach Rate14-37 days3,500+ DAU

3. Statistical Framework (F2P-Specific)

Why This Section Exists

We have seen numerous portfolio companies scale "winning" creatives based on 50 conversions, only to see ROAS collapse at scale. The "winner" was a statistical fluke — regression to the mean is inevitable with small samples.

The Base-Rate-Aware Sample Size Formula

To calculate the number of users required per variant (n):

n = (Zα/2 + Zβ)2 × p × (1 - p) / d2

Where: Zα/2 = 1.96 (95% confidence), Zβ = 0.84 (80% power), p = baseline rate, d = absolute MDE.

Base Rate Matters Enormously

A 10% relative lift on a 2% base rate (d = 0.002) requires 16x more samples than a 10% relative lift on a 30% base rate (d = 0.03). Low-base-rate metrics like CTR and paywall CVR require massive sample sizes.

Worked example: A puzzle RPG has paywall CVR of 5%. The team wants to detect a 20% relative improvement (5% → 6%). Absolute MDE d = 0.01.

n = (1.96 + 0.84)² × 0.05 × 0.95 / (0.01)² n = 7.84 × 0.0475 / 0.0001 n = 3,724 per variant (round to ~3,750) Total experiment: ~7,500 users (control + treatment)

Sample Size Lookup Table

All values per variant. 95% confidence, 80% power.

MetricBase Rate (p)MDE 10% Rel.MDE 20% Rel.MDE 50% Rel.
CTR2%78,40019,6003,140
Paywall CVR5%31,4007,8501,250
Install-to-Trial10%15,5003,875620
D1 Retention27%12,9003,225515
Trial-to-Paid30%11,6502,910465
Onboarding Completion50%8,4002,100335

iOS Attribution: SKAN 4 and AdAttributionKit (AAK)

iOS attribution is in transition. SKAN 4 remains operational but AdAttributionKit (AAK) is the strategic successor.

FrameworkMultiplier on Base nRationale
Android (full attribution)1.0xBaseline — full deterministic attribution
SKAN 4 (high volume)1.3xCrowd anonymity met; fine-grained values but delayed
SKAN 4 (low volume)1.5-2.0xCoarse values only; significant measurement noise
AAK (early adoption)1.3-1.5xSimilar privacy tiers; re-engagement data available but network support varies

✓ DO

  • Run creative tests on Android first, port winners to iOS
  • Use 1.3-1.5x multiplier for iOS ROAS tests
  • Ensure MMP supports AAK postbacks
  • Configure developer-mode postbacks for test campaigns

✗ DON'T

  • Trust iOS creative-to-purchase attribution for small tests
  • Use Android sample sizes for iOS tests
  • Ignore AAK transition in test infrastructure
  • Combine platforms in a single test arm

Sequential Testing and Early Stopping

F2P data accumulates slowly. We employ Bayesian sequential testing:

  • Early Stop (Negative): Posterior probability drops below 5% AND ≥30% of planned n reached → stop. Clear loser.
  • Early Stop (Positive): Posterior exceeds 95% for three consecutive periods AND ≥50% of planned n → may call early winner.
  • Mandatory Cap: Never save more than 20% of planned runtime through early stopping.
  • Never Stop Between: If probability is 5-95%, continue to planned n. The "gray zone" is where most false conclusions live.

Multi-Arm Bandit Considerations

ApproachWhen to UseF2P Application
Fixed-Horizon A/BLow volume, high stakesPaywall, monetization changes
Multi-Arm BanditHigh volume, many variantsCreative rotation with 10+ variants
Explore/ExploitNeed to minimize regret during testScale winning creative while testing new hooks

CUPED: Variance Reduction for Product Tests

How CUPED Works

For product experiments (onboarding, monetization, retention), adjust each user's outcome by their pre-experiment behavior (session count, device type, channel). This removes "noise," reducing required sample sizes by 20-40%. Not applicable to creative tests where users have no pre-experiment relationship.

Interaction Effects: Avoiding Test Pollution

  1. Preferred: Audience Isolation. Hash UserIDs to non-overlapping buckets (UserID % 4).
  2. Acceptable: Sequential Testing. Run Test A to completion, then Test B.
  3. Last Resort: Document and Model. Run both, document overlap, use ANOVA post-hoc. Accept wider CIs.

4. Test Documentation Standard

Every experiment must be documented in a central registry before launch. Undocumented tests repeat failures and waste review time.

Test ID Format

[YYMMDD]-[TYPE]-[NAME]-V[X] Types: CREATIVE, AUD, BID, CAMP, PAYWALL, ONBOARD, MONET, RETAIN, ASO, LIVEOPS Examples: 260115-CREATIVE-FailGameplay-V1 260203-PAYWALL-PriceElasticity-V2 260318-LIVEOPS-BattlePassPricing-V1

Required Document Fields

SectionContentsExample
HeaderID, Type, Status, Start/Review Date, Owner260115-CREATIVE-FailGameplay-V1, LIVE, Jan 15
Hypothesis"If [change], then [metric] will [direction] by [magnitude] because [mechanism]""If 'fail gameplay' hook, CTR +20% via fix-it impulse"
BaselineMetric values with 95% CI, source, date rangeCTR: 2.5% (2.3-2.7%), Google Ads, Dec 1-Jan 14
ControlWhat stays the sameStandard cinematic ad, US targeting, tCPA $3.00
TreatmentWhat changes (ONE variable)Hook replaced with fail-gameplay footage
Success GatesPrimary threshold + guardrail constraintsCTR >15% AND CPI ≤+10% AND D1 ret. holds
Sample SizeShow the calculation2.5% CTR, 20% MDE: n = 19,600/group
Max RuntimeKill date regardless of significance14 days OR $10K budget cap
ResultsObserved values, p-value/posterior, CIsCTR: 3.1% (2.8-3.4%), p=0.02, Pr=97%
DecisionOne of five outcomesSCALE
LearningsWhat we now know (use Appendix C template)"Fail-gameplay hooks outperform cinematic for casual puzzle"

Decision Definitions

DecisionMeaningAction
SCALETreatment wins decisivelyRoll to 100%, increase budget
ADOPTTreatment is betterTreatment becomes new default
EXTENDPromising but not yet significantContinue to full n
REVERTTreatment loses or harms guardrailsReturn to control immediately
INCONCLUSIVENo detectable differenceDocument, archive, redesign test

5. Test Lifecycle

The lifecycle ensures no test is forgotten, no budget wasted on zombie experiments, and every test produces a documented outcome.

State Workflow

1

PLANNED

Document complete, peer reviewed, budget allocated, audience isolation confirmed.

2

LIVE

Spend confirmed for >48 hours, no technical issues, tracking verified.

3

MEASURING

Wait for target n (sequential monitoring active). Zombie rule: kill if >21 days without reaching n.

4

SIGNAL

Sample size reached. Compute p-value / Bayesian posterior. Positive (>80%), Negative (<20%), or Neutral (20-80%).

5

DECISION

Statistical output reviewed, guardrails checked. Outcome: SCALE / ADOPT / REVERT / EXTEND / INCONCLUSIVE.

6

ARCHIVED

Learnings documented (Appendix C template), stakeholders notified, registry updated.

Escalation Rules

Kill Switch Triggers

CPI Spike: CPI >40% above baseline in first 48 hours → auto-pause for emergency review.

Budget Breach: Spend exceeds 150% of plan without reaching n → pause and evaluate.

Zombie Rule: Any test in MEASURING for >21 days without target n → kill as INCONCLUSIVE.

Guardrail Breach: Any guardrail degrades beyond threshold → immediate review, likely REVERT.

Review Cadence

CadenceActivityParticipants
DailyAutomated scanner report (spend, conversions, probability)UA Analyst (async)
WeeklyStatus scan of all active tests, flag approaching n or timelineUA Lead + Analyst
Bi-weeklyDeep review: full stats, guardrails, decisions on tests at SIGNALReview Committee

6. Portfolio-Level Test Orchestration

Managing experimentation across 42 portfolio companies requires a prioritization framework that accounts for lifecycle stage, available resources, and test category urgency.

Test Priority Matrix by Company Lifecycle

Test CategoryPre-PMF (to D90)Scaling ($50K-$500K/mo)Mature ($500K+/mo)
ONBOARDRequiredIf below medianOnly if regressed
CREATIVERequiredRequiredRequired
PAYWALLIf monetization liveRequiredRequired
AUDNot recommendedRequiredRequired
BIDNot recommendedRequiredRequired
CAMPNot recommendedIf >$200K/moRequired
MONETNot recommendedIf D30 data availableRequired
RETAINIf D7 data availableRequiredRequired
ASOIf launchedIf organic >20%Required
LIVEOPSNot recommendedIf cadence establishedRequired

Resource Conflict Resolution (Priority Stack)

1

Active Guardrail Breach

Any company with a live test showing guardrail degradation gets immediate analyst attention.

2

Tests at SIGNAL Stage

Tests that reached target n need statistical review. Delayed decisions waste the budget already spent.

3

Higher Lifecycle Stage

Mature companies generate more absolute value per test improvement than pre-PMF companies.

4

Test ROI Score

Use Appendix B calculator to compare expected value when lifecycle stages are equal.

5

First-Come, First-Served

Tiebreaker when all other factors are equal.

7. Automation & Tooling

The experimentation system is supported by a Claude Code automation layer that manages scale across the portfolio.

Scanner Data Sources

Data SourceWhat It QueriesTest Types Served
Google AdsImpressions, clicks, CTR, CPI, conversionsCREATIVE, AUD, BID, CAMP
Meta AdsROAS by placement, breakdowns by creativeCREATIVE, AUD, CAMP
AppsFlyerAttribution cohorts, D1/D7/D30 retentionAll types
Stripe / RevenueCatSubscription events, trial-to-paid, churnPAYWALL, MONET, LIVEOPS
GA4 / BigQueryOnboarding funnels, session dataONBOARD, RETAIN, MONET, LIVEOPS

Daily Scanner Output Example

Test 260115-CREATIVE-FailGameplay-V1: 64% of target n reached. Current lift: +8.2%. Probability of beating control: 88%. Guardrails: CPI +3% (within threshold). D1 retention: stable. Status: MEASURING. Projected completion: Jan 22.

External AI Review Integration

AI ModelFunctionApplication
GeminiCreative pattern analysisVisual patterns in winning/losing video files
GrokMarket researchCompetitor ad libraries and emerging trends
ClaudeStatistical reviewSample size validation and interaction effect flagging

8. Anti-Patterns (Learned from Real Portfolio Work)

These patterns have been observed across portfolio companies and have cost real money. Every one is preventable with this system.

8.1 The "Early Winner" Fallacy

The mistake: Calling a creative a winner because it had $0.50 CPI on day 1 with 15 installs.

❌ What Goes Wrong

Regression to the mean is inevitable. A creative that looks 40% better with 50 conversions has a coin-flip chance of being worse at 500.

✔ The Rule

Never make a scaling decision with fewer than 300 conversion events per variant.

Est. Impact: $15K-$75K per incident

8.2 Selection Bias Masquerading as Causation

The mistake: "Users who complete 6 lessons convert at 97%. Let's force all users through 6 lessons."

❌ What Goes Wrong

Users who completed 6 lessons arrived with intent. The lessons did not create that intent. Forcing unmotivated users will cause churn, not conversion.

✔ The Rule

Use randomized A/B tests. Compare "offered 6 lessons" vs. "offered 3" with random assignment.

Est. Impact: $20K-$50K in wasted UA spend

8.3 The Push Notification Illusion

The mistake: "Users with push enabled have 3x LTV. Force push opt-in."

❌ What Goes Wrong

Push permission is an intent signal, not a causal lever. Forcing opt-in increases uninstall rates by 8-12%.

✔ The Rule

Test push opt-in timing as a value-add, not a gate. Measure uninstall rate as guardrail.

Est. Impact: $5K-$8K/month in lost users

8.4 Creative Testing on iOS Without ATT Accounting

The mistake: Relying on Meta/Google reported "Purchases" for iOS creative tests, then scaling the "winner."

❌ What Goes Wrong

Due to ATT, iOS creative-to-purchase attribution is heavily modeled and wrong for small-scale tests below ~500 conversions.

✔ The Rule

Use Android as the "lab." Identify winners with full attribution, then port to iOS for scaling.

Est. Impact: $28K-$140K per misidentified winner

8.5 Volume Over Value in Creative Format Selection

The mistake: Scaling UGC video because it drives the most impressions and installs at lowest CPM.

❌ What Goes Wrong

When measured on ROAS rather than volume, static ads sometimes convert at 3.5x the rate of video. A high-volume, low-ROAS creative is actively losing money at scale.

✔ The Rule

Always evaluate creatives on ROAS or LTV, not volume. Use sample size table to compare ROAS, not just installs.

Est. Impact: $30K-$50K/month in value destruction

8.6 Over-Engineering the Solved Funnel

The mistake: Running elaborate A/B tests on onboarding when completion is 92-94% (top decile).

❌ What Goes Wrong

Chasing 1% gain in a solved funnel while D7 retention sits at 5%. Massive diminishing returns.

✔ The Rule

Use benchmarks (Section 9) to prioritize. Only test funnels below median. If onboarding is excellent, the problem is downstream.

Est. Impact: 5-10x opportunity cost of misallocation

Additional Anti-Patterns (8.7-8.11)

8.7 Energy Systems as Pure Monetization: Energy caps serve dual monetization+retention roles. Optimizing ARPU alone produces short-term spike then 20-30% D30 retention drop.

8.8 The Undocumented Failure: Negative results not logged = same test repeated 6 months later. Portfolio-wide cost: $50K-$100K/year in redundant testing.

8.9 Overlapping Tests Without Isolation: Two overlapping tests = four conditions, two measured. Both results contaminated. Cost: $30K-$150K per incident.

8.10 Organic vs. Paid Cohort Comparison: Organic users self-select with higher intent. Never use organic as control for paid UA tests.

8.11 ARPU/ROAS as Product Quality Signals: Both contaminated by bid strategy, channel mix, geo mix, user quality. Use Causal Attribution Analyzer to decompose.

Portfolio-Wide Impact

Each anti-pattern costs $15K-$150K per incident. Across 42 companies, anti-pattern violations are conservatively estimated at $300K-$500K per year in aggregate waste.

9. F2P Benchmarks Reference Table

Use these ranges to contextualize test results and prioritize what to test. If a metric is in the "Excellent" range, move testing effort to a category with larger upside.

Top-of-Funnel (UA Metrics)

MetricMedian (Good)Excellent (Top Decile)Test If Below
CTR (Video)1.2-1.8%3.5%+1.0%
CTR (Static)0.8-1.2%2.5%+0.7%
IPM8-1225+5
Store CVR (Organic)25-30%45%+20%
Store CVR (Paid)15-20%35%+12%

Product & Engagement

MetricMedian (Good)Excellent (Top Decile)Test If Below
Onboarding Completion50-60%85%+45%
D1 Retention25-32%45%+25%
D7 Retention8-12%18%+7%
D30 Retention3-5%8%+2%
Push Opt-in (iOS)40-56%60%+35%
Push Opt-in (Android)70-80%90%+60%

Monetization

MetricMedian (Good)Excellent (Top Decile)Test If Below
Paywall CVR (Soft)2-4%10%+2%
Trial-to-Paid25-35%50%+20%
ARPDAU (Casual)$0.05-$0.12$0.25+$0.04
ARPDAU (Mid-Core)$0.10-$0.25$0.50+$0.08
Energy Recharge Rate5-10% ever15%+5% (cap too high)

LiveOps & Events

MetricMedian (Good)Excellent (Top Decile)Test If Below
Event Participation (% DAU)15-25%40%+12%
Event Completion Rate20-35%50%+15%
BP Attach ($4.99)5-8%12%+4%
BP Attach ($9.99)3-5%8%+2%
Post-Event D7 Ret. Delta-1% to +2%+3%+-3% (draining)

Platform Fee Context

< $1M Annual Revenue

Platform Fee: 15% (Small Business Program)

Breakeven ROAS: 118%

> $1M Annual Revenue

Platform Fee: 30%

Breakeven ROAS: 143%

A game with 120% D30 ROAS looks profitable at <$1M revenue but is break-even at >$1M. Always check the company's current revenue tier before interpreting ROAS results.

10. Cross-Reference to Other Guides

GuideRelationshipWhen to Use
Portfolio Company Analysis PlaybookPost-hoc analysis of test resultsAfter a test completes, for broader context
ROAS Analysis PipelineLTV projection methodologyWhen defining ROAS-based success criteria
Statistical Significance FrameworkDetailed sample size calculationsWhen Section 3 lookup table is insufficient
F2P Marketing Analysis FrameworkSpend allocation decisionsWhen results indicate winning channel/geo
Causal Attribution AnalyzerDecomposing product vs. marketing effectsWhen confounded by simultaneous product changes

Appendix A: Quick-Start Checklist for UA Managers

Pre-Launch Verification

  1. Test ID formatted correctly: [YYMMDD]-[TYPE]-[NAME]-V[X]
  2. Exactly ONE variable changed between control and treatment
  3. Sample size n calculated using base rate formula (not guessed)
  4. Both a primary metric and at least one guardrail metric defined
  5. Maximum runtime and kill criteria documented
  6. Running on Android if creative attribution is required
  7. Audience isolation confirmed (no overlap with other active tests)
  8. Hypothesis documented in the test registry
  9. Automated scanner configured to track the test
  10. Metric being tested is below "Excellent" benchmark (otherwise test something else)
  11. Test category appropriate for company's lifecycle stage (Section 6 matrix)
  12. Test ROI estimated using Appendix B (for resource prioritization)

The VC Perspective

We do not invest in "magic." We invest in systems. A studio that can consistently run 4 high-quality experiments per month will eventually find the local maxima for their game. A studio that relies on creative genius will eventually run out of ideas — and cash.

Appendix B: Test ROI Calculator

Formula

Test Value = (Potential Uplift % × Affected Monthly Revenue × Probability of Success × 12) - Test Cost

Potential Uplift % = MDE you're testing for. Affected Monthly Revenue = revenue stream impacted. Probability of Success = historical win rate (default 15% if unknown). 12 = annualization factor. Test Cost = direct + opportunity cost.

Historical Win Rates by Category

CategoryWin RateTypical UpliftNotes
CREATIVE5-10%15-30% CTRHigh volume compensates low win rate
AUD15-20%10-20% ROASFewer tests, higher hit rate
BID20-25%10-15% ROASIncremental algorithmic improvements
PAYWALL10-15%15-25% CVRHigh impact but pricing is sensitive
ONBOARD20-30%10-20% completionProduct-side tests have higher hit rates
RETAIN15-25%5-15% retentionCompounding makes small wins very valuable
LIVEOPS15-25%10-25% event revenueHigh variance; seasonal dependencies

Worked Example

A scaling company ($150K/month spend, $80K/month revenue) choosing between two tests:

Option A: Paywall Pricing Test

Uplift: 20% on $50K/mo subscription

P(success): 12% | Cost: $8K

Value = $14,400 - $8,000 = $6,400

Option B: Creative Hook Test

Uplift: 25% CTR → ~10% CPI reduction on $150K

P(success): 8% | Cost: $5K

Value = $14,400 - $5,000 = $9,400

Decision: Run Option B first (higher expected value: $9,400 vs $6,400)

Self-Triage: Test Value < $0 = not worth running. Test Value < $2,000 = marginal, run only if no better alternatives.

Appendix C: Knowledge Management Template

Use this template for the "Learnings" field in every test document. Standardizing across all 42 portfolio companies ensures institutional memory is searchable and actionable.

## Learnings: [Test ID] ### What We Tested - **Variable:** [The single variable changed] - **Context:** [Game genre, monetization model, daily spend, platform] ### What We Found - **Result:** [SCALE / ADOPT / EXTEND / REVERT / INCONCLUSIVE] - **Primary metric delta:** [Observed change with CI] - **Guardrail impact:** [Any guardrail movement] ### Why It Happened (Hypothesis Validation) - **Original hypothesis confirmed/rejected:** [Did the mechanism hold?] - **Actual mechanism (if different):** [What actually drove the result] ### Transferability - **Likely applies to:** [Genres, models, spend levels where this transfers] - **Likely does NOT apply to:** [Contexts where it should not be assumed] - **Conditions for transfer:** [What must be true for similar results] ### What To Do Next - **If positive:** [Next test to run, or scaling action] - **If negative:** [Alternative to test, or hypothesis to explore] - **Open questions:** [What this test did NOT answer] ### Tags - **Category:** [CREATIVE/AUD/BID/CAMP/PAYWALL/ONBOARD/MONET/RETAIN/ASO/LIVEOPS] - **Genre:** [Casual Puzzle / Mid-Core RPG / Hypercasual / etc.] - **Platform:** [Android / iOS / Both] - **Spend Level:** [Pre-PMF / Scaling / Mature]

Usage Rules

  1. Every test gets a Learnings entry — including INCONCLUSIVE and REVERT results.
  2. The "Transferability" section is mandatory. This makes the learning useful beyond the company that ran the test.
  3. Tags enable portfolio-wide search. "Has anyone tested Battle Pass pricing?" becomes answerable in seconds.
  4. "What To Do Next" prevents orphaned learnings. Every finding should point to the next action.
✍️ Select text to comment
➕ Add Comment

💬 Comments

💬

No comments yet.

Select text to add the first comment.

Add Comment