← Back to Guides

F2P Marketing Experimentation System

A rigorous, operator-level guide for designing, executing, and learning from structured marketing experiments across F2P mobile game portfolios.

Marketing Experimentation F2P Version 2.0 Last Updated: January 2026

1. Philosophy & Principles

Why Structured Experimentation Matters for F2P UA

In F2P mobile gaming, the difference between a portfolio company achieving sustainable scale and burning through a Series A is often found in the rigor of their experimentation. Many studios "test" by throwing creatives at a wall and seeing what sticks. This is not experimentation — it is gambling.

F2P games rely on high-volume acquisition to fuel monetization through IAP, ads, and subscriptions. Without a disciplined experimentation system, teams waste budgets on unproven assumptions, scale losers prematurely, and repeat the same mistakes quarter after quarter.

Why This Matters

In F2P UA, where cohort maturity takes weeks to reveal true LTV, ad platforms reward precision. Poor testing leads to creative fatigue within days, inflating CPI by 20-50%. Studios that run 4+ high-quality experiments per month will eventually find the local maxima for their game. Studios that rely on "creative genius" will eventually run out of ideas — and cash.

The "Testing" vs. "Analyzing" Distinction

This guide focuses exclusively on Testing — the prospective, controlled application of variables to measure causal impact on future performance.

Dimension	Testing (This Guide)	Analyzing (Guide 2)
Orientation	Prospective	Retrospective
Method	Controlled experiments	Observational analysis
Output	Causal inference	Correlative insight
Example	"If we change the first 3 seconds of this video, how does D1 ROAS change?"	"Why did our ROAS drop last Tuesday?"

Core Principles

Isolate One Variable: If you change the creative AND the targeting simultaneously, you have learned nothing. Multi-variable tests require factorial designs that demand 4x the sample size.
Statistical Rigor Before Decisions: We do not call winners based on "gut feel" or "early trends." We wait for the minimum sample size and requisite confidence level.
Document Failures: Negative results are as valuable as positive ones. An undocumented failure guarantees someone will repeat it.
Platform Awareness: Treat iOS and Android as two different games. Test platforms separately to avoid confounding.

F2P-Specific Considerations

Cohort Maturity: Early signals (CPI, CTR) are often decoupled from long-term value. Wait for full attribution windows — 7-day click minimum, 28-day for LTV signals. Immature cohorts overestimate uplift by 10-20%.
Attribution Windows: iOS attribution is transitioning from SKAdNetwork (SKAN) to AdAttributionKit (AAK). Both create elongated feedback loops. Android offers fuller attribution but carries 10-15% invalid traffic risk.
The Power Law of Creatives: 95% of creatives fail. The top creative typically drives 30-40% of all conversions. Your system must be designed to fail fast and scale the 5% that work.
Platform Behavior Differences: iOS users skew premium (higher ARPU, 5-10% better retention). Android volumes are roughly 2x but CPI is 20% lower. Never combine platforms in a single test arm.

Principle	F2P Application	Risk of Violation
One Variable	Test hook text only, not entire creative	Confounded results; 2x test duration wasted
Statistical Rigor	Minimum 300 conversions per variant	False positives; wasted scale budget
Cohort Maturity	28-day window for LTV metrics	Overestimate ROAS by 15-25%
Platform Split	Separate iOS/Android test arms	Skewed metrics; Android fraud inflates installs
Document Failures	Log every negative result	Same paywall flop tested twice, 6 months apart

2. Test Taxonomy (MECE)

Every experiment must fall into exactly one of the following ten mutually exclusive categories. For each, we define the hypothesis template, key metrics, minimum sample sizes (80% power, 5% significance), typical duration, and success criteria.

2.1 Creative Testing

Definition: Testing visual and auditory ad assets — static vs. video vs. UGC, hook variants (first 3 seconds), thumbnail A/B variants, format variations.

Hypothesis Template: "If we replace the 'cinematic reveal' hook with a 'fail gameplay' hook, then CTR will increase by 20% because it triggers a 'fix-it' psychological impulse in casual puzzle players."

Key Metrics: Primary: CTR, IPM, Install Rate. Guardrails: CPI, D1 Retention, D1 ROAS.

Sample Sizes: CTR baseline 1-5% (median ~2.5%). For 10% relative MDE: ~78,000 impressions/variant. For 20%: ~20,000.

Duration: 3-7 days. Monitor for minimum 50 conversions before any directional read.

Success Criteria: CTR uplift >15% AND no degradation in D1 retention >5%. Scale if CPI improves or holds. Kill if CPI spikes >20%.

Creatives account for 30-40% of conversion variance in F2P UA. The top 20% of creatives drive 80% of conversions. Weekly testing of 10-20 variants is essential to combat fatigue and find the next winner.

2.2 Audience & Targeting Testing

Definition: Testing who sees the ads — LAL seed quality, interest stacking, broad vs. narrow targeting, geographic expansion.

Hypothesis Template: "If we move from a 1% LAL based on 'Installs' to a 5% LAL based on 'Highest IAP Spenders,' then D7 ROAS will increase by 15%."

Key Metrics: Primary: Install Rate, Trial Completion, D7 ROAS. Guardrails: CPI, D1 Retention, ARPU.

Sample Sizes: Install-to-trial baseline 5-15% (median ~10%). For 10% MDE: ~15,500 users/arm. For 20%: ~3,875.

Duration: 7-14 days. Geo expansions need longer for fraud detection.

Success Criteria: 15%+ trial uplift AND CPI < baseline + 10%. Guard: No D30 retention drop >3%.

2.3 Bid Strategy Testing

Definition: Testing the algorithm's optimization goal and price ceiling — tCPA, tROAS, MAI, bid caps, budget allocation.

Hypothesis Template: "If we switch from tCPA at $3.00 to tROAS at 120%, then D30 ROAS will improve by 25% because the algorithm will prioritize high-value conversions."

Key Metrics: Primary: ROAS (D7/D30), Effective CPI. Guardrails: Install Volume, Paywall CVR, Budget Utilization.

Sample Sizes: ROAS variance is high; require 500 conversions per arm minimum for 15% relative MDE.

Duration: 14-28 days. LTV signals need time to stabilize past the platform learning phase.

Success Criteria: ROAS improvement >1.2x baseline AND volume stable within +/-10%. The algorithm must be able to spend the full daily budget — a "better ROAS" at 30% of prior volume is not a win.

2.4 Campaign Structure Testing

Definition: Testing architectural setup within the ad manager — consolidation vs. segmentation, AEO vs. VO, CBO vs. ABO.

Hypothesis Template: "If we consolidate 5 separate ad sets into one broad CBO structure, the algorithm will find a lower blended CPI through increased liquidity."

Key Metrics: Primary: CPI, Scale Efficiency, CPM Stability. Guardrails: ROAS, Retention, Learning Phase duration.

Sample Sizes: 1,000 installs per structure for 10% relative MDE on CPI.

Duration: 7-21 days. Account for platform learning phase (3-5 days) before measurement.

Success Criteria: 10% CPI reduction AND ROAS stable. Guard: D1 retention stays above 25%.

Poor campaign structure halves efficiency. Over-segmentation starves algorithms of data; over-consolidation dilutes signal. Below $500/day, consolidation almost always wins.

2.5 Paywall & Pricing Testing

Definition: Testing the monetization gate and price elasticity — price points, trial lengths, discount depth, paywall timing/placement, copy/design.

Hypothesis Template: "If we offer a $4.99 weekly subscription with a 3-day trial instead of $9.99 monthly, trial-to-paid conversion will increase by 25%."

Key Metrics: Primary: Paywall CVR, Trial Start Rate, Trial-to-Paid. Guardrails: D7 Retention, ARPU, Churn, Total Net Revenue.

Sample Sizes: Paywall CVR baseline 2-5% (median ~3%). For 20% MDE at 5% base: ~7,850 users/variant. One of the most sample-hungry test types.

Duration: 14-28 days. Trial length directly extends measurement.

Success Criteria: CVR improvement that translates to higher total net revenue (accounting for platform fees and churn). Guard: D7 retention drop <5%.

2.6 Onboarding Flow Testing

Definition: Testing FTUE up to first meaningful action — step count, personalization depth, value proposition messaging, commitment devices.

Hypothesis Template: "If we reduce onboarding from 5 to 3 steps by deferring account creation, D1 retention will increase by 10%."

Key Metrics: Primary: Completion Rate, Time to First Play, D1 Retention. Guardrails: Trial Start Rate, D7 Retention.

Sample Sizes: Completion baseline 45-55%. For 10% MDE: ~600 users/flow.

Duration: 3-7 days. D1 retention results arrive within 48 hours.

Success Criteria: Completion >50% AND D1 retention >28%. Guard: no drop in downstream conversion.

2.7 Monetization Surface Testing

Definition: Testing placement and frequency of revenue touchpoints — energy/lives cap, ad load, IAP placement, subscription vs. IAP mix, rewarded video.

Hypothesis Template: "If we lower the energy cap from 10 to 5 but add a rewarded video option, ARPDAU will increase by 15%."

Key Metrics: Primary: ARPU, ARPDAU, Ad Revenue/User, IAP CVR. Guardrails: Session Length, D7 Retention, Churn.

Sample Sizes: ARPU has high variance; 1,000 users/arm minimum for 15% MDE.

Duration: 21-42 days. Revenue cycles are long; IAP shows delayed effects.

Success Criteria: ARPU increase >10% AND churn increase <5%. Guard: session length and D7 retention stable.

Energy Systems Are Dual-Purpose

Energy/lives systems are simultaneously monetization mechanics AND retention mechanics. A lower cap increases monetization pressure but also gates content, which can improve session pacing. If fewer than 5% of users ever recharge energy, the cap is too high — it is not functioning as a monetization surface.

2.8 Retention Mechanic Testing

Definition: Testing engagement loops — push notification timing/content, streak systems, daily reward calibration, re-engagement campaigns.

Hypothesis Template: "If we send a push at the user's historical peak play time instead of fixed 9:00 AM, D7 retention will improve by 5%."

Key Metrics: Primary: D7/D14/D30 Retention, Re-engagement Rate, Push Open Rate. Guardrails: Uninstall Rate, Push Opt-out Rate.

Sample Sizes: D1 retention baseline 26-28%. For 10% MDE: ~2,000 users/arm.

Duration: 7-30 days. D30 retention obviously requires 30 days of measurement.

Success Criteria: Retention improvement >10% AND opt-in >40% iOS. Guard: no increase in uninstall rate.

2.9 Landing Page / App Store (ASO) Testing

Definition: Testing App/Play Store presence — screenshot order, preview video, description copy, icon variants, custom product pages.

Hypothesis Template: "If we use screenshots featuring multiplayer elements instead of solo play, store CVR will increase by 12%."

Key Metrics: Primary: Page View to Install CVR, Impressions-to-Install Rate. Guardrails: Keyword Rankings, Paid Cannibalization Rate.

Sample Sizes: Organic CVR baseline 25-30%. For 10% MDE: ~2,000 page views/variant.

Duration: 14-30 days. ASO changes ramp slowly as store algorithms adjust.

Success Criteria: CVR improvement >10% AND rankings stable. Guard: no cannibalization of paid installs.

2.10 LiveOps & Seasonal Event Testing

Definition: Testing time-limited content parameters — event reward structures, durations, Battle Pass pricing/tiers, LTOs, seasonal cadence.

Hypothesis Template: "If we shorten the seasonal event from 14 to 7 days with the same reward pool, participation will increase by 20% due to urgency."

Key Metrics: Primary: Event Participation, Completion Rate, Event Revenue/Participant, Battle Pass Attach Rate. Guardrails: Post-event D7 Retention, Non-Event Session Length, Post-event Churn.

Sample Sizes: Participation baseline 15-30% (median ~20% DAU). For 15% MDE: ~3,500 DAU/variant. Battle Pass attach (5-8% baseline): ~6,000 DAU/variant.

Duration: One full event cycle + 7-day post-event observation. Weekly events: 14 days. Monthly events: 37 days.

Success Criteria: Event revenue/participant increase >15% AND post-event D7 retention drop <3%. Guard: non-event session length stable.

LiveOps Key Considerations

Reward Calibration: Over-generous rewards devalue the core economy; under-generous produce low participation. Test reward tiers independently from magnitude.

Duration vs. Intensity: Shorter events increase urgency but exclude casual players. Test duration separately from reward structure.

Battle Pass Pricing: At $4.99, attach rates of 8-12% are common; at $9.99, 3-5%. Always test — total revenue depends on game-specific price sensitivity.

LTO Frequency: More than 2 LTOs/week creates "deal fatigue." Test cadence with full-price IAP revenue as a guardrail.

Test Taxonomy Summary

Category	ID Code	Primary Metric	Duration	Min. Conversions
Creative	CREATIVE	CTR, IPM	3-7 days	300+/variant
Audience/Targeting	AUD	Install Rate, D7 ROAS	7-14 days	500+/variant
Bid Strategy	BID	ROAS, CPI	14-28 days	500+/variant
Campaign Structure	CAMP	CPI, Scale Efficiency	7-21 days	1,000+ installs
Paywall/Pricing	PAYWALL	Paywall CVR, Trial-to-Paid	14-28 days	800+/variant
Onboarding Flow	ONBOARD	Completion Rate, D1 Ret.	3-7 days	600+/variant
Monetization Surface	MONET	ARPU, ARPDAU	21-42 days	1,000+/variant
Retention Mechanic	RETAIN	D7/D30 Retention	7-30 days	2,000+/variant
Landing Page / ASO	ASO	Store CVR	14-30 days	2,000+ views
LiveOps / Seasonal	LIVEOPS	Event Revenue, Attach Rate	14-37 days	3,500+ DAU

3. Statistical Framework (F2P-Specific)

Why This Section Exists

We have seen numerous portfolio companies scale "winning" creatives based on 50 conversions, only to see ROAS collapse at scale. The "winner" was a statistical fluke — regression to the mean is inevitable with small samples.

The Base-Rate-Aware Sample Size Formula

To calculate the number of users required per variant (n):

n = (Z_α/2 + Z_β)² × p × (1 - p) / d²

Where: Z_α/2 = 1.96 (95% confidence), Z_β = 0.84 (80% power), p = baseline rate, d = absolute MDE.

Base Rate Matters Enormously

A 10% relative lift on a 2% base rate (d = 0.002) requires 16x more samples than a 10% relative lift on a 30% base rate (d = 0.03). Low-base-rate metrics like CTR and paywall CVR require massive sample sizes.

Worked example: A puzzle RPG has paywall CVR of 5%. The team wants to detect a 20% relative improvement (5% → 6%). Absolute MDE d = 0.01.

n = (1.96 + 0.84)² × 0.05 × 0.95 / (0.01)²
n = 7.84 × 0.0475 / 0.0001
n = 3,724 per variant (round to ~3,750)
Total experiment: ~7,500 users (control + treatment)

Sample Size Lookup Table

All values per variant. 95% confidence, 80% power.

Metric	Base Rate (p)	MDE 10% Rel.	MDE 20% Rel.	MDE 50% Rel.
CTR	2%	78,400	19,600	3,140
Paywall CVR	5%	31,400	7,850	1,250
Install-to-Trial	10%	15,500	3,875	620
D1 Retention	27%	12,900	3,225	515
Trial-to-Paid	30%	11,650	2,910	465
Onboarding Completion	50%	8,400	2,100	335

iOS Attribution: SKAN 4 and AdAttributionKit (AAK)

iOS attribution is in transition. SKAN 4 remains operational but AdAttributionKit (AAK) is the strategic successor.

Framework	Multiplier on Base n	Rationale
Android (full attribution)	1.0x	Baseline — full deterministic attribution
SKAN 4 (high volume)	1.3x	Crowd anonymity met; fine-grained values but delayed
SKAN 4 (low volume)	1.5-2.0x	Coarse values only; significant measurement noise
AAK (early adoption)	1.3-1.5x	Similar privacy tiers; re-engagement data available but network support varies

✓ DO

Run creative tests on Android first, port winners to iOS
Use 1.3-1.5x multiplier for iOS ROAS tests
Ensure MMP supports AAK postbacks
Configure developer-mode postbacks for test campaigns

✗ DON'T

Trust iOS creative-to-purchase attribution for small tests
Use Android sample sizes for iOS tests
Ignore AAK transition in test infrastructure
Combine platforms in a single test arm

Sequential Testing and Early Stopping

F2P data accumulates slowly. We employ Bayesian sequential testing:

Early Stop (Negative): Posterior probability drops below 5% AND ≥30% of planned n reached → stop. Clear loser.
Early Stop (Positive): Posterior exceeds 95% for three consecutive periods AND ≥50% of planned n → may call early winner.
Mandatory Cap: Never save more than 20% of planned runtime through early stopping.
Never Stop Between: If probability is 5-95%, continue to planned n. The "gray zone" is where most false conclusions live.

Multi-Arm Bandit Considerations

Approach	When to Use	F2P Application
Fixed-Horizon A/B	Low volume, high stakes	Paywall, monetization changes
Multi-Arm Bandit	High volume, many variants	Creative rotation with 10+ variants
Explore/Exploit	Need to minimize regret during test	Scale winning creative while testing new hooks

CUPED: Variance Reduction for Product Tests

How CUPED Works

For product experiments (onboarding, monetization, retention), adjust each user's outcome by their pre-experiment behavior (session count, device type, channel). This removes "noise," reducing required sample sizes by 20-40%. Not applicable to creative tests where users have no pre-experiment relationship.

Interaction Effects: Avoiding Test Pollution

Preferred: Audience Isolation. Hash UserIDs to non-overlapping buckets (UserID % 4).
Acceptable: Sequential Testing. Run Test A to completion, then Test B.
Last Resort: Document and Model. Run both, document overlap, use ANOVA post-hoc. Accept wider CIs.

4. Test Documentation Standard

Every experiment must be documented in a central registry before launch. Undocumented tests repeat failures and waste review time.

Test ID Format

[YYMMDD]-[TYPE]-[NAME]-V[X]

Types: CREATIVE, AUD, BID, CAMP, PAYWALL, ONBOARD, MONET, RETAIN, ASO, LIVEOPS

Examples:
  260115-CREATIVE-FailGameplay-V1
  260203-PAYWALL-PriceElasticity-V2
  260318-LIVEOPS-BattlePassPricing-V1

Required Document Fields

Section	Contents	Example
Header	ID, Type, Status, Start/Review Date, Owner	260115-CREATIVE-FailGameplay-V1, LIVE, Jan 15
Hypothesis	"If [change], then [metric] will [direction] by [magnitude] because [mechanism]"	"If 'fail gameplay' hook, CTR +20% via fix-it impulse"
Baseline	Metric values with 95% CI, source, date range	CTR: 2.5% (2.3-2.7%), Google Ads, Dec 1-Jan 14
Control	What stays the same	Standard cinematic ad, US targeting, tCPA $3.00
Treatment	What changes (ONE variable)	Hook replaced with fail-gameplay footage
Success Gates	Primary threshold + guardrail constraints	CTR >15% AND CPI ≤+10% AND D1 ret. holds
Sample Size	Show the calculation	2.5% CTR, 20% MDE: n = 19,600/group
Max Runtime	Kill date regardless of significance	14 days OR $10K budget cap
Results	Observed values, p-value/posterior, CIs	CTR: 3.1% (2.8-3.4%), p=0.02, Pr=97%
Decision	One of five outcomes	SCALE
Learnings	What we now know (use Appendix C template)	"Fail-gameplay hooks outperform cinematic for casual puzzle"

Decision Definitions

Decision	Meaning	Action
SCALE	Treatment wins decisively	Roll to 100%, increase budget
ADOPT	Treatment is better	Treatment becomes new default
EXTEND	Promising but not yet significant	Continue to full n
REVERT	Treatment loses or harms guardrails	Return to control immediately
INCONCLUSIVE	No detectable difference	Document, archive, redesign test

5. Test Lifecycle

The lifecycle ensures no test is forgotten, no budget wasted on zombie experiments, and every test produces a documented outcome.

State Workflow

PLANNED

Document complete, peer reviewed, budget allocated, audience isolation confirmed.

LIVE

Spend confirmed for >48 hours, no technical issues, tracking verified.

MEASURING

Wait for target n (sequential monitoring active). Zombie rule: kill if >21 days without reaching n.

SIGNAL

Sample size reached. Compute p-value / Bayesian posterior. Positive (>80%), Negative (<20%), or Neutral (20-80%).

DECISION

Statistical output reviewed, guardrails checked. Outcome: SCALE / ADOPT / REVERT / EXTEND / INCONCLUSIVE.

ARCHIVED

Learnings documented (Appendix C template), stakeholders notified, registry updated.

Escalation Rules

Kill Switch Triggers

CPI Spike: CPI >40% above baseline in first 48 hours → auto-pause for emergency review.

Budget Breach: Spend exceeds 150% of plan without reaching n → pause and evaluate.

Zombie Rule: Any test in MEASURING for >21 days without target n → kill as INCONCLUSIVE.

Guardrail Breach: Any guardrail degrades beyond threshold → immediate review, likely REVERT.

Review Cadence

Cadence	Activity	Participants
Daily	Automated scanner report (spend, conversions, probability)	UA Analyst (async)
Weekly	Status scan of all active tests, flag approaching n or timeline	UA Lead + Analyst
Bi-weekly	Deep review: full stats, guardrails, decisions on tests at SIGNAL	Review Committee

6. Portfolio-Level Test Orchestration

Managing experimentation across 42 portfolio companies requires a prioritization framework that accounts for lifecycle stage, available resources, and test category urgency.

Test Priority Matrix by Company Lifecycle

Test Category	Pre-PMF (to D90)	Scaling ($50K-$500K/mo)	Mature ($500K+/mo)
ONBOARD	Required	If below median	Only if regressed
CREATIVE	Required	Required	Required
PAYWALL	If monetization live	Required	Required
AUD	Not recommended	Required	Required
BID	Not recommended	Required	Required
CAMP	Not recommended	If >$200K/mo	Required
MONET	Not recommended	If D30 data available	Required
RETAIN	If D7 data available	Required	Required
ASO	If launched	If organic >20%	Required
LIVEOPS	Not recommended	If cadence established	Required

Resource Conflict Resolution (Priority Stack)

Active Guardrail Breach

Any company with a live test showing guardrail degradation gets immediate analyst attention.

Tests at SIGNAL Stage

Tests that reached target n need statistical review. Delayed decisions waste the budget already spent.

Higher Lifecycle Stage

Mature companies generate more absolute value per test improvement than pre-PMF companies.

Test ROI Score

Use Appendix B calculator to compare expected value when lifecycle stages are equal.

First-Come, First-Served

Tiebreaker when all other factors are equal.

7. Automation & Tooling

The experimentation system is supported by a Claude Code automation layer that manages scale across the portfolio.

Scanner Data Sources

Data Source	What It Queries	Test Types Served
Google Ads	Impressions, clicks, CTR, CPI, conversions	CREATIVE, AUD, BID, CAMP
Meta Ads	ROAS by placement, breakdowns by creative	CREATIVE, AUD, CAMP
AppsFlyer	Attribution cohorts, D1/D7/D30 retention	All types
Stripe / RevenueCat	Subscription events, trial-to-paid, churn	PAYWALL, MONET, LIVEOPS
GA4 / BigQuery	Onboarding funnels, session data	ONBOARD, RETAIN, MONET, LIVEOPS

Daily Scanner Output Example

Test 260115-CREATIVE-FailGameplay-V1: 64% of target n reached.
Current lift: +8.2%. Probability of beating control: 88%.
Guardrails: CPI +3% (within threshold). D1 retention: stable.
Status: MEASURING. Projected completion: Jan 22.

External AI Review Integration

AI Model	Function	Application
Gemini	Creative pattern analysis	Visual patterns in winning/losing video files
Grok	Market research	Competitor ad libraries and emerging trends
Claude	Statistical review	Sample size validation and interaction effect flagging

8. Anti-Patterns (Learned from Real Portfolio Work)

These patterns have been observed across portfolio companies and have cost real money. Every one is preventable with this system.

8.1 The "Early Winner" Fallacy

The mistake: Calling a creative a winner because it had $0.50 CPI on day 1 with 15 installs.

❌ What Goes Wrong

Regression to the mean is inevitable. A creative that looks 40% better with 50 conversions has a coin-flip chance of being worse at 500.

✔ The Rule

Never make a scaling decision with fewer than 300 conversion events per variant.

Est. Impact: $15K-$75K per incident

8.2 Selection Bias Masquerading as Causation

The mistake: "Users who complete 6 lessons convert at 97%. Let's force all users through 6 lessons."

❌ What Goes Wrong

Users who completed 6 lessons arrived with intent. The lessons did not create that intent. Forcing unmotivated users will cause churn, not conversion.

✔ The Rule

Use randomized A/B tests. Compare "offered 6 lessons" vs. "offered 3" with random assignment.

Est. Impact: $20K-$50K in wasted UA spend

8.3 The Push Notification Illusion

The mistake: "Users with push enabled have 3x LTV. Force push opt-in."

❌ What Goes Wrong

Push permission is an intent signal, not a causal lever. Forcing opt-in increases uninstall rates by 8-12%.

✔ The Rule

Test push opt-in timing as a value-add, not a gate. Measure uninstall rate as guardrail.

Est. Impact: $5K-$8K/month in lost users

8.4 Creative Testing on iOS Without ATT Accounting

The mistake: Relying on Meta/Google reported "Purchases" for iOS creative tests, then scaling the "winner."

❌ What Goes Wrong

Due to ATT, iOS creative-to-purchase attribution is heavily modeled and wrong for small-scale tests below ~500 conversions.

✔ The Rule

Use Android as the "lab." Identify winners with full attribution, then port to iOS for scaling.

Est. Impact: $28K-$140K per misidentified winner

8.5 Volume Over Value in Creative Format Selection

The mistake: Scaling UGC video because it drives the most impressions and installs at lowest CPM.

❌ What Goes Wrong

When measured on ROAS rather than volume, static ads sometimes convert at 3.5x the rate of video. A high-volume, low-ROAS creative is actively losing money at scale.

✔ The Rule

Always evaluate creatives on ROAS or LTV, not volume. Use sample size table to compare ROAS, not just installs.

Est. Impact: $30K-$50K/month in value destruction

8.6 Over-Engineering the Solved Funnel

The mistake: Running elaborate A/B tests on onboarding when completion is 92-94% (top decile).

❌ What Goes Wrong

Chasing 1% gain in a solved funnel while D7 retention sits at 5%. Massive diminishing returns.

✔ The Rule

Use benchmarks (Section 9) to prioritize. Only test funnels below median. If onboarding is excellent, the problem is downstream.

Est. Impact: 5-10x opportunity cost of misallocation

Additional Anti-Patterns (8.7-8.11)

8.7 Energy Systems as Pure Monetization: Energy caps serve dual monetization+retention roles. Optimizing ARPU alone produces short-term spike then 20-30% D30 retention drop.

8.8 The Undocumented Failure: Negative results not logged = same test repeated 6 months later. Portfolio-wide cost: $50K-$100K/year in redundant testing.

8.9 Overlapping Tests Without Isolation: Two overlapping tests = four conditions, two measured. Both results contaminated. Cost: $30K-$150K per incident.

8.10 Organic vs. Paid Cohort Comparison: Organic users self-select with higher intent. Never use organic as control for paid UA tests.

8.11 ARPU/ROAS as Product Quality Signals: Both contaminated by bid strategy, channel mix, geo mix, user quality. Use Causal Attribution Analyzer to decompose.

Portfolio-Wide Impact

Each anti-pattern costs $15K-$150K per incident. Across 42 companies, anti-pattern violations are conservatively estimated at $300K-$500K per year in aggregate waste.

9. F2P Benchmarks Reference Table

Use these ranges to contextualize test results and prioritize what to test. If a metric is in the "Excellent" range, move testing effort to a category with larger upside.

Top-of-Funnel (UA Metrics)

Metric	Median (Good)	Excellent (Top Decile)	Test If Below
CTR (Video)	1.2-1.8%	3.5%+	1.0%
CTR (Static)	0.8-1.2%	2.5%+	0.7%
IPM	8-12	25+	5
Store CVR (Organic)	25-30%	45%+	20%
Store CVR (Paid)	15-20%	35%+	12%

Product & Engagement

Metric	Median (Good)	Excellent (Top Decile)	Test If Below
Onboarding Completion	50-60%	85%+	45%
D1 Retention	25-32%	45%+	25%
D7 Retention	8-12%	18%+	7%
D30 Retention	3-5%	8%+	2%
Push Opt-in (iOS)	40-56%	60%+	35%
Push Opt-in (Android)	70-80%	90%+	60%

Monetization

Metric	Median (Good)	Excellent (Top Decile)	Test If Below
Paywall CVR (Soft)	2-4%	10%+	2%
Trial-to-Paid	25-35%	50%+	20%
ARPDAU (Casual)	$0.05-$0.12	$0.25+	$0.04
ARPDAU (Mid-Core)	$0.10-$0.25	$0.50+	$0.08
Energy Recharge Rate	5-10% ever	15%+	5% (cap too high)

LiveOps & Events

Metric	Median (Good)	Excellent (Top Decile)	Test If Below
Event Participation (% DAU)	15-25%	40%+	12%
Event Completion Rate	20-35%	50%+	15%
BP Attach ($4.99)	5-8%	12%+	4%
BP Attach ($9.99)	3-5%	8%+	2%
Post-Event D7 Ret. Delta	-1% to +2%	+3%+	-3% (draining)

Platform Fee Context

< $1M Annual Revenue

Platform Fee: 15% (Small Business Program)

Breakeven ROAS: 118%

> $1M Annual Revenue

Platform Fee: 30%

Breakeven ROAS: 143%

A game with 120% D30 ROAS looks profitable at <$1M revenue but is break-even at >$1M. Always check the company's current revenue tier before interpreting ROAS results.

10. Cross-Reference to Other Guides

Guide	Relationship	When to Use
Portfolio Company Analysis Playbook	Post-hoc analysis of test results	After a test completes, for broader context
ROAS Analysis Pipeline	LTV projection methodology	When defining ROAS-based success criteria
Statistical Significance Framework	Detailed sample size calculations	When Section 3 lookup table is insufficient
F2P Marketing Analysis Framework	Spend allocation decisions	When results indicate winning channel/geo
Causal Attribution Analyzer	Decomposing product vs. marketing effects	When confounded by simultaneous product changes

Appendix A: Quick-Start Checklist for UA Managers

Pre-Launch Verification

Test ID formatted correctly: [YYMMDD]-[TYPE]-[NAME]-V[X]
Exactly ONE variable changed between control and treatment
Sample size n calculated using base rate formula (not guessed)
Both a primary metric and at least one guardrail metric defined
Maximum runtime and kill criteria documented
Running on Android if creative attribution is required
Audience isolation confirmed (no overlap with other active tests)
Hypothesis documented in the test registry
Automated scanner configured to track the test
Metric being tested is below "Excellent" benchmark (otherwise test something else)
Test category appropriate for company's lifecycle stage (Section 6 matrix)
Test ROI estimated using Appendix B (for resource prioritization)

The VC Perspective

We do not invest in "magic." We invest in systems. A studio that can consistently run 4 high-quality experiments per month will eventually find the local maxima for their game. A studio that relies on creative genius will eventually run out of ideas — and cash.

Appendix B: Test ROI Calculator

Formula

Test Value = (Potential Uplift % × Affected Monthly Revenue × Probability of Success × 12) - Test Cost

Potential Uplift % = MDE you're testing for. Affected Monthly Revenue = revenue stream impacted. Probability of Success = historical win rate (default 15% if unknown). 12 = annualization factor. Test Cost = direct + opportunity cost.

Historical Win Rates by Category

Category	Win Rate	Typical Uplift	Notes
CREATIVE	5-10%	15-30% CTR	High volume compensates low win rate
AUD	15-20%	10-20% ROAS	Fewer tests, higher hit rate
BID	20-25%	10-15% ROAS	Incremental algorithmic improvements
PAYWALL	10-15%	15-25% CVR	High impact but pricing is sensitive
ONBOARD	20-30%	10-20% completion	Product-side tests have higher hit rates
RETAIN	15-25%	5-15% retention	Compounding makes small wins very valuable
LIVEOPS	15-25%	10-25% event revenue	High variance; seasonal dependencies

Worked Example

A scaling company ($150K/month spend, $80K/month revenue) choosing between two tests:

Option A: Paywall Pricing Test

Uplift: 20% on $50K/mo subscription

P(success): 12% | Cost: $8K

Value = $14,400 - $8,000 = $6,400

Option B: Creative Hook Test

Uplift: 25% CTR → ~10% CPI reduction on $150K

P(success): 8% | Cost: $5K

Value = $14,400 - $5,000 = $9,400

Decision: Run Option B first (higher expected value: $9,400 vs $6,400)

Self-Triage: Test Value < $0 = not worth running. Test Value < $2,000 = marginal, run only if no better alternatives.

Appendix C: Knowledge Management Template

Use this template for the "Learnings" field in every test document. Standardizing across all 42 portfolio companies ensures institutional memory is searchable and actionable.

## Learnings: [Test ID]

### What We Tested
- **Variable:** [The single variable changed]
- **Context:** [Game genre, monetization model, daily spend, platform]

### What We Found
- **Result:** [SCALE / ADOPT / EXTEND / REVERT / INCONCLUSIVE]
- **Primary metric delta:** [Observed change with CI]
- **Guardrail impact:** [Any guardrail movement]

### Why It Happened (Hypothesis Validation)
- **Original hypothesis confirmed/rejected:** [Did the mechanism hold?]
- **Actual mechanism (if different):** [What actually drove the result]

### Transferability
- **Likely applies to:** [Genres, models, spend levels where this transfers]
- **Likely does NOT apply to:** [Contexts where it should not be assumed]
- **Conditions for transfer:** [What must be true for similar results]

### What To Do Next
- **If positive:** [Next test to run, or scaling action]
- **If negative:** [Alternative to test, or hypothesis to explore]
- **Open questions:** [What this test did NOT answer]

### Tags
- **Category:** [CREATIVE/AUD/BID/CAMP/PAYWALL/ONBOARD/MONET/RETAIN/ASO/LIVEOPS]
- **Genre:** [Casual Puzzle / Mid-Core RPG / Hypercasual / etc.]
- **Platform:** [Android / iOS / Both]
- **Spend Level:** [Pre-PMF / Scaling / Mature]

Usage Rules

Every test gets a Learnings entry — including INCONCLUSIVE and REVERT results.
The "Transferability" section is mandatory. This makes the learning useful beyond the company that ran the test.
Tags enable portfolio-wide search. "Has anyone tested Battle Pass pricing?" becomes answerable in seconds.
"What To Do Next" prevents orphaned learnings. Every finding should point to the next action.

Sign In Required

Access Denied

F2P Marketing Experimentation System

1. Philosophy & Principles

Why Structured Experimentation Matters for F2P UA

Why This Matters

The "Testing" vs. "Analyzing" Distinction

Core Principles

F2P-Specific Considerations

2. Test Taxonomy (MECE)

2.1 Creative Testing

2.2 Audience & Targeting Testing

2.3 Bid Strategy Testing

2.4 Campaign Structure Testing

2.5 Paywall & Pricing Testing

2.6 Onboarding Flow Testing

2.7 Monetization Surface Testing

Energy Systems Are Dual-Purpose

2.8 Retention Mechanic Testing

2.9 Landing Page / App Store (ASO) Testing

2.10 LiveOps & Seasonal Event Testing

LiveOps Key Considerations

Test Taxonomy Summary

3. Statistical Framework (F2P-Specific)

Why This Section Exists

The Base-Rate-Aware Sample Size Formula

Base Rate Matters Enormously

Sample Size Lookup Table

iOS Attribution: SKAN 4 and AdAttributionKit (AAK)

✓ DO

✗ DON'T

Sequential Testing and Early Stopping

Multi-Arm Bandit Considerations

CUPED: Variance Reduction for Product Tests

How CUPED Works

Interaction Effects: Avoiding Test Pollution

4. Test Documentation Standard

Test ID Format

Required Document Fields

Decision Definitions

5. Test Lifecycle

State Workflow

PLANNED

LIVE

MEASURING

SIGNAL

DECISION

ARCHIVED

Escalation Rules

Kill Switch Triggers

Review Cadence

6. Portfolio-Level Test Orchestration

Test Priority Matrix by Company Lifecycle

Resource Conflict Resolution (Priority Stack)

Active Guardrail Breach

Tests at SIGNAL Stage

Higher Lifecycle Stage

Test ROI Score

First-Come, First-Served

7. Automation & Tooling

Scanner Data Sources

Daily Scanner Output Example

External AI Review Integration

8. Anti-Patterns (Learned from Real Portfolio Work)

8.1 The "Early Winner" Fallacy

❌ What Goes Wrong

✔ The Rule

8.2 Selection Bias Masquerading as Causation

❌ What Goes Wrong

✔ The Rule

8.3 The Push Notification Illusion

❌ What Goes Wrong

✔ The Rule

8.4 Creative Testing on iOS Without ATT Accounting

❌ What Goes Wrong

✔ The Rule

8.5 Volume Over Value in Creative Format Selection

❌ What Goes Wrong

✔ The Rule

8.6 Over-Engineering the Solved Funnel