Ronny Kohavi on A/B Testing and Experimentation

Speaker: Ronny Kohavi Source: Lenny’s Podcast Date: c. 2022

Ronny Kohavi — regarded as the world expert on A/B testing and experimentation — shares the rigorous framework he developed across Amazon, Microsoft (Bing), and Airbnb. The conversation covers when to start, what to measure, how not to be fooled by your own results, and why trust is the foundational property of any experimentation platform.

Key ideas

Most ideas fail. Roughly 66% at Microsoft overall, 85% at Bing, 92% in Airbnb search. Google Ads, Booking, and others report 80–90%. Every team that starts running experiments is humbled. Expecting success is the universal beginner’s mistake.
The Bing ad title story. Switching the second line of an ad result to the first line — a backlog item dismissed as a “meh idea” — generated ~$100M in incremental annual revenue. The biggest revenue improvement in Bing’s history came from a few hours of engineering work. The lesson: you are bad at predicting which ideas will win.
OEC must be causally predictive of LTV. The overall evaluation criterion must operationalise long-term value, not just short-term revenue. Maximising ads without guardrail metrics is trivially easy and bad strategy.
Trust is paramount. Optimizely’s early real-time P-value monitoring inflated false-positive rates from 5% to ~30%, destroying trust in results platform-wide. When an experiment platform loses trust, the entire culture collapses.
Sample ratio mismatch. ~8% of experiments at Microsoft suffered from split imbalances (caused by bots, data pipeline issues, mid-funnel randomisation errors). Until this check is added, many “positive” results are invalid.

Detailed notes

When to start experimenting

Need “tens of thousands” of users for the statistics to work on most metrics. For a retail-style conversion metric, the practical threshold where experiment results become reliable is ~200,000 users. Below that, build the culture and platform; above it, test everything.

The OEC

The OEC is not revenue. Revenue is easy to inflate (more ads, worse experience). The OEC must include countervailing metrics — session success rates, time to successful click, churn signals, satisfaction — that are causally linked to lifetime value. If the room cannot agree on whether higher values of the metric are good or bad (the microsoft.com case: “is more time on a support site good or bad?”), the OEC is wrong.

Twyman’s law

Any figure that looks interesting or different is usually wrong. If an experiment shows a 10% gain when typical movements are under 1%, hold the celebration and investigate first. Nine out of ten Twyman’s-law triggers turn out to have a flaw in the experiment. The Bing ad title result survived multiple replications; most apparent home runs do not.

P-values and false positive risk

P-value of 0.05 does not mean 95% confidence that the treatment is better. The correct interpretation requires Bayes’ rule: applying the historical success rate as a prior. At Airbnb (8% success rate), a P < 0.05 result carries a 26% false positive risk — not 5%. Teams at high-failure-rate domains should require P < 0.01 and replication before accepting a result.

Sample ratio mismatch (SRM)

If a 50/50 experiment shows a 50.2/49.8 split in users, that should happen by chance fewer than one in half a million experiments. An imbalance of this magnitude indicates a broken experiment. Adding an SRM check revealed ~8% of experiments at Microsoft were invalid. Bots are the most common cause; data pipeline filtering and mid-site randomisation are also common.

Structured narrative over PowerPoint

Borrowed from Amazon: write a structured narrative (six-pager) instead of a slide deck. Commenters annotate the document. The discipline of prose forces clearer thinking, the annotations persist after the meeting, and it serves as institutional memory.

Big bets vs. incremental bets

Experiment portfolios need both. Incremental bets yield the steady 2% annual improvement. Big bets (100-person-year Bing social integration; Airbnb Online Experiences) mostly fail — but occasionally produce breakthroughs. The mistake is not taking big bets; it is failing to recognise when a big bet has failed and continuing to invest. Bing’s social integration ran for ~18 months of negative-to-flat experiment results before the abort decision.

Ramesh Johari on Marketplaces and Data Science — Ramesh Johari at Stanford consulted to Optimizely to fix their statistical methodology after Ronny raised concerns
Elena Verna 3.0 on Growth Tactics That Never Work — contrasting view: Elena argues against over-testing everything