The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
Bing_Controlled Experimentation_Panel_The Hive
1. Why Didn’t My Feature Improve
the Metric?
Ya Xu
Based on two papers (KDD’2012 and WSDM’2013) with
Ronny Kohavi, Alex Deng, Toby Walker,
Brian Frasca and Roger Longbotham
Experimentation Panel 3/20/2013
2. What Metric?
• Overall Evaluation Criterion (OEC): metric(s) used
to decide whether A or B is better.
• Long term goal for : query share & revenue
• Puzzling outcome:
– Ranking bug in an experiment resulted in very poor
search results
– Query up +10% and revenue up +30%
– What should a search engine use as OEC?
• We use Sessions-Per-User.
3. REASON #1
The feature just wasn’t as good as you thought…
We are poor at assessing the value of ideas.
Jim Manzi: “Google ran approximately 12,000 randomized
experiment in 2009, with [only] about 10% of these
leading to business changes.”
5. Background
• Puzzling outcome:
– Several experiments showed surprising results
– Reran and effects disappeared
– Why?
• Bucket system (Bing/Google/Yahoo)
– Assign users into buckets, then assign buckets to
experiments.
– Buckets are reused from one experiment to next.
6. Carryover Effect
• Explanation:
– bucket system recycles users; prior experiment
had carryover effects
– Effects last for months
• Solution:
– Run A/A test start end
– Local Re-randomization
8. Background
• Performance matters
– Bing: +100msec = -0.6% revenue
– Amazon: +100msec = -1% revenue
– Google: +100msec = -0.2% query
• But not for Etsy.com?
“faster results better? Meh”
Insensitive experimentation can lead to wrong
conclusion that a feature has no impact.
9. How to Achieve Better Sensitivity?
1. Get more users
2. Run longer experiments:
– We recruit users continuously.
– Longer experiment = more users = more power?
– Wrong! This doesn’t always get us more power
3. CUPED
Controlled Experiments Using Pre-Experiment Data
Confidence interval for Sessions-
per-User doesn’t shrink over a
month!
10. CUPED
• Currently live in ’s experiment system
• Allows for running experiments with
– Half the users, or
– Half the duration
• Leveraging pre-exp data to improve sensitivity
• Intuition: mixture model
total variance
= between-group variance + within-group variance
11. • One top reason not discussed:
Instrumentation bugs
• For more insights, check out our papers
(KDD’2012 and WSDM’2013) or find me at the
networking session