This document discusses A/B testing at large internet companies. It describes how companies like Amazon, Microsoft, Google, and LinkedIn use A/B testing to evaluate new ideas, measure their impact, and gain customer feedback. It outlines best practices for A/B testing, such as running one experiment at a time, choosing appropriate metrics and statistical significance, properly powering experiments, and addressing issues like multiple testing. The document also describes the key components of a scalable A/B testing system, including experiment management, online infrastructure for traffic routing and data logging, and automated offline analysis.
5. Amazon Shopping Cart Recommendation
5
• At Amazon, Greg Linden had this idea of showing
recommendations based on cart items
• Trade-offs
• Pro: cross-sell more items (increase average basket size)
• Con: distract people from checking out (reduce conversion)
• HiPPO (Highest Paid Person’s Opinion) : stop the project
From Greg Linden’s Blog: http://glinden.blogspot.com/2006/04/early-amazon-shopping-cart.html
http://www.exp-platform.com/Documents/2012-08%20Puzzling%20Outcomes%20KDD.pptx
6. MSN Real Estate
§ “Find a house” widget variations
§ Revenue to MSN generated every time a user
clicks search/find button
6
A B
http://www.exp-platform.com/Documents/2012-08%20Puzzling%20Outcomes%20KDD.pptx
7. Take-away
Experiments
are the only way to prove causality.
7
Use A/B testing to:
§ Guide product development
§ Measure impact (assess ROI)
§ Gain “real” customer feedback
11. What to A/B Test
§ Evaluating new ideas:
– Visual changes
– Complete redesign of web page
– Relevance algorithms
– …
§ Platform changes
§ Code refactoring
§ Bug fixes
11
Test Everything!
12. Startups vs. Big Websites
§ Do startups have enough users to A/B test?
– Startups typically look for larger effects
– 5% vs. 0.5% difference è 100 times more users!
§ Startups should establish A/B testing culture
early
12
16. 1. Experiment Management
§ Define experiments
– Whom to target?
– How to split traffic?
§ Start/stop an experiment
§ Important addition:
– Define success criteria
– Power analysis
16
17. 2. Online Infrastructure
1) Hash & partition: random & consistent
2) Deploy: server-side, as a change to
– The default configuration (Bing)
– The default code path (LinkedIn)
3) Data logging
17
0% 100%
Treatment1
D20% D20%
Hash (ID)
Treatment2 Control
18. Hash & Partition @ Scale (I)
§ Pure bucket system (Google/Bing before 200X)
18
0% 100%
Exp. 1
D20% D20%
Exp. 2 Exp. 3
60%
red green yellow
15% 15%30%
• Does not scale
• Traffic management
19. Hash & Partition @ Scale (II)
§ Fully overlapping system
0% 100%
D
Exp. 2
A2 B2 control
Exp.1
controlA1
D
B1
D
• Each experiment gets 100% traffic
• A user is in “all” experiments simultaneously
• Randomization btw experiments are independent
(unique hashID)
• Cannot avoid interaction
20. Hash & Partition @ Scale (III)
§ Hybrid: Layer + Domain
20
• Centralized management (Bing)
• Central exp. team creates/manages layers/domains
• De-centralized management (LinkedIn)
• Each experiment is one “layer” by default
• Experimenter controls hashID to create a “domain”
21. Data Logging
§ Trigger
§ Trigger-based logging
– Log whether a request is actually affected by the
experiment
– Log for both factual & counter-factual
21
All LinkedIn members
300MM +
Triggered:
Members visiting
contacts page
22. 3. Automated Offline Analysis
§ Large-scale data processing, e.g. daily @LinkedIn
– 200+ experiments
– 700+ metrics
– Billions of experiment trigger events
§ Statistical analysis
– Metrics design
– Statistical significance test (p-value, confidence interval)
– Deep-dive: slicing & dicing capability
§ Monitoring & alerting
– Data quality
– Early termination
22
25. What to Experiment?
Measure one change at a time.
Unified Search Experiments 1+2+…N50%
En-US
Pre-unified search
50%
En-US
26. What to Measure?
§ Success metrics: summarize whether
treatment is better
§ Puzzling example:
– Key metrics for Bing: number of searches &
revenue
– Ranking bug in experiment resulted in poor search
results
– Number of searches up +10% and revenue up
+30%
Success metrics should reflect long
term impact
27. Scientific Experiment Design
§ How long to run the experiment?
§ How much traffic to allocate to treatment?
Story:
§ Site speed matters
– Bing: +100msec = -0.6% revenue
– Amazon: +100msec = -1.0% revenue
– Google: +100msec = -0.2% queries
§ But not for Etsy.com?
“Faster results better? … meh”
27
28. Power
§ Power: the chance of detecting a
difference when there really is one.
§ Two reasons your feature doesn’t move
metrics
1. No “real” impact
2. Not enough power
28
Properly power up your experiment!
31. Statistical Significance
31
§ Must consider statistical significance
– A 12.9% delta can still be noise!
– Identify signal from noise; focus on the “real” movers
– Ensure results are reproducible
Experiment 1 Experiment 2
Pageviews 1.5% 12.9%
Revenue 0.8% Stat. significant 2.4%
33. Multiple Testing Concerns
§ Multiple ramps
– Pre-decide a ramp to base decision on (e.g. 50/50)
§ Multiple “peeks”
– Rely on “full”-week results
§ Multiple variants
– Choose the best, then rerun to see if replicate
§ Multiple metrics
34. An irrelevant metric is statistically
significant. What to do?
§ Which metric?
§ How “significant”? (p-value)
34
34
All
metrics
2nd order
metrics
1st order
metrics
p-value < 0.05
p-value < 0.01
p-value < 0.001
Directly impacted by exp.
Maybe impacted by exp.
Watch out for multiple testing
With 100 metrics, how many would you see stat. significant
even if your experiment does NOTHING? 5
35. References
§ Tang, Diane, et al. Overlapping Experiment Infrastructure: More, Better,
Faster Experimentation. Proceedings 16th Conference on Knowledge
Discovery and Data Mining. 2010.
§ Kohavi, Ron, et al. Online Controlled Experiments at Large Scale. KDD
2013: Proceedings of the 19th ACM SIGKDD international conference on
Knowledge discovery and data mining. 2013.
§ LinkedIn blog post:
http://engineering.linkedin.com/ab-testing/xlnt-platform-driving-ab-testing-linkedin
Additional Resources: RecSys’14 A/B testing workshop
35