SlideShare une entreprise Scribd logo
1  sur  35
Télécharger pour lire hors ligne
A/B Testing @ Internet Scale
Ya Xu
8/12/2014 @ Coursera
A/B Testing in One Slide
20%80%
Collect results to determine which one is better
Join now
Control Treatment
Outline
§ Culture Challenge
–  Why A/B testing
–  What to A/B test
§ Building a scalable experimentation system
§ Best practices
3
Why A/B Testing
Amazon Shopping Cart Recommendation
5
•  At Amazon, Greg Linden had this idea of showing
recommendations based on cart items
•  Trade-offs
•  Pro: cross-sell more items (increase average basket size)
•  Con: distract people from checking out (reduce conversion)
•  HiPPO (Highest Paid Person’s Opinion) : stop the project
From Greg Linden’s Blog: http://glinden.blogspot.com/2006/04/early-amazon-shopping-cart.html
http://www.exp-platform.com/Documents/2012-08%20Puzzling%20Outcomes%20KDD.pptx
MSN Real Estate
§ “Find a house” widget variations
§ Revenue to MSN generated every time a user
clicks search/find button
6
A B
http://www.exp-platform.com/Documents/2012-08%20Puzzling%20Outcomes%20KDD.pptx
Take-away
Experiments
are the only way to prove causality.
7
Use A/B testing to:
§ Guide product development
§ Measure impact (assess ROI)
§ Gain “real” customer feedback
What to A/B Test
8
Ads CTR Drop
9
Sudden drop
on 11/11/2013
Profile top ads
Root-Cause
10
5 Pixels!!
Navigation bar
Profile top ads
What to A/B Test
§ Evaluating new ideas:
–  Visual changes
–  Complete redesign of web page
–  Relevance algorithms
–  …
§ Platform changes
§ Code refactoring
§ Bug fixes
11
Test Everything!
Startups vs. Big Websites
§ Do startups have enough users to A/B test?
–  Startups typically look for larger effects
–  5% vs. 0.5% difference è 100 times more users!
§ Startups should establish A/B testing culture
early
12
A Scalable Experimentation
System
13
A/B Testing 3 Steps
14
Design
•  What/Whom to experiment on
Deploy
•  Code deployment
Analyze
•  Impact on metrics
A/B Testing Platform Architecture
1.  Experiment Management
2.  Online Infrastructure
3.  Offline Analysis
15
Example: Bing A/B
1. Experiment Management
§ Define experiments
–  Whom to target?
–  How to split traffic?
§ Start/stop an experiment
§ Important addition:
–  Define success criteria
–  Power analysis
16
2. Online Infrastructure
1)  Hash & partition: random & consistent
2)  Deploy: server-side, as a change to
–  The default configuration (Bing)
–  The default code path (LinkedIn)
3)  Data logging
17
0% 100%
Treatment1
D20% D20%
Hash (ID)
Treatment2 Control
Hash & Partition @ Scale (I)
§ Pure bucket system (Google/Bing before 200X)
18
0% 100%
Exp. 1
D20% D20%
Exp. 2 Exp. 3
60%
red green yellow
15% 15%30%
•  Does not scale
•  Traffic management
Hash & Partition @ Scale (II)
§ Fully overlapping system
0% 100%
D
Exp. 2
A2 B2 control
Exp.1
controlA1
D
B1
D
•  Each experiment gets 100% traffic
•  A user is in “all” experiments simultaneously
•  Randomization btw experiments are independent
(unique hashID)
•  Cannot avoid interaction
Hash & Partition @ Scale (III)
§ Hybrid: Layer + Domain
20
•  Centralized management (Bing)
•  Central exp. team creates/manages layers/domains
•  De-centralized management (LinkedIn)
•  Each experiment is one “layer” by default
•  Experimenter controls hashID to create a “domain”
Data Logging
§  Trigger
§  Trigger-based logging
–  Log whether a request is actually affected by the
experiment
–  Log for both factual & counter-factual
21
All LinkedIn members
300MM +
Triggered:
Members visiting
contacts page
3. Automated Offline Analysis
§  Large-scale data processing, e.g. daily @LinkedIn
–  200+ experiments
–  700+ metrics
–  Billions of experiment trigger events
§  Statistical analysis
–  Metrics design
–  Statistical significance test (p-value, confidence interval)
–  Deep-dive: slicing & dicing capability
§  Monitoring & alerting
–  Data quality
–  Early termination
22
Best Practices
23
Example: Unified Search
What to Experiment?
Measure one change at a time.
Unified Search Experiments 1+2+…N50%
En-US
Pre-unified search
50%
En-US
What to Measure?
§ Success metrics: summarize whether
treatment is better
§ Puzzling example:
–  Key metrics for Bing: number of searches &
revenue
–  Ranking bug in experiment resulted in poor search
results
–  Number of searches up +10% and revenue up
+30%
Success metrics should reflect long
term impact
Scientific Experiment Design
§ How long to run the experiment?
§ How much traffic to allocate to treatment?
Story:
§  Site speed matters
–  Bing: +100msec = -0.6% revenue
–  Amazon: +100msec = -1.0% revenue
–  Google: +100msec = -0.2% queries
§  But not for Etsy.com?
“Faster results better? … meh”
27
Power
§ Power: the chance of detecting a
difference when there really is one.
§ Two reasons your feature doesn’t move
metrics
1.  No “real” impact
2.  Not enough power
28
Properly power up your experiment!
Statistical Significance
§ Which experiment has a bigger impact?
29
Experiment 1 Experiment 2
Pageviews 1.5% 12.9%
Revenue 0.8% 2.4%
Statistical Significance
§ Which experiment has a bigger impact?
30
Experiment 1 Experiment 2
Pageviews 1.5% 12.9%
Revenue 0.8% Stat. significant 2.4%
Statistical Significance
31
§ Must consider statistical significance
–  A 12.9% delta can still be noise!
–  Identify signal from noise; focus on the “real” movers
–  Ensure results are reproducible
Experiment 1 Experiment 2
Pageviews 1.5% 12.9%
Revenue 0.8% Stat. significant 2.4%
Multiple Testing
§ Famous xkcd comic on Jelly Beans
32
Multiple Testing Concerns
§ Multiple ramps
–  Pre-decide a ramp to base decision on (e.g. 50/50)
§ Multiple “peeks”
–  Rely on “full”-week results
§ Multiple variants
–  Choose the best, then rerun to see if replicate
§ Multiple metrics
An irrelevant metric is statistically
significant. What to do?
§  Which metric?
§  How “significant”? (p-value)
34
34
All
metrics
2nd order
metrics
1st order
metrics
p-value < 0.05
p-value < 0.01
p-value < 0.001
Directly impacted by exp.
Maybe impacted by exp.
Watch out for multiple testing
With 100 metrics, how many would you see stat. significant
even if your experiment does NOTHING? 5
References
§  Tang, Diane, et al. Overlapping Experiment Infrastructure: More, Better,
Faster Experimentation. Proceedings 16th Conference on Knowledge
Discovery and Data Mining. 2010.
§  Kohavi, Ron, et al. Online Controlled Experiments at Large Scale. KDD
2013: Proceedings of the 19th ACM SIGKDD international conference on
Knowledge discovery and data mining. 2013.
§  LinkedIn blog post:
http://engineering.linkedin.com/ab-testing/xlnt-platform-driving-ab-testing-linkedin
Additional Resources: RecSys’14 A/B testing workshop
35

Contenu connexe

Tendances

Experimentation Platform at Netflix
Experimentation Platform at NetflixExperimentation Platform at Netflix
Experimentation Platform at NetflixSteve Urban
 
Practical Introduction to A/B Testing
Practical Introduction to A/B TestingPractical Introduction to A/B Testing
Practical Introduction to A/B TestingAlex Alwan
 
SAMPLE SIZE – The indispensable A/B test calculation that you’re not making
SAMPLE SIZE – The indispensable A/B test calculation that you’re not makingSAMPLE SIZE – The indispensable A/B test calculation that you’re not making
SAMPLE SIZE – The indispensable A/B test calculation that you’re not makingZack Notes
 
10 Guidelines for A/B Testing
10 Guidelines for A/B Testing10 Guidelines for A/B Testing
10 Guidelines for A/B TestingEmily Robinson
 
Controlled Experimentation aka A/B Testing for PMs by Tinder Sr PM
Controlled Experimentation aka A/B Testing for PMs by Tinder Sr PMControlled Experimentation aka A/B Testing for PMs by Tinder Sr PM
Controlled Experimentation aka A/B Testing for PMs by Tinder Sr PMProduct School
 
A/B testing at Spotify
A/B testing at SpotifyA/B testing at Spotify
A/B testing at SpotifyAli Sarrafi
 
SXSW 2016 - Everything you think about A/B testing is wrong
SXSW 2016 - Everything you think about A/B testing is wrongSXSW 2016 - Everything you think about A/B testing is wrong
SXSW 2016 - Everything you think about A/B testing is wrongDan Chuparkoff
 
Netflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.js
Netflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.jsNetflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.js
Netflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.jsChris Saint-Amant
 
Lean Analytics Cycle
Lean Analytics CycleLean Analytics Cycle
Lean Analytics CycleHiten Shah
 
A/B Testing Pitfalls and Lessons Learned at Spotify
A/B Testing Pitfalls and Lessons Learned at SpotifyA/B Testing Pitfalls and Lessons Learned at Spotify
A/B Testing Pitfalls and Lessons Learned at SpotifyDanielle Jabin
 
A/B Testing for Game Design
A/B Testing for Game DesignA/B Testing for Game Design
A/B Testing for Game DesignTrieu Nguyen
 
Numbers not Napkins: Simple Startup Metrics
Numbers not Napkins: Simple Startup MetricsNumbers not Napkins: Simple Startup Metrics
Numbers not Napkins: Simple Startup MetricsDave McClure
 
Finding the one growth metric that matters
Finding the one growth metric that mattersFinding the one growth metric that matters
Finding the one growth metric that mattersSean Ellis
 
Optimizely Product Vision: The Future of Experimentation
Optimizely Product Vision: The Future of ExperimentationOptimizely Product Vision: The Future of Experimentation
Optimizely Product Vision: The Future of ExperimentationOptimizely
 
Exploratory Testing Explained and Experienced
Exploratory Testing Explained and ExperiencedExploratory Testing Explained and Experienced
Exploratory Testing Explained and ExperiencedMaaret Pyhäjärvi
 
The SaaS Founder’s Journey: What Matters at Each Stage
The SaaS Founder’s Journey: What Matters at Each StageThe SaaS Founder’s Journey: What Matters at Each Stage
The SaaS Founder’s Journey: What Matters at Each StageDavid Skok
 
Product Discovery At Google
Product Discovery At GoogleProduct Discovery At Google
Product Discovery At GoogleJohn Gibbon
 
A/B testing in Firebase. Intermediate and advanced approach
A/B testing in Firebase. Intermediate and advanced approachA/B testing in Firebase. Intermediate and advanced approach
A/B testing in Firebase. Intermediate and advanced approachGameCamp
 

Tendances (20)

Experimentation Platform at Netflix
Experimentation Platform at NetflixExperimentation Platform at Netflix
Experimentation Platform at Netflix
 
Practical Introduction to A/B Testing
Practical Introduction to A/B TestingPractical Introduction to A/B Testing
Practical Introduction to A/B Testing
 
The Power of A/B Testing
The Power of A/B TestingThe Power of A/B Testing
The Power of A/B Testing
 
Ab testing
Ab testingAb testing
Ab testing
 
SAMPLE SIZE – The indispensable A/B test calculation that you’re not making
SAMPLE SIZE – The indispensable A/B test calculation that you’re not makingSAMPLE SIZE – The indispensable A/B test calculation that you’re not making
SAMPLE SIZE – The indispensable A/B test calculation that you’re not making
 
10 Guidelines for A/B Testing
10 Guidelines for A/B Testing10 Guidelines for A/B Testing
10 Guidelines for A/B Testing
 
Controlled Experimentation aka A/B Testing for PMs by Tinder Sr PM
Controlled Experimentation aka A/B Testing for PMs by Tinder Sr PMControlled Experimentation aka A/B Testing for PMs by Tinder Sr PM
Controlled Experimentation aka A/B Testing for PMs by Tinder Sr PM
 
A/B testing at Spotify
A/B testing at SpotifyA/B testing at Spotify
A/B testing at Spotify
 
SXSW 2016 - Everything you think about A/B testing is wrong
SXSW 2016 - Everything you think about A/B testing is wrongSXSW 2016 - Everything you think about A/B testing is wrong
SXSW 2016 - Everything you think about A/B testing is wrong
 
Netflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.js
Netflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.jsNetflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.js
Netflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.js
 
Lean Analytics Cycle
Lean Analytics CycleLean Analytics Cycle
Lean Analytics Cycle
 
A/B Testing Pitfalls and Lessons Learned at Spotify
A/B Testing Pitfalls and Lessons Learned at SpotifyA/B Testing Pitfalls and Lessons Learned at Spotify
A/B Testing Pitfalls and Lessons Learned at Spotify
 
A/B Testing for Game Design
A/B Testing for Game DesignA/B Testing for Game Design
A/B Testing for Game Design
 
Numbers not Napkins: Simple Startup Metrics
Numbers not Napkins: Simple Startup MetricsNumbers not Napkins: Simple Startup Metrics
Numbers not Napkins: Simple Startup Metrics
 
Finding the one growth metric that matters
Finding the one growth metric that mattersFinding the one growth metric that matters
Finding the one growth metric that matters
 
Optimizely Product Vision: The Future of Experimentation
Optimizely Product Vision: The Future of ExperimentationOptimizely Product Vision: The Future of Experimentation
Optimizely Product Vision: The Future of Experimentation
 
Exploratory Testing Explained and Experienced
Exploratory Testing Explained and ExperiencedExploratory Testing Explained and Experienced
Exploratory Testing Explained and Experienced
 
The SaaS Founder’s Journey: What Matters at Each Stage
The SaaS Founder’s Journey: What Matters at Each StageThe SaaS Founder’s Journey: What Matters at Each Stage
The SaaS Founder’s Journey: What Matters at Each Stage
 
Product Discovery At Google
Product Discovery At GoogleProduct Discovery At Google
Product Discovery At Google
 
A/B testing in Firebase. Intermediate and advanced approach
A/B testing in Firebase. Intermediate and advanced approachA/B testing in Firebase. Intermediate and advanced approach
A/B testing in Firebase. Intermediate and advanced approach
 

Similaire à Talks@Coursera - A/B Testing @ Internet Scale

DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...Hakka Labs
 
From Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender systemFrom Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender systemPierre Gutierrez
 
Surviving the AB Testing Hype Cycle - Reaktor Breakpoint 2015
Surviving the AB Testing Hype Cycle - Reaktor Breakpoint 2015Surviving the AB Testing Hype Cycle - Reaktor Breakpoint 2015
Surviving the AB Testing Hype Cycle - Reaktor Breakpoint 2015Craig Sullivan
 
Agile 2014 Software Moneyball (Troy Magennis)
Agile 2014   Software Moneyball (Troy Magennis)Agile 2014   Software Moneyball (Troy Magennis)
Agile 2014 Software Moneyball (Troy Magennis)Troy Magennis
 
Making Strategic Decisions by fmr Capital One Dir. Digital PM
Making Strategic Decisions by fmr Capital One Dir. Digital PMMaking Strategic Decisions by fmr Capital One Dir. Digital PM
Making Strategic Decisions by fmr Capital One Dir. Digital PMProduct School
 
Optimizely Partner Ecosystem
Optimizely Partner EcosystemOptimizely Partner Ecosystem
Optimizely Partner EcosystemOptimizely
 
Drippler's A/B test library
Drippler's A/B test libraryDrippler's A/B test library
Drippler's A/B test libraryNir Hartmann
 
Digital analytics: Optimization (Lecture 10)
Digital analytics: Optimization (Lecture 10)Digital analytics: Optimization (Lecture 10)
Digital analytics: Optimization (Lecture 10)Joni Salminen
 
Ria Sankar on Building AI Products
Ria Sankar on Building AI ProductsRia Sankar on Building AI Products
Ria Sankar on Building AI ProductsRia Sankar
 
Test Case Design
Test Case DesignTest Case Design
Test Case Designacatalin
 
7 Step Data Cleanse: Salesforce Hygiene
7 Step Data Cleanse: Salesforce Hygiene7 Step Data Cleanse: Salesforce Hygiene
7 Step Data Cleanse: Salesforce HygieneCloudFixer
 
Data-Driven Marketing
Data-Driven MarketingData-Driven Marketing
Data-Driven MarketingPerformable
 
Surviving the hype cycle Shortcuts to split testing success
Surviving the hype cycle   Shortcuts to split testing successSurviving the hype cycle   Shortcuts to split testing success
Surviving the hype cycle Shortcuts to split testing successCraig Sullivan
 
Advanced Google Analytics #SearchFest
Advanced Google Analytics #SearchFestAdvanced Google Analytics #SearchFest
Advanced Google Analytics #SearchFestMike P.
 
Tips & Tricks for Getting Things Done Using Analytics Data
Tips & Tricks for Getting Things Done Using Analytics DataTips & Tricks for Getting Things Done Using Analytics Data
Tips & Tricks for Getting Things Done Using Analytics DataCharles Meaden
 
Designing speed with progressive enhancement
Designing speed with progressive enhancementDesigning speed with progressive enhancement
Designing speed with progressive enhancementSergeyChernyshev
 
CRO analytics - How to Continually Optimise
CRO analytics - How to Continually OptimiseCRO analytics - How to Continually Optimise
CRO analytics - How to Continually OptimisePhil Pearce
 
Google Analytics Powerups and Smartcuts
Google Analytics Powerups and Smartcuts Google Analytics Powerups and Smartcuts
Google Analytics Powerups and Smartcuts Charles Meaden
 

Similaire à Talks@Coursera - A/B Testing @ Internet Scale (20)

DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...
 
From Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender systemFrom Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender system
 
Surviving the AB Testing Hype Cycle - Reaktor Breakpoint 2015
Surviving the AB Testing Hype Cycle - Reaktor Breakpoint 2015Surviving the AB Testing Hype Cycle - Reaktor Breakpoint 2015
Surviving the AB Testing Hype Cycle - Reaktor Breakpoint 2015
 
Agile 2014 Software Moneyball (Troy Magennis)
Agile 2014   Software Moneyball (Troy Magennis)Agile 2014   Software Moneyball (Troy Magennis)
Agile 2014 Software Moneyball (Troy Magennis)
 
Making Strategic Decisions by fmr Capital One Dir. Digital PM
Making Strategic Decisions by fmr Capital One Dir. Digital PMMaking Strategic Decisions by fmr Capital One Dir. Digital PM
Making Strategic Decisions by fmr Capital One Dir. Digital PM
 
It Worked for Ustream
It Worked for UstreamIt Worked for Ustream
It Worked for Ustream
 
Optimizely Partner Ecosystem
Optimizely Partner EcosystemOptimizely Partner Ecosystem
Optimizely Partner Ecosystem
 
Drippler's A/B test library
Drippler's A/B test libraryDrippler's A/B test library
Drippler's A/B test library
 
Digital analytics: Optimization (Lecture 10)
Digital analytics: Optimization (Lecture 10)Digital analytics: Optimization (Lecture 10)
Digital analytics: Optimization (Lecture 10)
 
Ria Sankar on Building AI Products
Ria Sankar on Building AI ProductsRia Sankar on Building AI Products
Ria Sankar on Building AI Products
 
Test Case Design
Test Case DesignTest Case Design
Test Case Design
 
7 Step Data Cleanse: Salesforce Hygiene
7 Step Data Cleanse: Salesforce Hygiene7 Step Data Cleanse: Salesforce Hygiene
7 Step Data Cleanse: Salesforce Hygiene
 
Data-Driven Marketing
Data-Driven MarketingData-Driven Marketing
Data-Driven Marketing
 
Petri for kyiv.pptx
Petri for kyiv.pptxPetri for kyiv.pptx
Petri for kyiv.pptx
 
Surviving the hype cycle Shortcuts to split testing success
Surviving the hype cycle   Shortcuts to split testing successSurviving the hype cycle   Shortcuts to split testing success
Surviving the hype cycle Shortcuts to split testing success
 
Advanced Google Analytics #SearchFest
Advanced Google Analytics #SearchFestAdvanced Google Analytics #SearchFest
Advanced Google Analytics #SearchFest
 
Tips & Tricks for Getting Things Done Using Analytics Data
Tips & Tricks for Getting Things Done Using Analytics DataTips & Tricks for Getting Things Done Using Analytics Data
Tips & Tricks for Getting Things Done Using Analytics Data
 
Designing speed with progressive enhancement
Designing speed with progressive enhancementDesigning speed with progressive enhancement
Designing speed with progressive enhancement
 
CRO analytics - How to Continually Optimise
CRO analytics - How to Continually OptimiseCRO analytics - How to Continually Optimise
CRO analytics - How to Continually Optimise
 
Google Analytics Powerups and Smartcuts
Google Analytics Powerups and Smartcuts Google Analytics Powerups and Smartcuts
Google Analytics Powerups and Smartcuts
 

Dernier

Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 
Comparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization TechniquesComparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization Techniquesugginaramesh
 

Dernier (20)

Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 
Comparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization TechniquesComparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization Techniques
 

Talks@Coursera - A/B Testing @ Internet Scale

  • 1. A/B Testing @ Internet Scale Ya Xu 8/12/2014 @ Coursera
  • 2. A/B Testing in One Slide 20%80% Collect results to determine which one is better Join now Control Treatment
  • 3. Outline § Culture Challenge –  Why A/B testing –  What to A/B test § Building a scalable experimentation system § Best practices 3
  • 5. Amazon Shopping Cart Recommendation 5 •  At Amazon, Greg Linden had this idea of showing recommendations based on cart items •  Trade-offs •  Pro: cross-sell more items (increase average basket size) •  Con: distract people from checking out (reduce conversion) •  HiPPO (Highest Paid Person’s Opinion) : stop the project From Greg Linden’s Blog: http://glinden.blogspot.com/2006/04/early-amazon-shopping-cart.html http://www.exp-platform.com/Documents/2012-08%20Puzzling%20Outcomes%20KDD.pptx
  • 6. MSN Real Estate § “Find a house” widget variations § Revenue to MSN generated every time a user clicks search/find button 6 A B http://www.exp-platform.com/Documents/2012-08%20Puzzling%20Outcomes%20KDD.pptx
  • 7. Take-away Experiments are the only way to prove causality. 7 Use A/B testing to: § Guide product development § Measure impact (assess ROI) § Gain “real” customer feedback
  • 8. What to A/B Test 8
  • 9. Ads CTR Drop 9 Sudden drop on 11/11/2013 Profile top ads
  • 11. What to A/B Test § Evaluating new ideas: –  Visual changes –  Complete redesign of web page –  Relevance algorithms –  … § Platform changes § Code refactoring § Bug fixes 11 Test Everything!
  • 12. Startups vs. Big Websites § Do startups have enough users to A/B test? –  Startups typically look for larger effects –  5% vs. 0.5% difference è 100 times more users! § Startups should establish A/B testing culture early 12
  • 14. A/B Testing 3 Steps 14 Design •  What/Whom to experiment on Deploy •  Code deployment Analyze •  Impact on metrics
  • 15. A/B Testing Platform Architecture 1.  Experiment Management 2.  Online Infrastructure 3.  Offline Analysis 15 Example: Bing A/B
  • 16. 1. Experiment Management § Define experiments –  Whom to target? –  How to split traffic? § Start/stop an experiment § Important addition: –  Define success criteria –  Power analysis 16
  • 17. 2. Online Infrastructure 1)  Hash & partition: random & consistent 2)  Deploy: server-side, as a change to –  The default configuration (Bing) –  The default code path (LinkedIn) 3)  Data logging 17 0% 100% Treatment1 D20% D20% Hash (ID) Treatment2 Control
  • 18. Hash & Partition @ Scale (I) § Pure bucket system (Google/Bing before 200X) 18 0% 100% Exp. 1 D20% D20% Exp. 2 Exp. 3 60% red green yellow 15% 15%30% •  Does not scale •  Traffic management
  • 19. Hash & Partition @ Scale (II) § Fully overlapping system 0% 100% D Exp. 2 A2 B2 control Exp.1 controlA1 D B1 D •  Each experiment gets 100% traffic •  A user is in “all” experiments simultaneously •  Randomization btw experiments are independent (unique hashID) •  Cannot avoid interaction
  • 20. Hash & Partition @ Scale (III) § Hybrid: Layer + Domain 20 •  Centralized management (Bing) •  Central exp. team creates/manages layers/domains •  De-centralized management (LinkedIn) •  Each experiment is one “layer” by default •  Experimenter controls hashID to create a “domain”
  • 21. Data Logging §  Trigger §  Trigger-based logging –  Log whether a request is actually affected by the experiment –  Log for both factual & counter-factual 21 All LinkedIn members 300MM + Triggered: Members visiting contacts page
  • 22. 3. Automated Offline Analysis §  Large-scale data processing, e.g. daily @LinkedIn –  200+ experiments –  700+ metrics –  Billions of experiment trigger events §  Statistical analysis –  Metrics design –  Statistical significance test (p-value, confidence interval) –  Deep-dive: slicing & dicing capability §  Monitoring & alerting –  Data quality –  Early termination 22
  • 25. What to Experiment? Measure one change at a time. Unified Search Experiments 1+2+…N50% En-US Pre-unified search 50% En-US
  • 26. What to Measure? § Success metrics: summarize whether treatment is better § Puzzling example: –  Key metrics for Bing: number of searches & revenue –  Ranking bug in experiment resulted in poor search results –  Number of searches up +10% and revenue up +30% Success metrics should reflect long term impact
  • 27. Scientific Experiment Design § How long to run the experiment? § How much traffic to allocate to treatment? Story: §  Site speed matters –  Bing: +100msec = -0.6% revenue –  Amazon: +100msec = -1.0% revenue –  Google: +100msec = -0.2% queries §  But not for Etsy.com? “Faster results better? … meh” 27
  • 28. Power § Power: the chance of detecting a difference when there really is one. § Two reasons your feature doesn’t move metrics 1.  No “real” impact 2.  Not enough power 28 Properly power up your experiment!
  • 29. Statistical Significance § Which experiment has a bigger impact? 29 Experiment 1 Experiment 2 Pageviews 1.5% 12.9% Revenue 0.8% 2.4%
  • 30. Statistical Significance § Which experiment has a bigger impact? 30 Experiment 1 Experiment 2 Pageviews 1.5% 12.9% Revenue 0.8% Stat. significant 2.4%
  • 31. Statistical Significance 31 § Must consider statistical significance –  A 12.9% delta can still be noise! –  Identify signal from noise; focus on the “real” movers –  Ensure results are reproducible Experiment 1 Experiment 2 Pageviews 1.5% 12.9% Revenue 0.8% Stat. significant 2.4%
  • 32. Multiple Testing § Famous xkcd comic on Jelly Beans 32
  • 33. Multiple Testing Concerns § Multiple ramps –  Pre-decide a ramp to base decision on (e.g. 50/50) § Multiple “peeks” –  Rely on “full”-week results § Multiple variants –  Choose the best, then rerun to see if replicate § Multiple metrics
  • 34. An irrelevant metric is statistically significant. What to do? §  Which metric? §  How “significant”? (p-value) 34 34 All metrics 2nd order metrics 1st order metrics p-value < 0.05 p-value < 0.01 p-value < 0.001 Directly impacted by exp. Maybe impacted by exp. Watch out for multiple testing With 100 metrics, how many would you see stat. significant even if your experiment does NOTHING? 5
  • 35. References §  Tang, Diane, et al. Overlapping Experiment Infrastructure: More, Better, Faster Experimentation. Proceedings 16th Conference on Knowledge Discovery and Data Mining. 2010. §  Kohavi, Ron, et al. Online Controlled Experiments at Large Scale. KDD 2013: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 2013. §  LinkedIn blog post: http://engineering.linkedin.com/ab-testing/xlnt-platform-driving-ab-testing-linkedin Additional Resources: RecSys’14 A/B testing workshop 35