SlideShare une entreprise Scribd logo
1  sur  50
A/B Testing: Avoiding
Common Pitfalls
Danielle Jabin

March 5, 2014
1

Make all the world’s music
available instantly to everyone,
wherever and whenever they
want it
2

As of March 5, 2014
3

Over 24 million active users

As of March 5, 2014
4

Access to more than 20 million
songs

As of March 5, 2014
5
6

But can we make it even
easier?
7

We can try…
…with A/B testing!
8

So…what’s an A/B test?
9

Control

A
Pitfall #1: Not limiting
your error rate
11

Source: assets.20bits.com/20081027/normal-curve-small.png
12

What if I flip a coin 100 times and get 51 heads?
13

What if I flip a coin 100 times and get 5 heads?
14
15

The likelihood of obtaining a
certain value under a given
distribution is measured by its
p-value
16

If there is a low likelihood that a
change is due to chance alone,
we call our results statistically
significant
17

What if I flip a coin 100 times and get 5 heads?
18

Statistical significance is measured by alpha
● alpha levels of 5% and 1% are most commonly used
– Alternatively: P(significant) = .05 or .01
19

Each alpha has a corresponding Z-score
alpha

Z-score (two-sided test)

.10

1.65

.05

1.96

.01

2.58
20

The Z-score tells us how far a
particular value is from the
mean (and what the corresponding
likelihood is)
21

Source: assets.20bits.com/20081027/normal-curve-small.png
22

Compute the Z-score at the end of the test
23

Standard deviation (σ) tells us
how spread out the numbers
are
24
25

To lock in error rates before you
start, fix your sample size
26

What should my sample size be?
● To lock in error rates before you start a test, fix your sample size
Represents the
desired power
(typically .84 for 80%
power).

Sample size in each
group (assumes equal
sized groups)

2s (Z b + Za /2 )
n=
2
difference Represents the desired
2

Standard deviation of
the outcome variable

Source: www.stanford.edu/~kcobb/hrp259/lecture11.ppt

2

Effect Size (the
difference in
means)

level of statistical
significance (typically
1.96).
27

Recap: running an A/B test
● Compute your sample size
– Using alpha, beta, standard deviation of your metric, and effect size
● Run your test! But stop once you’ve reached the fixed sample size stopping point
● Compute your z-score and compare it with the z-score for the chosen alpha level
28

Control

A
29

Resulting Z-score?
30

33.3
Pitfall #2: Stopping your
test before the fixed
sample size stopping
point
32

Sample size for varying alpha levels
● With σ = 10, difference in means = 1

Two-sided test
alpha = .10, beta = .80

1230

alpha = .05, beta = .80

1568

alpha = .01, beta = .80

2339
33

Let’s see some numbers
● 1,000 experiments with 200,000 fake participants divided randomly into two groups both
receiving the exact same version, A, with a 3% conversion rate

90% significance
reached
95% significance
reached
99% significance
reached
Source: destack.home.xs4all.nl/projects/significance/

Stop at first point of
significance
654 of 1,000

Ended as significant

427 of 1,000

49 of 1,000

146 of 1,000

14 of 1,000

100 of 1,000
34

Remedies
● Don’t peek
● Okay, maybe you can peek, but don’t stop or make a decision before you reach the fixed
sample size stopping point
● Sequential sampling
35

Control

A

B
Pitfall #3: Making
multiple comparisons in
one test
37

A test can be one of two things: significant or not significant
● P(significant) + P(not significant) = 1
● Let’s take an alpha of .05
– P(significant) = .05
– P(not significant) = 1 – P(significant) = 1 - .05 = .95
38

What about for two comparisons?
● P(at least 1 significant) = 1 - P(none of the 2 are significant)
● P(none of the 2 are significant) = P(not significant)*P(not significant) = .95*.95 = .9025
● P(at least 1 significant) = 1 - .9025 = .0975
39

What about for two comparisons?

●That’s almost 2x (1.95x, to be precise) your .05
significance rate!
40

And it just gets worse…
P(at least 1 signifcant)

An increase of…

5 variations

1 – (1-.05)^5 = .23

4.6x

10 variations

1 – (1-.05)^10 = .40

8x

20 variations

1 – (1-.05)^20 = .64

12.8x
41

How can we remedy this?
●Bonferroni correction
– Divide P(significant), your alpha, by the number of variations you are testing, n
– alpha/n becomes the new level of statistical significance
42

So what about two comparisons now?
● Our new P(significant) = .05/2 = .025
● Our new P(not significant) = 1 - .025 = .975
● P(at least 1 significant) = 1 - P(none of the 2 are significant)
● P(none of the 2 are significant) = P(not significant)*P(not significant) = .975*.975 = .951
● P(at least 1 significant) = 1 - .951 = .0499
43

P(significant) stays under .05 
Corrected alpha

P(at least 1 signifcant)

5 variations

.05/5 = .01

1 – (1-.01)^5 = .049

10 variations

.05/10 = .005

1 – (1-.005)^10 = .049

20 variations

.05/20 = .0025

1 – (1-.0025)^20 = .049
Questions?
Appendix
46

A/B test steps:
1. Decide what to test
2. Determine a metric to test
3. Formulate your hypothesis
1. Select an effect size threshold: what change of the metric would make a rollout
worthwhile?
4. Calculate sample size (your stopping point)
1. Decide your Type I (alpha) and Type 2 (beta) error levels and the corresponding zscores
2. Determine the standard deviation of your metric
5. Run your test! But stop once you’ve reached the fixed sample size stopping point
6. Compute your z-score and compare it with the z-score for your chosen alpha level
47

Type I and Type II error
● Type I error: incorrectly reject a true null hypothesis
– alpha
● Type II error: incorrectly accept a false null hypothesis
– beta
– Power: 1 - beta
48

Z-score reference table
alpha

One-sided test

Two-sided test

.10

1.28

1.65

.05

1.65

1.96

.01

2.33

2.58
49

Z-score for proportions (e.g. conversion)

Contenu connexe

Tendances

Past, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspectivePast, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspectiveXavier Amatriain
 
A/B Testing Best Practices - Do's and Don'ts
A/B Testing Best Practices - Do's and Don'tsA/B Testing Best Practices - Do's and Don'ts
A/B Testing Best Practices - Do's and Don'tsRamkumar Ravichandran
 
10 Guidelines for A/B Testing
10 Guidelines for A/B Testing10 Guidelines for A/B Testing
10 Guidelines for A/B TestingEmily Robinson
 
Building Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at SpotifyBuilding Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at SpotifyVidhya Murali
 
Why everything is an A/B Test at Pinterest
Why everything is an A/B Test at PinterestWhy everything is an A/B Test at Pinterest
Why everything is an A/B Test at PinterestKrishna Gade
 
Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)Mounia Lalmas-Roelleke
 
Counterfactual Learning for Recommendation
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for RecommendationOlivier Jeunen
 
Algorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyAlgorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyChris Johnson
 
Music Personalization At Spotify
Music Personalization At SpotifyMusic Personalization At Spotify
Music Personalization At SpotifyVidhya Murali
 
4 Steps Toward Scientific A/B Testing
4 Steps Toward Scientific A/B Testing4 Steps Toward Scientific A/B Testing
4 Steps Toward Scientific A/B TestingJanessa Lantz
 
Past, Present & Future of Recommender Systems: An Industry Perspective
Past, Present & Future of Recommender Systems: An Industry PerspectivePast, Present & Future of Recommender Systems: An Industry Perspective
Past, Present & Future of Recommender Systems: An Industry PerspectiveJustin Basilico
 
Recommendation at Netflix Scale
Recommendation at Netflix ScaleRecommendation at Netflix Scale
Recommendation at Netflix ScaleJustin Basilico
 
Music Recommendations at Scale with Spark
Music Recommendations at Scale with SparkMusic Recommendations at Scale with Spark
Music Recommendations at Scale with SparkChris Johnson
 
CF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyCF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyVidhya Murali
 
Intro to A/B Testing by Ever's Senior Product Manager
Intro to A/B Testing by Ever's Senior Product ManagerIntro to A/B Testing by Ever's Senior Product Manager
Intro to A/B Testing by Ever's Senior Product ManagerProduct School
 
Machine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at SpotifyMachine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at SpotifyChing-Wei Chen
 
Spotify Discover Weekly: The machine learning behind your music recommendations
Spotify Discover Weekly: The machine learning behind your music recommendationsSpotify Discover Weekly: The machine learning behind your music recommendations
Spotify Discover Weekly: The machine learning behind your music recommendationsSophia Ciocca
 
Sequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsJaya Kawale
 
Ab testing 101
Ab testing 101Ab testing 101
Ab testing 101Ashish Dua
 

Tendances (20)

Past, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspectivePast, present, and future of Recommender Systems: an industry perspective
Past, present, and future of Recommender Systems: an industry perspective
 
A/B Testing Best Practices - Do's and Don'ts
A/B Testing Best Practices - Do's and Don'tsA/B Testing Best Practices - Do's and Don'ts
A/B Testing Best Practices - Do's and Don'ts
 
10 Guidelines for A/B Testing
10 Guidelines for A/B Testing10 Guidelines for A/B Testing
10 Guidelines for A/B Testing
 
Recommending and searching @ Spotify
Recommending and searching @ SpotifyRecommending and searching @ Spotify
Recommending and searching @ Spotify
 
Building Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at SpotifyBuilding Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at Spotify
 
Why everything is an A/B Test at Pinterest
Why everything is an A/B Test at PinterestWhy everything is an A/B Test at Pinterest
Why everything is an A/B Test at Pinterest
 
Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)
 
Counterfactual Learning for Recommendation
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for Recommendation
 
Algorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyAlgorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at Spotify
 
Music Personalization At Spotify
Music Personalization At SpotifyMusic Personalization At Spotify
Music Personalization At Spotify
 
4 Steps Toward Scientific A/B Testing
4 Steps Toward Scientific A/B Testing4 Steps Toward Scientific A/B Testing
4 Steps Toward Scientific A/B Testing
 
Past, Present & Future of Recommender Systems: An Industry Perspective
Past, Present & Future of Recommender Systems: An Industry PerspectivePast, Present & Future of Recommender Systems: An Industry Perspective
Past, Present & Future of Recommender Systems: An Industry Perspective
 
Recommendation at Netflix Scale
Recommendation at Netflix ScaleRecommendation at Netflix Scale
Recommendation at Netflix Scale
 
Music Recommendations at Scale with Spark
Music Recommendations at Scale with SparkMusic Recommendations at Scale with Spark
Music Recommendations at Scale with Spark
 
CF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyCF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At Spotify
 
Intro to A/B Testing by Ever's Senior Product Manager
Intro to A/B Testing by Ever's Senior Product ManagerIntro to A/B Testing by Ever's Senior Product Manager
Intro to A/B Testing by Ever's Senior Product Manager
 
Machine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at SpotifyMachine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at Spotify
 
Spotify Discover Weekly: The machine learning behind your music recommendations
Spotify Discover Weekly: The machine learning behind your music recommendationsSpotify Discover Weekly: The machine learning behind your music recommendations
Spotify Discover Weekly: The machine learning behind your music recommendations
 
Sequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in Recommendations
 
Ab testing 101
Ab testing 101Ab testing 101
Ab testing 101
 

En vedette

Making Better Mistakes Tomorrow
Making Better Mistakes TomorrowMaking Better Mistakes Tomorrow
Making Better Mistakes TomorrowDanielle Jabin
 
Measuring team performance at spotify slideshare
Measuring team performance at spotify slideshareMeasuring team performance at spotify slideshare
Measuring team performance at spotify slideshareDanielle Jabin
 
Africa DevOps Day 2015
Africa DevOps Day 2015Africa DevOps Day 2015
Africa DevOps Day 2015Danielle Jabin
 
The math-behind-ab-testing
The math-behind-ab-testingThe math-behind-ab-testing
The math-behind-ab-testingAmit Sawhney
 
Managing Experiment at Spotify
Managing Experiment at SpotifyManaging Experiment at Spotify
Managing Experiment at SpotifyAli Sarrafi
 
Big Data At Spotify
Big Data At SpotifyBig Data At Spotify
Big Data At SpotifyAdam Kawa
 
Explore and Evaluate - A different intro to research and data
Explore and Evaluate - A different intro to research and dataExplore and Evaluate - A different intro to research and data
Explore and Evaluate - A different intro to research and dataBen Dressler
 
Being a Data Driven Business
Being a Data Driven Business Being a Data Driven Business
Being a Data Driven Business Ali Sarrafi
 
How spotify makes product
How spotify makes productHow spotify makes product
How spotify makes productAli Sarrafi
 
Growing up with agile - how the Spotify 'model' has evolved
Growing up with agile - how the Spotify 'model' has evolved Growing up with agile - how the Spotify 'model' has evolved
Growing up with agile - how the Spotify 'model' has evolved Peter Antman
 
An Intro to Learning Organization
An Intro to Learning OrganizationAn Intro to Learning Organization
An Intro to Learning OrganizationHakan Cuzdan
 
How data drives spotify
How data drives spotifyHow data drives spotify
How data drives spotifyAli Sarrafi
 
A/B Testing for Lean Startups
A/B Testing for Lean StartupsA/B Testing for Lean Startups
A/B Testing for Lean StartupsPete Mauro
 
Fluent at agile - agile sverige 2014
Fluent at agile - agile sverige 2014Fluent at agile - agile sverige 2014
Fluent at agile - agile sverige 2014Peter Antman
 
Experimentation Platform at Netflix
Experimentation Platform at NetflixExperimentation Platform at Netflix
Experimentation Platform at NetflixSteve Urban
 
A/B Testing: You Might be Driving in the Wrong Direction
A/B Testing: You Might be Driving in the Wrong DirectionA/B Testing: You Might be Driving in the Wrong Direction
A/B Testing: You Might be Driving in the Wrong DirectionKissmetrics on SlideShare
 
온라인 서비스 개선을 데이터 활용법 - 김진영 (How We Use Data)
온라인 서비스 개선을 데이터 활용법  - 김진영 (How We Use Data)온라인 서비스 개선을 데이터 활용법  - 김진영 (How We Use Data)
온라인 서비스 개선을 데이터 활용법 - 김진영 (How We Use Data)Jin Young Kim
 
Product Owner presentation for Spotify
Product Owner presentation for SpotifyProduct Owner presentation for Spotify
Product Owner presentation for Spotifypdicorpo
 
eMetrics London - The AB Testing Hype Cycle
eMetrics London - The AB Testing Hype CycleeMetrics London - The AB Testing Hype Cycle
eMetrics London - The AB Testing Hype CycleCraig Sullivan
 

En vedette (20)

Making Better Mistakes Tomorrow
Making Better Mistakes TomorrowMaking Better Mistakes Tomorrow
Making Better Mistakes Tomorrow
 
Measuring team performance at spotify slideshare
Measuring team performance at spotify slideshareMeasuring team performance at spotify slideshare
Measuring team performance at spotify slideshare
 
Data at Spotify
Data at SpotifyData at Spotify
Data at Spotify
 
Africa DevOps Day 2015
Africa DevOps Day 2015Africa DevOps Day 2015
Africa DevOps Day 2015
 
The math-behind-ab-testing
The math-behind-ab-testingThe math-behind-ab-testing
The math-behind-ab-testing
 
Managing Experiment at Spotify
Managing Experiment at SpotifyManaging Experiment at Spotify
Managing Experiment at Spotify
 
Big Data At Spotify
Big Data At SpotifyBig Data At Spotify
Big Data At Spotify
 
Explore and Evaluate - A different intro to research and data
Explore and Evaluate - A different intro to research and dataExplore and Evaluate - A different intro to research and data
Explore and Evaluate - A different intro to research and data
 
Being a Data Driven Business
Being a Data Driven Business Being a Data Driven Business
Being a Data Driven Business
 
How spotify makes product
How spotify makes productHow spotify makes product
How spotify makes product
 
Growing up with agile - how the Spotify 'model' has evolved
Growing up with agile - how the Spotify 'model' has evolved Growing up with agile - how the Spotify 'model' has evolved
Growing up with agile - how the Spotify 'model' has evolved
 
An Intro to Learning Organization
An Intro to Learning OrganizationAn Intro to Learning Organization
An Intro to Learning Organization
 
How data drives spotify
How data drives spotifyHow data drives spotify
How data drives spotify
 
A/B Testing for Lean Startups
A/B Testing for Lean StartupsA/B Testing for Lean Startups
A/B Testing for Lean Startups
 
Fluent at agile - agile sverige 2014
Fluent at agile - agile sverige 2014Fluent at agile - agile sverige 2014
Fluent at agile - agile sverige 2014
 
Experimentation Platform at Netflix
Experimentation Platform at NetflixExperimentation Platform at Netflix
Experimentation Platform at Netflix
 
A/B Testing: You Might be Driving in the Wrong Direction
A/B Testing: You Might be Driving in the Wrong DirectionA/B Testing: You Might be Driving in the Wrong Direction
A/B Testing: You Might be Driving in the Wrong Direction
 
온라인 서비스 개선을 데이터 활용법 - 김진영 (How We Use Data)
온라인 서비스 개선을 데이터 활용법  - 김진영 (How We Use Data)온라인 서비스 개선을 데이터 활용법  - 김진영 (How We Use Data)
온라인 서비스 개선을 데이터 활용법 - 김진영 (How We Use Data)
 
Product Owner presentation for Spotify
Product Owner presentation for SpotifyProduct Owner presentation for Spotify
Product Owner presentation for Spotify
 
eMetrics London - The AB Testing Hype Cycle
eMetrics London - The AB Testing Hype CycleeMetrics London - The AB Testing Hype Cycle
eMetrics London - The AB Testing Hype Cycle
 

Similaire à A/B Testing Pitfalls and Lessons Learned at Spotify

Quantitative Aptitude Test (QAT)- Tips & Tricks
Quantitative Aptitude Test (QAT)- Tips & TricksQuantitative Aptitude Test (QAT)- Tips & Tricks
Quantitative Aptitude Test (QAT)- Tips & Tricksshwetavashishtha
 
Quantitative Aptitude Test (QAT)-Tips & Tricks
Quantitative Aptitude Test (QAT)-Tips & TricksQuantitative Aptitude Test (QAT)-Tips & Tricks
Quantitative Aptitude Test (QAT)-Tips & Trickscocubes_learningcalendar
 
Uji liliefors
Uji lilieforsUji liliefors
Uji lilieforsjojun
 
WEEK 6 – HOMEWORK 6 LANE CHAPTERS, 11, 12, AND 13; ILLOWSKY CHAP.docx
WEEK 6 – HOMEWORK 6  LANE CHAPTERS, 11, 12, AND 13; ILLOWSKY CHAP.docxWEEK 6 – HOMEWORK 6  LANE CHAPTERS, 11, 12, AND 13; ILLOWSKY CHAP.docx
WEEK 6 – HOMEWORK 6 LANE CHAPTERS, 11, 12, AND 13; ILLOWSKY CHAP.docxcockekeshia
 
Design of Experiments
Design of Experiments Design of Experiments
Design of Experiments Furk Kruf
 
Logistic regression
Logistic regressionLogistic regression
Logistic regressionRupak Roy
 
Tips & tricks for Quantitative Aptitude
Tips & tricks for Quantitative AptitudeTips & tricks for Quantitative Aptitude
Tips & tricks for Quantitative AptitudeAmber Bhaumik
 
CrashCourse_0622
CrashCourse_0622CrashCourse_0622
CrashCourse_0622Dexen Xi
 
Statistics for CRO - Conversion Conference London
Statistics for CRO - Conversion Conference LondonStatistics for CRO - Conversion Conference London
Statistics for CRO - Conversion Conference LondonTom Capper
 
Conversion Conference Berlin
Conversion Conference BerlinConversion Conference Berlin
Conversion Conference BerlinTom Capper
 
6.7 Percent Equations
6.7 Percent Equations6.7 Percent Equations
6.7 Percent EquationsJessca Lundin
 
Lecture_Wk08.pdf
Lecture_Wk08.pdfLecture_Wk08.pdf
Lecture_Wk08.pdfNiel89
 
Memorization of Various Calculator shortcuts
Memorization of Various Calculator shortcutsMemorization of Various Calculator shortcuts
Memorization of Various Calculator shortcutsPrincessNorberte
 
Week8 Live Lecture for Final Exam
Week8 Live Lecture for Final ExamWeek8 Live Lecture for Final Exam
Week8 Live Lecture for Final ExamBrent Heard
 

Similaire à A/B Testing Pitfalls and Lessons Learned at Spotify (20)

Quantitative Aptitude Test (QAT)- Tips & Tricks
Quantitative Aptitude Test (QAT)- Tips & TricksQuantitative Aptitude Test (QAT)- Tips & Tricks
Quantitative Aptitude Test (QAT)- Tips & Tricks
 
Quantitative Aptitude Test (QAT)-Tips & Tricks
Quantitative Aptitude Test (QAT)-Tips & TricksQuantitative Aptitude Test (QAT)-Tips & Tricks
Quantitative Aptitude Test (QAT)-Tips & Tricks
 
Uji liliefors
Uji lilieforsUji liliefors
Uji liliefors
 
WEEK 6 – HOMEWORK 6 LANE CHAPTERS, 11, 12, AND 13; ILLOWSKY CHAP.docx
WEEK 6 – HOMEWORK 6  LANE CHAPTERS, 11, 12, AND 13; ILLOWSKY CHAP.docxWEEK 6 – HOMEWORK 6  LANE CHAPTERS, 11, 12, AND 13; ILLOWSKY CHAP.docx
WEEK 6 – HOMEWORK 6 LANE CHAPTERS, 11, 12, AND 13; ILLOWSKY CHAP.docx
 
Quant tips edited
Quant tips editedQuant tips edited
Quant tips edited
 
7. the t distribution
7. the t distribution7. the t distribution
7. the t distribution
 
Design of Experiments
Design of Experiments Design of Experiments
Design of Experiments
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
MATHS
MATHSMATHS
MATHS
 
Tips & tricks for Quantitative Aptitude
Tips & tricks for Quantitative AptitudeTips & tricks for Quantitative Aptitude
Tips & tricks for Quantitative Aptitude
 
CrashCourse_0622
CrashCourse_0622CrashCourse_0622
CrashCourse_0622
 
Statistics for CRO - Conversion Conference London
Statistics for CRO - Conversion Conference LondonStatistics for CRO - Conversion Conference London
Statistics for CRO - Conversion Conference London
 
01_SLR_final (1).pptx
01_SLR_final (1).pptx01_SLR_final (1).pptx
01_SLR_final (1).pptx
 
Conversion Conference Berlin
Conversion Conference BerlinConversion Conference Berlin
Conversion Conference Berlin
 
6.7 Percent Equations
6.7 Percent Equations6.7 Percent Equations
6.7 Percent Equations
 
Lecture_Wk08.pdf
Lecture_Wk08.pdfLecture_Wk08.pdf
Lecture_Wk08.pdf
 
Memorization of Various Calculator shortcuts
Memorization of Various Calculator shortcutsMemorization of Various Calculator shortcuts
Memorization of Various Calculator shortcuts
 
Machine learning mathematicals.pdf
Machine learning mathematicals.pdfMachine learning mathematicals.pdf
Machine learning mathematicals.pdf
 
Lecture 4
Lecture 4Lecture 4
Lecture 4
 
Week8 Live Lecture for Final Exam
Week8 Live Lecture for Final ExamWeek8 Live Lecture for Final Exam
Week8 Live Lecture for Final Exam
 

Dernier

Structuring and Writing DRL Mckinsey (1).pdf
Structuring and Writing DRL Mckinsey (1).pdfStructuring and Writing DRL Mckinsey (1).pdf
Structuring and Writing DRL Mckinsey (1).pdflaloo_007
 
Marel Q1 2024 Investor Presentation from May 8, 2024
Marel Q1 2024 Investor Presentation from May 8, 2024Marel Q1 2024 Investor Presentation from May 8, 2024
Marel Q1 2024 Investor Presentation from May 8, 2024Marel
 
Falcon Invoice Discounting: Empowering Your Business Growth
Falcon Invoice Discounting: Empowering Your Business GrowthFalcon Invoice Discounting: Empowering Your Business Growth
Falcon Invoice Discounting: Empowering Your Business GrowthFalcon investment
 
Unveiling Falcon Invoice Discounting: Leading the Way as India's Premier Bill...
Unveiling Falcon Invoice Discounting: Leading the Way as India's Premier Bill...Unveiling Falcon Invoice Discounting: Leading the Way as India's Premier Bill...
Unveiling Falcon Invoice Discounting: Leading the Way as India's Premier Bill...Falcon Invoice Discounting
 
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai KuwaitThe Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwaitdaisycvs
 
Falcon Invoice Discounting: Aviate Your Cash Flow Challenges
Falcon Invoice Discounting: Aviate Your Cash Flow ChallengesFalcon Invoice Discounting: Aviate Your Cash Flow Challenges
Falcon Invoice Discounting: Aviate Your Cash Flow Challengeshemanthkumar470700
 
Buy Verified TransferWise Accounts From Seosmmearth
Buy Verified TransferWise Accounts From SeosmmearthBuy Verified TransferWise Accounts From Seosmmearth
Buy Verified TransferWise Accounts From SeosmmearthBuy Verified Binance Account
 
Pre Engineered Building Manufacturers Hyderabad.pptx
Pre Engineered  Building Manufacturers Hyderabad.pptxPre Engineered  Building Manufacturers Hyderabad.pptx
Pre Engineered Building Manufacturers Hyderabad.pptxRoofing Contractor
 
Jual Obat Aborsi ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan Cytotec
Jual Obat Aborsi ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan CytotecJual Obat Aborsi ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan Cytotec
Jual Obat Aborsi ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan CytotecZurliaSoop
 
Famous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st CenturyFamous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st Centuryrwgiffor
 
PHX May 2024 Corporate Presentation Final
PHX May 2024 Corporate Presentation FinalPHX May 2024 Corporate Presentation Final
PHX May 2024 Corporate Presentation FinalPanhandleOilandGas
 
Lucknow Housewife Escorts by Sexy Bhabhi Service 8250092165
Lucknow Housewife Escorts  by Sexy Bhabhi Service 8250092165Lucknow Housewife Escorts  by Sexy Bhabhi Service 8250092165
Lucknow Housewife Escorts by Sexy Bhabhi Service 8250092165meghakumariji156
 
Katrina Personal Brand Project and portfolio 1
Katrina Personal Brand Project and portfolio 1Katrina Personal Brand Project and portfolio 1
Katrina Personal Brand Project and portfolio 1kcpayne
 
Cannabis Legalization World Map: 2024 Updated
Cannabis Legalization World Map: 2024 UpdatedCannabis Legalization World Map: 2024 Updated
Cannabis Legalization World Map: 2024 UpdatedCannaBusinessPlans
 
Organizational Transformation Lead with Culture
Organizational Transformation Lead with CultureOrganizational Transformation Lead with Culture
Organizational Transformation Lead with CultureSeta Wicaksana
 
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...daisycvs
 
New 2024 Cannabis Edibles Investor Pitch Deck Template
New 2024 Cannabis Edibles Investor Pitch Deck TemplateNew 2024 Cannabis Edibles Investor Pitch Deck Template
New 2024 Cannabis Edibles Investor Pitch Deck TemplateCannaBusinessPlans
 
TVB_The Vietnam Believer Newsletter_May 6th, 2024_ENVol. 006.pdf
TVB_The Vietnam Believer Newsletter_May 6th, 2024_ENVol. 006.pdfTVB_The Vietnam Believer Newsletter_May 6th, 2024_ENVol. 006.pdf
TVB_The Vietnam Believer Newsletter_May 6th, 2024_ENVol. 006.pdfbelieveminhh
 

Dernier (20)

Structuring and Writing DRL Mckinsey (1).pdf
Structuring and Writing DRL Mckinsey (1).pdfStructuring and Writing DRL Mckinsey (1).pdf
Structuring and Writing DRL Mckinsey (1).pdf
 
Marel Q1 2024 Investor Presentation from May 8, 2024
Marel Q1 2024 Investor Presentation from May 8, 2024Marel Q1 2024 Investor Presentation from May 8, 2024
Marel Q1 2024 Investor Presentation from May 8, 2024
 
Falcon Invoice Discounting: Empowering Your Business Growth
Falcon Invoice Discounting: Empowering Your Business GrowthFalcon Invoice Discounting: Empowering Your Business Growth
Falcon Invoice Discounting: Empowering Your Business Growth
 
Unveiling Falcon Invoice Discounting: Leading the Way as India's Premier Bill...
Unveiling Falcon Invoice Discounting: Leading the Way as India's Premier Bill...Unveiling Falcon Invoice Discounting: Leading the Way as India's Premier Bill...
Unveiling Falcon Invoice Discounting: Leading the Way as India's Premier Bill...
 
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai KuwaitThe Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
 
Falcon Invoice Discounting: Aviate Your Cash Flow Challenges
Falcon Invoice Discounting: Aviate Your Cash Flow ChallengesFalcon Invoice Discounting: Aviate Your Cash Flow Challenges
Falcon Invoice Discounting: Aviate Your Cash Flow Challenges
 
Buy Verified TransferWise Accounts From Seosmmearth
Buy Verified TransferWise Accounts From SeosmmearthBuy Verified TransferWise Accounts From Seosmmearth
Buy Verified TransferWise Accounts From Seosmmearth
 
Pre Engineered Building Manufacturers Hyderabad.pptx
Pre Engineered  Building Manufacturers Hyderabad.pptxPre Engineered  Building Manufacturers Hyderabad.pptx
Pre Engineered Building Manufacturers Hyderabad.pptx
 
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabiunwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
 
Jual Obat Aborsi ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan Cytotec
Jual Obat Aborsi ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan CytotecJual Obat Aborsi ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan Cytotec
Jual Obat Aborsi ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan Cytotec
 
Famous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st CenturyFamous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st Century
 
PHX May 2024 Corporate Presentation Final
PHX May 2024 Corporate Presentation FinalPHX May 2024 Corporate Presentation Final
PHX May 2024 Corporate Presentation Final
 
Lucknow Housewife Escorts by Sexy Bhabhi Service 8250092165
Lucknow Housewife Escorts  by Sexy Bhabhi Service 8250092165Lucknow Housewife Escorts  by Sexy Bhabhi Service 8250092165
Lucknow Housewife Escorts by Sexy Bhabhi Service 8250092165
 
Katrina Personal Brand Project and portfolio 1
Katrina Personal Brand Project and portfolio 1Katrina Personal Brand Project and portfolio 1
Katrina Personal Brand Project and portfolio 1
 
Cannabis Legalization World Map: 2024 Updated
Cannabis Legalization World Map: 2024 UpdatedCannabis Legalization World Map: 2024 Updated
Cannabis Legalization World Map: 2024 Updated
 
Organizational Transformation Lead with Culture
Organizational Transformation Lead with CultureOrganizational Transformation Lead with Culture
Organizational Transformation Lead with Culture
 
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
 
New 2024 Cannabis Edibles Investor Pitch Deck Template
New 2024 Cannabis Edibles Investor Pitch Deck TemplateNew 2024 Cannabis Edibles Investor Pitch Deck Template
New 2024 Cannabis Edibles Investor Pitch Deck Template
 
!~+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUD...
!~+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUD...!~+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUD...
!~+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUD...
 
TVB_The Vietnam Believer Newsletter_May 6th, 2024_ENVol. 006.pdf
TVB_The Vietnam Believer Newsletter_May 6th, 2024_ENVol. 006.pdfTVB_The Vietnam Believer Newsletter_May 6th, 2024_ENVol. 006.pdf
TVB_The Vietnam Believer Newsletter_May 6th, 2024_ENVol. 006.pdf
 

A/B Testing Pitfalls and Lessons Learned at Spotify

  • 1. A/B Testing: Avoiding Common Pitfalls Danielle Jabin March 5, 2014
  • 2. 1 Make all the world’s music available instantly to everyone, wherever and whenever they want it
  • 3. 2 As of March 5, 2014
  • 4. 3 Over 24 million active users As of March 5, 2014
  • 5. 4 Access to more than 20 million songs As of March 5, 2014
  • 6. 5
  • 7. 6 But can we make it even easier?
  • 8. 7 We can try… …with A/B testing!
  • 11. Pitfall #1: Not limiting your error rate
  • 13. 12 What if I flip a coin 100 times and get 51 heads?
  • 14. 13 What if I flip a coin 100 times and get 5 heads?
  • 15. 14
  • 16. 15 The likelihood of obtaining a certain value under a given distribution is measured by its p-value
  • 17. 16 If there is a low likelihood that a change is due to chance alone, we call our results statistically significant
  • 18. 17 What if I flip a coin 100 times and get 5 heads?
  • 19. 18 Statistical significance is measured by alpha ● alpha levels of 5% and 1% are most commonly used – Alternatively: P(significant) = .05 or .01
  • 20. 19 Each alpha has a corresponding Z-score alpha Z-score (two-sided test) .10 1.65 .05 1.96 .01 2.58
  • 21. 20 The Z-score tells us how far a particular value is from the mean (and what the corresponding likelihood is)
  • 23. 22 Compute the Z-score at the end of the test
  • 24. 23 Standard deviation (σ) tells us how spread out the numbers are
  • 25. 24
  • 26. 25 To lock in error rates before you start, fix your sample size
  • 27. 26 What should my sample size be? ● To lock in error rates before you start a test, fix your sample size Represents the desired power (typically .84 for 80% power). Sample size in each group (assumes equal sized groups) 2s (Z b + Za /2 ) n= 2 difference Represents the desired 2 Standard deviation of the outcome variable Source: www.stanford.edu/~kcobb/hrp259/lecture11.ppt 2 Effect Size (the difference in means) level of statistical significance (typically 1.96).
  • 28. 27 Recap: running an A/B test ● Compute your sample size – Using alpha, beta, standard deviation of your metric, and effect size ● Run your test! But stop once you’ve reached the fixed sample size stopping point ● Compute your z-score and compare it with the z-score for the chosen alpha level
  • 32. Pitfall #2: Stopping your test before the fixed sample size stopping point
  • 33. 32 Sample size for varying alpha levels ● With σ = 10, difference in means = 1 Two-sided test alpha = .10, beta = .80 1230 alpha = .05, beta = .80 1568 alpha = .01, beta = .80 2339
  • 34. 33 Let’s see some numbers ● 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion rate 90% significance reached 95% significance reached 99% significance reached Source: destack.home.xs4all.nl/projects/significance/ Stop at first point of significance 654 of 1,000 Ended as significant 427 of 1,000 49 of 1,000 146 of 1,000 14 of 1,000 100 of 1,000
  • 35. 34 Remedies ● Don’t peek ● Okay, maybe you can peek, but don’t stop or make a decision before you reach the fixed sample size stopping point ● Sequential sampling
  • 37. Pitfall #3: Making multiple comparisons in one test
  • 38. 37 A test can be one of two things: significant or not significant ● P(significant) + P(not significant) = 1 ● Let’s take an alpha of .05 – P(significant) = .05 – P(not significant) = 1 – P(significant) = 1 - .05 = .95
  • 39. 38 What about for two comparisons? ● P(at least 1 significant) = 1 - P(none of the 2 are significant) ● P(none of the 2 are significant) = P(not significant)*P(not significant) = .95*.95 = .9025 ● P(at least 1 significant) = 1 - .9025 = .0975
  • 40. 39 What about for two comparisons? ●That’s almost 2x (1.95x, to be precise) your .05 significance rate!
  • 41. 40 And it just gets worse… P(at least 1 signifcant) An increase of… 5 variations 1 – (1-.05)^5 = .23 4.6x 10 variations 1 – (1-.05)^10 = .40 8x 20 variations 1 – (1-.05)^20 = .64 12.8x
  • 42. 41 How can we remedy this? ●Bonferroni correction – Divide P(significant), your alpha, by the number of variations you are testing, n – alpha/n becomes the new level of statistical significance
  • 43. 42 So what about two comparisons now? ● Our new P(significant) = .05/2 = .025 ● Our new P(not significant) = 1 - .025 = .975 ● P(at least 1 significant) = 1 - P(none of the 2 are significant) ● P(none of the 2 are significant) = P(not significant)*P(not significant) = .975*.975 = .951 ● P(at least 1 significant) = 1 - .951 = .0499
  • 44. 43 P(significant) stays under .05  Corrected alpha P(at least 1 signifcant) 5 variations .05/5 = .01 1 – (1-.01)^5 = .049 10 variations .05/10 = .005 1 – (1-.005)^10 = .049 20 variations .05/20 = .0025 1 – (1-.0025)^20 = .049
  • 47. 46 A/B test steps: 1. Decide what to test 2. Determine a metric to test 3. Formulate your hypothesis 1. Select an effect size threshold: what change of the metric would make a rollout worthwhile? 4. Calculate sample size (your stopping point) 1. Decide your Type I (alpha) and Type 2 (beta) error levels and the corresponding zscores 2. Determine the standard deviation of your metric 5. Run your test! But stop once you’ve reached the fixed sample size stopping point 6. Compute your z-score and compare it with the z-score for your chosen alpha level
  • 48. 47 Type I and Type II error ● Type I error: incorrectly reject a true null hypothesis – alpha ● Type II error: incorrectly accept a false null hypothesis – beta – Power: 1 - beta
  • 49. 48 Z-score reference table alpha One-sided test Two-sided test .10 1.28 1.65 .05 1.65 1.96 .01 2.33 2.58
  • 50. 49 Z-score for proportions (e.g. conversion)