[PositConf 2023] How Data Scientists Broke A/B Testing (and How We Can Fix It)

C
Carl VogelData Scientist
HowDataScientists
BrokeA/BTesting
(andhowwecanfixit)
Questions?
pos.it/slido-A
A Completely
True Story
[PositConf 2023] How Data Scientists Broke A/B Testing (and How We Can Fix It)
Launch on
Neutral
(But thanks anyway)
ExistentialDread
(Get used to it)
A real PM
“If it’s something we really believe
in, I’ll launch on a flat result … if
it’s part of a broader strategy.”
“My features are hard as shit to build,
but easy to tweak, so I’m not always
worried about statistical significance.”
Another real PM
NotjustNHST
Features aren’t IID
Path dependencies in
feature roadmaps
We develop experiences by
building up features over
time and it’s helpful to
launch them incrementally
MDE is basically zero
Feature costs are nearly all
sunk before the test
Any lift pays off
NotjustNHST
Risk is mismeasured
Decision makers don’t
think about Type I and II
error rates, per se
They just want to make
more money than they lose
CanImakegood
decisionsabout
smalltomoderate
effectsquickly?
Youcan’tmake
reliableinferences
aboutsmallto
moderateeffects
quickly.
Didtheymisusethetool?
Ordidwehandthemthewrongone?
Non-Inferiority
Designs
Non-inferioritydesigns
Let’s try not to wreck the place
Superiority Non-Inferiority
Non-inferioritydesigns
Let’s try not to wreck the place
• Inferiority margins ( ) prompt us to ask:
• How much do we believe in this feature?
• How quickly will we improve on it?
• Stakeholders can give meaningful answers to these questions
• Compare to MDE/minimal lift, which is often made up
• Avoid meaningless minimum e
ff
ect estimates
• Can power against a “no e
ff
ect” alternative
Δ
[PositConf 2023] How Data Scientists Broke A/B Testing (and How We Can Fix It)
What’s
the rush?
Thecostsoflongexperiments
Time is money, folks
• Opportunity cost of time:
• Experimental features live on a roadmap, waiting for launch decisions
delays development of subsequent features
• Opportunity cost of sampling:
• As long as the experiment runs, many users aren’t getting the best
variant
• Maintenance costs:
• More experiments running means more complexity in the codebase,
more e
ff
ort, etc.
Value of
Information
Designs
Whenisdataworthit?
Good things are worth waiting for
•Waiting is costly, but data is valuable.
•We should keep going as long as the value
of more data exceeds the cost of more time
•Quantify our impatience as part of test
design
ExpectedValuevs.CostofData
$0
$20,000
$40,000
$60,000
$80,000
Test Length
0 15 30 45 60
Exp. Value
Cost
Net Exp.
Value
Whyisdatavaluable?
How dumb am I, in dollars?
• Before we have data, our range of potential lifts is wide
• Our best guess could be way o
ff
; we could make a big
mistake
• Observing data narrows the range, even if our new guess is
wrong, it won’t be wrong by as much.
• If the value of being less wrong (in expectation) exceeds the
cost of waiting for the data, LFG!
ExpectedValueofSampleInformation
ExpectedValueofSampleInformation
ExpectedValueofSampleInformation
ExpectedValueofSampleInformation
ExpectedValueofSampleInformation
ExpectedValueofSampleInformation
$0
$10K
$200K
Sequentialtestingdecisions
Don’t stop ’til you get enough
• We can do this again after collecting some data
• This changes the core decision from: “is B > A?” to “should I stop or
continue testing?”
• Good
fi
t for A/B tests, where we collect data passively just by
waiting
• Once more data isn’t worth it, launch the best observed variant,
the inference problem is irrelevant (Claxton ’96)
• This is our best information, and it’s not worth getting more
Lessons
What’stheProblem?
Going back to basics
There’s no silver bullet
You may have other problems; you’ll need
other solutions
Misuse of tools should prompt us to
rethink the problem
What are we actually trying to solve?
What are the costs, benefits, and risks?
What’stheProblem?
Going back to basics
Are we solving the problem, or treating
symptoms?
Launch-on-neutral, run-til-significant, peeking,
etc. are symptoms, not the root problem
Lots of advanced techniques speed up tests, but
don’t actually address reasons for impatience
Here,there,andeverywhere
You’re soaking in it
This isn’t just about A/B testing
But it’s a domain where we have very
familiar tools close at hand
Whatareweherefor?
People who solve problems for people are the luckiest people in the world
This is the fun stuff
This is where we add value as data
scientists
These problems aren’t solved
Try new stuff!
Carl Vogel
Principal Data Scientist
carl.vogel@babylist.com
Thanks!
1 sur 34

Recommandé

Hashing notes data structures (HASHING AND HASH FUNCTIONS) par
Hashing notes data structures (HASHING AND HASH FUNCTIONS)Hashing notes data structures (HASHING AND HASH FUNCTIONS)
Hashing notes data structures (HASHING AND HASH FUNCTIONS)Kuntal Bhowmick
194 vues13 diapositives
Merge sort algorithm par
Merge sort algorithmMerge sort algorithm
Merge sort algorithmsrutisenpatra
264 vues11 diapositives
Heteroskedasticity par
HeteroskedasticityHeteroskedasticity
Heteroskedasticitymodelos-econometricos
10.8K vues6 diapositives
Tale of Two Tests par
Tale of Two TestsTale of Two Tests
Tale of Two TestsOptimizely
239 vues41 diapositives
Data-Driven off a Cliff: Anti-Patterns in Evidence-Based Decision Making par
Data-Driven off a Cliff: Anti-Patterns in Evidence-Based Decision MakingData-Driven off a Cliff: Anti-Patterns in Evidence-Based Decision Making
Data-Driven off a Cliff: Anti-Patterns in Evidence-Based Decision Makingindeedeng
2.5K vues227 diapositives
To Estimate or Not to Estimate, Is that the Question? (2017 Better Software C... par
To Estimate or Not to Estimate, Is that the Question? (2017 Better Software C...To Estimate or Not to Estimate, Is that the Question? (2017 Better Software C...
To Estimate or Not to Estimate, Is that the Question? (2017 Better Software C...Matthew Philip
574 vues50 diapositives

Contenu connexe

Similaire à [PositConf 2023] How Data Scientists Broke A/B Testing (and How We Can Fix It)

Building a culture of testing like lucid par
Building a culture of testing like lucidBuilding a culture of testing like lucid
Building a culture of testing like lucidKissmetrics on SlideShare
497 vues22 diapositives
Actionable Machine Learning par
Actionable Machine LearningActionable Machine Learning
Actionable Machine LearningMeir Maor
391 vues21 diapositives
Todd little - Risky Business | Real Options for Business Agility par
Todd little -  Risky Business | Real Options for Business AgilityTodd little -  Risky Business | Real Options for Business Agility
Todd little - Risky Business | Real Options for Business AgilityKanban Conferences
248 vues77 diapositives
What do we do with all this big par
What do we do with all this big What do we do with all this big
What do we do with all this big Rajeev Ranjan Dwivedi
26 vues16 diapositives
Portfolio Management Using Questionable Quality Data par
Portfolio Management Using Questionable Quality DataPortfolio Management Using Questionable Quality Data
Portfolio Management Using Questionable Quality DataPortfolio Decisions
269 vues32 diapositives
GDG Cloud Southlake #5 Eric Harvieux: Site Reliability Engineering (SRE) in P... par
GDG Cloud Southlake #5 Eric Harvieux: Site Reliability Engineering (SRE) in P...GDG Cloud Southlake #5 Eric Harvieux: Site Reliability Engineering (SRE) in P...
GDG Cloud Southlake #5 Eric Harvieux: Site Reliability Engineering (SRE) in P...James Anderson
198 vues10 diapositives

Similaire à [PositConf 2023] How Data Scientists Broke A/B Testing (and How We Can Fix It)(20)

Actionable Machine Learning par Meir Maor
Actionable Machine LearningActionable Machine Learning
Actionable Machine Learning
Meir Maor391 vues
Todd little - Risky Business | Real Options for Business Agility par Kanban Conferences
Todd little -  Risky Business | Real Options for Business AgilityTodd little -  Risky Business | Real Options for Business Agility
Todd little - Risky Business | Real Options for Business Agility
GDG Cloud Southlake #5 Eric Harvieux: Site Reliability Engineering (SRE) in P... par James Anderson
GDG Cloud Southlake #5 Eric Harvieux: Site Reliability Engineering (SRE) in P...GDG Cloud Southlake #5 Eric Harvieux: Site Reliability Engineering (SRE) in P...
GDG Cloud Southlake #5 Eric Harvieux: Site Reliability Engineering (SRE) in P...
James Anderson198 vues
mtpcon London+EMEA 2022 – Why Product Managers should not be data-driven.pdf par Jens-Fabian Goetzmann
mtpcon London+EMEA 2022 – Why Product Managers should not be data-driven.pdfmtpcon London+EMEA 2022 – Why Product Managers should not be data-driven.pdf
mtpcon London+EMEA 2022 – Why Product Managers should not be data-driven.pdf
Managing Data Science by David Martínez Rego par Big Data Spain
Managing Data Science by David Martínez RegoManaging Data Science by David Martínez Rego
Managing Data Science by David Martínez Rego
Big Data Spain556 vues
How to use data to make a hit tv show par Parul Verma
How to use data to make a hit tv showHow to use data to make a hit tv show
How to use data to make a hit tv show
Parul Verma67 vues
CommonAnalyticMistakes_v1.17_Unbranded par Jim Parnitzke
CommonAnalyticMistakes_v1.17_UnbrandedCommonAnalyticMistakes_v1.17_Unbranded
CommonAnalyticMistakes_v1.17_Unbranded
Jim Parnitzke190 vues
Is Bigger Data Really Better? 10 Facts from Theory and Practice par DataWorks Summit
Is Bigger Data Really Better? 10 Facts from Theory and PracticeIs Bigger Data Really Better? 10 Facts from Theory and Practice
Is Bigger Data Really Better? 10 Facts from Theory and Practice
DataWorks Summit720 vues
Why business people should always be involved par Jaap Vink
Why business people should always be involvedWhy business people should always be involved
Why business people should always be involved
Jaap Vink52 vues
I love the smell of data in the morning (getting started with data science) ... par Troy Magennis
I love the smell of data in the morning (getting started with data science)  ...I love the smell of data in the morning (getting started with data science)  ...
I love the smell of data in the morning (getting started with data science) ...
Troy Magennis1.2K vues
Module 4: Model Selection and Evaluation par Sara Hooker
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
Sara Hooker687 vues
Intro to Data Analytics with Oscar's Director of Product par Product School
 Intro to Data Analytics with Oscar's Director of Product Intro to Data Analytics with Oscar's Director of Product
Intro to Data Analytics with Oscar's Director of Product
Product School878 vues

Dernier

[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx par
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptxDataScienceConferenc1
5 vues12 diapositives
ColonyOS par
ColonyOSColonyOS
ColonyOSJohanKristiansson6
9 vues17 diapositives
Cross-network in Google Analytics 4.pdf par
Cross-network in Google Analytics 4.pdfCross-network in Google Analytics 4.pdf
Cross-network in Google Analytics 4.pdfGA4 Tutorials
6 vues7 diapositives
Organic Shopping in Google Analytics 4.pdf par
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdfGA4 Tutorials
11 vues13 diapositives
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf par
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdfVikas 500 BIG DATA TECHNOLOGIES LAB.pdf
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdfvikas12611618
8 vues30 diapositives
Data structure and algorithm. par
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm. Abdul salam
19 vues24 diapositives

Dernier(20)

[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx par DataScienceConferenc1
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
Cross-network in Google Analytics 4.pdf par GA4 Tutorials
Cross-network in Google Analytics 4.pdfCross-network in Google Analytics 4.pdf
Cross-network in Google Analytics 4.pdf
GA4 Tutorials6 vues
Organic Shopping in Google Analytics 4.pdf par GA4 Tutorials
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials11 vues
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf par vikas12611618
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdfVikas 500 BIG DATA TECHNOLOGIES LAB.pdf
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf
vikas126116188 vues
Data structure and algorithm. par Abdul salam
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
Abdul salam 19 vues
UNEP FI CRS Climate Risk Results.pptx par pekka28
UNEP FI CRS Climate Risk Results.pptxUNEP FI CRS Climate Risk Results.pptx
UNEP FI CRS Climate Risk Results.pptx
pekka2811 vues
Building Real-Time Travel Alerts par Timothy Spann
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel Alerts
Timothy Spann111 vues
Short Story Assignment by Kelly Nguyen par kellynguyen01
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen0119 vues
Understanding Hallucinations in LLMs - 2023 09 29.pptx par Greg Makowski
Understanding Hallucinations in LLMs - 2023 09 29.pptxUnderstanding Hallucinations in LLMs - 2023 09 29.pptx
Understanding Hallucinations in LLMs - 2023 09 29.pptx
Greg Makowski17 vues
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation par DataScienceConferenc1
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
Chapter 3b- Process Communication (1) (1)(1) (1).pptx par ayeshabaig2004
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptx
Supercharging your Data with Azure AI Search and Azure OpenAI par Peter Gallagher
Supercharging your Data with Azure AI Search and Azure OpenAISupercharging your Data with Azure AI Search and Azure OpenAI
Supercharging your Data with Azure AI Search and Azure OpenAI
Peter Gallagher37 vues
CRIJ4385_Death Penalty_F23.pptx par yvettemm100
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptx
yvettemm1006 vues
RuleBookForTheFairDataEconomy.pptx par noraelstela1
RuleBookForTheFairDataEconomy.pptxRuleBookForTheFairDataEconomy.pptx
RuleBookForTheFairDataEconomy.pptx
noraelstela167 vues
Advanced_Recommendation_Systems_Presentation.pptx par neeharikasingh29
Advanced_Recommendation_Systems_Presentation.pptxAdvanced_Recommendation_Systems_Presentation.pptx
Advanced_Recommendation_Systems_Presentation.pptx

[PositConf 2023] How Data Scientists Broke A/B Testing (and How We Can Fix It)