If you’re a marketer it’s very likely that you’ve run an A/B test. It’s also likely that you’ve never calculated the sample size for your tests, and instead, you run tests until they reach statistical significance. If this is the case, your strategy is statistically flawed. Conforming to sample size requires marketers to wait longer for test results, but choosing to ignore it will bear false positives and lead to bad decisions.
This deck was created for an email audience for there are valuable lessons for anyone who runs A/B tests.
15. Assuming you check results after every impression and
stop once you reach significance….
26.1%
So you just went from 95% confidence to 74%
This is a worst-case scenario. BUT, some test platforms do this automatically!
20. Agenda
1. How we put this into practice on a website test
2. How we applied these learnings to email testing:
• Open rates
• Click to Open Rates
• Conversion Rates
21. A/B Testing on your website
Here’s your new test process:
1. Determine your baseline conversion rate (or click rate, or
download rate, etc..)
2. Decide how long you are willing to wait for a result. Convert your
unique traffic metric to a sample size.
3. Adjust MDE (Minimum Detectable Effect) until your Sample Size is
just under the target you determined in #2 above.
4. Re-adjust MDE until you are content.
5. Start the test, and don’t stop until you hit the sample size.
23. Case Study: Item Urgency
TEST (VERSION A):
INVENTORY NOTIFICATION
CONTROL (VERSION B):
NO INVENTORY NOTIFICATION
24. STEP 1 – We determined our baseline conversion rate
25. STEP 2 – Calculate Target Sample Size
We initially decided we wanted a result in 2 weeks.
So we took the last 2 weeks of unique product page views:
26. STEP 2 – Calculate Target Sample Size
We then divided that number by two (since we’ll have two test
segments)
Divided by two again to account for desktop traffic only
Then multiplied by 5% (since the message only displays on 5% of
product pages)
Sample Size -> 12,351
27. This gave us 30% MDE (Conversion Lift). This is unrealistic
42. After learning about Sample Size,
we reconsidered our email testing strategy
• Open Rate (Subject line testing)
• Click-to-Open (CTO) Rate
• Conversion Rate
43. OPEN RATE
We used sample size to gut check the
size of our subject line test segments
44. OPEN RATE
Remember, for the sample size calculator,
you need the baseline conversion rate and
then the sample size, and that will give you
MDE.
48. OPEN RATE
We always test 4 different subject lines.
We had been sending each subject line to 10,000
customers.
So, sample size ~ 10,000
49. OPEN RATE
Plugging these numbers in, this would only detect 13% open rate lift or higher
50. OPEN RATE
13% lift on 17% open rate is 19.2%.
We rarely see subject lines this high
We needed a lower MDE to make sure we could detect
more winners…
51. OPEN RATE
We ended up doubling our subject line segment to
80,000, giving us an MDE ~ 9.2%
66. Conversion Rate
To get meaningful results for conversion rate,
consider running an email test many times, so that
you can eventually reach the necessary sample
size.
67. Takeaways
This is the MDE curve again. Remember what this looks like.
The longer you run a test, the lower MDE will be.
The more traffic volume you have, the faster MDE will drop
68. Takeaways
For Web Testing
• If you stop your A/B tests once you reach statistical significance, you are increasing your chances of finding
false positives
• Calculating sample size will give you a clear stop date and an MDE
• MDE and sample size are inversely related – The lower the MDE, the larger the sample size
• Most likely, your A/B tests need to run much longer than you realize
For Email Testing
• Use sample size to determine the size of your subject line test segments
• Your CTO tests are probably reaching the necessary sample size
• Your Conversion tests are probably not hitting sample size
69. Sources
Kyle Rush – Mozcon 2014 Presentation
https://seomoz.box.com/shared/static/2fw6yevkkmmdum
z431j4.pdf
Evan Miller – How not to run an AB test
http://www.evanmiller.org/how-not-to-run-an-ab-test.
html
70. Zack Notes
Digital Marketing
Manager
zack@uncommongoods.com
@zacknotes
slideshare.net/zacknotes1/presentations
74. What do you do if a test reaches sample size
and your lift < MDE?
75. You can either extend the test and accept a
lower MDE or Move On.
Notes de l'éditeur
I have a thought experiment for you…
All of the scenarios have the same end result, except for scenario #3
Note the baseline, minimum detectable effect and sample size
Baseline = preexisting conversion rate, click rate, open rate, etc.
Minimum Detectable Effect or MDE = the minimum lift you will be able to detect once you’ve reached the sample size.
So here’s the gist of this presentation – If you are running a test on a page with a baseline conversion rate of 3% - and you run the test until your test segment reaches 10,316 impressions. If you’re observed conversion lift is below 20%, you can’t declare a winner. Even if you’ve reached statistical significance. You either need to keep your test running or move on.
This is your step by step guide to sample size testing. I’m going to go over it very briefly. This slide is more of a resource for you to come back to when you set up your test.
We’re currently running this test on our product pages
This is a message that fires when there are less than 5 items in inventory. You can see the test on top and the control below.
We put 1.16% in conversion rate. Then we turn the MDE up until our sample size is just under 12,351
Nobody wants to wait 17 weeks for a test result, but if you make a call too early, you could be shooting yourself in the foot, buying a false positive, and deploying a new design which is actually making your site worse
BACK TO ITEM URGENCY
This is a chart of conversion rate.
Test in Blue. Control in Red
Note how the MDE contains the lift with an upper bound
Also note how lift is approaching MDE towards the end
Red line is significance
Green line is the 95% mark
Note that the significance crossed 95% in the beginning and then came back down, and it’s now rising above 95% again.
17% after a week but 7% after 2 hours when we make the call
17% after a week but 7% after 2 hours when we make the call
Here’s a few examples of CTO tests we ran and at the end, in the appendix, I’ve included all data for you to look at afterwards
I went back through all of our tests. In the 3 years since we’ve started A/B testing in emails, the only time we’ve hit sample size for conversion rate is when we’ve tested putting prices in an email (vs. leaving them out).
And Unfortunately, this lift in conversion was countered by an equal drop in CTO
The lift from the vast majority of your tests will never reach MDE. Be more comfortable reporting “no statistical difference”