25. Change Effect on productivity
Brighter light UP
Dimmer light UP
Warmer UP
Cooler UP
Shorter breaks UP
Longer breaks UP
26. Change Effect on productivity
Brighter light UP (temporarily)
Dimmer light UP (temporarily)
Warmer UP (temporarily)
Cooler UP (temporarily)
Shorter breaks UP (temporarily)
Longer breaks UP (temporarily)
27. Change Effect on productivity
Brighter light UP (temporarily)
Dimmer light UP (temporarily)
Warmer UP (temporarily)
Cooler UP (temporarily)
Shorter breaks UP (temporarily)
Longer breaks UP (temporarily)
91. Job applications: Up
Job clicks: Down
Recommended Jobs traffic: Up
Job views: Sideways
New resumes: Up
Return visits: Down
Logins: Up
Revenue: Down
(and it goes on…)
129. Clicks received
1260
Out of budget time
20:00
% of day w/o budget
0.1667
Potential clicks
1260 / (1 - 0.1667) = 1512
Missed clicks
1512 * 0.1667 = 260
Missed Clicks Report
Dear Customer,
You got 1,260 clicks yesterday.
Your daily budget ran out at 8:00pm.
If you funded your budget through
the whole day, you’d get another
260 clicks - a +20% improvement!
Get More Clicks
192. Hawthorne Revisited
… the variance in productivity could be fully accounted for
by the fact that the lighting changes were made on
Sundays and therefore followed by Mondays when
workers’ productivity was refreshed by a day off.”
https://en.wikipedia.org/wiki/Hawthorne_effect
220. Lesson 01
Lesson 02
Lesson 03
Lesson 04
Lesson 05
Lesson 06
Lesson 07
Be patient
Sampling is hard
Focus on a few, carefully chosen metrics
Be rigorous with your analysis
Watch out for side effects
Use metrics and stories
Plan for fallibility
Good evening, thanks for coming to our @IndeedEng Tech Talk tonight.
This is “Data-driven off a cliff, anti-patterns in evidence-based decision making”. I’m Tom Wilbur, and I’m a product manager at Indeed, and...
I help people get jobs.
Indeed is the #1 job site worldwide. We serve over 200M monthly unique users, across more than 60 countries and in 29 languages.
The primary place that jobseekers start on Indeed is here - the search experience. It’s simple -- you type in some keywords and a location and you get a ranked list of jobs that are relevant to you..
Indeed is headquartered here in Austin, Texas, the capital of the Lone Star State. Austin is also the location of our largest engineering office, and we have engineering offices around the world in Tokyo, Seattle, San Francisco and Hyderabad. So we have tons of smart engineers and product teams working around the clock to make a better Indeed.
https://en.wikipedia.org/wiki/Flag_of_Texas#/media/File:Flag_of_Texas.svg
We have tons of ideas.... BUT
We have tons of bad ideas, too.
Now occasionally we do have good ideas, but
It’s hard to tell the difference. What we really want to know, is --
What helps people get jobs? We believe...
The only reliable way to know is just try stuff and see what works. (NEXT TO JOKE)
(pause) So at Indeed,
We set up experiments. We run A/B tests on our site where users are randomly assigned to different experiences.
We collect results. We observe the users’ behavior. Our LogRepo system adds about 6TB of new data every day.
And we use that data to decide what to do. To see which features and capabilities do help people get jobs, and which don’t.
We’ve used data to make good decisions,
But having a ton of data is not a silver bullet.
We’ve also used data to make bad decisions. Because the truth is,
Science is hard. (NEXT TO JOKE)
(pause) For example, one serious problem is that the very act of just
Running an experiment, can ruin the experiment itself. Let me tell you a quick story.
There was a famous experiment conducted in the late 1920s at an electrical factory outside of Chicago, Illinois. Called the Hawthorne Works. The factory managers wanted to improve worker productivity, so they decided to try some changes to the worker environment.
They changed the lighting conditions, sometimes brighter, sometimes dimmer. They changed the temperature in the factory, and length of breaks. Initially they were excited, as their early experiments resulted in improvements in worker productivity.
Brighter lights? Productivity goes up! Dimmer lights? Productivity goes up! Warmer? Up! Cooler? Up. Shorter breaks, longer breaks, it seemed that everything they tried improved worker productivity. And on top of that, none of these improvements stuck.
It all quickly faded. Ultimately the conclusion of the researchers was that the very fact of changing the conditions, of running the test, of observing the results, affected the workers’ behavior. This effect is now known as -- the Hawthorne Effect. Those of us that run experiments to optimize websites all over the world know this well. When we see a change in user behavior, we often ask the question, “but will it last? Is that change real, or is it just the Hawthorne Effect?” So science is hard. And if that wasn’t enough,
It all quickly faded. Ultimately the conclusion of the researchers was that the very fact of changing the conditions, of running the test, of observing the results, affected the workers’ behavior. This effect is now known as -- the Hawthorne Effect. Those of us that run experiments to optimize websites all over the world know this well. When we see a change in user behavior, we often ask the question, “but will it last? Is that change real, or is it just the Hawthorne Effect?” So science is hard. And if that wasn’t enough,
Statistics are hard. There are plenty of ways where an analysis can produce surprising if not contradictory results.
For example, consider “Anscombe’s quartet”. In 1973, statistician Francis Anscombe described four very different sets of 11 points that all have the same basic statistical properties -- mean, variance, correlation, and as the blue line shows, regression. This demonstrates that looking at a statistical calculation isn’t at all sufficient to understand your data, especially when there are outliers.
https://en.wikipedia.org/wiki/Anscombe%27s_quartet
Another example is Simpson’s Paradox. This is where a statistician goes back in time with his toaster and starts accidentally changing the future and the more he tries to fix it, the worse it gets. There are no donuts, and people have lizard-tongues, and that’s just no way to make data-driven decisions. Wait no, that’s Homer Simpson’s Paradox from Treehouse of Horror V. Sorry.
Edwin Simpson’s Paradox is something else. This result describes the situation where individual groups of data tell a different story than when the data are combined. On this chart for example, the four blue dots and four red dots each show a positive trend, but when combined, you get the black dotted line that shows a negative trend overall. Imagine if you saw that revenue for mobile was increasing, and revenue for desktop was increasing, but overall revenue appears to be decreasing. Now what do you do? Usually this situation means you don’t understand underlying causal relationships in your data. Because statistics are hard.
But using data correctly is more than just statistics. If you apply good math to a bad idea...
Just because it’s mathematically correct, doesn’t mean you won’t seriously regret the outcome of that test.
http://www.glamour.com/images/health-fitness/2011/06/0606-tequila_at.jpg
So bad practices can undermine good math.
You don’t need me to teach you how to be bad at math.
But tonight, I’ll teach you to be bad at everything else. On top of the inherent challenges of science, statistics and bad ideas, we’ll share with you our powerful techniques of how to make data-driven decisions… the wrong way.
So, we’ll start with Anti-Lesson number 1. Be Impatient. One of the best ways to be bad at evidence-based decision making is to be impatient.
A p-value is the standard measure of statistical significance. It represents the probability that the observed result would happen if the null hypothesis were true, or, informally, the chance that what you’ve measured is just random chance. For a successful A/B test, we want to see positive results with a p-value below some threshold, typically 5% or .05.
But the calculation of p-value is by measurement, not the whole experiment. It only tells you how confident to be in your results given the circumstances of the test thus far.
If you check results on Tuesday, that’s another measurement. Now your boss is asking if it’s significant yet. So you keep checking and checking,
And your data scientist is muttering, saying you should just wait to get to the necessary sample size she estimated. It’s really frustrating. (pause) There’s a better way.
Got the result you want? On that test that you knew was a good idea. Are the results already positive after only two days? And when you checked the p-value on your phone while in line at Starbucks was it at less than 0.05?
Declare victory! Turn off the test and roll it 100%. Don’t waste your valuable time with that statistical wah wah wah about regression to the mean and probability of null hypothesis something.
http://www.qubit.com/sites/default/files/pdf/mostwinningabtestresultsareillusory_0.pdf (Martin Goodson, Research Lead at Qubit, a UK web consultancy)
In fact, Martin Goodson shows that if you were to do a check for significance every day, and stop positive tests as soon as they show significance, 80% of those “winning” A/B tests are likely false-positives. Are bogus results. And that’s why being impatient is a great way to make bad decisions.
Another great way to do data-driven product development wrong is to believe that sampling is easy. I mean, it’s hard and time-consuming to make sure that you’ve got representative users in your A/B tests.
Let me illustrate this with a story I call, “Beware the IEdes of March.” And you’ll see how well this anti-pattern worked for me.
At a previous company where I worked, we were building Used Car search experiences for major media brands, and we were doing A/B tests to try to increase the probability that we successfully connect a car shopper to a dealer with matching inventory.
One of the things we had observed when we analyzed successful user behavior, was that shoppers specifying price, mileage or year in their search do better. They’re more successful at finding cars they are interested in. So we had a hypothesis --
Could we encourage shoppers to specify price, mileage or year, and improve conversion?
We tried a couple ideas, including moving the price, mileage and year facets up in the search UI to make it easier to find, and we also tried a tooltip nudge, directly encouraging users to add these terms to their search.
Of all the variants, the tooltip nudge wins, we saw a 3% lift in unique conversion (with a p-value of .04). So we decided to roll it out.
It turns out, we’d taken a shortcut in our test assignment code. This was the summer of 2009, when IE had 60%+ of the US browser market, and my company, like many others, was sick and tired of supporting IE6 (the browser that PC World called “the least secure software on the planet”). So to work around a problem in our code that assigned users to test variants, we just didn’t handle IE6.
So the users on the oldest browsers got ignored. This turned out to be 20%+ of users. And even worse, we learned those 20% didn’t behave the same as the remaining 80%. From later analysis and user research, we came to believe that users on the oldest browsers also shopped differently, for different cars. They were on average more price sensitive and benefitted more from that nudge.
We’d depended on a distorted sample of the population. We went through all the effort to run a test, and a technical shortcut we took meant that we didn’t measure the results accurately. And we made an ill-informed decision. Because we thought sampling was easy.
Which brings us to the third way I’ll teach you how to do data-driven decision making wrong. Look only at one metric. If there’s one thing we know in life, it’s that
If a little bit is good, a lot is great. Anything worth doing is worth overdoing. I mean, there’s never a downside to that, is there?
Our first story I want to share about looking only at one metric, is called “Indeed has a heart,” and it’s about a test we ran in our mobile app.
As jobseekers explore available jobs, they have the option to Save a job so they can easily come back to it later. We decided to test changing the icon associated with a Save from a star to a heart. We did this on job details page,
And on the search results page.
So, were hearts better than stars?
They were! We observed a 16% increase in Saves on the search results page.
Now, everyone loves hearts! We rolled our test out 100%. But why stop there? The obvious thing to do is
To have hearts everywhere!
Stars on your Amazon reviews?
Nope! Hearts now.
We sent our test results to Google, and in the next version of Gmail the Starred folder will be replaced with Hearted!
And we’ve got a bill in front of the new state legislature. We’re all gonna live and work in the Lone Heart State!
[sigh] Not so fast. Changing the stars to hearts improved the one metric we were looking at - usage of the “Save this job” feature, but
Did Hearts help people get jobs?
Sadly, no. There was no discernable impact on job seeker success. When we analyzed longer-term behavior of jobseekers, there was no evidence of an improvement in the primary metrics -- clicks, applies, hires. Which is unfortunate, because that’s our goal, not
To help people heart jobs. What we had done was to focus only on one metric.
If you really want to do evidence-based decision making wrong, you should make sure you look only at one metric in situations beyond your A/B tests. This anti-lesson can do damage all across your company.
For example, at Indeed, we have a talented client services team that works with our customers to keep them engaged and highlight the value they’re receiving. Growing revenue from existing customers is clearly important, and we had a hypothesis that if we had a team focused only on that, we could be more successful.
So, we formed a dedicated “upsell team” and measured their results on a dashboard.
What we looked for was when there was a upsell contact with a customer, and then subsequently the customer’s spend went up, we credited the rep for that increase on the dashboard. This was also tied to a bonus program. So we started off, and
the dashboard told us it’s working! Reported upsells on the dashboard showed lots of wins, 10s of thousands of $$.
But when we stepped back, revenue for the total pool of accounts wasn’t increasing.
As it turned out, not every contact between a rep and a customer results in an increase in spend. Our naive dashboard looked only at one metric - the positive outcomes.
But in reality, Some are neutral. Some are negative. And so it didn’t measure the right result. In fact, when you’re showing people a metric about their performance,
What you measure is what you motivate. In talking to the reps, because our dashboard only looked at the positive outcomes, they were less interested in contacting customers who were planning to lower their spend. The incentives were only about getting to an increase, nothing else mattered. So we made a change.
We redefined success to include all the outcomes, updated the dashboard and continued the experiment of the upsell team. After that one change, we saw more diverse interactions, and better results!
The Upsell Team’s revenue increased by 200%, and we decided to continue the experiment and grow the team.
So we saw two examples there about how looking only at one metric, especially when it’s an easily-computed feature metric or maybe the first metric you thought of, is a great way to do evidence-based decision making wrong. Now, that anti-lesson has a flip-side, too --
Caveats: not an A/B test, lots of confounding factors, small sample size, team got better at their job, grain of salt, etc. But we also can directly observe the actors in this story, so we focus on how the metric affected behavior.
Because another secret to making bad data-driven decisions is to look at all the metrics. For this anti-lesson, we’ll return to Indeed’s mobile app.
We were comparing our mobile app to other companies’ apps and noticed a growing adoption of a particular way to indicate a menu. They were using what’s now popularly known as the “hamburger menu”. One of our product managers stole the idea...
(pause) And we decided to test a hamburger menu to improve Indeed’s mobile app.
It’s better for them, is it better for us? Let’s look at the results.
[read through list, growing more confused]
<click> at Logins
(pause) What we realized was...
We didn’t really know what we wanted. We didn’t start our test with a goal in mind for what the hamburger menu was supposed to do. So when the metrics came back with conflicting answers, we couldn’t know if the change was any good.
There was too much noise from too many metrics. We ended up leaving this test running for a looong time hoping the right decision would become clear. It didn’t. We had lots of discussions and email threads and meetings where “seriously we need to make a decision about the hamburger test.” In the end, we turned it off, so there’s no hamburger in the Indeed mobile app.
In this case, by not starting with a clear goal, and by looking at all the metrics, we spent a lot of time and energy and failed at making a good evidence-based decision.
Tom: Now I’d like to introduce my colleague Ketan who will teach us about even more exciting ways to make bad decisions. Ketan?
Who’s got time for rigorous analysis? Just give me an Excel spreadsheet.