When it comes to big data insights, how do you know you’re asking the right questions? Hiring data scientists is a good start – we’re seeing their growth both on LinkedIn and at LinkedIn. But even data scientists are not immune from the myriad of hidden pitfalls that keep your key insights out of sight.
Drawing from a deceptively simple exercise that I’ve used to haze dozens of data scientists on their first day, I will discuss the good, the bad and the ugly lessons we’ve learned about asking the right questions, denominators and being a data skeptic.
16. Hotness (X) = part year-over-part year growth of normalizednet job starters, minus noise, in a big enough industry X on LinkedIn hottest industries The Question
Lies, Damned Lies and the Data Scientist By: MonicaRogati – data scientist at LinkedIn.Data lies – but it lies because we let it. So let’s not let it. Let’s ask the right questions.
I’m going to talk about how to ask the right question by showing you a a deceptively simple exercise that LinkedIn data scientists go through. The question is, what are the hottest industries this year, according to the LinkedIn data? There’s one small detail I’m not specifying – what’s the definition of hot. That definition plays a major part in asking the right questions.
SO let’s take a look at the data. On LinkedIn, we have over 120M people, their industry, and the year they joined.
… so the first attempt at defining “hot” might be – let’s look at the YOY growth of an industry & look at the top 3. That idea is not so hot – at best, it’s only an indicatorof LinkedIn’s penetration in an industry; at worst, it’s actually a contrarian indicator because it shows people who might want to transition OUT of that industry
… so the first attempt at defining “hot” might be – let’s look at the YOY growth of an industry & look at the top 3. That idea is not so hot – at best, it’s only an indicatorof LinkedIn’s penetration in an industry; at worst, it’s actually a contrarian indicator because it shows people who might want to transition OUT of that industry
The next piece of data we can look at is the individual positions people list on their profiles – they have a start date and an industry, so you can see what industry people are flowing into in a given year. Much better.
You run the numbers… Wait a second!! Is consulting really the hottest industry? Hmm.. I think the data is trying to lie to us. We need to take into account churn & promotions – and we do that by looking at the NET inflow of people into an industry: people coming IN minus people coming out.
There, that should be much better. Next external factor that might come into play is seasonality. If we’re doing this analysis in the summer, it looks like there a lot fewer teachers and accountants, and a lot more summer interns compared to last year! So ideally, we want to compare the same time period to take out seasonal effects
OK … done, let’s take another look. Are the Mining and metals & Dairy industries really the hottest industries this year? Or are they just very small industries on LinkedIn, and it’s much easier to grow off of a small base? You can get around this by making separate categories for industries of different size, ignoring industries below a certain size, or somehow account for that effect.
Now, we’re done: got seasonality, thresholding, net inflow – this has to be the right question. Well, almost. We assumed the data is clean. And it’s not.
For example , there are a lot of fake accounts that we’ve immediately closed, but they’re still there in the database. If you don’t check for that flag, you have this army of darthvaders boosting up the defense and space industry.
Including the tail of a distribution might not make sense – do we want people who have 200 positions listed on their profile? They might throw off your data.
We need to put the data under a microscope and understand what each flag, category and date means.OK, now we’ve accounted for external factors, took out the noise, are we ready to see some industry growth charts?!
Hm, ok, we plot the YOY growth and we get something that looks like this : a spaghetti chart that mostly shows industries moving in unison – an effect of the broader economic conditions (see that dip in 2001 and 2009). If we want to actually focus on differences between industries instead of what they have in common, we need to scale or normalize those numbers – for example, by dividing the net # of people coming into an industry by the TOTAL number of people who started jobs that year. This also has the nice property that it accounts for website growth.
OK, this MUST be it, right? The data stopped lying and we can actually see some real trends. Wild swings around 2000 for Internet and telecommunications, and there’s definitely something going on w/ real estate there. It still looks like spaghetti, it’s hard to understand and explain, and it’s not exactly telling a story. To tell the story, we need to make some hard decisions and pick only a couple of those lines, clean things up, and let that story shine.
Nice! I’ve picked 3 industries – when the line is above zero, that industry is growing; below 0, it’s shrinking. So the Internet is taking off in 94, booming in 99, then there’s a huge dip in 2001. Real Estate is growing steadily, it’s picking up in 2002, and it’s sinking in 2008 – and so are financial services. This is all coming from aggregating data on people’s public LinkedIn profiles! This is the kind of story that gets people excited about the insights in the LinkedIn data – but it wouldn’t have been possible, if we didn’t ask the right questions.
So let’s have some fun with the method I’ve just describe – let’s take a look at the growth of analytics and data science jobs over the past few years. Whoa! That rapid growth in the past 3 years is even more impressive when we realize that this is all properly normalized, not just the count of people with those titles on LinkedIn
So next time you look at your data, don’t let it lie to you – account for external factors, take out the noise, and ask the right questions.