19. Year in Review
Steps to make the email
– Collect job changers
– Figure out who is connected
to them
– Rank job changes
STRATA NY 2012
20. Example: Year in Review
memberPosition = LOAD '$latest_positions' USING BinaryJSON; connectionsWithChangeWithPic::source_id AS source_id,
memberWithPositionsChangedLastYear = FOREACH (
connectionsWithChangeWithPic::member_id AS member_id,
FILTER memberPosition BY ((start_date >= $start_date_low ) AND connectionsWithChangeWithPic::dest_first_name as first_name,
(start_date <= $start_date_high)) connectionsWithChangeWithPic::dest_last_name as last_name,
) GENERATE member_id, start_date, end_date; connectionsWithChangeWithPic::pic_id AS pic_id,
memberinfowpics::first_name AS firstName,
allConnections = LOAD '$latest_bidirectional_connections' USING BinaryJSON; memberinfowpics::last_name AS lastName,
memberinfowpics::gmt_offset as gmt_offset,
allConnectionsWithChange_nondistinct = FOREACH ( memberinfowpics::email_locale as email_locale,
JOIN memberWithPositionsChangedLastYear BY member_id, memberinfowpics::email_address as email_address;
allConnections BY dest
) GENERATE allConnections::source AS source,
allConnections::dest AS dest; resultGroup0 = GROUP withName BY (source_id, firstName,
lastName, email_address, email_locale, gmt_offset);
allConnectionsWithChange = DISTINCT
allConnectionsWithChange_nondistinct; -- get the count of results per recipient
resultGroupCount = FOREACH resultGroup0 GENERATE group ,
memberinfowpics = LOAD '$latest_memberinfowpics' USING withName as toomany, COUNT_STAR(withName) as num_results;
BinaryJSON;
pictures = FOREACH ( FILTER memberinfowpics BY resultGroupPre = filter resultGroupCount by num_results > 2;
((cropped_picture_id is not null) AND resultGroup = FOREACH resultGroupPre {
( (member_picture_privacy == 'N') OR withName = LIMIT toomany 64;
(member_picture_privacy == 'E')))
) GENERATE member_id, cropped_picture_id, first_name as
GENERATE group, withName, num_results;
dest_first_name, last_name as dest_last_name; }
resultPic = JOIN allConnectionsWithChange BY dest, pictures x_in_review_pre_out = FOREACH resultGroup GENERATE
BY member_id; FLATTEN(group) as (source_id, firstName, lastName,
connectionsWithChangeWithPic = FOREACH resultPic GENERATE email_address, email_locale, gmt_offset),
allConnectionsWithChange::source AS source_id, withName.(member_id, pic_id, first_name, last_name) as
allConnectionsWithChange::dest AS member_id,
jobChanger, '2011' as changeYear:chararray,
pictures::cropped_picture_id AS pic_id, num_results as num_results;
pictures::dest_first_name AS dest_first_name,
pictures::dest_last_name AS dest_last_name;
x_in_review = FOREACH x_in_review_pre_out GENERATE
source_id as recipientID, gmt_offset as gmtOffset,
joinResult = JOIN connectionsWithChangeWithPic BY source_id, firstName as first_name, lastName as last_name, email_address,
memberinfowpics BY member_id; email_locale,
withName = FOREACH joinResult GENERATE TOTUPLE( changeYear, source_id,firstName, lastName,
num_results,jobChanger) as body;
rmf $xir;
STORE x_in_review INTO '$xir' USING BinaryJSON('recipientID');
STRATA NY 2012
Today, Sam and I are going to talk about how we use Hadoop to build products with data.Sam and I are both engineers at LinkedIn. My title is trendier than Sam’s, but don’t hold that against me. Or him. We both know how to build products with data.Both of us have talked about a lot of the products in this presentation before, but we haven’t focused as much on infrastructure
We’d like to start by telling you a little bit about LinkedIn (and LinkedIn’s data).LinkedIn is the leading web site for professional networking. We currently have over 175 million members, but we’re still growing. That means that our data is growing too.
Each member has a profile. We know a lot about our members (start scrolling animation)…We know their current position, past positions, schools they attended, skills they have, skills that other people have endorsed them for, people and companies they follow, companies they work on.We think this data is very interesting. We can use this data to help members connect to each other, and make them more productive. That’s actually LinkedIn’s mission statement… I can’t believe I recited our mission statement in a public presentation. Anyway, let’s take a look at how we use this data.<Flip to next slide to show how we use this data>
When a user logs into LinkedIn, they see a page like this. Almost every part of our home page has been touched by data science.Home page is purely driven by data:News articlesNews streamPYMKDisplay adJYMBIIWVMPGYMLEtc…And by the way, we also learn what our members like and don’t like. Wehave over 130 million visitors to our site every quarter, and deliver over 9.3 billion web pages. (That’s even more data)
So, here’s the point of today’s talk.At LinkedIn, we have a lot of data.
We store our data in Hadoop, and we want to build product using that data on Hadoop.
So here’s the big challenge: how do we make it easy for our engineers, product managers, data scientists, analysts, web devs, reseptionists, whatever, build products from our data?That’s what we’re going to talk about today. We’ll tell you about some of the products that we’ve built from data, how we built these products, and why we built infastructure to support these products.
Let’s start by telling you a little more about some of the products that we have built with Hadoop, then we’ll tell you more about two of those products and the challenges that we faced productionalizing them.
More examples:Groups you might likeNetwork updates digest email“People who viewed this profile also viewed”Etc.
Let’s start with a project that I worked on at LinkedIn that I think illustrates the power of building products with data.Ask audience “who got this email?”We sent this to every LinkedIn member who had a lot of job changes in their network.[now read the numbers]Later in this presentation, we’ll tell you how we built this email from our data. I’ll even show you the code.
Hereis another example of a product that I’ve worked on. In the network stream on our home page, we’ve started sharing trends and patterns in data.We also tell you things that you might not know about your network. For example, it turns out that 21 of my former coworkers are now working at Google.
One of the most famous examples of data products at LinkedIn is People you may Know.PYMK was invented at LinkedIn. The idea of PYMK was to help you discover current coworkers, former coworkers, and friends on LinkedIn to help make your experience better. (This is not an actual screen shot from my account; I’m already connected to Sam.)We used Hadoop to build and scale PYMK. (We’ll also tell you more about how we built PYMK later in this presentation.)
Has anyone in the room seen a screen like this on LinkedIn?Has anyone endorsed someone else?Has anyone found it hard to stop endorsing people?We also used Hadoop to build our suggested endorsements.
We love using Hadoop for building data products.There are so many things that are great about Hadoop. (Our user quotas are in TB.)Hundreds of nodesGreat tools for working with data like Pig, and hive, and CrunchShared infrastucture. Hundreds of employees have accounts on Hadoop and run jobs (engineers, data scientists, product managers, even designers and finance people)
One of the greatest advantages of Hadoop is that it empowers small teams to build great things. Here are a few examplesMost of the items on this are big, important features: lots of page views, lots of new connections, lots of great content.The marginal cost of building more products is low
One of the greatest advantages of Hadoop is that it empowers small teams to build great things. Here are a few examplesMost of the items on this are big, important features: lots of page views, lots of new connections, lots of great content.The marginal cost of building more products is low
Let’s talk a little more about the year in review email. This is actually a pretty straightforward message in theory. Here’s how we do it. (Read slide)There isn’t any machine learning, or fancy algorithms. It’s just grouping and ranking.And in practice, it’s not that hard.
This is the code to compose this message. It’s About 60 lines of code, and most of that code involved renaming things.This is why we love Hadoop: we can do something simple without much code…Great! We’re done. We write this code and the message is done.
Well, not so fast… here’s the challenge. We know how to do the computation to make this message. But every message requires a lot of data: we potentially look at hundreds of MB of data before degnerating every message, and in the end the messages are up to 1MB in size.How do we get all the raw data that we need to make this message? How do we keep it up to date?How do we run this job frequently so the results stay current?How do we get these results out of Hadoop, turn them into email messages, and send them out?Let’s consider another problem.
One of the most famous examples of data products is People you may Know.PYMK was invented at LinkedIn. The idea of PYMK was to help you discover current coworkers, former coworkers, and friends on LinkedIn to help make your experience better. (This is not an actual screen shot from my account; I’m already connected to Sam.)We used Hadoop to build and scale PYMK.
- PYMK started simpler, grew more complicated- Complicated workflow, required tools and infrastructure to do this --> we needed it in place.
Throw over the wall from data science to productionizationNo one dedicated toproductionizationProvided “as a service” to do so
- Don’t want to beg for data- Others: Scribe, Flume