SlideShare a Scribd company logo
1 of 32
The "Big Data" Ecosystem at LinkedIn
SIGMOD 2013
Roshan Sumbaly, Jay Kreps, & Sam Shah
June 2013
LinkedIn: the professional profile of record
©2012 LinkedIn Corporation. All Rights Reserved. 2
225MMembers 225M Member
Profiles
1 2
3
Applications
4
Application examples
 People You May Know (2 people)
 Year In Review Email (1 person, 1 month)
 Skills and Endorsements (2 people)
 Network Updates Digest (1 person, 3 months)
 Who‟s Viewed My Profile (2 people)
 Collaborative Filtering (1 person)
 Related Searches (1 person, 3 months)
 and more…
5
Skill sets
Rich Hadoop-based ecosystem
©2013 LinkedIn Corporation. All Rights Reserved. 6
“Last mile” problems
©2013 LinkedIn Corporation. All Rights Reserved. 7
 Ingress
– Moving data from online to offline system
 Workflow management
– Managing offline processes
 Egress
– Moving results from offline to online systems
 Key/Value
 Streams
 OLAP
8
Application examples
 People You May Know (2 people)
 Year In Review Email (1 person, 1 month)
 Skills and Endorsements (2 people)
 Network Updates Digest (1 person, 3 months)
 Who‟s Viewed My Profile (2 people)
 Collaborative Filtering (1 person)
 Related Searches (1 person, 3 months)
 and more…
9
People You May Know
10
People You May Know – Workflow
Perform triangle closing
for all members
Ethan
Jacob
William
connected connected
Triangle closing
Rank by discounting previously
shown recommendations
Push recommendations
to online service
Connection
stream
Impression
stream
“Last mile” problems
©2013 LinkedIn Corporation. All Rights Reserved. 11
 Ingress
– Moving data from online to offline system
 Workflow management
– Managing offline processes
 Egress
– Moving results from offline to online systems
 Key/Value
 Streams
 OLAP
Ingress - O(n2) data integration complexity
©2013 LinkedIn Corporation. All Rights Reserved. 12
 Point to point
 Fragile, delayed and potentially lossy
 Non-standardized
Ingress - O(n) data integration
©2013 LinkedIn Corporation. All Rights Reserved. 13
14
Ingress – Kafka
 Distributed and elastic
– Multi-broker system
 Categorized topics
– “PeopleYouMayKnowTopic”
– “ConnectionUpdateTopic”
15
Ingress
 Standardized schemas
– Avro
– Central repository
– Programmatic compatibility
 Audited
 ETL to Hadoop
People you may
know service
Kafka brokers (dev)
Kafka brokers
Hadoop
PeopleYouMayKnowTopic
“Last mile” problems
©2013 LinkedIn Corporation. All Rights Reserved. 16
 Ingress
– Moving data from online to offline system
 Workflow management
– Managing offline processes
 Egress
– Moving results form offline to online systems
 Key/Value
 Streams
 OLAP
17
People You May Know – Workflow
Perform triangle closing
for all members
Rank by discounting previously
shown recommendations
Push recommendations
to online service
Connection
stream
Impression
stream
18
People You May Know – Workflow (in
reality)
19
Workflow Management - Azkaban
 Dependency management
– Historical logs
 Diverse job types
– Pig, Hive, Java
 Scheduling
 Monitoring
 Visualization
 Configuration
 Retry/restart on failure
 Resource locking
20
People You May Know – Workflow
Perform triangle closing
for all members
Rank by discounting previously
shown recommendations
Push recommendations
to online service
Connection
stream
Impression
stream
Member Id 1213 =>
[ Recommended member id 1734,
Recommended member id 1523
…
Recommended member id 6332 ]
“Last mile” problems
©2013 LinkedIn Corporation. All Rights Reserved. 21
 Ingress
– Moving data from online to offline system
 Workflow management
– Managing offline processes
 Egress
– Moving results from offline to online systems
 Key/Value
 Streams
 OLAP
22
Egress – Key/Value
 Voldemort
– Based on Amazon‟s Dynamo
 Distributed and Elastic
 Horizontally scalable
 Bulk load pipeline from Hadoop
 Simple to use
store results into „url‟ using KeyValue(„member_id‟)
People you may
know service
Voldemort
Hadoop
Batch load
getRecommendations(member id)
23
People You May Know - Summary
People you may
know service
Kafka brokers (mirror)
Kafka brokers
Hadoop
PeopleYouMayKnowTopic
Voldemort
Front end
24
Application examples
 People You May Know (2 people)
 Year In Review Email (1 person, 1 month)
 Skills and Endorsements (2 people)
 Network Updates Digest (1 person, 3 months)
 Who‟s Viewed My Profile (2 people)
 Collaborative Filtering (1 person)
 Related Searches (1 person, 3 months)
 and more…
25
Year In Review Email
26
Year In Review Email
memberPosition = LOAD '$latest_positions' USING BinaryJSON;
memberWithPositionsChangedLastYear = FOREACH (
FILTER memberPosition BY ((start_date >= $start_date_low ) AND
(start_date <= $start_date_high))
) GENERATE member_id, start_date, end_date;
allConnections = LOAD '$latest_bidirectional_connections' USING BinaryJSON;
allConnectionsWithChange_nondistinct = FOREACH (
JOIN memberWithPositionsChangedLastYear BY member_id,
allConnections BY dest
) GENERATE allConnections::source AS source,
allConnections::dest AS dest;
allConnectionsWithChange = DISTINCT
allConnectionsWithChange_nondistinct;
memberinfowpics = LOAD '$latest_memberinfowpics' USING
BinaryJSON;
pictures = FOREACH ( FILTER memberinfowpics BY
((cropped_picture_id is not null) AND
( (member_picture_privacy == 'N') OR
(member_picture_privacy == 'E')))
) GENERATE member_id, cropped_picture_id, first_name as
dest_first_name, last_name as dest_last_name;
resultPic = JOIN allConnectionsWithChange BY dest, pictures
BY member_id;
connectionsWithChangeWithPic = FOREACH resultPic GENERATE
allConnectionsWithChange::source AS source_id,
allConnectionsWithChange::dest AS member_id,
pictures::cropped_picture_id AS pic_id,
pictures::dest_first_name AS dest_first_name,
pictures::dest_last_name AS dest_last_name;
joinResult = JOIN connectionsWithChangeWithPic BY source_id,
memberinfowpics BY member_id;
withName = FOREACH joinResult GENERATE
connectionsWithChangeWithPic::source_id AS source_id,
connectionsWithChangeWithPic::member_id AS member_id,
connectionsWithChangeWithPic::dest_first_name as first_name,
connectionsWithChangeWithPic::dest_last_name as last_name,
connectionsWithChangeWithPic::pic_id AS pic_id,
memberinfowpics::first_name AS firstName,
memberinfowpics::last_name AS lastName,
memberinfowpics::gmt_offset as gmt_offset,
memberinfowpics::email_locale as email_locale,
memberinfowpics::email_address as email_address;
resultGroup = GROUP withName BY (source_id, firstName,
lastName, email_address, email_locale, gmt_offset);
-- Get the count of results per recipient
resultGroupCount = FOREACH resultGroup GENERATE group ,
withName as toomany, COUNT_STAR(withName) as num_results;
resultGroupPre = filter resultGroupCount by num_results > 2;
resultGroup = FOREACH resultGroupPre {
withName = LIMIT toomany 64;
GENERATE group, withName, num_results;
}
x_in_review_pre_out = FOREACH resultGroup GENERATE
FLATTEN(group) as (source_id, firstName, lastName,
email_address, email_locale, gmt_offset),
withName.(member_id, pic_id, first_name, last_name) as
jobChanger, '2013' as changeYear:chararray,
num_results as num_results;
x_in_review = FOREACH x_in_review_pre_out GENERATE
source_id as recipientID, gmt_offset as gmtOffset,
firstName as first_name, lastName as last_name, email_address,
email_locale,
TOTUPLE( changeYear, source_id,firstName, lastName,
num_results,jobChanger) as body;
rmf $xir;
STORE x_in_review INTO '$url' USING Kafka();
27
Year In Review Email – Workflow
Find users that have
changed jobs
Join with connections
and metadata (pictures)
Group by connections of
these users
Push content to email
service
“Last mile” problems
©2013 LinkedIn Corporation. All Rights Reserved. 28
 Ingress
– Moving data from online to offline system
 Workflow management
– Managing offline processes
 Egress
– Moving results from offline to online systems
 Key/Value
 Streams
 OLAP
29
Egress - Streams
 Service acts as consumer
 “EmailContentTopic”
store emails into „url‟ using Stream(“topic=x“)
Email service
Kafka brokers (mirror)
Kafka brokers
Hadoop
EmailSentTopic
Email service
Kafka brokers (mirror)
Kafka brokers
Hadoop
EmailContentTopic
30
Conclusion
 Hadoop: simple programmatic model, rich developer ecosystem
 Primitives for
– Ingress:
 Structured, complete data available
 Automatically handles data evolution
– Workflow management
 Run and operate production processes
– Egress
 1-line command for data for exporting data
 Horizontally scalable, little need for capacity planning
 Empowers data scientists to focus on new product ideas,
not infrastructure
Future work: models of computation
• Alternating Direction Method of Multipliers (ADMM)
• Distributed Conjugate Gradient Descent (DCGD)
• Distributed L-BFGS
• Bayesian Distributed Learning (BDL)
Graphs
Distributed learning
Near-line processing
32
data.linkedin.com

More Related Content

What's hot

Blazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & SparkBlazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & SparkMongoDB
 
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)Jun Rao
 
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]Shirshanka Das
 
Family tree of data – provenance and neo4j
Family tree of data – provenance and neo4jFamily tree of data – provenance and neo4j
Family tree of data – provenance and neo4jM. David Allen
 
LinkedIn Segmentation & Targeting Platform
LinkedIn Segmentation & Targeting PlatformLinkedIn Segmentation & Targeting Platform
LinkedIn Segmentation & Targeting PlatformHien Luu
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Amy W. Tang
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsGeorge Stathis
 
Clickstream data with spark
Clickstream data with sparkClickstream data with spark
Clickstream data with sparkMarissa Saunders
 
LinkedIn Graph Presentation
LinkedIn Graph PresentationLinkedIn Graph Presentation
LinkedIn Graph PresentationAmy W. Tang
 
Webinar: “ditch Oracle NOW”: Best Practices for Migrating to MongoDB
 Webinar: “ditch Oracle NOW”: Best Practices for Migrating to MongoDB Webinar: “ditch Oracle NOW”: Best Practices for Migrating to MongoDB
Webinar: “ditch Oracle NOW”: Best Practices for Migrating to MongoDBMongoDB
 
Webinar: An Enterprise Architect’s View of MongoDB
Webinar: An Enterprise Architect’s View of MongoDBWebinar: An Enterprise Architect’s View of MongoDB
Webinar: An Enterprise Architect’s View of MongoDBMongoDB
 
The year of the graph: do you really need a graph database? How do you choose...
The year of the graph: do you really need a graph database? How do you choose...The year of the graph: do you really need a graph database? How do you choose...
The year of the graph: do you really need a graph database? How do you choose...George Anadiotis
 
Introduction: Relational to Graphs
Introduction: Relational to GraphsIntroduction: Relational to Graphs
Introduction: Relational to GraphsNeo4j
 
Neo4J : Introduction to Graph Database
Neo4J : Introduction to Graph DatabaseNeo4J : Introduction to Graph Database
Neo4J : Introduction to Graph DatabaseMindfire Solutions
 
An Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4jAn Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4jDebanjan Mahata
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop frameworkTu Pham
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedInAmy W. Tang
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...MongoDB
 
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInA Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInAmy W. Tang
 
My other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 editionMy other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 editionSteve Loughran
 

What's hot (20)

Blazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & SparkBlazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & Spark
 
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
 
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
 
Family tree of data – provenance and neo4j
Family tree of data – provenance and neo4jFamily tree of data – provenance and neo4j
Family tree of data – provenance and neo4j
 
LinkedIn Segmentation & Targeting Platform
LinkedIn Segmentation & Targeting PlatformLinkedIn Segmentation & Targeting Platform
LinkedIn Segmentation & Targeting Platform
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data Lessons
 
Clickstream data with spark
Clickstream data with sparkClickstream data with spark
Clickstream data with spark
 
LinkedIn Graph Presentation
LinkedIn Graph PresentationLinkedIn Graph Presentation
LinkedIn Graph Presentation
 
Webinar: “ditch Oracle NOW”: Best Practices for Migrating to MongoDB
 Webinar: “ditch Oracle NOW”: Best Practices for Migrating to MongoDB Webinar: “ditch Oracle NOW”: Best Practices for Migrating to MongoDB
Webinar: “ditch Oracle NOW”: Best Practices for Migrating to MongoDB
 
Webinar: An Enterprise Architect’s View of MongoDB
Webinar: An Enterprise Architect’s View of MongoDBWebinar: An Enterprise Architect’s View of MongoDB
Webinar: An Enterprise Architect’s View of MongoDB
 
The year of the graph: do you really need a graph database? How do you choose...
The year of the graph: do you really need a graph database? How do you choose...The year of the graph: do you really need a graph database? How do you choose...
The year of the graph: do you really need a graph database? How do you choose...
 
Introduction: Relational to Graphs
Introduction: Relational to GraphsIntroduction: Relational to Graphs
Introduction: Relational to Graphs
 
Neo4J : Introduction to Graph Database
Neo4J : Introduction to Graph DatabaseNeo4J : Introduction to Graph Database
Neo4J : Introduction to Graph Database
 
An Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4jAn Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4j
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop framework
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
 
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInA Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
 
My other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 editionMy other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 edition
 

Similar to The "Big Data" Ecosystem at LinkedIn

Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataAbhishek M Shivalingaiah
 
ATAGTR2017 HikeRunner: Load Test Framework
ATAGTR2017 HikeRunner: Load Test FrameworkATAGTR2017 HikeRunner: Load Test Framework
ATAGTR2017 HikeRunner: Load Test FrameworkAgile Testing Alliance
 
From discovering to trusting data
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting datamarkgrover
 
Cassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthCassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthdaveconnors
 
How we (Almost) Forgot Lambda Architecture and used Elasticsearch
How we (Almost) Forgot Lambda Architecture and used ElasticsearchHow we (Almost) Forgot Lambda Architecture and used Elasticsearch
How we (Almost) Forgot Lambda Architecture and used ElasticsearchMichael Stockerl
 
2018-10-17 J1 6D - Draw your imagination with Microsoft Graph API - Dipti Chh...
2018-10-17 J1 6D - Draw your imagination with Microsoft Graph API - Dipti Chh...2018-10-17 J1 6D - Draw your imagination with Microsoft Graph API - Dipti Chh...
2018-10-17 J1 6D - Draw your imagination with Microsoft Graph API - Dipti Chh...Modern Workplace Conference Paris
 
Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015Johann de Boer
 
Consuming Data From Many Platforms: The Benefits of OData - St. Louis Day of ...
Consuming Data From Many Platforms: The Benefits of OData - St. Louis Day of ...Consuming Data From Many Platforms: The Benefits of OData - St. Louis Day of ...
Consuming Data From Many Platforms: The Benefits of OData - St. Louis Day of ...Eric D. Boyd
 
Canarie Federated Non Web Signon
Canarie Federated Non Web SignonCanarie Federated Non Web Signon
Canarie Federated Non Web SignonChris Phillips
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science ChallengeMark Nichols, P.E.
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupDoug Needham
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" Joshua Bloom
 
Data science and OSS
Data science and OSSData science and OSS
Data science and OSSKevin Crocker
 
Sentiment Analysis in Dynamics CRM using Azure Text Analytics
Sentiment Analysis in Dynamics CRM using Azure Text AnalyticsSentiment Analysis in Dynamics CRM using Azure Text Analytics
Sentiment Analysis in Dynamics CRM using Azure Text AnalyticsLucas Alexander
 
Eagle6 mongo dc revised
Eagle6 mongo dc revisedEagle6 mongo dc revised
Eagle6 mongo dc revisedMongoDB
 
Eagle6 Enterprise Situational Awareness
Eagle6 Enterprise Situational AwarenessEagle6 Enterprise Situational Awareness
Eagle6 Enterprise Situational AwarenessMongoDB
 
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...Connected Data World
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon
 
PGQL: A Language for Graphs
PGQL: A Language for GraphsPGQL: A Language for Graphs
PGQL: A Language for GraphsJean Ihm
 
Redshift at Lightspeed: How to continuously optimize and modify Redshift sche...
Redshift at Lightspeed: How to continuously optimize and modify Redshift sche...Redshift at Lightspeed: How to continuously optimize and modify Redshift sche...
Redshift at Lightspeed: How to continuously optimize and modify Redshift sche...Amazon Web Services
 

Similar to The "Big Data" Ecosystem at LinkedIn (20)

Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
ATAGTR2017 HikeRunner: Load Test Framework
ATAGTR2017 HikeRunner: Load Test FrameworkATAGTR2017 HikeRunner: Load Test Framework
ATAGTR2017 HikeRunner: Load Test Framework
 
From discovering to trusting data
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting data
 
Cassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthCassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per month
 
How we (Almost) Forgot Lambda Architecture and used Elasticsearch
How we (Almost) Forgot Lambda Architecture and used ElasticsearchHow we (Almost) Forgot Lambda Architecture and used Elasticsearch
How we (Almost) Forgot Lambda Architecture and used Elasticsearch
 
2018-10-17 J1 6D - Draw your imagination with Microsoft Graph API - Dipti Chh...
2018-10-17 J1 6D - Draw your imagination with Microsoft Graph API - Dipti Chh...2018-10-17 J1 6D - Draw your imagination with Microsoft Graph API - Dipti Chh...
2018-10-17 J1 6D - Draw your imagination with Microsoft Graph API - Dipti Chh...
 
Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015Digital analytics with R - Sydney Users of R Forum - May 2015
Digital analytics with R - Sydney Users of R Forum - May 2015
 
Consuming Data From Many Platforms: The Benefits of OData - St. Louis Day of ...
Consuming Data From Many Platforms: The Benefits of OData - St. Louis Day of ...Consuming Data From Many Platforms: The Benefits of OData - St. Louis Day of ...
Consuming Data From Many Platforms: The Benefits of OData - St. Louis Day of ...
 
Canarie Federated Non Web Signon
Canarie Federated Non Web SignonCanarie Federated Non Web Signon
Canarie Federated Non Web Signon
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science Challenge
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning"
 
Data science and OSS
Data science and OSSData science and OSS
Data science and OSS
 
Sentiment Analysis in Dynamics CRM using Azure Text Analytics
Sentiment Analysis in Dynamics CRM using Azure Text AnalyticsSentiment Analysis in Dynamics CRM using Azure Text Analytics
Sentiment Analysis in Dynamics CRM using Azure Text Analytics
 
Eagle6 mongo dc revised
Eagle6 mongo dc revisedEagle6 mongo dc revised
Eagle6 mongo dc revised
 
Eagle6 Enterprise Situational Awareness
Eagle6 Enterprise Situational AwarenessEagle6 Enterprise Situational Awareness
Eagle6 Enterprise Situational Awareness
 
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
 
PGQL: A Language for Graphs
PGQL: A Language for GraphsPGQL: A Language for Graphs
PGQL: A Language for Graphs
 
Redshift at Lightspeed: How to continuously optimize and modify Redshift sche...
Redshift at Lightspeed: How to continuously optimize and modify Redshift sche...Redshift at Lightspeed: How to continuously optimize and modify Redshift sche...
Redshift at Lightspeed: How to continuously optimize and modify Redshift sche...
 

Recently uploaded

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Recently uploaded (20)

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

The "Big Data" Ecosystem at LinkedIn

  • 1. The "Big Data" Ecosystem at LinkedIn SIGMOD 2013 Roshan Sumbaly, Jay Kreps, & Sam Shah June 2013
  • 2. LinkedIn: the professional profile of record ©2012 LinkedIn Corporation. All Rights Reserved. 2 225MMembers 225M Member Profiles 1 2
  • 4. 4 Application examples  People You May Know (2 people)  Year In Review Email (1 person, 1 month)  Skills and Endorsements (2 people)  Network Updates Digest (1 person, 3 months)  Who‟s Viewed My Profile (2 people)  Collaborative Filtering (1 person)  Related Searches (1 person, 3 months)  and more…
  • 6. Rich Hadoop-based ecosystem ©2013 LinkedIn Corporation. All Rights Reserved. 6
  • 7. “Last mile” problems ©2013 LinkedIn Corporation. All Rights Reserved. 7  Ingress – Moving data from online to offline system  Workflow management – Managing offline processes  Egress – Moving results from offline to online systems  Key/Value  Streams  OLAP
  • 8. 8 Application examples  People You May Know (2 people)  Year In Review Email (1 person, 1 month)  Skills and Endorsements (2 people)  Network Updates Digest (1 person, 3 months)  Who‟s Viewed My Profile (2 people)  Collaborative Filtering (1 person)  Related Searches (1 person, 3 months)  and more…
  • 10. 10 People You May Know – Workflow Perform triangle closing for all members Ethan Jacob William connected connected Triangle closing Rank by discounting previously shown recommendations Push recommendations to online service Connection stream Impression stream
  • 11. “Last mile” problems ©2013 LinkedIn Corporation. All Rights Reserved. 11  Ingress – Moving data from online to offline system  Workflow management – Managing offline processes  Egress – Moving results from offline to online systems  Key/Value  Streams  OLAP
  • 12. Ingress - O(n2) data integration complexity ©2013 LinkedIn Corporation. All Rights Reserved. 12  Point to point  Fragile, delayed and potentially lossy  Non-standardized
  • 13. Ingress - O(n) data integration ©2013 LinkedIn Corporation. All Rights Reserved. 13
  • 14. 14 Ingress – Kafka  Distributed and elastic – Multi-broker system  Categorized topics – “PeopleYouMayKnowTopic” – “ConnectionUpdateTopic”
  • 15. 15 Ingress  Standardized schemas – Avro – Central repository – Programmatic compatibility  Audited  ETL to Hadoop People you may know service Kafka brokers (dev) Kafka brokers Hadoop PeopleYouMayKnowTopic
  • 16. “Last mile” problems ©2013 LinkedIn Corporation. All Rights Reserved. 16  Ingress – Moving data from online to offline system  Workflow management – Managing offline processes  Egress – Moving results form offline to online systems  Key/Value  Streams  OLAP
  • 17. 17 People You May Know – Workflow Perform triangle closing for all members Rank by discounting previously shown recommendations Push recommendations to online service Connection stream Impression stream
  • 18. 18 People You May Know – Workflow (in reality)
  • 19. 19 Workflow Management - Azkaban  Dependency management – Historical logs  Diverse job types – Pig, Hive, Java  Scheduling  Monitoring  Visualization  Configuration  Retry/restart on failure  Resource locking
  • 20. 20 People You May Know – Workflow Perform triangle closing for all members Rank by discounting previously shown recommendations Push recommendations to online service Connection stream Impression stream Member Id 1213 => [ Recommended member id 1734, Recommended member id 1523 … Recommended member id 6332 ]
  • 21. “Last mile” problems ©2013 LinkedIn Corporation. All Rights Reserved. 21  Ingress – Moving data from online to offline system  Workflow management – Managing offline processes  Egress – Moving results from offline to online systems  Key/Value  Streams  OLAP
  • 22. 22 Egress – Key/Value  Voldemort – Based on Amazon‟s Dynamo  Distributed and Elastic  Horizontally scalable  Bulk load pipeline from Hadoop  Simple to use store results into „url‟ using KeyValue(„member_id‟) People you may know service Voldemort Hadoop Batch load getRecommendations(member id)
  • 23. 23 People You May Know - Summary People you may know service Kafka brokers (mirror) Kafka brokers Hadoop PeopleYouMayKnowTopic Voldemort Front end
  • 24. 24 Application examples  People You May Know (2 people)  Year In Review Email (1 person, 1 month)  Skills and Endorsements (2 people)  Network Updates Digest (1 person, 3 months)  Who‟s Viewed My Profile (2 people)  Collaborative Filtering (1 person)  Related Searches (1 person, 3 months)  and more…
  • 26. 26 Year In Review Email memberPosition = LOAD '$latest_positions' USING BinaryJSON; memberWithPositionsChangedLastYear = FOREACH ( FILTER memberPosition BY ((start_date >= $start_date_low ) AND (start_date <= $start_date_high)) ) GENERATE member_id, start_date, end_date; allConnections = LOAD '$latest_bidirectional_connections' USING BinaryJSON; allConnectionsWithChange_nondistinct = FOREACH ( JOIN memberWithPositionsChangedLastYear BY member_id, allConnections BY dest ) GENERATE allConnections::source AS source, allConnections::dest AS dest; allConnectionsWithChange = DISTINCT allConnectionsWithChange_nondistinct; memberinfowpics = LOAD '$latest_memberinfowpics' USING BinaryJSON; pictures = FOREACH ( FILTER memberinfowpics BY ((cropped_picture_id is not null) AND ( (member_picture_privacy == 'N') OR (member_picture_privacy == 'E'))) ) GENERATE member_id, cropped_picture_id, first_name as dest_first_name, last_name as dest_last_name; resultPic = JOIN allConnectionsWithChange BY dest, pictures BY member_id; connectionsWithChangeWithPic = FOREACH resultPic GENERATE allConnectionsWithChange::source AS source_id, allConnectionsWithChange::dest AS member_id, pictures::cropped_picture_id AS pic_id, pictures::dest_first_name AS dest_first_name, pictures::dest_last_name AS dest_last_name; joinResult = JOIN connectionsWithChangeWithPic BY source_id, memberinfowpics BY member_id; withName = FOREACH joinResult GENERATE connectionsWithChangeWithPic::source_id AS source_id, connectionsWithChangeWithPic::member_id AS member_id, connectionsWithChangeWithPic::dest_first_name as first_name, connectionsWithChangeWithPic::dest_last_name as last_name, connectionsWithChangeWithPic::pic_id AS pic_id, memberinfowpics::first_name AS firstName, memberinfowpics::last_name AS lastName, memberinfowpics::gmt_offset as gmt_offset, memberinfowpics::email_locale as email_locale, memberinfowpics::email_address as email_address; resultGroup = GROUP withName BY (source_id, firstName, lastName, email_address, email_locale, gmt_offset); -- Get the count of results per recipient resultGroupCount = FOREACH resultGroup GENERATE group , withName as toomany, COUNT_STAR(withName) as num_results; resultGroupPre = filter resultGroupCount by num_results > 2; resultGroup = FOREACH resultGroupPre { withName = LIMIT toomany 64; GENERATE group, withName, num_results; } x_in_review_pre_out = FOREACH resultGroup GENERATE FLATTEN(group) as (source_id, firstName, lastName, email_address, email_locale, gmt_offset), withName.(member_id, pic_id, first_name, last_name) as jobChanger, '2013' as changeYear:chararray, num_results as num_results; x_in_review = FOREACH x_in_review_pre_out GENERATE source_id as recipientID, gmt_offset as gmtOffset, firstName as first_name, lastName as last_name, email_address, email_locale, TOTUPLE( changeYear, source_id,firstName, lastName, num_results,jobChanger) as body; rmf $xir; STORE x_in_review INTO '$url' USING Kafka();
  • 27. 27 Year In Review Email – Workflow Find users that have changed jobs Join with connections and metadata (pictures) Group by connections of these users Push content to email service
  • 28. “Last mile” problems ©2013 LinkedIn Corporation. All Rights Reserved. 28  Ingress – Moving data from online to offline system  Workflow management – Managing offline processes  Egress – Moving results from offline to online systems  Key/Value  Streams  OLAP
  • 29. 29 Egress - Streams  Service acts as consumer  “EmailContentTopic” store emails into „url‟ using Stream(“topic=x“) Email service Kafka brokers (mirror) Kafka brokers Hadoop EmailSentTopic Email service Kafka brokers (mirror) Kafka brokers Hadoop EmailContentTopic
  • 30. 30 Conclusion  Hadoop: simple programmatic model, rich developer ecosystem  Primitives for – Ingress:  Structured, complete data available  Automatically handles data evolution – Workflow management  Run and operate production processes – Egress  1-line command for data for exporting data  Horizontally scalable, little need for capacity planning  Empowers data scientists to focus on new product ideas, not infrastructure
  • 31. Future work: models of computation • Alternating Direction Method of Multipliers (ADMM) • Distributed Conjugate Gradient Descent (DCGD) • Distributed L-BFGS • Bayesian Distributed Learning (BDL) Graphs Distributed learning Near-line processing