SlideShare une entreprise Scribd logo
1  sur  17
Télécharger pour lire hors ligne
Data Infrastructure at Linkedin
Jun Rao and Sam Shah

LinkedIn Confidential ©2013 All Rights Reserved
Outline
1.
2.
3.
4.

LinkedIn introduction
Online/nearline infrastructure
Offline infrastructure
Conclusion

LinkedIn Confidential ©2013 All Rights Reserved

2
The World’s Largest Professional Network
Connecting Talent  Opportunity. At scale…

200M+ 2 new
Members Worldwide

Members Per Second

LinkedIn Confidential ©2013 All Rights Reserved

100M+
Monthly Unique Visitors

2M+
Company Pages

3
Two Product Families
For Members

Professionals

For Partners

 People You May Know
 Who’s Viewed My Profile
 Jobs You May Be
Interested In
 News/Sharing
 Today
 Search
 Subscriptions

Hire
Companies

Market
Sell

Science and Analytics
Data Infrastructure
Actions

Profiles
Connections
LinkedIn Confidential ©2013 All Rights Reserved

Data

Content
4
The Big-Data Feedback Loop
Refinement 

Engagement
Value 

Member

Product

Insights 

Virality

Data

Signals

Science
Analytics 

Scale 
Infrastructure
LinkedIn Confidential ©2013 All Rights Reserved

5
LinkedIn Data Infrastructure: Three-Phase Abstraction
Near-Line
Infra

Offline
Data Infra

Application

Users

Infrastructure

Online

Near-Line

Offline

Online Data
Infra

Latency & Freshness Requirements
Activity that should be reflected immediately

•
•
•

Products
• Messages
Member Profiles
• Endorsements
Company Profiles
• Skills
Connections

Activity that should be reflected soon

•
•
•

•
Activity Streams
Profile Standardization •
•
News

Recommendations
Search
Messages

Activity that can be reflected later

•
•
•

People You May Know •
Connection Strength •
News

Recommendations
Next best idea…

LinkedIn Confidential ©2013 All Rights Reserved

6
LinkedIn Data Infrastructure: Sample Stack

Infra challenges in 3-phase
ecosystem are
diverse, complex and specific

Some off-the-shelf.
Significant investment in
home-grown, deep and
interesting platforms
7
LinkedIn Data Infrastructure Solutions

Voldemort: Highly-Available
Distributed KV Store
• Key/value access at scale

8
Voldemort: Architecture

• Pluggable components
• Tunable consistency /
availability
• Key/value model,
server side “views”

•
•
•
•
•

10 clusters, 100+ nodes
Largest cluster – 10K+ qps
Avg latency: 3ms
Hundreds of Stores
Largest store – 2.8TB+
LinkedIn Data Infrastructure Solutions

Espresso: Indexed Timeline-Consistent
Distributed Data Store
• Fill in the gap btw Oracle and KV store

10
Espresso: System Components
• Hierarchical data model
• Timeline consistency
• Rich functionality
• Transactions
• Secondary index
• Text search
• Partitioning/replication
• Change propagation

11
Generic Cluster Manager: Helix
• Generic Distributed State Model
•
•
•
•

ConfigManagement
Automatic Load Balancing
Fault tolerance
Cluster expansion and rebalancing

• Espresso, Databus and Search
• Open Source Apr 2012
• https://github.com/linkedin/helix

12
LinkedIn Data Infrastructure Solutions

Databus : Timeline-Consistent
Change Data Capture
• Deliver data store changes to apps
Databus at LinkedIn
DB

Capture
Changes

Relay
Event Win

On-line
Changes

On-line
Changes

Databus
Client Lib

Client

Snapshot at U

Databus
Client Lib

Consistent

 Transport independent of data
source: Oracle, MySQL, …
 Transactional semantics
 In order, at least once delivery

Consumer n

Client

Bootstrap

DB

Consumer 1

Consumer 1

Consumer n

 Tens of relays
 Hundreds of sources
 Low latency - milliseconds

14
LinkedIn Data Infrastructure Solutions

Kafka: High-Volume Low-Latency
Messaging System
• Log aggregation and queuing

15
Kafka Architecture
Producer

Producer

Broker 1

Broker 2

Broker 3

Broker 4

topic1-part1

topic1-part2

topic2-part1

topic2-part2

topic2-part2

topic1-part1

topic1-part2

topic2-part1

topic2-part1

topic2-part2

topic1-part1

topic1-part2

Key features
• Scale-out architecture
• Automatic load balancing
• High throughput/low latency
• Rewindability
• Intra-cluster replication

Zookeeper

Consumer

Consumer

Per day stats
• writes: 10+ billion messages
• reads: 50+ billion messages
LinkedIn Data Infrastructure: A few take-aways
1.
2.
3.

Building infrastructure in a hyper-growth
environment is challenging.
Few vs Many: Balance over-specialized (agile)
vs generic efforts (leverage-able) platforms (*)
Balance open-source products with homegrown platforms (**)

LinkedIn Confidential ©2013 All Rights Reserved

17

Contenu connexe

Tendances

Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentHostedbyConfluent
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Precisely
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
 
TechEvent Building a Data Lake
TechEvent Building a Data LakeTechEvent Building a Data Lake
TechEvent Building a Data LakeTrivadis
 
Data Pipelines With Streamsets
Data Pipelines With Streamsets Data Pipelines With Streamsets
Data Pipelines With Streamsets Jowanza Joseph
 
LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)Sid Anand
 
a Real-time Processing System based on Spark streaming int he field of Teleco...
a Real-time Processing System based on Spark streaming int he field of Teleco...a Real-time Processing System based on Spark streaming int he field of Teleco...
a Real-time Processing System based on Spark streaming int he field of Teleco...DataWorks Summit
 
Building a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with DatabricksBuilding a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with DatabricksDatabricks
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?David P. Moore
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHumza Naseer
 
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...Cloudera, Inc.
 
Data Integration through Data Virtualization (SQL Server Konferenz 2019)
Data Integration through Data Virtualization (SQL Server Konferenz 2019)Data Integration through Data Virtualization (SQL Server Konferenz 2019)
Data Integration through Data Virtualization (SQL Server Konferenz 2019)Cathrine Wilhelmsen
 
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...Dipti Borkar
 
Red Hat JBoss Data Virtualization
Red Hat JBoss Data VirtualizationRed Hat JBoss Data Virtualization
Red Hat JBoss Data VirtualizationDLT Solutions
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
 
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
"Who Moved my Data? - Why tracking changes and sources of data is critical to..."Who Moved my Data? - Why tracking changes and sources of data is critical to...
"Who Moved my Data? - Why tracking changes and sources of data is critical to...Cask Data
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data MeshLibbySchulze
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataNaveen Korakoppa
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 

Tendances (20)

Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
TechEvent Building a Data Lake
TechEvent Building a Data LakeTechEvent Building a Data Lake
TechEvent Building a Data Lake
 
Data Pipelines With Streamsets
Data Pipelines With Streamsets Data Pipelines With Streamsets
Data Pipelines With Streamsets
 
LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)
 
a Real-time Processing System based on Spark streaming int he field of Teleco...
a Real-time Processing System based on Spark streaming int he field of Teleco...a Real-time Processing System based on Spark streaming int he field of Teleco...
a Real-time Processing System based on Spark streaming int he field of Teleco...
 
Building a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with DatabricksBuilding a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with Databricks
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing Architectures
 
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
 
Data Integration through Data Virtualization (SQL Server Konferenz 2019)
Data Integration through Data Virtualization (SQL Server Konferenz 2019)Data Integration through Data Virtualization (SQL Server Konferenz 2019)
Data Integration through Data Virtualization (SQL Server Konferenz 2019)
 
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
 
Red Hat JBoss Data Virtualization
Red Hat JBoss Data VirtualizationRed Hat JBoss Data Virtualization
Red Hat JBoss Data Virtualization
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
"Who Moved my Data? - Why tracking changes and sources of data is critical to..."Who Moved my Data? - Why tracking changes and sources of data is critical to...
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
 
About CDAP
About CDAPAbout CDAP
About CDAP
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 

Similaire à LinkedIn Infrastructure (analytics@webscale, at fb 2013)

Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedInAmy W. Tang
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization Denodo
 
Creating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationCreating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationMongoDB
 
Qo Introduction V2
Qo Introduction V2Qo Introduction V2
Qo Introduction V2Joe_F
 
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012Bhaskar Ghosh
 
Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a...
Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a...Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a...
Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a...MongoDB
 
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...Denodo
 
Modern Data Management for Federal Modernization
Modern Data Management for Federal ModernizationModern Data Management for Federal Modernization
Modern Data Management for Federal ModernizationDenodo
 
SharePoint 2016 Beta 2 What's new (End users and IT Pros) Microsoft Innovat...
SharePoint 2016   Beta 2 What's new (End users and IT Pros) Microsoft Innovat...SharePoint 2016   Beta 2 What's new (End users and IT Pros) Microsoft Innovat...
SharePoint 2016 Beta 2 What's new (End users and IT Pros) Microsoft Innovat...serge luca
 
Ms net work-sharepoint 2013-applied architecture from the field v4
Ms net work-sharepoint 2013-applied architecture from the field v4Ms net work-sharepoint 2013-applied architecture from the field v4
Ms net work-sharepoint 2013-applied architecture from the field v4Tihomir Ignatov
 
A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)Denodo
 
SharePoint Online vs. On-Premise
SharePoint Online vs. On-PremiseSharePoint Online vs. On-Premise
SharePoint Online vs. On-PremiseEvan Hodges
 
Accelerating Data-Driven Enterprise Transformation in Banking, Financial Serv...
Accelerating Data-Driven Enterprise Transformation in Banking, Financial Serv...Accelerating Data-Driven Enterprise Transformation in Banking, Financial Serv...
Accelerating Data-Driven Enterprise Transformation in Banking, Financial Serv...Denodo
 
When and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureWhen and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)Moacyr Passador
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dataconomy Media
 
Best practices to deliver data analytics to the business with power bi
Best practices to deliver data analytics to the business with power biBest practices to deliver data analytics to the business with power bi
Best practices to deliver data analytics to the business with power biSatya Shyam K Jayanty
 
MongoDB Breakfast Milan - Mainframe Offloading Strategies
MongoDB Breakfast Milan -  Mainframe Offloading StrategiesMongoDB Breakfast Milan -  Mainframe Offloading Strategies
MongoDB Breakfast Milan - Mainframe Offloading StrategiesMongoDB
 
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Zaloni
 
Advanced applications with MongoDB
Advanced applications with MongoDBAdvanced applications with MongoDB
Advanced applications with MongoDBNorberto Leite
 

Similaire à LinkedIn Infrastructure (analytics@webscale, at fb 2013) (20)

Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
 
Creating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationCreating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital Transformation
 
Qo Introduction V2
Qo Introduction V2Qo Introduction V2
Qo Introduction V2
 
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
 
Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a...
Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a...Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a...
Bangalore Executive Seminar 2015: Case Study - Text Analysis on MongoDB for a...
 
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...
 
Modern Data Management for Federal Modernization
Modern Data Management for Federal ModernizationModern Data Management for Federal Modernization
Modern Data Management for Federal Modernization
 
SharePoint 2016 Beta 2 What's new (End users and IT Pros) Microsoft Innovat...
SharePoint 2016   Beta 2 What's new (End users and IT Pros) Microsoft Innovat...SharePoint 2016   Beta 2 What's new (End users and IT Pros) Microsoft Innovat...
SharePoint 2016 Beta 2 What's new (End users and IT Pros) Microsoft Innovat...
 
Ms net work-sharepoint 2013-applied architecture from the field v4
Ms net work-sharepoint 2013-applied architecture from the field v4Ms net work-sharepoint 2013-applied architecture from the field v4
Ms net work-sharepoint 2013-applied architecture from the field v4
 
A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)A Key to Real-time Insights in a Post-COVID World (ASEAN)
A Key to Real-time Insights in a Post-COVID World (ASEAN)
 
SharePoint Online vs. On-Premise
SharePoint Online vs. On-PremiseSharePoint Online vs. On-Premise
SharePoint Online vs. On-Premise
 
Accelerating Data-Driven Enterprise Transformation in Banking, Financial Serv...
Accelerating Data-Driven Enterprise Transformation in Banking, Financial Serv...Accelerating Data-Driven Enterprise Transformation in Banking, Financial Serv...
Accelerating Data-Driven Enterprise Transformation in Banking, Financial Serv...
 
When and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureWhen and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data Architecture
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
 
Best practices to deliver data analytics to the business with power bi
Best practices to deliver data analytics to the business with power biBest practices to deliver data analytics to the business with power bi
Best practices to deliver data analytics to the business with power bi
 
MongoDB Breakfast Milan - Mainframe Offloading Strategies
MongoDB Breakfast Milan -  Mainframe Offloading StrategiesMongoDB Breakfast Milan -  Mainframe Offloading Strategies
MongoDB Breakfast Milan - Mainframe Offloading Strategies
 
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
 
Advanced applications with MongoDB
Advanced applications with MongoDBAdvanced applications with MongoDB
Advanced applications with MongoDB
 

Dernier

Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 

Dernier (20)

Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 

LinkedIn Infrastructure (analytics@webscale, at fb 2013)

  • 1. Data Infrastructure at Linkedin Jun Rao and Sam Shah LinkedIn Confidential ©2013 All Rights Reserved
  • 2. Outline 1. 2. 3. 4. LinkedIn introduction Online/nearline infrastructure Offline infrastructure Conclusion LinkedIn Confidential ©2013 All Rights Reserved 2
  • 3. The World’s Largest Professional Network Connecting Talent  Opportunity. At scale… 200M+ 2 new Members Worldwide Members Per Second LinkedIn Confidential ©2013 All Rights Reserved 100M+ Monthly Unique Visitors 2M+ Company Pages 3
  • 4. Two Product Families For Members Professionals For Partners  People You May Know  Who’s Viewed My Profile  Jobs You May Be Interested In  News/Sharing  Today  Search  Subscriptions Hire Companies Market Sell Science and Analytics Data Infrastructure Actions Profiles Connections LinkedIn Confidential ©2013 All Rights Reserved Data Content 4
  • 5. The Big-Data Feedback Loop Refinement  Engagement Value  Member Product Insights  Virality Data Signals Science Analytics  Scale  Infrastructure LinkedIn Confidential ©2013 All Rights Reserved 5
  • 6. LinkedIn Data Infrastructure: Three-Phase Abstraction Near-Line Infra Offline Data Infra Application Users Infrastructure Online Near-Line Offline Online Data Infra Latency & Freshness Requirements Activity that should be reflected immediately • • • Products • Messages Member Profiles • Endorsements Company Profiles • Skills Connections Activity that should be reflected soon • • • • Activity Streams Profile Standardization • • News Recommendations Search Messages Activity that can be reflected later • • • People You May Know • Connection Strength • News Recommendations Next best idea… LinkedIn Confidential ©2013 All Rights Reserved 6
  • 7. LinkedIn Data Infrastructure: Sample Stack Infra challenges in 3-phase ecosystem are diverse, complex and specific Some off-the-shelf. Significant investment in home-grown, deep and interesting platforms 7
  • 8. LinkedIn Data Infrastructure Solutions Voldemort: Highly-Available Distributed KV Store • Key/value access at scale 8
  • 9. Voldemort: Architecture • Pluggable components • Tunable consistency / availability • Key/value model, server side “views” • • • • • 10 clusters, 100+ nodes Largest cluster – 10K+ qps Avg latency: 3ms Hundreds of Stores Largest store – 2.8TB+
  • 10. LinkedIn Data Infrastructure Solutions Espresso: Indexed Timeline-Consistent Distributed Data Store • Fill in the gap btw Oracle and KV store 10
  • 11. Espresso: System Components • Hierarchical data model • Timeline consistency • Rich functionality • Transactions • Secondary index • Text search • Partitioning/replication • Change propagation 11
  • 12. Generic Cluster Manager: Helix • Generic Distributed State Model • • • • ConfigManagement Automatic Load Balancing Fault tolerance Cluster expansion and rebalancing • Espresso, Databus and Search • Open Source Apr 2012 • https://github.com/linkedin/helix 12
  • 13. LinkedIn Data Infrastructure Solutions Databus : Timeline-Consistent Change Data Capture • Deliver data store changes to apps
  • 14. Databus at LinkedIn DB Capture Changes Relay Event Win On-line Changes On-line Changes Databus Client Lib Client Snapshot at U Databus Client Lib Consistent  Transport independent of data source: Oracle, MySQL, …  Transactional semantics  In order, at least once delivery Consumer n Client Bootstrap DB Consumer 1 Consumer 1 Consumer n  Tens of relays  Hundreds of sources  Low latency - milliseconds 14
  • 15. LinkedIn Data Infrastructure Solutions Kafka: High-Volume Low-Latency Messaging System • Log aggregation and queuing 15
  • 16. Kafka Architecture Producer Producer Broker 1 Broker 2 Broker 3 Broker 4 topic1-part1 topic1-part2 topic2-part1 topic2-part2 topic2-part2 topic1-part1 topic1-part2 topic2-part1 topic2-part1 topic2-part2 topic1-part1 topic1-part2 Key features • Scale-out architecture • Automatic load balancing • High throughput/low latency • Rewindability • Intra-cluster replication Zookeeper Consumer Consumer Per day stats • writes: 10+ billion messages • reads: 50+ billion messages
  • 17. LinkedIn Data Infrastructure: A few take-aways 1. 2. 3. Building infrastructure in a hyper-growth environment is challenging. Few vs Many: Balance over-specialized (agile) vs generic efforts (leverage-able) platforms (*) Balance open-source products with homegrown platforms (**) LinkedIn Confidential ©2013 All Rights Reserved 17

Notes de l'éditeur

  1. Enterprise Facing is all about Segmentation and Connections Our base data lead to revenue-generating productsEnterprise Application-building problems with deterministic life-cycles Science is key for targeting and matching (e.g. CAP, Marketing Solutions) Key back-office play for Hiring, Sales and Marketing for 85% of Fortune-500
  2. Transition needs to be goodProducts => data infrastructure requirements in previous slideAll products don’t make the same latency and freshness requirements from our data infrastructureThe way we bucketize this is….News and recommendations show up in both nearline and offline
  3. Data Integration is hard. Having sane and same metadata across systems. Have a schema which works across the 3 phases. Want a rich evolving schemas and make the conforming push as much of data cleaning to source and upstream as much as possible so near-line and off-line helpsSessionization logic is in WH which makes it hard for near-line systems to useExtensible system where changing schema in one phase does not break downstream systemsDon’t build over-specialized systems: e.g. a monitoring system for PYMK – build Azkaban