SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
Building a Real-Time Data Pipeline:
Apache Kafka at Linkedin
Hadoop Summit 2013
Joel Koshy
June 2013
LinkedIn Corporation ©2013 All Rights Reserved
HADOOP SUMMIT 2013
Network update stream
LinkedIn Corporation ©2013 All Rights Reserved
We have a lot of data.
We want to leverage this data to build products.
Data pipeline
HADOOP SUMMIT 2013
People you may know
HADOOP SUMMIT 2013
System and application metrics/logging
LinkedIn Corporation ©2013 All Rights Reserved 5
How do we integrate this variety of data
and make it available to all these systems?
LinkedIn Confidential ©2013 All Rights Reserved
HADOOP SUMMIT 2013
Point-to-point pipelines
HADOOP SUMMIT 2013
LinkedIn’s user activity data pipeline (circa 2010)
HADOOP SUMMIT 2013
Point-to-point pipelines
HADOOP SUMMIT 2013
Four key ideas
1. Central data pipeline
2. Push data cleanliness upstream
3. O(1) ETL
4. Evidence-based correctness
LinkedIn Corporation ©2013 All Rights Reserved 10
HADOOP SUMMIT 2013
Central data pipeline
First attempt: don’t re-invent the wheel
LinkedIn Confidential ©2013 All Rights Reserved
HADOOP SUMMIT 2013
Second attempt: re-invent the wheel!
LinkedIn Confidential ©2013 All Rights Reserved
Use a central commit log
LinkedIn Confidential ©2013 All Rights Reserved
HADOOP SUMMIT 2013
What is a commit log?
HADOOP SUMMIT 2013
The log as a messaging system
LinkedIn Corporation ©2013 All Rights Reserved 17
HADOOP SUMMIT 2013
Apache Kafka
LinkedIn Corporation ©2013 All Rights Reserved 18
HADOOP SUMMIT 2013
Usage at LinkedIn
 16 brokers in each cluster
 28 billion messages/day
 Peak rates
– Writes: 460,000 messages/second
– Reads: 2,300,000 messages/second
 ~ 700 topics
 40-50 live services consuming user-activity data
 Many ad hoc consumers
 Every production service is a producer (for metrics)
 10k connections/colo
LinkedIn Corporation ©2013 All Rights Reserved 19
HADOOP SUMMIT 2013
Usage at LinkedIn
LinkedIn Corporation ©2013 All Rights Reserved 20
HADOOP SUMMIT 2013
Four key ideas
1. Central data pipeline
2. Push data cleanliness upstream
3. O(1) ETL
4. Evidence-based correctness
LinkedIn Corporation ©2013 All Rights Reserved 21
HADOOP SUMMIT 2013
Standardize on Avro in data pipeline
LinkedIn Corporation ©2013 All Rights Reserved 22
{
"type": "record",
"name": "URIValidationRequestEvent",
"namespace": "com.linkedin.event.usv",
"fields": [
{
"name": "header",
"type": {
"type": "record",
"name": ”TrackingEventHeader",
"namespace": "com.linkedin.event",
"fields": [
{
"name": "memberId",
"type": "int",
"doc": "The member id of the user initiating the action"
},
{
"name": ”timeMs",
"type": "long",
"doc": "The time of the event"
},
{
"name": ”host",
"type": "string",
...
...
HADOOP SUMMIT 2013
Four key ideas
1. Central data pipeline
2. Push data cleanliness upstream
3. O(1) ETL
4. Evidence-based correctness
LinkedIn Corporation ©2013 All Rights Reserved 23
HADOOP SUMMIT 2013
Hadoop data load (Camus)
 Open sourced:
– https://github.com/linkedin/camus
 One job loads all events
 ~10 minute ETA on average from producer to HDFS
 Hive registration done automatically
 Schema evolution handled transparently
HADOOP SUMMIT 2013
Four key ideas
1. Central data pipeline
2. Push data cleanliness upstream
3. O(1) ETL
4. Evidence-based correctness
LinkedIn Corporation ©2013 All Rights Reserved 25
Does it work?
“All published messages must be delivered to all consumers (quickly)”
LinkedIn Confidential ©2013 All Rights Reserved
HADOOP SUMMIT 2013
Audit Trail
HADOOP SUMMIT 2013
Kafka replication (0.8)
 Intra-cluster replication feature
– Facilitates high availability and durability
 Beta release available
https://dist.apache.org/repos/dist/release/kafka/
 Rolled out in production at LinkedIn last week
LinkedIn Corporation ©2013 All Rights Reserved 28
HADOOP SUMMIT 2013
Join us at our user-group meeting tonight @ LinkedIn!
– Thursday, June 27, 7.30pm to 9.30pm
– 2025 Stierlin Ct., Mountain View, CA
– http://www.meetup.com/http-kafka-apache-org/events/125887332/
– Presentations (replication overview and use-case studies) from:
 RichRelevance
 Netflix
 Square
 LinkedIn
LinkedIn Corporation ©2013 All Rights Reserved 29
HADOOP SUMMIT 2013LinkedIn Corporation ©2013 All Rights Reserved 30

Contenu connexe

Tendances

HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...Altinity Ltd
 
Intro to Time Series
Intro to Time Series Intro to Time Series
Intro to Time Series InfluxData
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...
Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...
Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...NoSQLmatters
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentationIlias Okacha
 
Migrate from Oracle to Amazon Aurora using AWS Schema Conversion Tool & AWS D...
Migrate from Oracle to Amazon Aurora using AWS Schema Conversion Tool & AWS D...Migrate from Oracle to Amazon Aurora using AWS Schema Conversion Tool & AWS D...
Migrate from Oracle to Amazon Aurora using AWS Schema Conversion Tool & AWS D...Amazon Web Services
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Streaming sql and druid
Streaming sql and druid Streaming sql and druid
Streaming sql and druid arupmalakar
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural searchDmitry Kan
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...HostedbyConfluent
 
Apache Sentry for Hadoop security
Apache Sentry for Hadoop securityApache Sentry for Hadoop security
Apache Sentry for Hadoop securitybigdatagurus_meetup
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Claudio Martella
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaDatabricks
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcachedJurriaan Persyn
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBMike Dirolf
 

Tendances (20)

HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
 
Intro to Time Series
Intro to Time Series Intro to Time Series
Intro to Time Series
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...
Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...
Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barce...
 
Screw DevOps, Let's Talk DataOps
Screw DevOps, Let's Talk DataOpsScrew DevOps, Let's Talk DataOps
Screw DevOps, Let's Talk DataOps
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Migrate from Oracle to Amazon Aurora using AWS Schema Conversion Tool & AWS D...
Migrate from Oracle to Amazon Aurora using AWS Schema Conversion Tool & AWS D...Migrate from Oracle to Amazon Aurora using AWS Schema Conversion Tool & AWS D...
Migrate from Oracle to Amazon Aurora using AWS Schema Conversion Tool & AWS D...
 
Apache Ranger
Apache RangerApache Ranger
Apache Ranger
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Streaming sql and druid
Streaming sql and druid Streaming sql and druid
Streaming sql and druid
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural search
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...How Kafka Powers the World's Most Popular Vector Database System with Charles...
How Kafka Powers the World's Most Popular Vector Database System with Charles...
 
Apache Sentry for Hadoop security
Apache Sentry for Hadoop securityApache Sentry for Hadoop security
Apache Sentry for Hadoop security
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 

En vedette

Architecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructureArchitecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructuremattlieber
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedInAmy W. Tang
 
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Yahoo Developer Network
 
LinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data ApplicationLinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data ApplicationAmy W. Tang
 
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Amy W. Tang
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Amy W. Tang
 
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInA Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInAmy W. Tang
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJeff Holoman
 
LinkedIn Communication Architecture
LinkedIn Communication ArchitectureLinkedIn Communication Architecture
LinkedIn Communication ArchitectureLinkedIn
 
Introduction to Databus
Introduction to DatabusIntroduction to Databus
Introduction to DatabusAmy W. Tang
 
Building Distributed Systems Using Helix
Building Distributed Systems Using HelixBuilding Distributed Systems Using Helix
Building Distributed Systems Using HelixAmy W. Tang
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella
 
Rakuten LeoFs - distributed file system
Rakuten LeoFs - distributed file systemRakuten LeoFs - distributed file system
Rakuten LeoFs - distributed file systemRakuten Group, Inc.
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafkaSamuel Kerrien
 
Realtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIORealtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIOJozo Kovac
 
IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...
IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...
IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...In-Memory Computing Summit
 
Intro to SnappyData Webinar
Intro to SnappyData WebinarIntro to SnappyData Webinar
Intro to SnappyData WebinarSnappyData
 

En vedette (20)

Architecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructureArchitecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructure
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
 
LinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data ApplicationLinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data Application
 
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInA Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
LinkedIn Communication Architecture
LinkedIn Communication ArchitectureLinkedIn Communication Architecture
LinkedIn Communication Architecture
 
Introduction to Databus
Introduction to DatabusIntroduction to Databus
Introduction to Databus
 
Building Distributed Systems Using Helix
Building Distributed Systems Using HelixBuilding Distributed Systems Using Helix
Building Distributed Systems Using Helix
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.
 
Rakuten LeoFs - distributed file system
Rakuten LeoFs - distributed file systemRakuten LeoFs - distributed file system
Rakuten LeoFs - distributed file system
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Realtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIORealtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIO
 
IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...
IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...
IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...
 
Intro to SnappyData Webinar
Intro to SnappyData WebinarIntro to SnappyData Webinar
Intro to SnappyData Webinar
 

Similaire à Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn

All data accessible to all my organization - Presentation at OW2con'19, June...
 All data accessible to all my organization - Presentation at OW2con'19, June... All data accessible to all my organization - Presentation at OW2con'19, June...
All data accessible to all my organization - Presentation at OW2con'19, June...OW2
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Yael Garten
 
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshThe Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshIanFurlong4
 
Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014Sri Ambati
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit/Hadoop Summit
 
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012Bhaskar Ghosh
 
The LOD Gateway: Open Source Infrastructure for Linked Data
The LOD Gateway: Open Source Infrastructure for Linked DataThe LOD Gateway: Open Source Infrastructure for Linked Data
The LOD Gateway: Open Source Infrastructure for Linked DataDavid Newbury
 
The oecd delta project – providing easier access to data through api's
The oecd delta project – providing easier access to data through api'sThe oecd delta project – providing easier access to data through api's
The oecd delta project – providing easier access to data through api'sJonathan Challener
 
Microsoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needsMicrosoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needsMicrosoft Tech Community
 
Microsoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needsMicrosoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needsMicrosoft Tech Community
 
Big Data, Bigger Brains
Big Data, Bigger BrainsBig Data, Bigger Brains
Big Data, Bigger BrainsDenny Lee
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
 
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Neo4j
 
Opensocial Haifa Seminar - 2008.04.08
Opensocial Haifa Seminar - 2008.04.08Opensocial Haifa Seminar - 2008.04.08
Opensocial Haifa Seminar - 2008.04.08Ari Leichtberg
 
Better integrations through open interfaces
Better integrations through open interfacesBetter integrations through open interfaces
Better integrations through open interfacesSteve Speicher
 
Test trend analysis: Towards robust reliable and timely tests
Test trend analysis: Towards robust reliable and timely testsTest trend analysis: Towards robust reliable and timely tests
Test trend analysis: Towards robust reliable and timely testsHugh McCamphill
 

Similaire à Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn (20)

All data accessible to all my organization - Presentation at OW2con'19, June...
 All data accessible to all my organization - Presentation at OW2con'19, June... All data accessible to all my organization - Presentation at OW2con'19, June...
All data accessible to all my organization - Presentation at OW2con'19, June...
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
 
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshThe Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
 
Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014Sparkling Water Webinar October 29th, 2014
Sparkling Water Webinar October 29th, 2014
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
 
Breaking down data silos with OData
Breaking down data silos with ODataBreaking down data silos with OData
Breaking down data silos with OData
 
The LOD Gateway: Open Source Infrastructure for Linked Data
The LOD Gateway: Open Source Infrastructure for Linked DataThe LOD Gateway: Open Source Infrastructure for Linked Data
The LOD Gateway: Open Source Infrastructure for Linked Data
 
The oecd delta project – providing easier access to data through api's
The oecd delta project – providing easier access to data through api'sThe oecd delta project – providing easier access to data through api's
The oecd delta project – providing easier access to data through api's
 
Microsoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needsMicrosoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needs
 
Microsoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needsMicrosoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needs
 
Big Data, Bigger Brains
Big Data, Bigger BrainsBig Data, Bigger Brains
Big Data, Bigger Brains
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
 
Opensocial Haifa Seminar - 2008.04.08
Opensocial Haifa Seminar - 2008.04.08Opensocial Haifa Seminar - 2008.04.08
Opensocial Haifa Seminar - 2008.04.08
 
Better integrations through open interfaces
Better integrations through open interfacesBetter integrations through open interfaces
Better integrations through open interfaces
 
Test trend analysis: Towards robust reliable and timely tests
Test trend analysis: Towards robust reliable and timely testsTest trend analysis: Towards robust reliable and timely tests
Test trend analysis: Towards robust reliable and timely tests
 

Plus de Amy W. Tang

Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Amy W. Tang
 
LinkedIn Graph Presentation
LinkedIn Graph PresentationLinkedIn Graph Presentation
LinkedIn Graph PresentationAmy W. Tang
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedInAmy W. Tang
 
Voldemort on Solid State Drives
Voldemort on Solid State DrivesVoldemort on Solid State Drives
Voldemort on Solid State DrivesAmy W. Tang
 
Untangling Cluster Management with Helix
Untangling Cluster Management with HelixUntangling Cluster Management with Helix
Untangling Cluster Management with HelixAmy W. Tang
 
All Aboard the Databus
All Aboard the DatabusAll Aboard the Databus
All Aboard the DatabusAmy W. Tang
 

Plus de Amy W. Tang (6)

Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
 
LinkedIn Graph Presentation
LinkedIn Graph PresentationLinkedIn Graph Presentation
LinkedIn Graph Presentation
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
Voldemort on Solid State Drives
Voldemort on Solid State DrivesVoldemort on Solid State Drives
Voldemort on Solid State Drives
 
Untangling Cluster Management with Helix
Untangling Cluster Management with HelixUntangling Cluster Management with Helix
Untangling Cluster Management with Helix
 
All Aboard the Databus
All Aboard the DatabusAll Aboard the Databus
All Aboard the Databus
 

Dernier

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 

Dernier (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 

Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn

  • 1. Building a Real-Time Data Pipeline: Apache Kafka at Linkedin Hadoop Summit 2013 Joel Koshy June 2013 LinkedIn Corporation ©2013 All Rights Reserved
  • 3. LinkedIn Corporation ©2013 All Rights Reserved We have a lot of data. We want to leverage this data to build products. Data pipeline
  • 5. HADOOP SUMMIT 2013 System and application metrics/logging LinkedIn Corporation ©2013 All Rights Reserved 5
  • 6. How do we integrate this variety of data and make it available to all these systems? LinkedIn Confidential ©2013 All Rights Reserved
  • 8. HADOOP SUMMIT 2013 LinkedIn’s user activity data pipeline (circa 2010)
  • 10. HADOOP SUMMIT 2013 Four key ideas 1. Central data pipeline 2. Push data cleanliness upstream 3. O(1) ETL 4. Evidence-based correctness LinkedIn Corporation ©2013 All Rights Reserved 10
  • 11. HADOOP SUMMIT 2013 Central data pipeline
  • 12. First attempt: don’t re-invent the wheel LinkedIn Confidential ©2013 All Rights Reserved
  • 14. Second attempt: re-invent the wheel! LinkedIn Confidential ©2013 All Rights Reserved
  • 15. Use a central commit log LinkedIn Confidential ©2013 All Rights Reserved
  • 16. HADOOP SUMMIT 2013 What is a commit log?
  • 17. HADOOP SUMMIT 2013 The log as a messaging system LinkedIn Corporation ©2013 All Rights Reserved 17
  • 18. HADOOP SUMMIT 2013 Apache Kafka LinkedIn Corporation ©2013 All Rights Reserved 18
  • 19. HADOOP SUMMIT 2013 Usage at LinkedIn  16 brokers in each cluster  28 billion messages/day  Peak rates – Writes: 460,000 messages/second – Reads: 2,300,000 messages/second  ~ 700 topics  40-50 live services consuming user-activity data  Many ad hoc consumers  Every production service is a producer (for metrics)  10k connections/colo LinkedIn Corporation ©2013 All Rights Reserved 19
  • 20. HADOOP SUMMIT 2013 Usage at LinkedIn LinkedIn Corporation ©2013 All Rights Reserved 20
  • 21. HADOOP SUMMIT 2013 Four key ideas 1. Central data pipeline 2. Push data cleanliness upstream 3. O(1) ETL 4. Evidence-based correctness LinkedIn Corporation ©2013 All Rights Reserved 21
  • 22. HADOOP SUMMIT 2013 Standardize on Avro in data pipeline LinkedIn Corporation ©2013 All Rights Reserved 22 { "type": "record", "name": "URIValidationRequestEvent", "namespace": "com.linkedin.event.usv", "fields": [ { "name": "header", "type": { "type": "record", "name": ”TrackingEventHeader", "namespace": "com.linkedin.event", "fields": [ { "name": "memberId", "type": "int", "doc": "The member id of the user initiating the action" }, { "name": ”timeMs", "type": "long", "doc": "The time of the event" }, { "name": ”host", "type": "string", ... ...
  • 23. HADOOP SUMMIT 2013 Four key ideas 1. Central data pipeline 2. Push data cleanliness upstream 3. O(1) ETL 4. Evidence-based correctness LinkedIn Corporation ©2013 All Rights Reserved 23
  • 24. HADOOP SUMMIT 2013 Hadoop data load (Camus)  Open sourced: – https://github.com/linkedin/camus  One job loads all events  ~10 minute ETA on average from producer to HDFS  Hive registration done automatically  Schema evolution handled transparently
  • 25. HADOOP SUMMIT 2013 Four key ideas 1. Central data pipeline 2. Push data cleanliness upstream 3. O(1) ETL 4. Evidence-based correctness LinkedIn Corporation ©2013 All Rights Reserved 25
  • 26. Does it work? “All published messages must be delivered to all consumers (quickly)” LinkedIn Confidential ©2013 All Rights Reserved
  • 28. HADOOP SUMMIT 2013 Kafka replication (0.8)  Intra-cluster replication feature – Facilitates high availability and durability  Beta release available https://dist.apache.org/repos/dist/release/kafka/  Rolled out in production at LinkedIn last week LinkedIn Corporation ©2013 All Rights Reserved 28
  • 29. HADOOP SUMMIT 2013 Join us at our user-group meeting tonight @ LinkedIn! – Thursday, June 27, 7.30pm to 9.30pm – 2025 Stierlin Ct., Mountain View, CA – http://www.meetup.com/http-kafka-apache-org/events/125887332/ – Presentations (replication overview and use-case studies) from:  RichRelevance  Netflix  Square  LinkedIn LinkedIn Corporation ©2013 All Rights Reserved 29
  • 30. HADOOP SUMMIT 2013LinkedIn Corporation ©2013 All Rights Reserved 30