SlideShare une entreprise Scribd logo
1  sur  53
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Apache Kafka at LinkedIn
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
About Me
2
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
3
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Roadmap
• Q & A
Why We Build Kafka?
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
We Have a lot of Data
5
• User activity tracking
• Page views, ad impressions, etc
• Server logs and metrics
• Syslogs, request-rates, etc
• Messaging
• Emails, news feeds, etc
• Computation derived
• Results of Hadoop / data warehousing, etc
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
.. and We Build Products on Data
6
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Newsfeed
7
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Recommendation
8HADOOP SUMMIT 2013
People you may know
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Recommendation
9
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Search
10
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Metrics and Monitoring
11
HADOOP SUMMIT 2013
System and application metrics/logging
LinkedIn Corporation ©2013 All Rights Reserved 5
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
.. and a LOT of Monitoring
12
The Problem:
How to integrate this variety of data
and make it available to all products?
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 14
Life back in 2010:
Point-to-Point Pipeplines
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 15
Example: User Activity Data Flow
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 16
What We Want
• A centralized data pipeline
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 17
Apache Kafka
We tried some systems off-
the-shelf, but…
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 18
What We REALLY Want
• A centralized data pipeline that is
• Elastically scalable
• Durable
• High-throughput
• Easy to use
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
• A distributed pub-sub messaging system
• Scale-out from groundup
• Persistent to disks
• High-Throughput (10s MB/sec per server)
19
Apache Kafka
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 20
Life Since Kafka in Production
Apache Kafka
• Developed and maintained by 5 Devs + 2 SRE
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
21
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Roadmap
• Q & A
Key Idea #1:
Data-parallelism leads to scale-out
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
• Produce/consume requests are randomly balanced
among brokers
23
Distribute Clients across Partitions
Key Idea #2:
Disks are fast when used sequentially
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
• Appends are effectively O(1)
• Reads from known offset are fast still, when cached
25
Store Messages as a Log
3 4 5 5 7 8 9 10 11 12...
Producer Write
Consumer1
Reads (offset 7)
Consumer2
Reads (offset 7)
Partition i of Topic A
Key Idea #3:
Batching makes best use of network/IO
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
• Batched send and receive
• Batched compression
• No message caching in JVM
• Zero-copy from file to socket (Java NIO)
27
Batch Transfer
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 28
The API (0.8)
Producer:
send(topic, message)
Consumer:
Iterable stream = createMessageStreams(…).get(topic)
for (message: stream) {
// process the message
}
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
29
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Pipeline deployment
• Schema for data cleanliness
• O(1) ETL
• Auditing for correctness
• Roadmap
• Q & A
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 30
Kafka Usage at LinkedIn
• Mainly used for tracking user-activity and metrics data
• 16 - 32 brokers in each cluster (615+ total brokers)
• 527 billion messages/day
• 7500+ topics, 270k+ partitions
• Byte rates:
• Writes: 97 TB/day
• Reads: 430 TB/day
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 31
Kafka Usage at LinkedIn
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 32
Kafka Usage at LinkedIn
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 33
Kafka Usage at LinkedIn
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
34
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Pipeline deployment
• Schema for data cleanliness
• O(1) ETL
• Auditing for correctness
• Roadmap
• Q & A
Problems
• Hundreds of message types
• Thousands of fields
• What do they all mean?
• What happens when they change?
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 36
Standardized Schema on Avro
• Schema
• Message structure contract
• Performance gain
• Workflow
• Check in schema
• Auto compatibility check
• Code review
• “Ship it!”
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
37
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Pipeline deployment
• Schema for data cleanliness
• O(1) ETL
• Auditing for correctness
• Roadmap
• Q & A
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 38
Kafka to Hadoop
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 39
Hadoop ETL (Camus)
• Map/Reduce job does data load
• One job loads all events
• ~10 minute ETA on average from producer to HDFS
• Hive registration done automatically
• Schema evolution handled transparently
• Open sourced:
– https://github.com/linkedin/camus
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
40
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Pipeline deployment
• Schema for data cleanliness
• O(1) ETL
• Auditing for correctness
• Roadmap
• Q & A
Does it really work?
“All published messages must be delivered to all consumers (quickly)”
Audit Trail
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 43
More Features in Kafka 0.8
• Intra-cluster replication (0.8.0)
• Highly availability,
• Reduced latency
• Log compaction (0.8.1)
• State storage
• Operational tools (0.8.2)
• Topic management
• Automated leader rebalance
• etc ..
Checkout our page for more: http://kafka.apache.org/
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 44
Kafka 0.9
• Clients Rewrite
• Remove ZK dependency
• Even better throughput
• Security
• More operability, multi-tenancy ready
• Transactional Messaing
• From at-least-one to exactly-once
Checkout our page for more: http://kafka.apache.org/
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Kafka Users: Next Maybe You?
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 46
Acknowledgements
Questions? Guozhang Wang
guwang@linkedin.com
www.linkedin.com/in/guozhangwang
Backup Slides
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 49
Real-time Analysis with Kafka
• Analytics from Hadoop can be slow
• Production -> Kafka: tens of milliseconds
• Kafka - > Hadoop: < 1 minute
• ETL in Hadoop: ~ 45 minutes
• MapReduce in Hadoop: maybe hours
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 50
Real-time Analysis with Kafka
• Solution No.1: directly consuming from Kafka
• Solution No. 2: other storage than HDFS
• Spark, Shark
• Pinot, Druid, FastBit
• Solution No. 3: stream processing
• Apache Samza
• Storm
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 51
How Fast can Kafka Go?
• Bottleneck #1: network bandwidth
• Producer: 100 Mb/s for 1 Gig-Ethernet
• Consumer can be slower due to multi-sub
• Bottleneck #2: disk space
• Data may be deleted before consumed at peak time•
• Configurable time/size-based retention policy
• Bottleneck #3: Zookeeper
• Mainly due to offset commit, will be lifted in 0.9
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 52
Intra-cluster Replication
• Pick CA within Datacenter (failover < 10ms)
• Network partition is rare
• Latency less than an issue
• Separate data replication and consensus
• Consensus => Zookeeper
• Replication => primary-backup (f to tolerate f-1 failure)
• Configurable ACK (durability v.s. latency)
• More details:
• http://www.slideshare.net/junrao/kafka-replication-apachecon2013
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 53
Replication Architecture
Producer
Consumer
Producer
Broker Broker Broker Broker
Consumer
ZK

Contenu connexe

Tendances

Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Diving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka ConnectDiving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka Connectconfluent
 
Spring Boot+Kafka: the New Enterprise Platform
Spring Boot+Kafka: the New Enterprise PlatformSpring Boot+Kafka: the New Enterprise Platform
Spring Boot+Kafka: the New Enterprise PlatformVMware Tanzu
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database Systemconfluent
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaKai Wähner
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using KafkaKnoldus Inc.
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight OverviewJacques Nadeau
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in DeltaDatabricks
 
Apache NiFi Record Processing
Apache NiFi Record ProcessingApache NiFi Record Processing
Apache NiFi Record ProcessingBryan Bende
 
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...HostedbyConfluent
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJeff Holoman
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafkaemreakis
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin PodvalMartin Podval
 

Tendances (20)

Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Diving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka ConnectDiving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka Connect
 
Spring Boot+Kafka: the New Enterprise Platform
Spring Boot+Kafka: the New Enterprise PlatformSpring Boot+Kafka: the New Enterprise Platform
Spring Boot+Kafka: the New Enterprise Platform
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
 
Unified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache FlinkUnified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache Flink
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight Overview
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Apache NiFi Record Processing
Apache NiFi Record ProcessingApache NiFi Record Processing
Apache NiFi Record Processing
 
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin Podval
 

Similaire à Apache Kafka at LinkedIn

GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023Timothy Spann
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Uri Laserson
 
CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4Michael Kehoe
 
CA Technologies Customer Presentation
CA Technologies Customer PresentationCA Technologies Customer Presentation
CA Technologies Customer PresentationSplunk
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Data Con LA
 
Introduction to Kafka
Introduction to KafkaIntroduction to Kafka
Introduction to KafkaAkash Vacher
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015Cloudera, Inc.
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Cloudera, Inc.
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven productsLars Albertsson
 
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data PipelinesETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelinesconfluent
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Timothy Spann
 
Oracle Cloud : Big Data Use Cases and Architecture
Oracle Cloud : Big Data Use Cases and ArchitectureOracle Cloud : Big Data Use Cases and Architecture
Oracle Cloud : Big Data Use Cases and ArchitectureRiccardo Romani
 
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Michael Noll
 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015Wes McKinney
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
 

Similaire à Apache Kafka at LinkedIn (20)

GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4
 
CA Technologies Customer Presentation
CA Technologies Customer PresentationCA Technologies Customer Presentation
CA Technologies Customer Presentation
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
 
Introduction to Kafka
Introduction to KafkaIntroduction to Kafka
Introduction to Kafka
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data PipelinesETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
 
Oracle Cloud : Big Data Use Cases and Architecture
Oracle Cloud : Big Data Use Cases and ArchitectureOracle Cloud : Big Data Use Cases and Architecture
Oracle Cloud : Big Data Use Cases and Architecture
 
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 

Plus de Guozhang Wang

Consensus in Apache Kafka: From Theory to Production.pdf
Consensus in Apache Kafka: From Theory to Production.pdfConsensus in Apache Kafka: From Theory to Production.pdf
Consensus in Apache Kafka: From Theory to Production.pdfGuozhang Wang
 
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...Guozhang Wang
 
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...Guozhang Wang
 
Introduction to the Incremental Cooperative Protocol of Kafka
Introduction to the Incremental Cooperative Protocol of KafkaIntroduction to the Incremental Cooperative Protocol of Kafka
Introduction to the Incremental Cooperative Protocol of KafkaGuozhang Wang
 
Performance Analysis and Optimizations for Kafka Streams Applications
Performance Analysis and Optimizations for Kafka Streams ApplicationsPerformance Analysis and Optimizations for Kafka Streams Applications
Performance Analysis and Optimizations for Kafka Streams ApplicationsGuozhang Wang
 
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
Apache Kafka from 0.7 to 1.0, History and Lesson LearnedApache Kafka from 0.7 to 1.0, History and Lesson Learned
Apache Kafka from 0.7 to 1.0, History and Lesson LearnedGuozhang Wang
 
Exactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsExactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsGuozhang Wang
 
Apache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream ProcessingApache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream ProcessingGuozhang Wang
 
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingGuozhang Wang
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaGuozhang Wang
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka StreamsGuozhang Wang
 
Building a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache KafkaBuilding a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache KafkaGuozhang Wang
 
Behavioral Simulations in MapReduce
Behavioral Simulations in MapReduceBehavioral Simulations in MapReduce
Behavioral Simulations in MapReduceGuozhang Wang
 
Automatic Scaling Iterative Computations
Automatic Scaling Iterative ComputationsAutomatic Scaling Iterative Computations
Automatic Scaling Iterative ComputationsGuozhang Wang
 

Plus de Guozhang Wang (14)

Consensus in Apache Kafka: From Theory to Production.pdf
Consensus in Apache Kafka: From Theory to Production.pdfConsensus in Apache Kafka: From Theory to Production.pdf
Consensus in Apache Kafka: From Theory to Production.pdf
 
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
 
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
 
Introduction to the Incremental Cooperative Protocol of Kafka
Introduction to the Incremental Cooperative Protocol of KafkaIntroduction to the Incremental Cooperative Protocol of Kafka
Introduction to the Incremental Cooperative Protocol of Kafka
 
Performance Analysis and Optimizations for Kafka Streams Applications
Performance Analysis and Optimizations for Kafka Streams ApplicationsPerformance Analysis and Optimizations for Kafka Streams Applications
Performance Analysis and Optimizations for Kafka Streams Applications
 
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
Apache Kafka from 0.7 to 1.0, History and Lesson LearnedApache Kafka from 0.7 to 1.0, History and Lesson Learned
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
 
Exactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsExactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka Streams
 
Apache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream ProcessingApache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream Processing
 
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
Building a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache KafkaBuilding a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache Kafka
 
Behavioral Simulations in MapReduce
Behavioral Simulations in MapReduceBehavioral Simulations in MapReduce
Behavioral Simulations in MapReduce
 
Automatic Scaling Iterative Computations
Automatic Scaling Iterative ComputationsAutomatic Scaling Iterative Computations
Automatic Scaling Iterative Computations
 

Dernier

MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 

Dernier (20)

MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 

Apache Kafka at LinkedIn

  • 1. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Apache Kafka at LinkedIn
  • 2. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure About Me 2
  • 3. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Agenda 3 • Overview of Kafka • Kafka Design • Kafka Usage at LinkedIn • Roadmap • Q & A
  • 4. Why We Build Kafka?
  • 5. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure We Have a lot of Data 5 • User activity tracking • Page views, ad impressions, etc • Server logs and metrics • Syslogs, request-rates, etc • Messaging • Emails, news feeds, etc • Computation derived • Results of Hadoop / data warehousing, etc
  • 6. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure .. and We Build Products on Data 6
  • 7. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Newsfeed 7
  • 8. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Recommendation 8HADOOP SUMMIT 2013 People you may know
  • 9. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Recommendation 9
  • 10. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Search 10
  • 11. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Metrics and Monitoring 11 HADOOP SUMMIT 2013 System and application metrics/logging LinkedIn Corporation ©2013 All Rights Reserved 5
  • 12. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure .. and a LOT of Monitoring 12
  • 13. The Problem: How to integrate this variety of data and make it available to all products?
  • 14. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 14 Life back in 2010: Point-to-Point Pipeplines
  • 15. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 15 Example: User Activity Data Flow
  • 16. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 16 What We Want • A centralized data pipeline
  • 17. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 17 Apache Kafka We tried some systems off- the-shelf, but…
  • 18. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 18 What We REALLY Want • A centralized data pipeline that is • Elastically scalable • Durable • High-throughput • Easy to use
  • 19. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure • A distributed pub-sub messaging system • Scale-out from groundup • Persistent to disks • High-Throughput (10s MB/sec per server) 19 Apache Kafka
  • 20. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 20 Life Since Kafka in Production Apache Kafka • Developed and maintained by 5 Devs + 2 SRE
  • 21. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Agenda 21 • Overview of Kafka • Kafka Design • Kafka Usage at LinkedIn • Roadmap • Q & A
  • 22. Key Idea #1: Data-parallelism leads to scale-out
  • 23. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure • Produce/consume requests are randomly balanced among brokers 23 Distribute Clients across Partitions
  • 24. Key Idea #2: Disks are fast when used sequentially
  • 25. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure • Appends are effectively O(1) • Reads from known offset are fast still, when cached 25 Store Messages as a Log 3 4 5 5 7 8 9 10 11 12... Producer Write Consumer1 Reads (offset 7) Consumer2 Reads (offset 7) Partition i of Topic A
  • 26. Key Idea #3: Batching makes best use of network/IO
  • 27. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure • Batched send and receive • Batched compression • No message caching in JVM • Zero-copy from file to socket (Java NIO) 27 Batch Transfer
  • 28. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 28 The API (0.8) Producer: send(topic, message) Consumer: Iterable stream = createMessageStreams(…).get(topic) for (message: stream) { // process the message }
  • 29. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Agenda 29 • Overview of Kafka • Kafka Design • Kafka Usage at LinkedIn • Pipeline deployment • Schema for data cleanliness • O(1) ETL • Auditing for correctness • Roadmap • Q & A
  • 30. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 30 Kafka Usage at LinkedIn • Mainly used for tracking user-activity and metrics data • 16 - 32 brokers in each cluster (615+ total brokers) • 527 billion messages/day • 7500+ topics, 270k+ partitions • Byte rates: • Writes: 97 TB/day • Reads: 430 TB/day
  • 31. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 31 Kafka Usage at LinkedIn
  • 32. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 32 Kafka Usage at LinkedIn
  • 33. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 33 Kafka Usage at LinkedIn
  • 34. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Agenda 34 • Overview of Kafka • Kafka Design • Kafka Usage at LinkedIn • Pipeline deployment • Schema for data cleanliness • O(1) ETL • Auditing for correctness • Roadmap • Q & A
  • 35. Problems • Hundreds of message types • Thousands of fields • What do they all mean? • What happens when they change?
  • 36. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 36 Standardized Schema on Avro • Schema • Message structure contract • Performance gain • Workflow • Check in schema • Auto compatibility check • Code review • “Ship it!”
  • 37. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Agenda 37 • Overview of Kafka • Kafka Design • Kafka Usage at LinkedIn • Pipeline deployment • Schema for data cleanliness • O(1) ETL • Auditing for correctness • Roadmap • Q & A
  • 38. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 38 Kafka to Hadoop
  • 39. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 39 Hadoop ETL (Camus) • Map/Reduce job does data load • One job loads all events • ~10 minute ETA on average from producer to HDFS • Hive registration done automatically • Schema evolution handled transparently • Open sourced: – https://github.com/linkedin/camus
  • 40. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Agenda 40 • Overview of Kafka • Kafka Design • Kafka Usage at LinkedIn • Pipeline deployment • Schema for data cleanliness • O(1) ETL • Auditing for correctness • Roadmap • Q & A
  • 41. Does it really work? “All published messages must be delivered to all consumers (quickly)”
  • 43. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 43 More Features in Kafka 0.8 • Intra-cluster replication (0.8.0) • Highly availability, • Reduced latency • Log compaction (0.8.1) • State storage • Operational tools (0.8.2) • Topic management • Automated leader rebalance • etc .. Checkout our page for more: http://kafka.apache.org/
  • 44. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 44 Kafka 0.9 • Clients Rewrite • Remove ZK dependency • Even better throughput • Security • More operability, multi-tenancy ready • Transactional Messaing • From at-least-one to exactly-once Checkout our page for more: http://kafka.apache.org/
  • 45. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Kafka Users: Next Maybe You?
  • 46. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 46 Acknowledgements
  • 49. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 49 Real-time Analysis with Kafka • Analytics from Hadoop can be slow • Production -> Kafka: tens of milliseconds • Kafka - > Hadoop: < 1 minute • ETL in Hadoop: ~ 45 minutes • MapReduce in Hadoop: maybe hours
  • 50. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 50 Real-time Analysis with Kafka • Solution No.1: directly consuming from Kafka • Solution No. 2: other storage than HDFS • Spark, Shark • Pinot, Druid, FastBit • Solution No. 3: stream processing • Apache Samza • Storm
  • 51. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 51 How Fast can Kafka Go? • Bottleneck #1: network bandwidth • Producer: 100 Mb/s for 1 Gig-Ethernet • Consumer can be slower due to multi-sub • Bottleneck #2: disk space • Data may be deleted before consumed at peak time• • Configurable time/size-based retention policy • Bottleneck #3: Zookeeper • Mainly due to offset commit, will be lifted in 0.9
  • 52. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 52 Intra-cluster Replication • Pick CA within Datacenter (failover < 10ms) • Network partition is rare • Latency less than an issue • Separate data replication and consensus • Consensus => Zookeeper • Replication => primary-backup (f to tolerate f-1 failure) • Configurable ACK (durability v.s. latency) • More details: • http://www.slideshare.net/junrao/kafka-replication-apachecon2013
  • 53. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 53 Replication Architecture Producer Consumer Producer Broker Broker Broker Broker Consumer ZK

Notes de l'éditeur

  1. Data-serving websites, LinkedIn has a lot of data
  2. Based on relevence
  3. We have this variety of data and and we need to build all these products around such data.
  4. We have this variety of data and and we need to build all these products around such data.
  5. Messaging: ActiveMQ User Activity: In house log aggregation Logging: Splunk Metrics: JMX => Zenoss Database data: Databus, custom ETL
  6. ActiveMQ: they do not fly
  7. Now you maybe wondering why it works so well? For example, why it can be both highly durable by persisting data to disks while still maintaining high throughput?
  8. Topic = message stream Topic has partitions, partitions are distributed to brokers
  9. Do not be afraid of disks
  10. File system caching
  11. And finally after all these tricks, the client interface we exposed to the users, are very simple.
  12. Now I will switch my gear and talk a little bit about Kafka usage at Linkedin
  13. 21st, October.
  14. Multi-colo
  15. 99.99%
  16. 0.8.2: Delete topic Automated leader rebalancing Controlled shutdown Offset management Parallel recovery min.isr and clean leader election
  17. Non-Java / Scala C / C++ / .NET Go Clojure Ruby Node.js PHP Python Erlang HTTP REST Command line etc .. https://cwiki.apache.org/confluence/display/KAFKA/Clients Python - Pure Python implementation with full protocol support. Consumer and Producer implementations included, GZIP and Snappy compression supported. C - High performance C library with full protocol support C++ - Native C++ library with protocol support for Metadata, Produce, Fetch, and Offset. Go (aka golang) Pure Go implementation with full protocol support. Consumer and Producer implementations included, GZIP and Snappy compression supported. Ruby - Pure Ruby, Consumer and Producer implementations included, GZIP and Snappy compression supported. Ruby 1.9.3 and up (CI runs MRI 2. Clojure - Clojure DSL for the Kafka API JavaScript (NodeJS) - NodeJS client in a pure JavaScript implementation stdin & stdout https://cwiki.apache.org/confluence/display/KAFKA/Clients
  18. Non-Java / Scala C / C++ / .NET Go Clojure Ruby Node.js PHP Python Erlang HTTP REST Command line etc ..