SlideShare une entreprise Scribd logo
1  sur  57
Data processing at LinkedIn
with Apache Kafka
Jeff Weiner
Chief Executive Officer
Joel Koshy
Sr. Staff Software Engineer
Kartik Paramasivam
Director, Software Engineering
Outline
Kafka growth at LinkedIn
Canonical use cases
Search, analytics and storage platforms
Data pipelines
Stream processing
Conclusion
Q&A
Kafka at LinkedIn over the years
Canonical use cases
Data movement
Who did what,
when?
Tracking
Monitoring and
alerting
Metrics/logs
Ad hoc messaging
Queuing
Offline  online
bridge
Data
deployment
Search, analytics and
storage platforms
Distributed near real-time OLAP
datastore with SQL query interface
Pinot
• 100B documents
• 1B documents ingested per day
• 100M queries per day
• 10’s of ms latency
Pinot
Pinot
SELECT weeksSinceEpochSunday,
distinctCount(viewerId)
FROM profileViewEvents
WHERE vieweeId=myMID
AND daysSinceEpoch >= 16624
AND daysSinceEpoch <= 16714
GROUP BY weeksSinceEpochSunday
TOP 20
Pinot
Pinot
(Galene)
Search-as-a-service
• People search
• Job search
• Internal code search
• … and more
Galene
• Base index generated
weekly (offline)
• Live updater pulls from
Kafka and Brooklin (DB
changes)
• Periodically combine
incremental snapshot and
live update buffer
Distributed replicated NoSQL store
Storage Node
API Server
MySQL
Router
Router
Router
Apache Helix
ZooKeeper
Storage Node
API Server
MySQL
Storage Node
API Server
MySQL
Storage Node
API Server
MySQL
Data
Control
Routing Table
r
r
r
HTTP
Client
HTTP
Distributed replicated NoSQL store
• Member profiles
• InMail
• Ad platforms
• Invites, endorsements, etc.
Espresso replication (before)
• MySQL (per-instance)
replication
• Partitions unnecessarily
share fate
• Poor resource utilization
• Cluster expansions are hard
Node 1
P1 P2 P3
Node 2
P1 P2 P3
Node 3
P1 P2 P3
Node 1
P4 P5 P6
Node 2
P4 P5 P6
Node 3
P4 P5 P6
Master
Slave
Offline
Espresso 1.0 Kafka-based replication
HELIX
P4:
Master: 1
Slave: 3
…
EXTERNALVIEW
Node 1
Node 2
Node 3
LIVEINSTANCES Node 1
P1 P2
P4
P3
P5 P6
P9 P10
Node 2
P5 P6
P8
P7
P1 P2
P11 P12
Node 3
P9 P10
P12
P11
P3 P4
P7 P8
Kafka
Espresso 1.0 Kafka-based replication
Espresso replication how-to
RF = 3
min.insync.replicas = 2
Disable unclean leader election
Rack awareness
acks = “all”
block.on.buffer.full = true
retries = Integer.MAX_VALUE
Durability
Espresso replication how-to
RF = 3
min.insync.replicas = 2
Disable unclean leader election
Rack awareness
acks = “all”
block.on.buffer.full = true
retries = Integer.MAX_VALUE
Durability
Bump up num.replica.fetchers
max.block.ms = 0
Reduce linger.ms
Use LZ4
Low latency
Espresso replication how-to
RF = 3
min.insync.replicas = 2
Disable unclean leader election
Rack awareness
acks = “all”
block.on.buffer.full = true
retries = Integer.MAX_VALUE
Durability
Bump up num.replica.fetchers
max.block.ms = 0
Reduce linger.ms
Use LZ4
Low latency
max.inflight.requests = 1
close(0) in callback on send
failure
Ordering
Espresso replication how-to
RF = 3
min.insync.replicas = 2
Disable unclean leader election
Rack awareness
acks = “all”
block.on.buffer.full = true
retries = Integer.MAX_VALUE
Durability
Bump up num.replica.fetchers
max.block.ms = 0
Reduce linger.ms
Use LZ4
Low latency
max.inflight.requests = 1
close(0) in callback on send
failure
Ordering
Large message support
JBOD (RF3 is costly with RAID-
10)
Nice-haves
Data pipelines
tee‘ing change-capture from replication stream
tee‘ing change-capture from replication stream
tee‘ing change-capture from replication stream
Streaming data pipeline
Brooklin
Continuous data movement
between various sources
and destinations
Brooklin architecture
Brooklin client options (at LinkedIn)
Stream processing
Stream processing technologies
Stream processing technologies
Yes it is
crowded!!
Distributed stream processing framework
Samza
• Top-level Apache project since 2014
• In use at LinkedIn, Uber,
Metamarkets, Netflix, Intuit,
TripAdvisor, MobileAware,
Optimizely, etc.
• Increase in production usage at
LinkedIn – from ~20 to ~350
applications in two years
Stateless processing – message in, message out
• Schema translation
• Data transformation
(e.g., ID
obfuscation)
Stateless processing – accessing adjunct data
Key issues:
• Accidental DOS of member
DB
• Dealing with spikes
• I/O makes performance slow
Stateless processing – locally accessible adjunct data
Stateless processing – locally accessible adjunct data
• Awesome performance at low cost (100x
faster)
• No issues with accidental DoS
• No need to over provision the remote
database
Pros Cons
• Does not work for cases where the adjunct
data is large and not co-partitionable in input
stream
• Auto-scaling the processor gets trickier
• Repartitioning the Input Kafka topic can mess
up local state
Stateless processing – async data access
Synchronous API (existing) Asynchronous API
// execute on multiple threads
public interface StreamTask {
void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
// process message
}
}
// call-back based
public interface AsyncStreamTask {
void processAsync(
IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator),
TaskCallback callback) {
// process message with asynchronous calls
// fire callback upon completion
}
}
Stateful processing
Aggregations,
windowed joins, etc.
Managing state
● Full state checkpointing
● Simply does not scale for non-trivial application state
● … but makes it easier to achieve “repeatable results” when recovering from
failure
● Incremental state checkpointing
● Scales to any type of application state
● Achieving repeatable results requires additional techniques (e.g. variants of
de-dup or transaction support)
Managing local state
• Durably store “host-to-task”
mapping
• Minimize reseeding during
failures, adding/removing capacity
Samza processing pipeline
• Natural back-pressure
• Per-stage checkpointing instead of global
checkpointing
• Cost considerations – new Kafka feature
(KIP-107: deleteDataBefore)
Stream
processing
Batch
processingvs
Stream
processing
Batch
processing
It is all just
data processing
vs
Scenario: title standardization
Re-evaluate titles for
all LinkedIn members
with new ML model
Dealing with changes in ML models
Re-evaluate titles for
all LinkedIn members
with new ML model
Batch processing in Samza
Samza HDFS support
(REPROCESSING, EXPERIMENTATION, LAMBDA ARCH, ETC.)
Samza HDFS benchmark
Profile count,
group-by country
500 files
250GB
Samza: a common API for data processing
● Application code does not change
● Stream Processing
● Batch data processing
● Configurable input sources and sinks (e.g. Kafka, Kinesis, Eventhub, HDFS
etc.)
Fluent API (0.13 release)
public class PageViewCounterExample implements StreamApplication {
@Override
public void init(StreamGraph graph, Config config) {
MessageStream<PageViewEvent> pageViewEvents = graph.createInputStream(“myinput”);
MessageStream<MyStreamOutput> outputStream = graph.createOutputStream(“myoutput”);
pageViewEvents.
partitionBy(m -> m.getMessage().memberId).
window(Windows.<PageViewEvent, String, Integer> keyedTumblingWindow(m ->
m.getMessage().memberId, Duration.ofSeconds(10), (m, c) -> c + 1).
map(MyStreamOutput::new).
sendTo(outputStream);
}
}
Fluent API (0.13 release)
public class PageViewCounterExample implements StreamApplication {
@Override
public void init(StreamGraph graph, Config config) {
MessageStream<PageViewEvent> pageViewEvents = graph.createInputStream(“myinput”);
MessageStream<MyStreamOutput> outputStream = graph.createOutputStream(“myoutput”);
pageViewEvents.
partitionBy(m -> m.getMessage().memberId).
window(Windows.<PageViewEvent, String, Integer> keyedTumblingWindow(m ->
m.getMessage().memberId, Duration.ofSeconds(10), (m, c) -> c + 1).
map(MyStreamOutput::new).
sendTo(outputStream);
}
public static void main(String[] args) throws Exception {
CommandLine cmdLine = new CommandLine();
Config config = cmdLine.loadConfig(cmdLine.parser().parse(args));
ApplicationRunner localRunner = ApplicationRunner.getLocalRunner(config);
localRunner.run(new PageViewCounterExample());
}
}
Deployment options
• Full control on application lifecycle
• Can be part of a bigger application
• ZK-based coordination
Standalone YARN-based
• Dashboard
• Management service
• Monitoring/alerts
• Long running service in YARN
Conclusion
+
Font check slide
THE FOLLOWING WORDS SHOULD BE IDENTICAL IN STYLE
Hello there.
Source Sans Pro Light If words do not look like the left side, please correct your font

Contenu connexe

Tendances

Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in Flink
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in FlinkMaxim Fateev - Beyond the Watermark- On-Demand Backfilling in Flink
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in FlinkFlink Forward
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraFlink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorFlink Forward
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using KafkaKnoldus Inc.
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database Systemconfluent
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안SANG WON PARK
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache KafkaChhavi Parasher
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink Forward
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
 
Kafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsSlim Baltagi
 
Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022
Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022
Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022HostedbyConfluent
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaJiangjie Qin
 
Data ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiData ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiLev Brailovskiy
 

Tendances (20)

Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in Flink
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in FlinkMaxim Fateev - Beyond the Watermark- On-Demand Backfilling in Flink
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in Flink
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
 
Kafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiasts
 
Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022
Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022
Schema Registry 101 with Bill Bejeck | Kafka Summit London 2022
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
An Introduction to Druid
An Introduction to DruidAn Introduction to Druid
An Introduction to Druid
 
kafka
kafkakafka
kafka
 
Data ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiData ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFi
 

Similaire à Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka

Scalable Stream Processing with Apache Samza
Scalable Stream Processing with Apache SamzaScalable Stream Processing with Apache Samza
Scalable Stream Processing with Apache SamzaPrateek Maheshwari
 
Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networkspbelko82
 
Data Streaming with Apache Kafka & MongoDB - EMEA
Data Streaming with Apache Kafka & MongoDB - EMEAData Streaming with Apache Kafka & MongoDB - EMEA
Data Streaming with Apache Kafka & MongoDB - EMEAAndrew Morgan
 
Webinar: Data Streaming with Apache Kafka & MongoDB
Webinar: Data Streaming with Apache Kafka & MongoDBWebinar: Data Streaming with Apache Kafka & MongoDB
Webinar: Data Streaming with Apache Kafka & MongoDBMongoDB
 
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, QlikKeeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, QlikHostedbyConfluent
 
Streaming Data Ingest and Processing with Apache Kafka
Streaming Data Ingest and Processing with Apache KafkaStreaming Data Ingest and Processing with Apache Kafka
Streaming Data Ingest and Processing with Apache KafkaAttunity
 
Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysisAmazon Web Services
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark Summit
 
Distributed Kafka Architecture Taboola Scale
Distributed Kafka Architecture Taboola ScaleDistributed Kafka Architecture Taboola Scale
Distributed Kafka Architecture Taboola ScaleApache Kafka TLV
 
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017Monal Daxini
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time AnalyticsAmazon Web Services
 
Webinar: Unlock the Power of Streaming Data with Kinetica and Confluent
Webinar: Unlock the Power of Streaming Data with Kinetica and ConfluentWebinar: Unlock the Power of Streaming Data with Kinetica and Confluent
Webinar: Unlock the Power of Streaming Data with Kinetica and ConfluentKinetica
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDBMongoDB
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Value Association
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven productsLars Albertsson
 
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.Data Con LA
 
Realtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIORealtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIOJozo Kovac
 
Sunx4450 Intel7460 GigaSpaces XAP Platform Benchmark
Sunx4450 Intel7460 GigaSpaces XAP Platform BenchmarkSunx4450 Intel7460 GigaSpaces XAP Platform Benchmark
Sunx4450 Intel7460 GigaSpaces XAP Platform BenchmarkShay Hassidim
 
From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...
From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...
From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...HostedbyConfluent
 
Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...
Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...
Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...LINE Corporation
 

Similaire à Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka (20)

Scalable Stream Processing with Apache Samza
Scalable Stream Processing with Apache SamzaScalable Stream Processing with Apache Samza
Scalable Stream Processing with Apache Samza
 
Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networks
 
Data Streaming with Apache Kafka & MongoDB - EMEA
Data Streaming with Apache Kafka & MongoDB - EMEAData Streaming with Apache Kafka & MongoDB - EMEA
Data Streaming with Apache Kafka & MongoDB - EMEA
 
Webinar: Data Streaming with Apache Kafka & MongoDB
Webinar: Data Streaming with Apache Kafka & MongoDBWebinar: Data Streaming with Apache Kafka & MongoDB
Webinar: Data Streaming with Apache Kafka & MongoDB
 
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, QlikKeeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, Qlik
 
Streaming Data Ingest and Processing with Apache Kafka
Streaming Data Ingest and Processing with Apache KafkaStreaming Data Ingest and Processing with Apache Kafka
Streaming Data Ingest and Processing with Apache Kafka
 
Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysis
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
 
Distributed Kafka Architecture Taboola Scale
Distributed Kafka Architecture Taboola ScaleDistributed Kafka Architecture Taboola Scale
Distributed Kafka Architecture Taboola Scale
 
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time Analytics
 
Webinar: Unlock the Power of Streaming Data with Kinetica and Confluent
Webinar: Unlock the Power of Streaming Data with Kinetica and ConfluentWebinar: Unlock the Power of Streaming Data with Kinetica and Confluent
Webinar: Unlock the Power of Streaming Data with Kinetica and Confluent
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICS
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
 
Realtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIORealtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIO
 
Sunx4450 Intel7460 GigaSpaces XAP Platform Benchmark
Sunx4450 Intel7460 GigaSpaces XAP Platform BenchmarkSunx4450 Intel7460 GigaSpaces XAP Platform Benchmark
Sunx4450 Intel7460 GigaSpaces XAP Platform Benchmark
 
From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...
From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...
From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...
 
Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...
Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...
Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...
 

Plus de confluent

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flinkconfluent
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsconfluent
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flinkconfluent
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...confluent
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluentconfluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkconfluent
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloudconfluent
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Diveconfluent
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluentconfluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Meshconfluent
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservicesconfluent
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3confluent
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernizationconfluent
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataconfluent
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2confluent
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023confluent
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesisconfluent
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023confluent
 
The Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data StreamsThe Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data Streamsconfluent
 

Plus de confluent (20)

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalk
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesis
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023
 
The Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data StreamsThe Playful Bond Between REST And Data Streams
The Playful Bond Between REST And Data Streams
 

Dernier

cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptrcbcrtm
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 

Dernier (20)

cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.ppt
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Odoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting ServiceOdoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting Service
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 

Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka

  • 1. Data processing at LinkedIn with Apache Kafka Jeff Weiner Chief Executive Officer Joel Koshy Sr. Staff Software Engineer Kartik Paramasivam Director, Software Engineering
  • 2. Outline Kafka growth at LinkedIn Canonical use cases Search, analytics and storage platforms Data pipelines Stream processing Conclusion Q&A
  • 3. Kafka at LinkedIn over the years
  • 5. Data movement Who did what, when? Tracking Monitoring and alerting Metrics/logs Ad hoc messaging Queuing Offline  online bridge Data deployment
  • 7. Distributed near real-time OLAP datastore with SQL query interface Pinot • 100B documents • 1B documents ingested per day • 100M queries per day • 10’s of ms latency
  • 9. Pinot SELECT weeksSinceEpochSunday, distinctCount(viewerId) FROM profileViewEvents WHERE vieweeId=myMID AND daysSinceEpoch >= 16624 AND daysSinceEpoch <= 16714 GROUP BY weeksSinceEpochSunday TOP 20
  • 10. Pinot
  • 11. Pinot
  • 12. (Galene) Search-as-a-service • People search • Job search • Internal code search • … and more
  • 13. Galene • Base index generated weekly (offline) • Live updater pulls from Kafka and Brooklin (DB changes) • Periodically combine incremental snapshot and live update buffer
  • 14. Distributed replicated NoSQL store Storage Node API Server MySQL Router Router Router Apache Helix ZooKeeper Storage Node API Server MySQL Storage Node API Server MySQL Storage Node API Server MySQL Data Control Routing Table r r r HTTP Client HTTP
  • 15. Distributed replicated NoSQL store • Member profiles • InMail • Ad platforms • Invites, endorsements, etc.
  • 16. Espresso replication (before) • MySQL (per-instance) replication • Partitions unnecessarily share fate • Poor resource utilization • Cluster expansions are hard Node 1 P1 P2 P3 Node 2 P1 P2 P3 Node 3 P1 P2 P3 Node 1 P4 P5 P6 Node 2 P4 P5 P6 Node 3 P4 P5 P6 Master Slave Offline
  • 17. Espresso 1.0 Kafka-based replication HELIX P4: Master: 1 Slave: 3 … EXTERNALVIEW Node 1 Node 2 Node 3 LIVEINSTANCES Node 1 P1 P2 P4 P3 P5 P6 P9 P10 Node 2 P5 P6 P8 P7 P1 P2 P11 P12 Node 3 P9 P10 P12 P11 P3 P4 P7 P8 Kafka
  • 19. Espresso replication how-to RF = 3 min.insync.replicas = 2 Disable unclean leader election Rack awareness acks = “all” block.on.buffer.full = true retries = Integer.MAX_VALUE Durability
  • 20. Espresso replication how-to RF = 3 min.insync.replicas = 2 Disable unclean leader election Rack awareness acks = “all” block.on.buffer.full = true retries = Integer.MAX_VALUE Durability Bump up num.replica.fetchers max.block.ms = 0 Reduce linger.ms Use LZ4 Low latency
  • 21. Espresso replication how-to RF = 3 min.insync.replicas = 2 Disable unclean leader election Rack awareness acks = “all” block.on.buffer.full = true retries = Integer.MAX_VALUE Durability Bump up num.replica.fetchers max.block.ms = 0 Reduce linger.ms Use LZ4 Low latency max.inflight.requests = 1 close(0) in callback on send failure Ordering
  • 22. Espresso replication how-to RF = 3 min.insync.replicas = 2 Disable unclean leader election Rack awareness acks = “all” block.on.buffer.full = true retries = Integer.MAX_VALUE Durability Bump up num.replica.fetchers max.block.ms = 0 Reduce linger.ms Use LZ4 Low latency max.inflight.requests = 1 close(0) in callback on send failure Ordering Large message support JBOD (RF3 is costly with RAID- 10) Nice-haves
  • 24. tee‘ing change-capture from replication stream
  • 25. tee‘ing change-capture from replication stream
  • 26. tee‘ing change-capture from replication stream
  • 27. Streaming data pipeline Brooklin Continuous data movement between various sources and destinations
  • 29. Brooklin client options (at LinkedIn)
  • 33. Distributed stream processing framework Samza • Top-level Apache project since 2014 • In use at LinkedIn, Uber, Metamarkets, Netflix, Intuit, TripAdvisor, MobileAware, Optimizely, etc. • Increase in production usage at LinkedIn – from ~20 to ~350 applications in two years
  • 34. Stateless processing – message in, message out • Schema translation • Data transformation (e.g., ID obfuscation)
  • 35. Stateless processing – accessing adjunct data Key issues: • Accidental DOS of member DB • Dealing with spikes • I/O makes performance slow
  • 36. Stateless processing – locally accessible adjunct data
  • 37. Stateless processing – locally accessible adjunct data • Awesome performance at low cost (100x faster) • No issues with accidental DoS • No need to over provision the remote database Pros Cons • Does not work for cases where the adjunct data is large and not co-partitionable in input stream • Auto-scaling the processor gets trickier • Repartitioning the Input Kafka topic can mess up local state
  • 38. Stateless processing – async data access Synchronous API (existing) Asynchronous API // execute on multiple threads public interface StreamTask { void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { // process message } } // call-back based public interface AsyncStreamTask { void processAsync( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator), TaskCallback callback) { // process message with asynchronous calls // fire callback upon completion } }
  • 40. Managing state ● Full state checkpointing ● Simply does not scale for non-trivial application state ● … but makes it easier to achieve “repeatable results” when recovering from failure ● Incremental state checkpointing ● Scales to any type of application state ● Achieving repeatable results requires additional techniques (e.g. variants of de-dup or transaction support)
  • 41. Managing local state • Durably store “host-to-task” mapping • Minimize reseeding during failures, adding/removing capacity
  • 42. Samza processing pipeline • Natural back-pressure • Per-stage checkpointing instead of global checkpointing • Cost considerations – new Kafka feature (KIP-107: deleteDataBefore)
  • 45. Scenario: title standardization Re-evaluate titles for all LinkedIn members with new ML model
  • 46. Dealing with changes in ML models Re-evaluate titles for all LinkedIn members with new ML model
  • 48. Samza HDFS support (REPROCESSING, EXPERIMENTATION, LAMBDA ARCH, ETC.)
  • 49. Samza HDFS benchmark Profile count, group-by country 500 files 250GB
  • 50. Samza: a common API for data processing ● Application code does not change ● Stream Processing ● Batch data processing ● Configurable input sources and sinks (e.g. Kafka, Kinesis, Eventhub, HDFS etc.)
  • 51. Fluent API (0.13 release) public class PageViewCounterExample implements StreamApplication { @Override public void init(StreamGraph graph, Config config) { MessageStream<PageViewEvent> pageViewEvents = graph.createInputStream(“myinput”); MessageStream<MyStreamOutput> outputStream = graph.createOutputStream(“myoutput”); pageViewEvents. partitionBy(m -> m.getMessage().memberId). window(Windows.<PageViewEvent, String, Integer> keyedTumblingWindow(m -> m.getMessage().memberId, Duration.ofSeconds(10), (m, c) -> c + 1). map(MyStreamOutput::new). sendTo(outputStream); } }
  • 52. Fluent API (0.13 release) public class PageViewCounterExample implements StreamApplication { @Override public void init(StreamGraph graph, Config config) { MessageStream<PageViewEvent> pageViewEvents = graph.createInputStream(“myinput”); MessageStream<MyStreamOutput> outputStream = graph.createOutputStream(“myoutput”); pageViewEvents. partitionBy(m -> m.getMessage().memberId). window(Windows.<PageViewEvent, String, Integer> keyedTumblingWindow(m -> m.getMessage().memberId, Duration.ofSeconds(10), (m, c) -> c + 1). map(MyStreamOutput::new). sendTo(outputStream); } public static void main(String[] args) throws Exception { CommandLine cmdLine = new CommandLine(); Config config = cmdLine.loadConfig(cmdLine.parser().parse(args)); ApplicationRunner localRunner = ApplicationRunner.getLocalRunner(config); localRunner.run(new PageViewCounterExample()); } }
  • 53. Deployment options • Full control on application lifecycle • Can be part of a bigger application • ZK-based coordination Standalone YARN-based • Dashboard • Management service • Monitoring/alerts • Long running service in YARN
  • 55.
  • 56. +
  • 57. Font check slide THE FOLLOWING WORDS SHOULD BE IDENTICAL IN STYLE Hello there. Source Sans Pro Light If words do not look like the left side, please correct your font