SlideShare a Scribd company logo
1 of 37
Stream Processing @Scale in
LinkedIn
Yi Pan
Data Infrastructure
Samza Team @LinkedIn
Databus
• What is Stream Processing?
• What is Samza?
• Samza Programming API
• Stream Processing @LinkedIn
• Upcoming features
Overview
• What’s stream processing
– Input: an unbounded sequence of events
• E.g. web server logs, user activity tracking events,
database changelogs, etc.
– Latency: near real-time
• From milliseconds to minutes, instead of hours to days
– Output: an unbounded sequence of changes to
the derived dataset
• The derived dataset is usually the final or partial
analytic results that can either be in another stream, or
a serving data store
Stream Processing
Response latency
Milliseconds to minutes
Synchronous Later. Possibly much later.
0 ms
Stream Processing
• What are the application requirements?
– Scalable, fast, stateful stream processing
– What scale should we operate at?
• Traffic Volume: 1.4 Trillion events/day
• Intermediate State Size: multi TB / colo (*)
– Why is it expensive to run stream processing at
scale?
• Intermediate data set needs to be stored to allow low
latency processing
• Large volume of data needs to be pulled and pushed
via network
Stream Processing
• What is Stream Processing?
• What is Samza?
• Samza Programming API
• Stream Processing @LinkedIn
• Upcoming features
Overview
• Samza is a distributed Turing machine
– Single Task Samza Job is a stateful Turing
machine
What’s Samza
Samza Task
Input stream Output stream
State
changelog
checkpoint
– Scaling a Samza job: partition the streams
What’s SamzaInputstreamA
partition 0
partition 1
partition 2
partition 3
partition n
Samza Task
State
– Scaling a Samza job: partition the streams
What’s SamzaInputstreamB
partition 0
partition 1
partition 2
partition 3
partition n
Samza Task
State
– Scaling a Samza job: replicating the state
machine
What’s Samza
shared checkpoint
Job
• Samza Execution in Yarn
What’s Samza
Host 1 Host 2 Host 3
Application
Master
Samza container Samza container
Samza container
Deploy Samza job
• Samza Execution in Yarn
What’s Samza
Host 1 Host 2 Host 3
Application
Master
Samza container Samza container
Samza container
• Samza Execution in Yarn
What’s Samza
Host 1 Host 2 Host 3
Application
Master
Samza container
Samza container
Samza container
• States in Samza
– Checkpoints
• Offsets per input stream partitions
– State Stores
• In-memory or on-disk (RocksDB) derived data set
What’s Samza
Samza Task
Output stream partitions
State
changelogpartitions
checkpoint Host 1
• States in Samza
– Checkpoints and local state stores are backed by
distributed logs
What’s Samza
Samza Task
Output stream partitions
State
changelogpartitions
checkpoint Host 1
• States in Samza
– Checkpoints and local state stores are backed
by distributed logs
What’s Samza
Samza Task
Output stream partitions
State
changelogpartitions
checkpoint Host 1
• States in Samza
– Checkpoints and local state stores are backed
by distributed logs
What’s Samza
Samza Task
Output stream partitions
State
changelogpartitions
checkpoint Host 2
• Multiple Jobs in a Dataflow
What’s Samza
Stream A Stream B Stream C
Stream E
Stream F
Job 1 Job 2
Stream D
Job 3
• What is Stream Processing?
• What is Samza?
• Samza Programming API
• Stream Processing @LinkedIn
• Upcoming features
Overview
Partition 0
class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
Samza Programming API
Partition 0
class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
Samza Programming API
Partition 0
class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
Samza Programming API
Partition 0
class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
Samza Programming API
Partition 0
class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
Samza Programming API
Partition 0
class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
Samza Programming API
Partition 0
class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
Samza Programming API
Partition 0
class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
Samza Programming API
Partition 0
class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
Samza Programming API
Partition 0
class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
Samza Programming API
• What is Stream Processing?
• What is Samza?
• Samza Programming API
• Stream Processing @LinkedIn
• Upcoming features
Overview
Stream Processing @
LinkedIn
WebServers
WebServers
WebServers
WebServers
WebServers
WebServers
WebServersMonitor
Servers
Oracle
Espresso
Kafka Databus
Tracking
events
Metrics
changelog
changelog
Samza
Jobs
Samza
Jobs
Samza
Jobs
Samza
Jobs
bootstrap
bootstrap
Voldemort
Derived
Data Derived
Data
Stream Processing @
LinkedIn
• Tracking aggregate/analysis (ACG)
Stream Processing @
LinkedIn
• Content standardization w/ adjunct data
set
Member
Profile DB
Bootstrap
Job
Databus
Kafka
Content
Standardization
Kafka
Kafka
Stream Processing @
LinkedIn
• Kafka Deployment
– 1.1 Trillion messages / day
• Databus Deployment
– 300 Billion messages / day
• Samza Deployment
– multiple colos
– 10+ Yarn clusters
– 200+ nodes
– 100+ Jobs in production
• What is Stream Processing?
• What’s Samza
• Samza Programming API
• Stream Processing @LinkedIn
• Upcoming features
Overview
• New features
– Local state store improvements
• RocksDB TTL support
• Fast recovery
– Dynamic configuration
– Easier deployment w/ standalone jobs
– High-level query language for faster
development
Upcoming Features
Contact Us / Get Involved
• Open Source
–Documentation: samza.apache.org
–Mailing list: dev@samza.apache.org
–JIRA:
https://issues.apache.org/jira/browse/SA
MZA

More Related Content

What's hot

Ingestion file copy using apex
Ingestion   file copy using apexIngestion   file copy using apex
Ingestion file copy using apexApache Apex
 
Spark streaming
Spark streamingSpark streaming
Spark streamingWhiteklay
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini
 
Apache samza past, present and future
Apache samza  past, present and futureApache samza  past, present and future
Apache samza past, present and futureEd Yakabosky
 
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...Flink Forward
 
Data pipeline with kafka
Data pipeline with kafkaData pipeline with kafka
Data pipeline with kafkaMole Wong
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsMonal Daxini
 
Apache Kafka Streams Use Case
Apache Kafka Streams Use CaseApache Kafka Streams Use Case
Apache Kafka Streams Use CaseApache Kafka TLV
 
Kappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioKappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioPiotr Czarnas
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasMonal Daxini
 
Easily Build a Smart Pulsar Stream Processor_Simon Crosby
Easily Build a Smart Pulsar Stream Processor_Simon CrosbyEasily Build a Smart Pulsar Stream Processor_Simon Crosby
Easily Build a Smart Pulsar Stream Processor_Simon CrosbyStreamNative
 
Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.Fabian Hueske
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexApache Apex
 
Stream Processing using Samza SQL
Stream Processing using Samza SQLStream Processing using Samza SQL
Stream Processing using Samza SQLSamarth Shetty
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache ApexApache Apex
 
Unbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxiniUnbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxiniMonal Daxini
 

What's hot (20)

Ingestion file copy using apex
Ingestion   file copy using apexIngestion   file copy using apex
Ingestion file copy using apex
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
 
Apache samza past, present and future
Apache samza  past, present and futureApache samza  past, present and future
Apache samza past, present and future
 
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...
 
Data pipeline with kafka
Data pipeline with kafkaData pipeline with kafka
Data pipeline with kafka
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Apache Kafka Streams Use Case
Apache Kafka Streams Use CaseApache Kafka Streams Use Case
Apache Kafka Streams Use Case
 
Kappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioKappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.io
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paas
 
Easily Build a Smart Pulsar Stream Processor_Simon Crosby
Easily Build a Smart Pulsar Stream Processor_Simon CrosbyEasily Build a Smart Pulsar Stream Processor_Simon Crosby
Easily Build a Smart Pulsar Stream Processor_Simon Crosby
 
Apex as yarn application
Apex as yarn applicationApex as yarn application
Apex as yarn application
 
ApacheCon BigData Europe 2015
ApacheCon BigData Europe 2015 ApacheCon BigData Europe 2015
ApacheCon BigData Europe 2015
 
Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache Apex
 
Stream Processing using Samza SQL
Stream Processing using Samza SQLStream Processing using Samza SQL
Stream Processing using Samza SQL
 
Samza la hug
Samza la hugSamza la hug
Samza la hug
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache Apex
 
Unbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxiniUnbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxini
 

Viewers also liked

Estrategias comerciales y mkt
Estrategias comerciales y mktEstrategias comerciales y mkt
Estrategias comerciales y mktsuneo1
 
I.II. clasificación de la industria del espectáculo
I.II. clasificación de la industria del espectáculoI.II. clasificación de la industria del espectáculo
I.II. clasificación de la industria del espectáculoCarlos H. Elizalde Ramírez
 
Bienvenida mercadotecnia deportiva internacional
Bienvenida mercadotecnia deportiva internacionalBienvenida mercadotecnia deportiva internacional
Bienvenida mercadotecnia deportiva internacionalCarlos H. Elizalde Ramírez
 
презентация ТРК Жигулевский ковчег - Белогорье
презентация ТРК Жигулевский ковчег - Белогорьепрезентация ТРК Жигулевский ковчег - Белогорье
презентация ТРК Жигулевский ковчег - БелогорьеTaniana Zharikova
 
Caso práctico desarrollo de productos deportivos. evaluación unidad i.
Caso práctico desarrollo de productos deportivos. evaluación unidad i.Caso práctico desarrollo de productos deportivos. evaluación unidad i.
Caso práctico desarrollo de productos deportivos. evaluación unidad i.Carlos H. Elizalde Ramírez
 
Literatura catalana segles XIV i XV
Literatura catalana segles XIV i XVLiteratura catalana segles XIV i XV
Literatura catalana segles XIV i XVLluis Rius
 
DepED K to 12 English Grade 8 Curriculum Guide (CG) --> 1-10-2014
DepED K to 12 English Grade 8 Curriculum Guide (CG)    -->  1-10-2014DepED K to 12 English Grade 8 Curriculum Guide (CG)    -->  1-10-2014
DepED K to 12 English Grade 8 Curriculum Guide (CG) --> 1-10-2014Chuckry Maunes
 
Recursos Naturales I
Recursos Naturales IRecursos Naturales I
Recursos Naturales Iesbaflorida
 

Viewers also liked (11)

Estrategias comerciales y mkt
Estrategias comerciales y mktEstrategias comerciales y mkt
Estrategias comerciales y mkt
 
Presentacion
PresentacionPresentacion
Presentacion
 
HerNameWasKatrina
HerNameWasKatrinaHerNameWasKatrina
HerNameWasKatrina
 
Certificate
CertificateCertificate
Certificate
 
I.II. clasificación de la industria del espectáculo
I.II. clasificación de la industria del espectáculoI.II. clasificación de la industria del espectáculo
I.II. clasificación de la industria del espectáculo
 
Bienvenida mercadotecnia deportiva internacional
Bienvenida mercadotecnia deportiva internacionalBienvenida mercadotecnia deportiva internacional
Bienvenida mercadotecnia deportiva internacional
 
презентация ТРК Жигулевский ковчег - Белогорье
презентация ТРК Жигулевский ковчег - Белогорьепрезентация ТРК Жигулевский ковчег - Белогорье
презентация ТРК Жигулевский ковчег - Белогорье
 
Caso práctico desarrollo de productos deportivos. evaluación unidad i.
Caso práctico desarrollo de productos deportivos. evaluación unidad i.Caso práctico desarrollo de productos deportivos. evaluación unidad i.
Caso práctico desarrollo de productos deportivos. evaluación unidad i.
 
Literatura catalana segles XIV i XV
Literatura catalana segles XIV i XVLiteratura catalana segles XIV i XV
Literatura catalana segles XIV i XV
 
DepED K to 12 English Grade 8 Curriculum Guide (CG) --> 1-10-2014
DepED K to 12 English Grade 8 Curriculum Guide (CG)    -->  1-10-2014DepED K to 12 English Grade 8 Curriculum Guide (CG)    -->  1-10-2014
DepED K to 12 English Grade 8 Curriculum Guide (CG) --> 1-10-2014
 
Recursos Naturales I
Recursos Naturales IRecursos Naturales I
Recursos Naturales I
 

Similar to Samza tech talk_2015 - huawei

Beam me up, Samza!
Beam me up, Samza!Beam me up, Samza!
Beam me up, Samza!Xinyu Liu
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analyticsamesar0
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingOh Chan Kwon
 
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Monal Daxini
 
Samza at LinkedIn
Samza at LinkedInSamza at LinkedIn
Samza at LinkedInVenu Ryali
 
stream-processing-at-linkedin-with-apache-samza
stream-processing-at-linkedin-with-apache-samzastream-processing-at-linkedin-with-apache-samza
stream-processing-at-linkedin-with-apache-samzaAbhishek Shivanna
 
Apache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedInApache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedInChris Riccomini
 
Real-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS LambdaReal-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS LambdaAmazon Web Services
 
Apache Samza Past, Present and Future
Apache Samza  Past, Present and FutureApache Samza  Past, Present and Future
Apache Samza Past, Present and FutureKartik Paramasivam
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationYi Pan
 
Drinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time MetricsDrinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time MetricsSamantha Quiñones
 
Nextcon samza preso july - final
Nextcon samza preso   july - finalNextcon samza preso   july - final
Nextcon samza preso july - finalYi Pan
 
Apache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedInApache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedInChris Riccomini
 
Will it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsWill it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsNavina Ramesh
 
Realtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQRealtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQXin Wang
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
 
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck -  Pravega: Storage Rei...Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck -  Pravega: Storage Rei...
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...Flink Forward
 
Data Streaming in Kafka
Data Streaming in KafkaData Streaming in Kafka
Data Streaming in KafkaSilviuMarcu1
 

Similar to Samza tech talk_2015 - huawei (20)

Beam me up, Samza!
Beam me up, Samza!Beam me up, Samza!
Beam me up, Samza!
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
 
Spark cep
Spark cepSpark cep
Spark cep
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
 
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
 
Samza at LinkedIn
Samza at LinkedInSamza at LinkedIn
Samza at LinkedIn
 
stream-processing-at-linkedin-with-apache-samza
stream-processing-at-linkedin-with-apache-samzastream-processing-at-linkedin-with-apache-samza
stream-processing-at-linkedin-with-apache-samza
 
Apache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedInApache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedIn
 
Real-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS LambdaReal-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS Lambda
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 
Apache Samza Past, Present and Future
Apache Samza  Past, Present and FutureApache Samza  Past, Present and Future
Apache Samza Past, Present and Future
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentation
 
Drinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time MetricsDrinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time Metrics
 
Nextcon samza preso july - final
Nextcon samza preso   july - finalNextcon samza preso   july - final
Nextcon samza preso july - final
 
Apache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedInApache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedIn
 
Will it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsWill it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing Applications
 
Realtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQRealtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQ
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck -  Pravega: Storage Rei...Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck -  Pravega: Storage Rei...
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
 
Data Streaming in Kafka
Data Streaming in KafkaData Streaming in Kafka
Data Streaming in Kafka
 

Recently uploaded

OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profileakrivarotava
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 

Recently uploaded (20)

OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profile
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 

Samza tech talk_2015 - huawei

  • 1. Stream Processing @Scale in LinkedIn Yi Pan Data Infrastructure Samza Team @LinkedIn Databus
  • 2. • What is Stream Processing? • What is Samza? • Samza Programming API • Stream Processing @LinkedIn • Upcoming features Overview
  • 3. • What’s stream processing – Input: an unbounded sequence of events • E.g. web server logs, user activity tracking events, database changelogs, etc. – Latency: near real-time • From milliseconds to minutes, instead of hours to days – Output: an unbounded sequence of changes to the derived dataset • The derived dataset is usually the final or partial analytic results that can either be in another stream, or a serving data store Stream Processing
  • 4. Response latency Milliseconds to minutes Synchronous Later. Possibly much later. 0 ms Stream Processing
  • 5. • What are the application requirements? – Scalable, fast, stateful stream processing – What scale should we operate at? • Traffic Volume: 1.4 Trillion events/day • Intermediate State Size: multi TB / colo (*) – Why is it expensive to run stream processing at scale? • Intermediate data set needs to be stored to allow low latency processing • Large volume of data needs to be pulled and pushed via network Stream Processing
  • 6. • What is Stream Processing? • What is Samza? • Samza Programming API • Stream Processing @LinkedIn • Upcoming features Overview
  • 7. • Samza is a distributed Turing machine – Single Task Samza Job is a stateful Turing machine What’s Samza Samza Task Input stream Output stream State changelog checkpoint
  • 8. – Scaling a Samza job: partition the streams What’s SamzaInputstreamA partition 0 partition 1 partition 2 partition 3 partition n Samza Task State
  • 9. – Scaling a Samza job: partition the streams What’s SamzaInputstreamB partition 0 partition 1 partition 2 partition 3 partition n Samza Task State
  • 10. – Scaling a Samza job: replicating the state machine What’s Samza shared checkpoint Job
  • 11. • Samza Execution in Yarn What’s Samza Host 1 Host 2 Host 3 Application Master Samza container Samza container Samza container Deploy Samza job
  • 12. • Samza Execution in Yarn What’s Samza Host 1 Host 2 Host 3 Application Master Samza container Samza container Samza container
  • 13. • Samza Execution in Yarn What’s Samza Host 1 Host 2 Host 3 Application Master Samza container Samza container Samza container
  • 14. • States in Samza – Checkpoints • Offsets per input stream partitions – State Stores • In-memory or on-disk (RocksDB) derived data set What’s Samza Samza Task Output stream partitions State changelogpartitions checkpoint Host 1
  • 15. • States in Samza – Checkpoints and local state stores are backed by distributed logs What’s Samza Samza Task Output stream partitions State changelogpartitions checkpoint Host 1
  • 16. • States in Samza – Checkpoints and local state stores are backed by distributed logs What’s Samza Samza Task Output stream partitions State changelogpartitions checkpoint Host 1
  • 17. • States in Samza – Checkpoints and local state stores are backed by distributed logs What’s Samza Samza Task Output stream partitions State changelogpartitions checkpoint Host 2
  • 18. • Multiple Jobs in a Dataflow What’s Samza Stream A Stream B Stream C Stream E Stream F Job 1 Job 2 Stream D Job 3
  • 19. • What is Stream Processing? • What is Samza? • Samza Programming API • Stream Processing @LinkedIn • Upcoming features Overview
  • 20. Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } } Samza Programming API
  • 21. Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } } Samza Programming API
  • 22. Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } } Samza Programming API
  • 23. Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } } Samza Programming API
  • 24. Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } } Samza Programming API
  • 25. Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } } Samza Programming API
  • 26. Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } } Samza Programming API
  • 27. Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } } Samza Programming API
  • 28. Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } } Samza Programming API
  • 29. Partition 0 class PageKeyViewsCounterTask implements StreamTask { public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { GenericRecord record = ((GenericRecord) envelope.getMsg()); String pageKey = record.get("page-key").toString(); int newCount = pageKeyViews.get(pageKey).incrementAndGet(); collector.send(countStream, pageKey, newCount); } } Samza Programming API
  • 30. • What is Stream Processing? • What is Samza? • Samza Programming API • Stream Processing @LinkedIn • Upcoming features Overview
  • 31. Stream Processing @ LinkedIn WebServers WebServers WebServers WebServers WebServers WebServers WebServersMonitor Servers Oracle Espresso Kafka Databus Tracking events Metrics changelog changelog Samza Jobs Samza Jobs Samza Jobs Samza Jobs bootstrap bootstrap Voldemort Derived Data Derived Data
  • 32. Stream Processing @ LinkedIn • Tracking aggregate/analysis (ACG)
  • 33. Stream Processing @ LinkedIn • Content standardization w/ adjunct data set Member Profile DB Bootstrap Job Databus Kafka Content Standardization Kafka Kafka
  • 34. Stream Processing @ LinkedIn • Kafka Deployment – 1.1 Trillion messages / day • Databus Deployment – 300 Billion messages / day • Samza Deployment – multiple colos – 10+ Yarn clusters – 200+ nodes – 100+ Jobs in production
  • 35. • What is Stream Processing? • What’s Samza • Samza Programming API • Stream Processing @LinkedIn • Upcoming features Overview
  • 36. • New features – Local state store improvements • RocksDB TTL support • Fast recovery – Dynamic configuration – Easier deployment w/ standalone jobs – High-level query language for faster development Upcoming Features
  • 37. Contact Us / Get Involved • Open Source –Documentation: samza.apache.org –Mailing list: dev@samza.apache.org –JIRA: https://issues.apache.org/jira/browse/SA MZA

Editor's Notes

  1. Rpc – bulk of what we do, we expect immediate response (web page, bunch of requests sent) Other extreme is batch processing, typically done in hadoop (order of hours if not days) Samza fits in middle, async, relatively quickly, Order of ms to minute stream processing for us = anything asynchronous, but not batch computed.
  2. re-playable, ordered, fault tolerant, infinite very heavyweight definition of a stream (vs. s4, storm, etc)
  3. Ordering within partition and not amongst all partitions
  4. partition assignment happens on write (e.g. memberID). Once the partition assignment is done, you cannot modify it. Kind of writing to a file.
  5. Multiple jobs in Samza can connect together in a dataflow
  6. Interface called stream task
  7. One call per upstream
  8. Lifecycle of the task - Shutdown, commit, window
  9. Countestream – outputstream Pagekey – partition key (same partition for the same page and it’s count)