SlideShare a Scribd company logo
1 of 32
Download to read offline
Scalable Stream
Processing
With
Apache Samza
Prateek Maheshwari
Apache Samza PMC
Agenda
● Stream Processing at LinkedIn
○ Scale at LinkedIn
○ Scenarios at LinkedIn
● Apache Samza
○ Processing Model
○ Stateful Processing
○ Processing APIs
○ Deployment Model
Apache Kafka
5 Trillion+ messages ingested
per day
1.5+ PB data per day
100k+ topics, 5M+ partitions
Brooklin
2 Trillion+ messages moved
per day
10k+ topics mirrored
2k+ change capture streams
Apache Samza
1.5 Trillion+ messages
processed per day
3k+ jobs in production
500 TB+ local state
Scale at LinkedIn
Scenarios at LinkedIn
DDoS prevention,
bot detection, access
monitoring
Security
Email and Push
notifications
Notifications
Topic tagging, NER in
news articles, image
classification
Classification
Site speed and
health monitoring
Site Speed
Monitoring
inter-service
dependencies and
SLAs
Call Graphs
Scenarios at LinkedIn
Tracking ad views
and clicks
Ad CTR
Tracking
Pre-aggregated
real-time counts by
dimensions
Business
Metrics
Standardizing titles,
companies,
education
Profile
Standardization
Updating search
indices with new
data
Index
Updates
Tracking member
page views,
dwell-time, sessions
Activity
Tracking
Hardened
at Scale
In production at
LinkedIn, Slack, Intuit,
TripAdvisor, VMWare,
Redfin, etc.
Processing events from
Kafka, Brooklin,
Kinesis, EventHubs,
HDFS, DynamoDB
Streams, Databus, etc.
Apache Samza
Incremental checkpoints
for large local state and
instant recovery.
Local state that works
seamlessly across
upgrades and failures.
APIs for simple and
efficient remote I/O
Best In Class
Stateful Processing
Stream and batch
processing without
changing code.
Convenient High-level
DSLs and a powerful
Low-level API.
Universal
Processing APIs
Write once, run
anywhere.
Run on a multi-tenant
cluster or as an
embedded library.
Flexible
Deployment Model
Brooklin
Hadoop
Task-1
Task-2
Task-3
Container-1
Container-2
Kafka
Heartbeat
Job Coordinator
Samza Application
Processing Model
Kafka
Hadoop
Serving Stores (e.g.
Espresso, Venice, Pinot)
Elasticsearch
● Parallelism across tasks by increasing the number of containers.
○ Up to 1 container per task.
● Parallelism across partitions by increasing the number of tasks.
○ Up to 1 task per partition.
● Parallelism within a partition for out of order processing.
○ Any number of threads.
Scaling a Samza Application
Hardened
at Scale
In production at
LinkedIn, Uber, Slack,
Intuit, TripAdvisor,
VMWare, Redfin, etc.
Processing events from
Kafka, Brooklin, Kinesis,
EventHubs, HDFS,
DynamoDB Streams,
Databus, etc.
Apache Samza
Incremental checkpoints
for large local state and
instant recovery.
Local state that works
seamlessly across
upgrades and failures.
APIs for simple and
efficient remote I/O
Best In Class
Stateful Processing
Stream and batch
processing without
changing code.
Convenient High-level
DSLs and a powerful
Low-level API.
Universal
Processing APIs
Write once, run
anywhere.
Run on a multi-tenant
cluster or as an
embedded library.
Flexible
Deployment Model
• State is used for performing lookups
and joins, caching data,
buffering/batching data, and writing
computed results.
• State can be local (in-memory or on
disk) or remote.
Samza
Local Store I/O
Samza
Why State Matters
and
Remote DB I/O
Why Local State Matters: Throughput
on disk w/ caching comparable with in memory changelog adds minimal overhead
remote state
30-150x worse than
local state
Terminology
Disk Type: SSD
Max-Net: Max network bandwidth
CLog: Kafka changelog
ReadOnly: read only workloads (lookups)
ReadWrite: read - write workloads (counts)
Shadi A. Noghabi et al. Samza: stateful scalable stream processing at LinkedIn. Proc. VLDB Endow. 10, 12 (August 2017), 1634-1645.
Why Local State Matters: Latency
on disk w/ caching comparable with in memory changelog adds minimal overhead
> 2 orders of magnitude slower compared to
local state
Shadi A. Noghabi et al. Samza: stateful scalable stream processing at LinkedIn. Proc. VLDB Endow. 10, 12 (August 2017), 1634-1645.
Optimizations for Local State
Task-1
Container-1
Samza Application Master
Durable Container ID – host mapping
1. Log state changes to a Kafka compacted
topic for durability.
2. Catch up on only the delta from the
change log topic on restart.
Task-2
Container-2
Optimizations for Local State
1. Host Affinity
2. Parallel Recovery
3. Bulk Load Mode
4. Standby Containers
5. Log Compaction
Task-1
Container-1
Samza Application Master
Durable Container ID – host mapping
Task-2
Container-2
Why Remote I/O Matters
• Data is only available in the remote store (no change capture).
• Need strong consistency or transactions.
• Data cannot be partitioned but is too large to copy to every container.
• Writing processed results for online serving.
• Calling other services to handle complex business logic.
Optimizations for Remote I/O: Table API
• Async Requests
• Rate Limiting
• Batching
• Caching
• Retries
• Stream Table Joins
Hardened
at Scale
In production at
LinkedIn, Uber, Slack,
Intuit, TripAdvisor,
VMWare, Redfin, etc.
Processing events from
Kafka, Brooklin, Kinesis,
EventHubs, HDFS,
DynamoDB Streams,
Databus, etc.
Apache Samza
Incremental checkpoints
for large local state and
instant recovery.
Local state that works
seamlessly across
upgrades and failures.
APIs for simple and
efficient remote I/O
Best In Class
Stateful Processing
Stream and batch
processing without
changing code.
Convenient High-level
DSLs and a powerful
Low-level API.
Universal
Processing APIs
Write once, run
anywhere.
Run on a multi-tenant
cluster or as an
embedded library.
Flexible
Deployment Model
Example Application
Count number of "Page Views" for each member in a 5 minute window
18
Page View
Page View Per
Member
Repartition
by member id
Window Map SendTo
Intermediate Stream
High Level API
● Complex Processing Pipelines
● Easy Repartitioning
● Stream-Stream and Stream-Table Joins
● Processing Time Windows and Joins
High Level API
public class PageViewCountApplication implements StreamApplication {
@Override
public void describe(StreamApplicationDescriptor appDescriptor) {
KafkaSystemDescriptor ksd = new KafkaSystemDescriptor("tracking");
KafkaInputDescriptor<PageViewEvent> pageViews = ksd.getInputDescriptor("PageView", serde);
KafkaOutputDescriptor<PageViewCount> pageViewCounts = ksd.getOutputDescriptor("PageViewCount", serde);
appDescriptor.getInputStream(pageViews)
.partitionBy(m -> m.memberId, serde)
.window(Windows.keyedTumblingWindow(m -> m.memberId, Duration.ofMinutes(5),
initialValue, (m, c) -> c + 1))
.map(PageViewCount::new)
.sendTo(appDescriptor.getOutputStream(pageViewCounts));
}
}
Apache Beam
● Event Time Processing
● Multi-lingual APIs (Java, Python, Go*)
● Advanced Windows and Joins
Apache Beam
public class PageViewCount {
public static void main(String[] args) {
...
pipeline
.apply(LiKafkaIO.<PageViewEvent>read()
.withTopic("PageView")
.withTimestampFn(kv -> new Instant(kv.getValue().header.time))
.withWatermarkFn(kv -> new Instant(kv.getValue().header.time - 60000))
.apply(Values.create())
.apply(MapElements
.into(TypeDescriptors.kvs(TypeDescriptors.strings(), TypeDescriptors.integers()))
.via((PageViewEvent pv) -> KV.of(String.valueOf(pv.header.memberId), 1)))
.apply(Window.into(TumblingWindows.of(Duration.standardMinutes(5))))
.apply(Count.perKey())
.apply(MapElements
.into(TypeDescriptors.kvs(TypeDescriptors.strings(), TypeDescriptor.of(Counter.class)))
.via(newCounter()))
.apply(LiKafkaIO.<Counter>write().withTopic("PageViewCount")
pipeline.run();
}
}
Apache Beam: Python
p = Pipeline(options=pipeline_options)
(p
| 'read' >> ReadFromKafka(cluster="tracking",
topic="PageViewEvent", config=config)
| 'extract' >> beam.Map(lambda record: (record.value['memberId'], 1))
| "windowing" >> beam.WindowInto(window.FixedWindows(60*5))
| "compute" >> beam.CombinePerKey(beam.combiners.CountCombineFn())
| 'write' >> WriteToKafka(cluster = "queuing",
topic = "PageViewCount", config = config)
p.run().waitUntilFinish()
Samza SQL
● Declarative streaming SQL API
● Managed service at LinkedIn
● Create and deploy applications in minutes using SQL Shell
Samza SQL
INSERT INTO kafka.tracking.PageViewCount
SELECT memberId, count(*) FROM kafka.tracking.PageView
GROUP BY memberId, TUMBLE(current_timestamp, INTERVAL '5' MINUTES)
Low Level
High Level
Samza SQL
Apache Beam
Java
Python
Samza APIs
Hardened
at Scale
In production at
LinkedIn, Uber, Slack,
Intuit, TripAdvisor,
VMWare, Redfin, etc.
Processing events from
Kafka, Brooklin, Kinesis,
EventHubs, HDFS,
DynamoDB Streams,
Databus, etc.
Apache Samza
Incremental checkpoints
for large local state and
instant recovery.
Local state that works
seamlessly across
upgrades and failures.
APIs for simple and
efficient remote I/O
Best In Class
Stateful Processing
Stream and batch
processing without
changing code.
Convenient High-level
DSLs and a powerful
Low-level API.
Universal
Processing APIs
Write once, run
anywhere.
Run on a multi-tenant
cluster or as an
embedded library.
Flexible
Deployment Model
Samza on a Multi-Tenant Cluster
• Uses a cluster manager (e.g. YARN) for resource management,
coordination, liveness monitoring, etc.
• Better resource utilization in a multi-tenant environment.
• Works well for large number of applications.
Samza as an Embedded Library
• Embed Samza as a library in an application. No cluster manager dependency.
• Dynamically scale out applications by increasing or decreasing the number of
processors at run-time.
• Supports rolling upgrades and canaries.
● Uses ZooKeeper for leader election and liveness monitoring for processors.
● Leader JobCoordinator performs work assignments among processors.
● Leader redistributes partitions when processors join or leave the group.
Samza as a Library
ZooKeeper Based Coordination
Zookeeper
StreamProcessor
Samza
Container
Job Coordinator
StreamProcessor
Samza
Container
Job
Coordinator
StreamProcessor
Samza
Container
Job
Coordinator…
Leader
Apache Samza
• Mature, versatile, and scalable processing framework
• Best-in-class support for local and remote state
• Powerful and flexible APIs
• Can be operated as a platform or used as an embedded library
Contact Us
http://samza.apache.org
dev@samza.apache.org

More Related Content

What's hot

Kafka Connect & Kafka Streams/KSQL - powerful ecosystem around Kafka core
Kafka Connect & Kafka Streams/KSQL - powerful ecosystem around Kafka coreKafka Connect & Kafka Streams/KSQL - powerful ecosystem around Kafka core
Kafka Connect & Kafka Streams/KSQL - powerful ecosystem around Kafka core
Guido Schmutz
 

What's hot (20)

Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysis
 
How Alibaba Cloud scaled ApsaraDB with MariaDB MaxScale
How Alibaba Cloud scaled ApsaraDB with MariaDB MaxScaleHow Alibaba Cloud scaled ApsaraDB with MariaDB MaxScale
How Alibaba Cloud scaled ApsaraDB with MariaDB MaxScale
 
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기2017 AWS DB Day | Amazon Redshift 자세히 살펴보기
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기
 
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
 
APAC Kafka Summit - Best Of
APAC Kafka Summit - Best Of APAC Kafka Summit - Best Of
APAC Kafka Summit - Best Of
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Event Driven Architectures with Apache Kafka on Heroku
Event Driven Architectures with Apache Kafka on HerokuEvent Driven Architectures with Apache Kafka on Heroku
Event Driven Architectures with Apache Kafka on Heroku
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
 
MongoDB 3.4 webinar
MongoDB 3.4 webinarMongoDB 3.4 webinar
MongoDB 3.4 webinar
 
Kafka Connect & Kafka Streams/KSQL - powerful ecosystem around Kafka core
Kafka Connect & Kafka Streams/KSQL - powerful ecosystem around Kafka coreKafka Connect & Kafka Streams/KSQL - powerful ecosystem around Kafka core
Kafka Connect & Kafka Streams/KSQL - powerful ecosystem around Kafka core
 
Near Real-Time Data Analysis With FlyData
Near Real-Time Data Analysis With FlyData Near Real-Time Data Analysis With FlyData
Near Real-Time Data Analysis With FlyData
 
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache KafkaSolutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
 
Kafka - Linkedin's messaging backbone
Kafka - Linkedin's messaging backboneKafka - Linkedin's messaging backbone
Kafka - Linkedin's messaging backbone
 
Real-Time Analytics Visualized w/ Kafka + Streamliner + MemSQL + ZoomData, An...
Real-Time Analytics Visualized w/ Kafka + Streamliner + MemSQL + ZoomData, An...Real-Time Analytics Visualized w/ Kafka + Streamliner + MemSQL + ZoomData, An...
Real-Time Analytics Visualized w/ Kafka + Streamliner + MemSQL + ZoomData, An...
 
How to Migrate from Cassandra to Amazon DynamoDB - AWS Online Tech Talks
How to Migrate from Cassandra to Amazon DynamoDB - AWS Online Tech TalksHow to Migrate from Cassandra to Amazon DynamoDB - AWS Online Tech Talks
How to Migrate from Cassandra to Amazon DynamoDB - AWS Online Tech Talks
 
Change Data Capture using Kafka
Change Data Capture using KafkaChange Data Capture using Kafka
Change Data Capture using Kafka
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!
 
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...
 
Introduction to Amazon DynamoDB
Introduction to Amazon DynamoDBIntroduction to Amazon DynamoDB
Introduction to Amazon DynamoDB
 

Similar to Scalable Stream Processing with Apache Samza

Similar to Scalable Stream Processing with Apache Samza (20)

Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache KafkaKafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
 
Applying ML on your Data in Motion with AWS and Confluent | Joseph Morais, Co...
Applying ML on your Data in Motion with AWS and Confluent | Joseph Morais, Co...Applying ML on your Data in Motion with AWS and Confluent | Joseph Morais, Co...
Applying ML on your Data in Motion with AWS and Confluent | Joseph Morais, Co...
 
Handling Data in Mega Scale Systems
Handling Data in Mega Scale SystemsHandling Data in Mega Scale Systems
Handling Data in Mega Scale Systems
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's Next
 
Event Streaming Architectures with Confluent and ScyllaDB
Event Streaming Architectures with Confluent and ScyllaDBEvent Streaming Architectures with Confluent and ScyllaDB
Event Streaming Architectures with Confluent and ScyllaDB
 
Data Streaming with Apache Kafka & MongoDB - EMEA
Data Streaming with Apache Kafka & MongoDB - EMEAData Streaming with Apache Kafka & MongoDB - EMEA
Data Streaming with Apache Kafka & MongoDB - EMEA
 
Webinar: Data Streaming with Apache Kafka & MongoDB
Webinar: Data Streaming with Apache Kafka & MongoDBWebinar: Data Streaming with Apache Kafka & MongoDB
Webinar: Data Streaming with Apache Kafka & MongoDB
 
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matter
 
Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBData Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDB
 
Scaling up Near Real-time Analytics @Uber &LinkedIn
Scaling up Near Real-time Analytics @Uber &LinkedInScaling up Near Real-time Analytics @Uber &LinkedIn
Scaling up Near Real-time Analytics @Uber &LinkedIn
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Amazon Kinesis
Amazon KinesisAmazon Kinesis
Amazon Kinesis
 
MongoDB .local Chicago 2019: MongoDB Atlas Jumpstart
MongoDB .local Chicago 2019: MongoDB Atlas JumpstartMongoDB .local Chicago 2019: MongoDB Atlas Jumpstart
MongoDB .local Chicago 2019: MongoDB Atlas Jumpstart
 
Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”
 
Getting Started with AWS Lambda and the Serverless Cloud - AWS Summit Cape T...
 Getting Started with AWS Lambda and the Serverless Cloud - AWS Summit Cape T... Getting Started with AWS Lambda and the Serverless Cloud - AWS Summit Cape T...
Getting Started with AWS Lambda and the Serverless Cloud - AWS Summit Cape T...
 
GWAB 2015 - Data Plaraform
GWAB 2015 - Data PlaraformGWAB 2015 - Data Plaraform
GWAB 2015 - Data Plaraform
 
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and MoreWSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
 

Recently uploaded

%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 

Recently uploaded (20)

%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
Generic or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 

Scalable Stream Processing with Apache Samza

  • 2. Agenda ● Stream Processing at LinkedIn ○ Scale at LinkedIn ○ Scenarios at LinkedIn ● Apache Samza ○ Processing Model ○ Stateful Processing ○ Processing APIs ○ Deployment Model
  • 3. Apache Kafka 5 Trillion+ messages ingested per day 1.5+ PB data per day 100k+ topics, 5M+ partitions Brooklin 2 Trillion+ messages moved per day 10k+ topics mirrored 2k+ change capture streams Apache Samza 1.5 Trillion+ messages processed per day 3k+ jobs in production 500 TB+ local state Scale at LinkedIn
  • 4. Scenarios at LinkedIn DDoS prevention, bot detection, access monitoring Security Email and Push notifications Notifications Topic tagging, NER in news articles, image classification Classification Site speed and health monitoring Site Speed Monitoring inter-service dependencies and SLAs Call Graphs
  • 5. Scenarios at LinkedIn Tracking ad views and clicks Ad CTR Tracking Pre-aggregated real-time counts by dimensions Business Metrics Standardizing titles, companies, education Profile Standardization Updating search indices with new data Index Updates Tracking member page views, dwell-time, sessions Activity Tracking
  • 6. Hardened at Scale In production at LinkedIn, Slack, Intuit, TripAdvisor, VMWare, Redfin, etc. Processing events from Kafka, Brooklin, Kinesis, EventHubs, HDFS, DynamoDB Streams, Databus, etc. Apache Samza Incremental checkpoints for large local state and instant recovery. Local state that works seamlessly across upgrades and failures. APIs for simple and efficient remote I/O Best In Class Stateful Processing Stream and batch processing without changing code. Convenient High-level DSLs and a powerful Low-level API. Universal Processing APIs Write once, run anywhere. Run on a multi-tenant cluster or as an embedded library. Flexible Deployment Model
  • 7. Brooklin Hadoop Task-1 Task-2 Task-3 Container-1 Container-2 Kafka Heartbeat Job Coordinator Samza Application Processing Model Kafka Hadoop Serving Stores (e.g. Espresso, Venice, Pinot) Elasticsearch
  • 8. ● Parallelism across tasks by increasing the number of containers. ○ Up to 1 container per task. ● Parallelism across partitions by increasing the number of tasks. ○ Up to 1 task per partition. ● Parallelism within a partition for out of order processing. ○ Any number of threads. Scaling a Samza Application
  • 9. Hardened at Scale In production at LinkedIn, Uber, Slack, Intuit, TripAdvisor, VMWare, Redfin, etc. Processing events from Kafka, Brooklin, Kinesis, EventHubs, HDFS, DynamoDB Streams, Databus, etc. Apache Samza Incremental checkpoints for large local state and instant recovery. Local state that works seamlessly across upgrades and failures. APIs for simple and efficient remote I/O Best In Class Stateful Processing Stream and batch processing without changing code. Convenient High-level DSLs and a powerful Low-level API. Universal Processing APIs Write once, run anywhere. Run on a multi-tenant cluster or as an embedded library. Flexible Deployment Model
  • 10. • State is used for performing lookups and joins, caching data, buffering/batching data, and writing computed results. • State can be local (in-memory or on disk) or remote. Samza Local Store I/O Samza Why State Matters and Remote DB I/O
  • 11. Why Local State Matters: Throughput on disk w/ caching comparable with in memory changelog adds minimal overhead remote state 30-150x worse than local state Terminology Disk Type: SSD Max-Net: Max network bandwidth CLog: Kafka changelog ReadOnly: read only workloads (lookups) ReadWrite: read - write workloads (counts) Shadi A. Noghabi et al. Samza: stateful scalable stream processing at LinkedIn. Proc. VLDB Endow. 10, 12 (August 2017), 1634-1645.
  • 12. Why Local State Matters: Latency on disk w/ caching comparable with in memory changelog adds minimal overhead > 2 orders of magnitude slower compared to local state Shadi A. Noghabi et al. Samza: stateful scalable stream processing at LinkedIn. Proc. VLDB Endow. 10, 12 (August 2017), 1634-1645.
  • 13. Optimizations for Local State Task-1 Container-1 Samza Application Master Durable Container ID – host mapping 1. Log state changes to a Kafka compacted topic for durability. 2. Catch up on only the delta from the change log topic on restart. Task-2 Container-2
  • 14. Optimizations for Local State 1. Host Affinity 2. Parallel Recovery 3. Bulk Load Mode 4. Standby Containers 5. Log Compaction Task-1 Container-1 Samza Application Master Durable Container ID – host mapping Task-2 Container-2
  • 15. Why Remote I/O Matters • Data is only available in the remote store (no change capture). • Need strong consistency or transactions. • Data cannot be partitioned but is too large to copy to every container. • Writing processed results for online serving. • Calling other services to handle complex business logic.
  • 16. Optimizations for Remote I/O: Table API • Async Requests • Rate Limiting • Batching • Caching • Retries • Stream Table Joins
  • 17. Hardened at Scale In production at LinkedIn, Uber, Slack, Intuit, TripAdvisor, VMWare, Redfin, etc. Processing events from Kafka, Brooklin, Kinesis, EventHubs, HDFS, DynamoDB Streams, Databus, etc. Apache Samza Incremental checkpoints for large local state and instant recovery. Local state that works seamlessly across upgrades and failures. APIs for simple and efficient remote I/O Best In Class Stateful Processing Stream and batch processing without changing code. Convenient High-level DSLs and a powerful Low-level API. Universal Processing APIs Write once, run anywhere. Run on a multi-tenant cluster or as an embedded library. Flexible Deployment Model
  • 18. Example Application Count number of "Page Views" for each member in a 5 minute window 18 Page View Page View Per Member Repartition by member id Window Map SendTo Intermediate Stream
  • 19. High Level API ● Complex Processing Pipelines ● Easy Repartitioning ● Stream-Stream and Stream-Table Joins ● Processing Time Windows and Joins
  • 20. High Level API public class PageViewCountApplication implements StreamApplication { @Override public void describe(StreamApplicationDescriptor appDescriptor) { KafkaSystemDescriptor ksd = new KafkaSystemDescriptor("tracking"); KafkaInputDescriptor<PageViewEvent> pageViews = ksd.getInputDescriptor("PageView", serde); KafkaOutputDescriptor<PageViewCount> pageViewCounts = ksd.getOutputDescriptor("PageViewCount", serde); appDescriptor.getInputStream(pageViews) .partitionBy(m -> m.memberId, serde) .window(Windows.keyedTumblingWindow(m -> m.memberId, Duration.ofMinutes(5), initialValue, (m, c) -> c + 1)) .map(PageViewCount::new) .sendTo(appDescriptor.getOutputStream(pageViewCounts)); } }
  • 21. Apache Beam ● Event Time Processing ● Multi-lingual APIs (Java, Python, Go*) ● Advanced Windows and Joins
  • 22. Apache Beam public class PageViewCount { public static void main(String[] args) { ... pipeline .apply(LiKafkaIO.<PageViewEvent>read() .withTopic("PageView") .withTimestampFn(kv -> new Instant(kv.getValue().header.time)) .withWatermarkFn(kv -> new Instant(kv.getValue().header.time - 60000)) .apply(Values.create()) .apply(MapElements .into(TypeDescriptors.kvs(TypeDescriptors.strings(), TypeDescriptors.integers())) .via((PageViewEvent pv) -> KV.of(String.valueOf(pv.header.memberId), 1))) .apply(Window.into(TumblingWindows.of(Duration.standardMinutes(5)))) .apply(Count.perKey()) .apply(MapElements .into(TypeDescriptors.kvs(TypeDescriptors.strings(), TypeDescriptor.of(Counter.class))) .via(newCounter())) .apply(LiKafkaIO.<Counter>write().withTopic("PageViewCount") pipeline.run(); } }
  • 23. Apache Beam: Python p = Pipeline(options=pipeline_options) (p | 'read' >> ReadFromKafka(cluster="tracking", topic="PageViewEvent", config=config) | 'extract' >> beam.Map(lambda record: (record.value['memberId'], 1)) | "windowing" >> beam.WindowInto(window.FixedWindows(60*5)) | "compute" >> beam.CombinePerKey(beam.combiners.CountCombineFn()) | 'write' >> WriteToKafka(cluster = "queuing", topic = "PageViewCount", config = config) p.run().waitUntilFinish()
  • 24. Samza SQL ● Declarative streaming SQL API ● Managed service at LinkedIn ● Create and deploy applications in minutes using SQL Shell
  • 25. Samza SQL INSERT INTO kafka.tracking.PageViewCount SELECT memberId, count(*) FROM kafka.tracking.PageView GROUP BY memberId, TUMBLE(current_timestamp, INTERVAL '5' MINUTES)
  • 26. Low Level High Level Samza SQL Apache Beam Java Python Samza APIs
  • 27. Hardened at Scale In production at LinkedIn, Uber, Slack, Intuit, TripAdvisor, VMWare, Redfin, etc. Processing events from Kafka, Brooklin, Kinesis, EventHubs, HDFS, DynamoDB Streams, Databus, etc. Apache Samza Incremental checkpoints for large local state and instant recovery. Local state that works seamlessly across upgrades and failures. APIs for simple and efficient remote I/O Best In Class Stateful Processing Stream and batch processing without changing code. Convenient High-level DSLs and a powerful Low-level API. Universal Processing APIs Write once, run anywhere. Run on a multi-tenant cluster or as an embedded library. Flexible Deployment Model
  • 28. Samza on a Multi-Tenant Cluster • Uses a cluster manager (e.g. YARN) for resource management, coordination, liveness monitoring, etc. • Better resource utilization in a multi-tenant environment. • Works well for large number of applications.
  • 29. Samza as an Embedded Library • Embed Samza as a library in an application. No cluster manager dependency. • Dynamically scale out applications by increasing or decreasing the number of processors at run-time. • Supports rolling upgrades and canaries.
  • 30. ● Uses ZooKeeper for leader election and liveness monitoring for processors. ● Leader JobCoordinator performs work assignments among processors. ● Leader redistributes partitions when processors join or leave the group. Samza as a Library ZooKeeper Based Coordination Zookeeper StreamProcessor Samza Container Job Coordinator StreamProcessor Samza Container Job Coordinator StreamProcessor Samza Container Job Coordinator… Leader
  • 31. Apache Samza • Mature, versatile, and scalable processing framework • Best-in-class support for local and remote state • Powerful and flexible APIs • Can be operated as a platform or used as an embedded library