2. • What is Stream Processing?
• What is Samza?
• Stream Processing @LinkedIn
• Upcoming features
Overview
3. • What’s stream processing
– Input: an unbounded sequence of events
• E.g. web server logs, user activity tracking events,
database changelogs, etc.
– Latency: near real-time
• From milliseconds to minutes, instead of hours to days
– Output: an unbounded sequence of changes to
the derived dataset
• The derived dataset is usually the final or partial
analytic results that can either be in another stream, or
a serving data store
Stream Processing
5. • What are the application requirements?
– Scalable, fast, stateful stream processing
– What scale should we operate at?
• Traffic Volume: 1.4 Trillion events/day
• Intermediate State Size: multi TB / colo (*)
– Why is it expensive to run stream processing at
scale?
• Intermediate data set needs to be stored to allow low
latency processing
• Large volume of data needs to be pulled and pushed
via network
Stream Processing
6. • What is Stream Processing?
• What is Samza?
• Stream Processing @LinkedIn
• Upcoming features
Overview
7. • Samza is a distributed Turing machine
– Single Task Samza Job is a stateful Turing
machine
What’s Samza
Samza Task
Input stream Output stream
State
changelog
checkpoint
8. – Scaling a Samza job: partition the streams
What’s SamzaInputstreamA
partition 0
partition 1
partition 2
partition 3
partition n
9. – Scaling a Samza job: partition the streams
What’s SamzaInputstreamB
partition 0
partition 1
partition 2
partition 3
partition n
10. – Scaling a Samza job: replicating the state
machine
What’s Samza
shared checkpoint
Job
12. • States in Samza
– Checkpoints
• Offsets per input stream partitions
– State Stores
• In-memory or on-disk (RocksDB) derived data set
What’s Samza
Samza Task
Output stream partitions
State
changelogpartitions
checkpoint Host 1
13. • States in Samza
– Checkpoints and local state stores are backed by
distributed logs
What’s Samza
Samza Task
Output stream partitions
State
changelogpartitions
checkpoint Host 1
14. • What is Stream Processing?
• What is Samza?
• Stream Processing @LinkedIn
• Upcoming features
Overview
17. Stream Processing @
LinkedIn
• Content standardization w/ adjunct data
set
Member
Profile DB
Bootstrap
Job
Databus
Kafka
Content
Standardization
Kafka
Kafka
18. Stream Processing @
LinkedIn
• Kafka Deployment
– 1.1 Trillion messages / day
• Databus Deployment
– 300 Billion messages / day
• Samza Deployment
– multiple colos
– 10+ Yarn clusters
– 200+ nodes
– 100+ Jobs in production
19. • What is Stream Processing?
• What’s Samza
• Stream Processing @LinkedIn
• Upcoming features
Overview
20. • New features
– Local state store improvements
• RocksDB TTL support
• Fast recovery
– Dynamic configuration
– Easier deployment w/ standalone jobs
– High-level query language for faster
development
Upcoming Features
21. Contact Us / Get Involved
• Open Source
–Documentation: samza.apache.org
–Mailing list: dev@samza.apache.org
–JIRA:
https://issues.apache.org/jira/browse/SA
MZA
Notes de l'éditeur
Rpc – bulk of what we do, we expect immediate response (web page, bunch of requests sent)
Other extreme is batch processing, typically done in hadoop (order of hours if not days)
Samza fits in middle, async, relatively quickly, Order of ms to minute
stream processing for us = anything asynchronous, but not batch computed.