2. • What is Stream Processing?
• What is Samza?
• Samza Programming API
• Stream Processing @LinkedIn
• Upcoming features
Overview
3. • What’s stream processing
– Input: an unbounded sequence of events
• E.g. web server logs, user activity tracking events,
database changelogs, etc.
– Latency: near real-time
• From milliseconds to minutes, instead of hours to days
– Output: an unbounded sequence of changes to
the derived dataset
• The derived dataset is usually the final or partial
analytic results that can either be in another stream, or
a serving data store
Stream Processing
5. • What are the application requirements?
– Scalable, fast, stateful stream processing
– What scale should we operate at?
• Traffic Volume: 1.4 Trillion events/day
• Intermediate State Size: multi TB / colo (*)
– Why is it expensive to run stream processing at
scale?
• Intermediate data set needs to be stored to allow low
latency processing
• Large volume of data needs to be pulled and pushed
via network
Stream Processing
6. • What is Stream Processing?
• What is Samza?
• Samza Programming API
• Stream Processing @LinkedIn
• Upcoming features
Overview
7. • Samza is a distributed Turing machine
– Single Task Samza Job is a stateful Turing
machine
What’s Samza
Samza Task
Input stream Output stream
State
changelog
checkpoint
8. – Scaling a Samza job: partition the streams
What’s SamzaInputstreamA
partition 0
partition 1
partition 2
partition 3
partition n
Samza Task
State
9. – Scaling a Samza job: partition the streams
What’s SamzaInputstreamB
partition 0
partition 1
partition 2
partition 3
partition n
Samza Task
State
10. – Scaling a Samza job: replicating the state
machine
What’s Samza
shared checkpoint
Job
14. • States in Samza
– Checkpoints
• Offsets per input stream partitions
– State Stores
• In-memory or on-disk (RocksDB) derived data set
What’s Samza
Samza Task
Output stream partitions
State
changelogpartitions
checkpoint Host 1
15. • States in Samza
– Checkpoints and local state stores are backed by
distributed logs
What’s Samza
Samza Task
Output stream partitions
State
changelogpartitions
checkpoint Host 1
16. • States in Samza
– Checkpoints and local state stores are backed
by distributed logs
What’s Samza
Samza Task
Output stream partitions
State
changelogpartitions
checkpoint Host 1
17. • States in Samza
– Checkpoints and local state stores are backed
by distributed logs
What’s Samza
Samza Task
Output stream partitions
State
changelogpartitions
checkpoint Host 2
18. • Multiple Jobs in a Dataflow
What’s Samza
Stream A Stream B Stream C
Stream E
Stream F
Job 1 Job 2
Stream D
Job 3
19. • What is Stream Processing?
• What is Samza?
• Samza Programming API
• Stream Processing @LinkedIn
• Upcoming features
Overview
20. Partition 0
class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
Samza Programming API
21. Partition 0
class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
Samza Programming API
22. Partition 0
class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
Samza Programming API
23. Partition 0
class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
Samza Programming API
24. Partition 0
class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
Samza Programming API
25. Partition 0
class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
Samza Programming API
26. Partition 0
class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
Samza Programming API
27. Partition 0
class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
Samza Programming API
28. Partition 0
class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
Samza Programming API
29. Partition 0
class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
Samza Programming API
30. • What is Stream Processing?
• What is Samza?
• Samza Programming API
• Stream Processing @LinkedIn
• Upcoming features
Overview
33. Stream Processing @
LinkedIn
• Content standardization w/ adjunct data
set
Member
Profile DB
Bootstrap
Job
Databus
Kafka
Content
Standardization
Kafka
Kafka
34. Stream Processing @
LinkedIn
• Kafka Deployment
– 1.1 Trillion messages / day
• Databus Deployment
– 300 Billion messages / day
• Samza Deployment
– multiple colos
– 10+ Yarn clusters
– 200+ nodes
– 100+ Jobs in production
35. • What is Stream Processing?
• What’s Samza
• Samza Programming API
• Stream Processing @LinkedIn
• Upcoming features
Overview
36. • New features
– Local state store improvements
• RocksDB TTL support
• Fast recovery
– Dynamic configuration
– Easier deployment w/ standalone jobs
– High-level query language for faster
development
Upcoming Features
37. Contact Us / Get Involved
• Open Source
–Documentation: samza.apache.org
–Mailing list: dev@samza.apache.org
–JIRA:
https://issues.apache.org/jira/browse/SA
MZA
Editor's Notes
Rpc – bulk of what we do, we expect immediate response (web page, bunch of requests sent)
Other extreme is batch processing, typically done in hadoop (order of hours if not days)
Samza fits in middle, async, relatively quickly, Order of ms to minute
stream processing for us = anything asynchronous, but not batch computed.
re-playable, ordered, fault tolerant, infinite
very heavyweight definition of a stream (vs. s4, storm, etc)
Ordering within partition and not amongst all partitions
partition assignment happens on write (e.g. memberID). Once the partition assignment is done, you cannot modify it. Kind of writing to a file.
Multiple jobs in Samza can connect together in a dataflow
Interface called stream task
One call per upstream
Lifecycle of the task - Shutdown, commit, window
Countestream – outputstream
Pagekey – partition key (same partition for the same page and it’s count)