- The document discusses the Samza high-level API, which allows expressing stream processing pipelines in a single program using built-in functions, providing a more flexible deployment model that can run Samza applications either embedded or in a cluster.
- It also covers convergence between batch and stream processing in Samza, where the same application logic can run on either streaming or batch data with only configuration changes.
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
Samza 0.13 meetup slide v1.0.pptx
1. Unified processing with the
Samza High-level API
Yi Pan
Streams Team @LinkedIn
Committer and PMC Chair, Apache Samza
1
2. Agenda
• High-level API
• Flexible Deployment Model
• Convergence between Batch and Stream Processing
2
3. Application Example
Application logic: Count PageViewEvent for each member in a 5 minute window
and send the counts to PageViewEventPerMemberStream
Re-partition
by memberId
window map sendTo
PageViewEvent
PageViewEventPer
MemberStream
3
4. Application Example
Re-partition window map sendTo
PageViewEvent
PageViewEvent
ByMemberId
PageViewEventPer
MemberStream
Job-1:
PageViewRepartitionTask
Job-2: PageViewByMemberIdCounterTask
Application in low-level API
4
5. Application in Low-level API
• Job-1: Repartition job
public class PageViewRepartitionTask implements StreamTask {
private final SystemStream pageViewByMIDStream = new SystemStream("kafka",
"PaveViewEventByMemberId");
@Override
public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator
coordinator) throws Exception {
PageViewEvent pve = (PageViewEvent) envelope.getMessage();
collector.send(new OutgoingMessageEnvelope(pageViewByMIDStream, pve.memberId, pve));
}
}
5
6. Application in Low-level API
• Job-2: Window-based counter
public class PageViewByMemberIdCounterTask implements InitableTask, StreamTask, WindowableTask {
private final SystemStream pageViewCounterStream = new SystemStream("kafka", "PageViewEventPerMemberStream");
private KeyValueStore<String, PageViewPerMemberIdCounterEvent> windowedCounters;
private Long windowSize;
@Override
public void init(Config config, TaskContext context) throws Exception {
this.windowedCounters = (KeyValueStore<String, PageViewPerMemberIdCounterEvent>)
context.getStore("windowed-counter-store");
this.windowSize = config.getLong("task.window.ms");
}
@Override
public void window(MessageCollector collector, TaskCoordinator coordinator) throws Exception {
getWindowCounterEvent().forEach(counter ->
collector.send(new OutgoingMessageEnvelope(pageViewCounterStream, counter.memberId, counter)));
}
@Override
public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) throws
Exception {
PageViewEvent pve = (PageViewEvent) envelope.getMessage();
countPageViewEvent(pve);
}
}
6
8. High Level API
• Samza High Level API (NEW)
– Ability to express a multi-stage processing
pipeline in a single user program
– Built-in library to provide high-level stream
transformation functions
8
9. Application in High Level API
(NEW)
public class RepartitionAndCounterExample implements StreamApplication {
@Override public void init(StreamGraph graph, Config config) {
Supplier<Integer> initialValue = () -> 0;
MessageStream<PageViewEvent> pageViewEvents =
graph.getInputStream("pageViewEventStream", (k, m) -> (PageViewEvent) m);
OutputStream<String, MyStreamOutput, MyStreamOutput> pageViewEventPerMemberStream = graph
.getOutputStream("pageViewEventPerMemberStream", m -> m.memberId, m -> m);
pageViewEvents
.partitionBy(m -> m.memberId)
.window(Windows.keyedTumblingWindow(m -> m.memberId, Duration.ofMinutes(5), initialValue,
(m, c) -> c + 1))
.map(MyStreamOutput::new)
.sendTo(pageViewEventPerMemberStream);
}
}
Built-in transform functions
9
10. Application in High Level API
(NEW)
• Visualized execution plan
Visualization:
10
11. High Level API
• Built-in transformation functions in high-level
API
filter select a subset of messages from the stream
map map one input message to an output message
flatMap map one input message to 0 or more output messages
merge union all inputs into a single output stream
partitionBy re-partition the input messages based on a specific field
sendTo send the result to an output stream
sink send the result to an external system (e.g. external DB)
window window aggregation on the input stream
join join messages from two input streams
statelessfunctionsI/Ofunctions
stateful
functions
11
12. Agenda
• High-level API
• Flexible Deployment Model
• Convergence between Batch and Stream Processing
12
13. Limitations with current Samza
Deployment
• Tight dependency on YARN
• Can’t easily port over to non-YARN clusters
(e.g. Mesos, Kubernetes, AWS)
• Can’t directly embed stream processing in
other application (eg. a web frontend)
13
14. Flexible Deployment Model
• Flexible deployment of Samza applications
– Samza-as-a-library (NEW)
• Run embedded stream processing in a user program
• Zookeeper based coordination between multiple
instances of user program
– Samza in a cluster
• Run stream processing as a managed program in a
cluster (e.g. SamzaContainer in YARN)
• Use the cluster manager (e.g. YARN) to provide
deployment, coordination, and resource management
14
15. Samza-as-a-library
Samza Job is composed of a collection of standalone processes
● Full control on
● Application’s life cycle
● Physical resource allocated to Samza processors
● Configuration and initialization
StreamProcessor
Samza
Container
Job
Coordinator
StreamProcessor
Samza
Container
Job
Coordinator
StreamProcessor
Samza
Container
Job
Coordinator...
Leader
15
16. ● ZooKeeper-based JobCoordinator (stateful use
case)
● JobCoordinator uses ZooKeeper for leader election
● Leader will perform partition assignments among all
active StreamProcessors
Samza-as-a-library
ZooKeeper
StreamProcessor
Samza
Container
Job
Coordinator
StreamProcessor
Samza
Container
Job
Coordinator
StreamProcessor
Samza
Container
Job
Coordinator...
16
17. Samza-as-a-library
● Embedded application code example
public class WikipediaZkLocalApplication {
/**
* Executes the application using the local application runner.
* It takes two required command line arguments
* config-factory: a fully {@link org.apache.samza.config.factories.PropertiesConfigFactory} class name
* config-path: path to application properties
*
* @param args command line arguments
*/
public static void main(String[] args) {
CommandLine cmdLine = new CommandLine();
OptionSet options = cmdLine.parser().parse(args);
Config config = cmdLine.loadConfig(options);
LocalApplicationRunner runner = new LocalApplicationRunner(config);
WikipediaApplication app = new WikipediaApplication();
runner.run(app);
runner.waitForFinish();
}
}
17
18. Samza-as-a-library
● Embedded application code example
public class WikipediaZkLocalApplication {
/**
* Executes the application using the local application runner.
* It takes two required command line arguments
* config-factory: a fully {@link org.apache.samza.config.factories.PropertiesConfigFactory} class name
* config-path: path to application properties
*
* @param args command line arguments
*/
public static void main(String[] args) {
CommandLine cmdLine = new CommandLine();
OptionSet options = cmdLine.parser().parse(args);
Config config = cmdLine.loadConfig(options);
LocalApplicationRunner runner = new LocalApplicationRunner(config);
WikipediaApplication app = new WikipediaApplication();
runner.run(app);
runner.waitForFinish();
}
}
18
job.coordinator.factory=org.apache.samza.zk.ZkJobCoordinatorFactory
job.coordinator.zk.connect=my-zk.server:2191
19. • Embedded application launch sequence
Samza-as-a-library
myApp.main()
Stream
Application
Local
Application
Runner
Stream
Processor
runner.run() streamProcessor.
start()
n
19
20. •Cluster-based application launch sequence
Samza in a Cluster
run-app.sh
Remote
Application
Runner
JobRunnerjobRunner.run()
n
main()
app.class=my.app.MyStreamApplication
Yarn
RM
run-jc.sh
task.execute=run-local-app.sh
run-local-app.sh
Stream
Application
myApp.main()
Local
Application
Runner
Stream
Processor
runner.run() streamProcessor.
start()
n
Job
Coordinator
20
23. Stream Application in Batch
Application logic: Count PageViewEvent for each member in a 5 minute window
and send the counts to PageViewEventPerMemberStream
Re-partition
by memberId
window map sendTo
PageViewEvent
PageViewEventPer
MemberStream
HDFS
PageViewEvent: hdfs://mydbsnapshot/PageViewEvent/
PageViewEventPerMemberStream: hdfs://myoutputdb/PageViewEventPerMemberFiles
23
24. Stream Application in Batch
• No code change in application
streams.pageViewEventStream.system=kafka
streams.pageViewEventPerMemberStream.system=kafka
streams.pageViewEventStream.system=hdfs
streams.pageViewEventStream.physical.name=hdfs://mydbsnapshot/PageViewEvent/
streams.pageViewEventPerMemberStream.system=hdfs
streams.pageViewEventPerMemberStream.physical.name=hdfs://myoutputdb/PageViewEventPerMemberFiles
old config
new config
24
25. Samza 0.13 Architecture
25
High-level API
Unified Stream & Batch Processing
Remote Runner
Run in Remote Cluster
Cluster-based
Yarn, (Mesos)
Local Runner
Run Locally
Embedded
ZooKeeper, Standalone
APIRUNNERDEPLOY
MENT
PROCESSOR
StreamProcessor
Streams
Kafka, Kinesis, HDFS
...
Local State
RocksDb, In-Memory
Remote Data
Multithreading
25
26. Future Works
• Samza runner for Apache Beam
• Event-time processing
• Support for Exactly-once processing
• Support partition expansion for stateful
application
• Easy access to Adjunct datasets
• SQL over Streams
26