Ensuring Technical Readiness For Copilot in Microsoft 365
Apache Samza* Reliable Stream Processing atop Apache Kafka and Yarn
1. Apache Samza*
Reliable Stream Processing atop
Apache Kafka and Yarn
Sriram Subramanian
Me on Linkedin
Me on twitter - @sriramsub1
* Incubating
2.
3. Agenda
• Why Stream Processing?
• What is Samza’s Design ?
• How is Samza’s Design Implemented?
• How can you use Samza ?
• Example usage at Linkedin
42. Stateful Processing
• Windowed Aggregation
– Counting the number of page views for each user per hour
• Stream Stream Join
– Join stream of ad clicks to stream of ad views to identify the view that
lead to the click
• Stream Table Join
– Join user region info to stream of page views to create an augmented
stream
43. • In memory state with checkpointing
– Periodically save out the task’s in memory
data
– As state grows becomes very expensive
– Some implementation checkpoints diffs but
adds complexity
How do people do this?
44. • Using an external store
– Push state to an external store
– Performance suffers because of remote queries
– Lack of isolation
– Limited query capabilities
How do people do this?
62. At LinkedIn
10+ billion
writes per day
172k
messages per second
(average)
60+ billion
messages per day
to real-time consumers
63. Apache Kafka
• Models streams as topics
• Each topic is partitioned and each partition is
replicated
• Producer sends messages to a topic
• Messages are stored in brokers
• Consumers consume from a topic (pull from broker)
64. YARN- Yet another resource
negotiator
• Framework to run your code on a grid of
machines
• Distributes our tasks across multiple
machines
• Notifies our framework when a task has
died
• Isolates our tasks from each other
99. OK, now lots of streams with
TreeIDs…
all_service_calls
(partitioned by TreeID)
Samza job:
Repartition-By-TreeID
*_service_call
Samza job:
Assemble Call Graph
service_call_graphs
• Near real-time holistic view of how we’re actually serving data
• Compare day-over-day, cost, changes, outages
100. Thank you
• Quick start: bit.ly/hello-samza
• Project homepage:
samza.incubator.apache.org
• Newbie issues: bit.ly/samza_newbie_issues
• Detailed Samza and YARN talk:
bit.ly/samza_and_yarn
• A must-read: http://bit.ly/jay_on_logs
• Twitter: @samzastream
• Me on Twitter: @sriramsub1
Notes de l'éditeur
- stream processing for us = anything asynchronous, but not batch computed.- 25% of code is async. 50% is rpc/online. 25% is batch.- stream processing is worst supported.
- stream processing for us = anything asynchronous, but not batch computed.- 25% of code is async. 50% is rpc/online. 25% is batch.- stream processing is worst supported.
- stream processing for us = anything asynchronous, but not batch computed.- 25% of code is async. 50% is rpc/online. 25% is batch.- stream processing is worst supported.
- stream processing for us = anything asynchronous, but not batch computed.- 25% of code is async. 50% is rpc/online. 25% is batch.- stream processing is worst supported.
- stream processing for us = anything asynchronous, but not batch computed.- 25% of code is async. 50% is rpc/online. 25% is batch.- stream processing is worst supported.
Provide timely, relevant updates to your newsfeed
Update search results with new information as it appears
- open area of research- been around for 20 years
Example – Stream 1 -> Ad Views
partitioned
re-playableorderedfault tolerantinfinitevery heavyweight definition of a stream (vs. s4, storm, etc)
At least once messaging. Duplicates are possible.Future: exact semantics.Transparent to user. No ack’ing API.