http://flink-forward.org/kb_sessions/apache-beam-a-unified-model-for-batch-and-streaming-data-processing/
Unbounded, unordered, global-scale datasets are increasingly common in day-to-day business, and consumers of these datasets have detailed requirements for latency, cost, and completeness. Apache Beam (incubating) defines a new data processing programming model that evolved from more than a decade of experience within Google, including MapReduce, FlumeJava, MillWheel, and Cloud Dataflow. Beam handles both batch and streaming use cases and neatly separates properties of the data from runtime characteristics, allowing pipelines to be portable across multiple runtimes, both open-source (e.g., Apache Flink, Apache Spark, et al.) and proprietary (e.g., Google Cloud Dataflow). This talk will cover the basics of Apache Beam, touch on its evolution, describe main concepts in the programming model, and compare with similar systems. We’ll go from a simple scenario to a relatively complex data processing pipeline, and finally demonstrate execution of that pipeline on multiple runtimes.
Invezz.com - Grow your wealth with trading signals
Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data Processing
1. Apache Beam (incubating)
Kenneth Knowles
Apache Beam (incubating) PPMC
Software Engineer @ Google
klk@google.com / @KennKnowles Flink Forward 2016
https://goo.gl/jzlvD9
A Unified Model for Batch and Streaming Data Processing
2. What is Apache Beam?
Apache Beam is
a unified programming model
for expressing
efficient and portable
data processing pipelines.
3. Big Data: Infinite & Out of Order
The Beam Model
Beam Project / Technical Vision
Agenda
1
2
3
3
17. The Beam Vision (for users)
Sum Per Key
17
input.apply(
Sum.integersPerKey())
Java
input | Sum.PerKey()
Python
Apache Flink
Apache Spark
Cloud
Dataflow
⋮ ⋮
Apache
Gearpump
(incubating)
Apache Apex
18. Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*"))
.apply(FlatMapElements.via(line → Arrays.asList(line.split("[^a-zA-Z']+"))))
.apply(Filter.byPredicate(word → !word.isEmpty()))
.apply(Count.perElement())
.apply(MapElements.via(count → count.getKey() + ": " + count.getValue())
.apply(TextIO.Write.to("gs://..."));
p.run();
What your (Java) Code Looks Like
18
19. The Beam Model: Asking the Right Questions
What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
19
20. The Beam Model: Asking the Right Questions
What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
20
Aggregations,
transformations,
...
22. The Beam Model: What are you computing?
Sum Per Key
22
input.apply(Sum.integersPerKey())
.apply(BigQueryIO.Write.to(...));
Java
input | Sum.PerKey()
| Write(BigQuerySink(...))
Python
24. The Beam Model: What are you computing?
Sum Per Key
24
input.apply(Sum.integersPerKey())
.apply(BigQueryIO.Write.to(...));
Java
input | Sum.PerKey()
| Write(BigQuerySink(...))
Python
25. The Beam Model: Asking the Right Questions
What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
25
Event time
windowing
34. ProcessingTime
Event Time
Event Time Windows
34
(implementing processing time windows)
Just throw away
your data's
timestamps and
replace them with
"now()"
35. input | WindowInto(FixedWindows(3600)
| Sum.PerKey()
| Write(BigQuerySink(...))
Python
The Beam Model: Where in Event Time?
Sum Per Key
Window Into
35
input.apply(
Window.into(
FixedWindows.of(
Duration.standardHours(1)))
.apply(Sum.integersPerKey())
.apply(BigQueryIO.Write.to(...))
Java
36. Fixed Windows
(also called Tumbling)
Sliding Windows
User Sessions
The Beam Model: Where in Event Time?
1. Assign each timestamped
event to one or more
windows
2. Merge those windows
according to custom logic
38. The Beam Model: Asking the Right Questions
What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
38
Watermarks
& Triggers
47. The Beam Model: When in Processing Time?
Sum Per Key
Window Into
47
input
.apply(Window.into(FixedWindows.of(...))
.triggering(
AfterWatermark.pastEndOfWindow()))
.apply(Sum.integersPerKey())
.apply(BigQueryIO.Write.to(...))
Java
input | WindowInto(FixedWindows(3600),
trigger=AfterWatermark())
| Sum.PerKey()
| Write(BigQuerySink(...))
Python
Trigger after end
of window
57. Build a finely tuned trigger for your use case
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(
AfterProcessingTime
.pastFirstElementInPane()
.plusDuration(Duration.standardMinutes(1))
.withLateFirings(AfterPane.elementCountAtLeast(1))
57
Bill at end of month
Near real-time estimates
Immediate corrections
64. Trigger Catalogue
Composite TriggersBasic Triggers
64
AfterEndOfWindow()
AfterCount(n)
AfterProcessingTimeDelay(Δ)
AfterEndOfWindow()
.withEarlyFirings(A)
.withLateFirings(B)
AfterAny(A, B)
AfterAll(A, B)
Repeat(A)
Sequence(A, B)
65. The Beam Model: Asking the Right Questions
What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
65
Accumulation
Mode
66. 66
The Beam Model: How do refinements relate?
2
5 7 14 25
Window.into(...)
.triggering(...)
.discardingFiredPanes()
5
Window.into(...)
.triggering(...)
.accumulatingFiredPanes()
7
11
67. The Beam Model: Asking the Right Questions
What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
67
70. End users - who want to write pipelines
in a language that’s familiar.
SDK authors - who want to make Beam
concepts available in new languages.
Runner authors - who have a distributed
processing environment and want to run
Beam pipelines Beam Fn API: Invoke user-definable functions
Apache
Flink
Apache
Spark
Beam Runner API: Build and submit a pipeline
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Cloud
Dataflow
Execution
The Beam Vision
Apache
Apex
Apache
Gearpump
(incubating)
73. Unified - One model handles batch and streaming
use cases.
Portable - Pipelines can be executed on multiple
execution environments, avoiding lock-in.
Extensible - Supports user and community driven
SDKs, Runners, transformation libraries, and IO
connectors.
Why Apache Beam?
74. Why Apache Beam?
http://data-artisans.com/why-apache-beam/
"We firmly believe that the Beam model is the
correct programming model for streaming and
batch data processing."
- Kostas Tzoumas (Data Artisans)
https://cloud.google.com/blog/big-data/2016/0
5/why-apache-beam-a-google-perspective
"We hope it will lead to a healthy ecosystem of
sophisticated runners that compete by making
users happy, not [via] API lock in."
- Tyler Akidau (Google)
75. Beam here at Flink Forward 2016
75
I hope you saw:
Beaming Flink to the Cloud @ Netflix
Monal Daxini - Netflix
And stay in this room for:
Flink and Beam: Current State & Roadmap
Maximilian Michels - data Artisans
No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - Google
76. http://beam.incubator.apache.org/
Join the community!
User discussions - user-subscribe@beam.incubator.apache.org
Development discussions - dev-subscribe@beam.incubator.apache.org
Follow @ApacheBeam on Twitter
Good Reads
Why Apache Beam? (from Data Artisans)
Why Apache Beam? (from Google)
Streaming 101
Streaming 102
The Dataflow Beam Model
More Beam!
76