Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data Processing

Apache Beam (incubating)
Kenneth Knowles
Apache Beam (incubating) PPMC
Software Engineer @ Google
klk@google.com / @KennKnowles Flink Forward 2016
https://goo.gl/jzlvD9
A Unified Model for Batch and Streaming Data Processing

What is Apache Beam?
Apache Beam is
a unified programming model
for expressing
efficient and portable
data processing pipelines.

Big Data: Infinite & Out of Order
The Beam Model
Beam Project / Technical Vision
Agenda
1
2
3
3

4
Big Data:
Infinite & Out of Order
1

https://commons.wikimedia.org/wiki/File:Globe_centered_in_the_Atlantic_Ocean_(green_and_grey_globe_scheme).svg
5

6
Unbounded, delayed, out of order
9:008:00 14:0013:0012:0011:0010:002:001:00 7:006:005:004:003:00
6
8:00
8:008:00

Organizing the stream
8
8:00
8:00
8:00

Completeness Latency Cost
$$$
Data Processing Tradeoffs
9

What is important for your application?
Completeness Low Latency Low Cost
Important
Not Important
$$$
10

Monthly Billing
Important
Not Important
$$$
11

Billing estimate
Important
Not Important
$$$
12

Abuse Detection
Important
Not Important
$$$
13

20142004 2006 2008 2010 2012 20162005 2007 2009 2013 20152011
MapReduce
(paper)
Apache
Hadoop
Dataflow Model
(paper)
See also: Tyler Akidau's Evolution of Massive-Scale Data Processing (goo.gl/VlVAEp)
MillWheel
(paper)
Heron
Apache
Spark
Apache
Storm
Apache
Gearpump
(incubating)
Apache
Apex
Apache
Flink
Cloud
Dataflow
FlumeJava
(paper)
Apache Beam
(incubating)
Choices abound
Apache
Samza

The Beam Model
Pipeline
16
PTransform
PCollection
(bounded or
unbounded)

The Beam Vision (for users)
Sum Per Key
17
input.apply(
Sum.integersPerKey())
Java
input | Sum.PerKey()
Python
Apache Flink
Apache Spark
Cloud
Dataflow
⋮ ⋮
Apache
Gearpump
(incubating)
Apache Apex

Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*"))
.apply(FlatMapElements.via(line → Arrays.asList(line.split("[^a-zA-Z']+"))))
.apply(Filter.byPredicate(word → !word.isEmpty()))
.apply(Count.perElement())
.apply(MapElements.via(count → count.getKey() + ": " + count.getValue())
.apply(TextIO.Write.to("gs://..."));
p.run();
What your (Java) Code Looks Like
18

The Beam Model: Asking the Right Questions
What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
19

20
Aggregations,
transformations,
...

The Beam Model: What are you computing?
Sum Per
User
21

Sum Per Key
22
input.apply(Sum.integersPerKey())
.apply(BigQueryIO.Write.to(...));
Java
| Write(BigQuerySink(...))
Python

Per element
(ParDo)
Grouping
(Group/Combine Per Key)
Composite

Sum Per Key
24
input.apply(Sum.integersPerKey())
.apply(BigQueryIO.Write.to(...));
Java
Python

25
Event time
windowing

26
The Beam Model: Where in Event Time?
8:00
8:00
8:00

Processing Time vs Event Time
Event Time = Processing Time ??
27

28
ProcessingTime
Event Time

ProcessingTime
Realtime
29
This is not possible
Event Time

30
Processing Delay
ProcessingTime

Very delayed
31
ProcessingTime
Event Time

Processing Time windows
(probably are not what you want)
ProcessingTime
Event Time 32

Event Time Windows
33
ProcessingTime
Event Time

ProcessingTime
Event Time
Event Time Windows
34
(implementing processing time windows)
Just throw away
your data's
timestamps and
replace them with
"now()"

input | WindowInto(FixedWindows(3600)
| Sum.PerKey()
Python
Sum Per Key
Window Into
35
input.apply(
Window.into(
FixedWindows.of(
Duration.standardHours(1)))
.apply(Sum.integersPerKey())
.apply(BigQueryIO.Write.to(...))
Java

Fixed Windows
(also called Tumbling)
Sliding Windows
User Sessions
1. Assign each timestamped
event to one or more
windows
2. Merge those windows
according to custom logic

So that's what and where...
37

38
Watermarks
& Triggers

Event time windows
ProcessingTime
39
Event Time

Fixed cutoff (we can do better)
ProcessingTime
Event Time
40
Allowed
delay
Concurrent windows

Perfect watermark
ProcessingTime
41
Event Time
Check out Slava's
slides from Strata
London 2016 talk on
watermarks:
https://goo.gl/K4FnqQ

Heuristic Watermark
ProcessingTime
42
Event Time

Heuristic Watermark
ProcessingTime
43
Current processing time
Event Time

Heuristic Watermark
ProcessingTime
44
Event Time

Heuristic Watermark
ProcessingTime
45
Late data
Event Time

Watermarks measure completeness
46
$$$
$$$
$$
? Running Total
✔ Monthly billing
? Abuse Detection

The Beam Model: When in Processing Time?
Sum Per Key
Window Into
47
input
.apply(Window.into(FixedWindows.of(...))
.triggering(
AfterWatermark.pastEndOfWindow()))
.apply(Sum.integersPerKey())
.apply(BigQueryIO.Write.to(...))
Java
input | WindowInto(FixedWindows(3600),
trigger=AfterWatermark())
| Sum.PerKey()
Python
Trigger after end
of window

ProcessingTime
Event Time
AfterWatermark.pastEndOfWindow()
48

ProcessingTime
Event Time
49

ProcessingTime
Event Time
Late data
50

ProcessingTime
Event Time
51
High completeness
Potentially high latency
Low cost
$$$

ProcessingTime
Event Time
Repeatedly.forever(
AfterPane.elementCountAtLeast(2))
52

ProcessingTime
Event Time
53
Repeatedly.forever(

ProcessingTime
Event Time
54
Repeatedly.forever(

ProcessingTime
Event Time
55
Repeatedly.forever(

ProcessingTime
Event Time
56
Repeatedly.forever(
Low completeness
Low latency
Cost driven by input$$$

Build a finely tuned trigger for your use case
.withEarlyFirings(
AfterProcessingTime
.pastFirstElementInPane()
.plusDuration(Duration.standardMinutes(1))
.withLateFirings(AfterPane.elementCountAtLeast(1))
57
Bill at end of month
Near real-time estimates
Immediate corrections

ProcessingTime
Event Time
58
.withEarlyFirings(after 1 minute)
.withLateFirings(ASAP after each element)

ProcessingTime
Event Time
59

ProcessingTime
Event Time
60
Low completeness
Low latency
Low cost, driven by time$$$

ProcessingTime
Event Time
61

ProcessingTime
Event Time
Late output
62

ProcessingTime
Event Time
Late output
63

Trigger Catalogue
Composite TriggersBasic Triggers
64
AfterEndOfWindow()
AfterCount(n)
AfterProcessingTimeDelay(Δ)
AfterEndOfWindow()
.withEarlyFirings(A)
.withLateFirings(B)
AfterAny(A, B)
AfterAll(A, B)
Repeat(A)
Sequence(A, B)

65
Accumulation
Mode

66
The Beam Model: How do refinements relate?
2
5 7 14 25
Window.into(...)
.triggering(...)
.discardingFiredPanes()
5
Window.into(...)
.triggering(...)
.accumulatingFiredPanes()
7
11

67

68
Beam Project / Technical Vision3

Dataflow → Beam
GoogleCloudPlatform/DataflowJavaSDK
cloudera/spark-dataflow
dataArtisans/flink-dataflow
apache/incubator-beam
Contributors [with GitHub badges] from:
Google, Data Artisans, Cloudera, Talend, Paypal, Spotify, Intel, Twitter,
Capital One, DataTorrent, …, <your org here>

End users - who want to write pipelines
in a language that’s familiar.
SDK authors - who want to make Beam
concepts available in new languages.
Runner authors - who have a distributed
processing environment and want to run
Beam pipelines Beam Fn API: Invoke user-definable functions
Apache
Flink
Apache
Spark
Beam Runner API: Build and submit a pipeline
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Cloud
Dataflow
Execution
The Beam Vision
Apache
Apex
Apache
Gearpump
(incubating)

Outlook
Dataflow Java 1.x
Apache Beam Java 0.x
Apache Beam Java 2.x
Bug Fix
Feature
Breaking Change
We
are
here
Feb
2016

Capability Matrix
http://beam.apache.org/learn/runners/capability-matrix/

Unified - One model handles batch and streaming
use cases.
Portable - Pipelines can be executed on multiple
execution environments, avoiding lock-in.
Extensible - Supports user and community driven
SDKs, Runners, transformation libraries, and IO
connectors.
Why Apache Beam?

Why Apache Beam?
http://data-artisans.com/why-apache-beam/
"We firmly believe that the Beam model is the
correct programming model for streaming and
batch data processing."
- Kostas Tzoumas (Data Artisans)
https://cloud.google.com/blog/big-data/2016/0
5/why-apache-beam-a-google-perspective
"We hope it will lead to a healthy ecosystem of
sophisticated runners that compete by making
users happy, not [via] API lock in."
- Tyler Akidau (Google)

Beam here at Flink Forward 2016
75
I hope you saw:
Beaming Flink to the Cloud @ Netflix
Monal Daxini - Netflix
And stay in this room for:
Flink and Beam: Current State & Roadmap
Maximilian Michels - data Artisans
No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - Google

http://beam.incubator.apache.org/
Join the community!
User discussions - user-subscribe@beam.incubator.apache.org
Development discussions - dev-subscribe@beam.incubator.apache.org
Follow @ApacheBeam on Twitter
Good Reads
Why Apache Beam? (from Data Artisans)
Why Apache Beam? (from Google)
Streaming 101
Streaming 102
The Dataflow Beam Model
More Beam!
76

Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data Processing

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data Processing

Similaire à Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data Processing (20)

Plus de Flink Forward

Plus de Flink Forward (20)

Dernier

Dernier (20)

Kenneth Knowles - Apache Beam - A Unified Model for Batch and Streaming Data Processing