Continuous Processing with Apache Flink - Strata London 2016

Stephan Ewen
@stephanewen
Continuous Processing
with Apache Flink

2
Streaming technology is enabling the obvious:
continuous processing on data that is
continuously produced

Continuous Apps before Streaming
3
time
Scheduler
file 1
file 2
Job 1
Job 2
Serving
file 3 Job 3

Continuous Apps with Lambda
4
Scheduler
file 1
file 2
Job 1
Job 2
Serving
Streaming
job

Continuous Apps with Streaming
5
collect log analyze serve & store

Continuous Data Sources
6
Process a period of
historic data
partition
partition
Process latest data
with low latency
(tail of the log)
Reprocess stream
(historic data first, catches up with realtime data)

Continuous Data Sources
7
2016-3-1
12:00 am
2016-3-1
1:00 am
2016-3-1
2:00 am
2016-3-11
11:00pm
2016-3-12
12:00am
2016-3-12
1:00am
2016-3-11
10:00pm
2016-3-12
2:00am
2016-3-12
3:00am…
partition
partition
Stream of events in Apache Kafka partitions
Stream view over sequence of files

Continuous Processing
Time State

Apache Flink Stack
10
DataStream API
Stream Processing
DataSet API
Batch Processing
Runtime
Distributed Streaming Data Flow
Libraries
Streaming and batch as first class citizens.

Programs and Dataflows
11
Source
Transformation
Transformation
Sink
val lines: DataStream[String] = env.addSource(new FlinkKafkaConsumer09(…))
val events: DataStream[Event] = lines.map((line) => parse(line))
val stats: DataStream[Statistic] = stream
.keyBy("sensor")
.timeWindow(Time.seconds(5))
.apply(new MyAggregationFunction())
stats.addSink(new RollingSink(path))
Source
[1]
map()
[1]
keyBy()/
window()/
apply()
[1]
Sink
[1]
Source
[2]
map()
[2]
keyBy()/
window()/
apply()
[2]
Streaming
Dataflow

What makes Flink flink?
12
Low latency
High Throughput
Well-behaved
flow control
(back pressure)
Make more sense of data
Works on real-time
and historic data
True
Streaming
Event Time
APIs
Libraries
Stateful
Streaming
Globally consistent
savepoints
Exactly-once semantics
for fault tolerance
Windows &
user-defined state
Flexible windows
(time, count, session, roll-your own)
Complex Event Processing

Different Notions of Time
14
Event Producer Message Queue
Flink
Data Source
Flink
Window Operator
partition 1
partition 2
Event
Time Stream Processor
Ingestion
Time
Window
Processing
Time
Storage
Ingestion
Time

1977 1980 1983 1999 2002 2005 2015
Processing Time
Episode
IV
Episode
V
Episode
VI
Episode
I
Episode
II
Episode
III
Episode
VII
Event Time
Event Time vs. Processing Time
15

Batch: Implicit Treatment of Time
16
Time is treated outside of your application.
Data is grouped by storage ingestion time.
Batch Job
1h
Serving
LayerBatch Job
1h
Batch Job
1h

Streaming: Windows
17
Time
Aggregates on streams
are scoped by windows
Time-driven Data-driven
e.g. last X minutes e.g. last X records

Streaming: Windows
18
Time
"Average over the last 5 minutes”

Event Time Windows
19
Event Time Windows reorder the events to their Event Time order

Processing Time
20
case class Event(id: String, measure: Double, timestamp: Long)
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(ProcessingTime)
val stream: DataStream[Event] = env.addSource(…)
stream
.keyBy("id")
.timeWindow(Time.seconds(15), Time.seconds(5))
.sum("measure")

Ingestion Time
21
env.setStreamTimeCharacteristic(IngestionTime)
stream
.keyBy("id")
.sum("measure")

Event Time
22
env.setStreamTimeCharacteristic(EventTime)
val tsStream = stream.assignTimestampsAndWatermarks(
new MyTimestampsAndWatermarkGenerator())
tsStream
.keyBy("id")
.sum("measure")

The Power of Event Time
 Batch Processors: Event-time in ingestion-time batches
• Stable across re-executions
• Wrong grouping at batch boundaries
 Traditional Stream Processors: Processing time
• Results depend on when the program runs (different on re-execution)
• Results affected by network speed and delays
 Event-Time Stream Processors: Event time
• No incorrect results at batch boundaries
23

The Power of Event Time
 Batch Processors: Event-time in ingestion-time batches
• Wrong grouping at batch boundaries
 Traditional Stream Processors: Processing time
• Results depend on when the program runs (different on re-execution)
• Results affected by network speed and delays
 Event-Time Stream Processors: Event time
• No incorrect results at batch boundaries
24
Purely data-driven time
Purely wall clock time
Mix of data-driven and
wall clock time

Event Time Progress: Watermarks
25
7
W(11)W(17)
11159121417122220 171921
Watermark
Event
Event timestamp
Stream (in order)
7
W(11)W(20)
Watermark
991011141517
Event
Event timestamp
1820 192123
Stream (out of order)

Bounding the Latency for Results
 Triggering on combinations on
Event Time and Processing Time
 See previous talks by Tyler Akidau &
Kenneth Knowles on Apache Beam (incub.)
 Concepts apply almost 1:1 to Apache Flink
 Syntax varies
26

Batch vs. Continuous
28
• No state across batches
• Fault tolerance within a job
• Re-processing starts empty
Batch Jobs
Continuous
Programs
• Continuous state across time
• Fault tolerance guards state
• Reprocessing starts stateful

Continuous State
29
time
No stateless point in time
Sessions over time

Re-processing data (in batch)
30
2016-3-1
12:00 am
2016-3-1
1:00 am
2016-3-1
2:00 am
2016-3-1
5:00am
2016-3-1
6:00am
2016-3-1
7:00am
2016-3-1
4:00am
2016-3-1
3:00 am

Re-processing data (in batch)
31
2016-3-1
12:00 am
2016-3-1
1:00 am
2016-3-1
2:00 am
2016-3-1
5:00am
2016-3-1
6:00am
2016-3-1
7:00am
2016-3-1
4:00am
2016-3-1
3:00 am
Wrong / corrupt results

Streaming: Savepoints
32
Savepoint A Savepoint B
Globally consistent point-in-time snapshot
of the streaming application

Re-processing data (continuous)
33
Savepoint A

Re-processing data (continuous)
 Draw savepoints at times that you will want to start new jobs
from (daily, hourly, …)
 Reprocess by starting a new job from a savepoint
• Defines start position in stream (for example Kafka offsets)
• Initializes pending state (like partial sessions)
34
Savepoint
Run new streaming
program from savepoint

Forking and Versioning Applications
35
Savepoint
Savepoint
Savepoint
Savepoint
App. A
App. B
App. C

Wrap up
 Streaming is the architecture for continuous processing
 Continuous processing makes data applications
• Simpler: Fewer moving parts
• More correct: No broken state at any boundaries
• More flexible: Reprocess data and fork applications via savepoints
 Requires a powerful stream processor, like Apache Flink
37

Upcoming Features
 Dynamic Scaling, Resource Elasticity
 Stream SQL
 CEP enhancements
 Incremental & asynchronous state snapshotting
 Mesos support
 More connectors, end-to-end exactly once
 API enhancements (e.g., joins, slowly changing inputs)
 Security (data encryption, Kerberos with Kafka)
38

What makes Flink flink?
39
Low latency
High Throughput
Well-behaved
flow control
(back pressure)
Make more sense of data
Works on real-time
and historic data
True
Streaming
Event Time
APIs
Libraries
Stateful
Streaming
Globally consistent
savepoints
Exactly-once semantics
for fault tolerance
Windows &
user-defined state
Flexible windows
(time, count, session, roll-your own)
Complex Event Processing

Flink Forward 2016, Berlin
Submission deadline: June 30, 2016
Early bird deadline: July 15, 2016
www.flink-forward.org

We are hiring!
data-artisans.com/careers

Continuous Processing with Apache Flink - Strata London 2016

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Continuous Processing with Apache Flink - Strata London 2016

Similaire à Continuous Processing with Apache Flink - Strata London 2016 (20)

Dernier

Dernier (20)

Continuous Processing with Apache Flink - Strata London 2016