Task from the Strata & Hadoop World conference in London, 2016: Apache Flink and Continuous Processing.
The talk discusses some of the shortcomings of building continuous applications via batch processing, and how a stream processing architecture naturally solves many of these issues.
6. Continuous Data Sources
6
Process a period of
historic data
partition
partition
Process latest data
with low latency
(tail of the log)
Reprocess stream
(historic data first, catches up with realtime data)
7. Continuous Data Sources
7
2016-3-1
12:00 am
2016-3-1
1:00 am
2016-3-1
2:00 am
2016-3-11
11:00pm
2016-3-12
12:00am
2016-3-12
1:00am
2016-3-11
10:00pm
2016-3-12
2:00am
2016-3-12
3:00am…
partition
partition
Stream of events in Apache Kafka partitions
Stream view over sequence of files
10. Apache Flink Stack
10
DataStream API
Stream Processing
DataSet API
Batch Processing
Runtime
Distributed Streaming Data Flow
Libraries
Streaming and batch as first class citizens.
12. What makes Flink flink?
12
Low latency
High Throughput
Well-behaved
flow control
(back pressure)
Make more sense of data
Works on real-time
and historic data
True
Streaming
Event Time
APIs
Libraries
Stateful
Streaming
Globally consistent
savepoints
Exactly-once semantics
for fault tolerance
Windows &
user-defined state
Flexible windows
(time, count, session, roll-your own)
Complex Event Processing
14. Different Notions of Time
14
Event Producer Message Queue
Flink
Data Source
Flink
Window Operator
partition 1
partition 2
Event
Time Stream Processor
Ingestion
Time
Window
Processing
Time
Storage
Ingestion
Time
15. 1977 1980 1983 1999 2002 2005 2015
Processing Time
Episode
IV
Episode
V
Episode
VI
Episode
I
Episode
II
Episode
III
Episode
VII
Event Time
Event Time vs. Processing Time
15
16. Batch: Implicit Treatment of Time
16
Time is treated outside of your application.
Data is grouped by storage ingestion time.
Batch Job
1h
Serving
LayerBatch Job
1h
Batch Job
1h
20. Processing Time
20
case class Event(id: String, measure: Double, timestamp: Long)
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(ProcessingTime)
val stream: DataStream[Event] = env.addSource(…)
stream
.keyBy("id")
.timeWindow(Time.seconds(15), Time.seconds(5))
.sum("measure")
21. Ingestion Time
21
case class Event(id: String, measure: Double, timestamp: Long)
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(IngestionTime)
val stream: DataStream[Event] = env.addSource(…)
stream
.keyBy("id")
.timeWindow(Time.seconds(15), Time.seconds(5))
.sum("measure")
22. Event Time
22
case class Event(id: String, measure: Double, timestamp: Long)
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(EventTime)
val stream: DataStream[Event] = env.addSource(…)
val tsStream = stream.assignTimestampsAndWatermarks(
new MyTimestampsAndWatermarkGenerator())
tsStream
.keyBy("id")
.timeWindow(Time.seconds(15), Time.seconds(5))
.sum("measure")
23. The Power of Event Time
Batch Processors: Event-time in ingestion-time batches
• Stable across re-executions
• Wrong grouping at batch boundaries
Traditional Stream Processors: Processing time
• Results depend on when the program runs (different on re-execution)
• Results affected by network speed and delays
Event-Time Stream Processors: Event time
• Stable across re-executions
• No incorrect results at batch boundaries
23
24. The Power of Event Time
Batch Processors: Event-time in ingestion-time batches
• Stable across re-executions
• Wrong grouping at batch boundaries
Traditional Stream Processors: Processing time
• Results depend on when the program runs (different on re-execution)
• Results affected by network speed and delays
Event-Time Stream Processors: Event time
• Stable across re-executions
• No incorrect results at batch boundaries
24
Purely data-driven time
Purely wall clock time
Mix of data-driven and
wall clock time
25. Event Time Progress: Watermarks
25
7
W(11)W(17)
11159121417122220 171921
Watermark
Event
Event timestamp
Stream (in order)
7
W(11)W(20)
Watermark
991011141517
Event
Event timestamp
1820 192123
Stream (out of order)
26. Bounding the Latency for Results
Triggering on combinations on
Event Time and Processing Time
See previous talks by Tyler Akidau &
Kenneth Knowles on Apache Beam (incub.)
Concepts apply almost 1:1 to Apache Flink
Syntax varies
26
28. Batch vs. Continuous
28
• No state across batches
• Fault tolerance within a job
• Re-processing starts empty
Batch Jobs
Continuous
Programs
• Continuous state across time
• Fault tolerance guards state
• Reprocessing starts stateful
30. Re-processing data (in batch)
30
2016-3-1
12:00 am
2016-3-1
1:00 am
2016-3-1
2:00 am
2016-3-1
5:00am
2016-3-1
6:00am
2016-3-1
7:00am
2016-3-1
4:00am
2016-3-1
3:00 am
31. Re-processing data (in batch)
31
2016-3-1
12:00 am
2016-3-1
1:00 am
2016-3-1
2:00 am
2016-3-1
5:00am
2016-3-1
6:00am
2016-3-1
7:00am
2016-3-1
4:00am
2016-3-1
3:00 am
Wrong / corrupt results
34. Re-processing data (continuous)
Draw savepoints at times that you will want to start new jobs
from (daily, hourly, …)
Reprocess by starting a new job from a savepoint
• Defines start position in stream (for example Kafka offsets)
• Initializes pending state (like partial sessions)
34
Savepoint
Run new streaming
program from savepoint
35. Forking and Versioning Applications
35
Savepoint
Savepoint
Savepoint
Savepoint
App. A
App. B
App. C
37. Wrap up
Streaming is the architecture for continuous processing
Continuous processing makes data applications
• Simpler: Fewer moving parts
• More correct: No broken state at any boundaries
• More flexible: Reprocess data and fork applications via savepoints
Requires a powerful stream processor, like Apache Flink
37
38. Upcoming Features
Dynamic Scaling, Resource Elasticity
Stream SQL
CEP enhancements
Incremental & asynchronous state snapshotting
Mesos support
More connectors, end-to-end exactly once
API enhancements (e.g., joins, slowly changing inputs)
Security (data encryption, Kerberos with Kafka)
38
39. What makes Flink flink?
39
Low latency
High Throughput
Well-behaved
flow control
(back pressure)
Make more sense of data
Works on real-time
and historic data
True
Streaming
Event Time
APIs
Libraries
Stateful
Streaming
Globally consistent
savepoints
Exactly-once semantics
for fault tolerance
Windows &
user-defined state
Flexible windows
(time, count, session, roll-your own)
Complex Event Processing
40. Flink Forward 2016, Berlin
Submission deadline: June 30, 2016
Early bird deadline: July 15, 2016
www.flink-forward.org