Flink currently features different APIs for bounded/batch (DataSet) and streaming (DataStream) programs. And while the DataStream API can handle batch use cases, it is much less efficient in that compared to the DataSet API. The Table API was built as a unified API on top of both, to cover batch and streaming with the same API, and under the hood delegate to either DataSet or DataStream.
In this talk, we present the latest on the Flink community's efforts to rework the APIs and the stack for better unified batch & streaming experience. We will discuss:
- The future roles and interplay of DataSet, DataStream, and Table API
- The new Flink stack and the abstractions on which these APIs will build
- The new unified batch/streaming sources
- How batch and streaming optimizations differ in the runtime, and what the future interplay of batch and streaming execution could look like
Similaire à Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and APIs to unify Batch & Stream - Stephan Ewen & Aljoscha Krettek
Similaire à Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and APIs to unify Batch & Stream - Stephan Ewen & Aljoscha Krettek (20)
Streaming
Keep up with real time, some extra capacity for catch-up
Receive data roughly in order as produced
Latency is important
Batch
Fast forward through months/years of history
Massively parallel unordered reads
Throughput most important
Time in data stream must be quasi monotonous to produce time progress (watermarks)
Always have close-to-latest incremental results
Resource requirements change over time
Recovery must catch up very fast
Order of time in data does not matter (parallel unordered reads)
Bulk operations (2 phase hash/sort)
Longer time for recovery (no low latency SLA)
Resource requirements change fast throughout the execution of a single job
Understanding this difference will help later, when we discuss scheduling changes.
Possibly put these on separate slides, with fewer words. Or even some graphics.
Possibly put these on separate slides, with fewer words. Or even some graphics.
There are some quirks when you use DataStream for batch
a groupReduce would be window with a GlobalWindow
MapPartition would have to finalizing things in close()
Joins would have to specify global window
Of course, state requirements are bad for the naïve approach, i.e. large state, inefficient access patterns
Joins and grouping can be a lot faster with specific algorithms
Hash Join, Merge join, etc…
Recall the earlier processing-styles slide:
batch wants step by step
streaming is all at once
This has been mentioned a lot.
Lyft has given a talk about this at last FF
For example
different window operator
Different join implementations
The scheduling stuff and networking would be a whole talk on their own. Memory management is another issue.
Pull-based operator is how most databases were/are implemented.
Note how the pull model enables hash join, merge join, …
Side inputs benefit from a pull-based model
Bring the dog-drinking-from-hose example, also for Join operator
This will allow porting batch operators/algorithms to StreamOperator
Note that this nicely jibes with the pull-based model. Enables the things we need for batch.
Mention the dog with the hose. Sources just keep spitting out records as fast as they can.