1. Apache Flink Research
A look into the future
Paris Carbone - PhD Candidate
KTH Royal Institute of Technology
<parisc@kth.se, senorcarbone@apache.org>
1
3. Research in Flink
• Many ideas behind Flink were research products
• Job plan optimiser
• Efficient joins
• Memory management
• Execution engine always was a streaming engine
3
4. Our Focus
• Contributions already in the current release
• Streaming Semantics - Expressive Windowing
• State Management (representation, handling)
• Graph Semantics - Gelly
• Exactly-once-processing (checkpointing)
4
5. Ongoing Research
• Advanced State Management & Fault Tolerance
• Pre-aggregate sharing for sliding windows
• Streaming ML Pipelines
• Streaming Graphs
• Experiment Reproducibility
5
7. Lessons Learned from
Batch
7
batch-1batch-2
• If a batch computation fails, simply repeat computation
as a transaction (if we have repeatable sources)
• Transaction rate is constant
• Can we apply these principles to a true streaming
execution?
12. ML Pipelines
12
training set test set
Flink ML
ETL
Transformers
Learners
Evaluators
training
stream
test
stream
Flink
Streaming ML
stream ETL
concept drift detection
anomaly detection
online learning
online classification
13. ML on Unbounded Data
13
• We are often interested in:
• Low latency approximations on a single pass
• Instant classification on stream ingestion with higher
error bounds
• Continous aggregates on unbounded data synopses
(e.g. stream sampling)
14. Streaming ML
14
Streaming APIBatch API
Table
ML
Gelly
ML
Gelly
bounded data
multi-pass algorithms
bulk classification
unbounded data
single-pass algorithms
instant classification
15. ML Use Cases
15
Batch ML Streaming ML
SVM Anomaly Detection
Clustering Concept Drift Detection
Col. Filtering (matrix
factorisation)
Incremental Clustering
Rank Estimation Dec. Tree and Rule Mining
Similarity Matching
Approximations (freq itemsets,
distinct items, samples etc.)
16. Stream ML Abstractions
16
• Reusing the same abstractions from the batch ML library
(e.g. Transformer, Learner, Evaluator)
• plus some more abstractions (e.g. Drift Detector)
https://github.com/senorcarbone/flink/tree/incremental-ml
17. Example: Vertical Hoeffding
Trees
• Building a decision tree on-the-fly
• Parallelizing attribute metric computation (vertical
parallelization)
17
20. Or even more complex
pipelines
20
Transformer Learner Evaluator
change
error
Batch ML
Pipeline
correct
schedule
Integrating Batch and Streaming ML
21. Unbounded Graph Analysis
21
• Graphs are often created by a snapshot of a stream of
events: user interactions, product purchases, clicks, etc.
• Can we process the graph as a stream, immediately when
it arrives in the system?
• We can leverage existing research on one-pass streaming
algorithms and Flink’s streaming engine
22. Streaming Graphs?
22
Streaming APIBatch API
Table
ML
Gelly
ML
Gelly
bounded graph data
multi-pass algorithms (BSP)
exact computations
unbounded graph data
single-pass algorithms
incremental computations
26. Experiments -
Reproducibility
• Defining, Deploying, orchestrating and collecting results
for experiments is a big hustle!
• A single experiment will need
• devops hours to allocate VMs, fetch the right versions
and install system dependencies in the correct order
• dev hours to write scripts for data processing/collection
• Repeating a benchmark/experiment is impossible without
all the low level configuration details
26
27. Introducing Karamel
27
standalone web app
karamel
file
karamelized
cookbooks
• Simplifying system dependencies to a
bare minimum
• Simple integration for existing cookbooks
(chef) by adding a Karamel file
• Compositional cluster definitions
• Tight integration with Github
yaml
28. Introducing Karamel
27
standalone web app
karamel
file
karamelized
cookbooks
• Simplifying system dependencies to a
bare minimum
• Simple integration for existing cookbooks
(chef) by adding a Karamel file
• Compositional cluster definitions
• Tight integration with Github
yaml