In our fast moving world it becomes more and more important for companies to gain near real-time insights from their data to make faster decisions. These insights do not only provide a competitve edge over ones rivals but also enable a company to create completely new services and products. Amongst others, predictive user interfaces and online recommendation can be implemented when being able to process large amounts of data in real-time.
Apache Flink, one of the most advanced open source distributed stream processing platforms, allows you to extract business intelligence from your data in near real-time. With Apache Flink it is possible to process billions of messages with milliseconds latency. Moreover, its expressive APIs allow you to quickly solve your problems, ranging from classical analytical workloads to distributed event-driven applications.
In this talk, I will introduce Apache Flink and explain how it enables users to develop distributed applications and process analytical workloads alike. Starting with Flink’s basic concepts of fault-tolerance, statefulness and event-time aware processing, we will take a look at the different APIs and what they allow us to do. The talk will be concluded by demonstrating how we can use Flink’s higher level abstractions such as FlinkCEP and StreamSQL to do declarative stream processing.
3. What changes faster? Data or Query?
3
Data changes slowly
compared to fast
changing queries
ad-hoc queries, data
exploration, ML training and
(hyper) parameter tuning
Batch Processing
Use Case
Data changes fast
application logic
is long-lived
continuous applications,
data pipelines, standing queries,
anomaly detection, ML evaluation,
…
Stream Processing
Use Case
7. Apache Flink in a Nutshell
7
Queries
Applications
Devices
etc.
Database
Stream
File / Object
Storage
Stateful computations over streams
real-time and historic
fast, scalable, fault tolerant, in-memory,
event time, large state, exactly-once
Historic
Data
Streams
Application
8. 8
Event Streams State (Event) Time Snapshots
The Core Building Blocks
real-time and
hindsight
complex
business logic
consistency with
out-of-order data
and late data
forking /
versioning /
time-travel
9. Powerful Abstractions
9
Process Function (events, state, time)
DataStream API (streams, windows)
Stream SQL / Tables (dynamic tables)
Stream- & Batch
Data Processing
High-level
Analytics API
Stateful Event-
Driven Applications
val stats = stream
.keyBy("sensor")
.timeWindow(Time.seconds(5))
.sum((a, b) -> a.add(b))
def processElement(event: MyEvent, ctx: Context, out: Collector[Result]) = {
// work with event and state
(event, state.value) match { … }
out.collect(…) // emit events
state.update(…) // modify state
// schedule a timer callback
ctx.timerService.registerEventTimeTimer(event.timestamp + 500)
}
Layered abstractions to
navigate simple to complex use cases
10. Hardened at scale
10
Athena X Streaming SQL
Platform Service
Streaming Platform as a Service
Fraud detection
Streaming Analytics Platform
100s jobs, 1000s nodes, TBs state
metrics, analytics, real time ML
Streaming SQL as a platform
15. Modern distributed app. architecture
15
Application
Sensor
APIs
Application
Application
Application
The limit to what you can do is how sophisticated
can you compute over the stream!
Boils down to: How well do you handle state and time
24. Scaling a Service
24
separately provision additional
database capacity
provision compute
and state together
Classic tiered architecture Streaming architecture
provision
compute
25. Rolling out a new Service
25
provision a new database
(or add capacity to an existing one)
simply occupies some
additional backup space
Classic tiered
architecture
Streaming architecture
provision compute
and state together
26. What users built on checkpoints…
§ Upgrades and Rollbacks
§ Cross Datacenter Failover
§ State Archiving
§ Application Migration
§ Spot Instance Region Chasing
§ A/B testing
§ …
26
27. 27
What is the next wave of
stream processing
applications?
28. What changes faster? Data or Query?
28
Data changes slowly
compared to fast
changing queries
ad-hoc queries, data
exploration, ML training and
tuning
Batch Processing
Use Case
Data changes fast
application logic
is long-lived
continuous applications,
data pipelines, standing queries,
anomaly detection, ML evaluation,
…
Stream Processing
Use Case
29. Analytics & Business Logic
29
Stream
Source analytics metrics
Business
Logic
Stream
Processor
events
ApplicationDatabase
32. 32
Can one build an entire sophisticated
web application (say a social network)
on a stream processor?
(Yes, we can!™)
33. 33
Social network implemented using event sourcing
and CQRS (Command Query Responsibility
Segregation) on Kafka/Flink/Elasticsearch/Redis
More: https://data-artisans.com/blog/drivetribe-cqrs-apache-flink
@
34. 34
The next wave of stream
processing applications…
… is all types of stateful
applications that react to
data and time!
36. The dA Platform Architecture
36
dA Platform 2
Apache Flink
Stateful stream processing
Kubernetes
Container platform
Logging
Streams from
Kafka, Kinesis,
S3, HDFS,
Databases, ...
dA
Application
Manager
Application lifecycle
management
Metrics
CI/CD
Real-time
Analytics
Anomaly- &
Fraud Detection
Real-time
Data Integration
Reactive
Microservices
(and more)
37. Versioned Applications, not Jobs/Jars
37
Stream
Processing
Application
Version 3
Version 2
Version 1
Code and Application
Snapshot
upgrade
upgrade
New Application
Version 3a
Version 2a
fork /
duplicate
39. Hooks for CI/CD pipelines
39
Kubernetes Cluster
Application
Version 1
CI Service
dA
Application
Manager
push
update
trigger
CI
upgrade
API
stateful
upgrade
Application
Version 2