Data Stream Processing - Concepts and Frameworks

Data Stream Processing – Concepts and Frameworks
Matthias Niehoff
1

AGENDA
2
Typical Problems
Basic Ideas
Streaming Frameworks
Current Innovations
Recommendations

3
Basic Ideas
Data Stream Processing – Why and what is it?

Batch Layer
Speed Layer
Current Situation of Dealing with (Big) Data
4

IoT Sensor Data
Industrial Machines,
Consumer Electronic,
Agriculture
Click Streams
Online Shops, Self Service
Portals, Comparison Portals
Monitoring
System Health, Traffic
between Systems,
Resource Utilization
Online Gaming
Gamer Interactions, Reward
Systems, Custom Content
& Experiences
Automotive Industry
Vehicle Tracking, Predictive
Maintenance , Routing
Information
Financial Transactions
Fraud Detection, Trade
Monitoring and
Management
Sources for streaming data can not only be found in the frequently
mentioned IoT area. In many other industries incur streaming data.
Strictly speaking, any data can be viewed as a stream. Some of the
most popular use cases and examples are:
5
Sources for Streaming Data

Distributed Stream Processing
6

First step – Microbatching
10
Source
Processing
Sink
Microbatches

Native Streaming
11
Source
Processing
Sink

12
Typical Problems
and the way frameworks tackle them

Event time vs processing time
15
event
processing
1 2 3 4 5 6 7 8 9t in minutes

Windowing - Slicing data into chunks
16
Tumbling Window Sliding Window Session Window
Time Trigger Count Trigger Content Trigger

Tumbling & Sliding Windows
17
4 5 3 6 1 5 9 2 8 6 7 2
4 5 3 6 1 5 9 2 8 6 7 2
18 17 23
tumbling windows
sum
4 5 3 6 1 5 9 2 8 6 7 2
18 17 23sum
4 5 3 6 1 5 9 2 8 6 7 2
15 25
sliding windows

Session Window
18
time
user 1
user 2
?
logout
delayed event

Session Window
19
time
user 1
user 2
logout
delayed event

The Dataflow Model:  
A Practical Approach to Balancing Correctness, Latency, and Cost
in Massive-Scale, Unbounded, Out-of-Order Data Processing
20
[...] stop trying to groom unbounded
datasets into ﬁnite pools of information that eventually
become complete, and instead live and breathe under
the assumption that we will never know if or when we have
seen all of our data, only that new data will arrive, old data
may be retracted [...]
http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

•Part 1 of „When will the result be calculated?“
•Watermark of all received data
•A watermark of 10:00 means „It is assumed that all
data until 10:00 now arrived“
•fix watermark
•heuristic watermark
•A window will be materialized/processed when
watermark equals end of window
21
Watermarks

Watermarks
22
event time
processingtime
3
6
4,5

Trigger
23
ContentEvent Time Processing Time Count Composite
•Part 2 of „When will the result be calculated?“
•Triggers an (additional) materialization of the window
•Example
•every 10 minutes (in processing time)
•& when the watermark reached the end of the window
•& with each delayed event
•but only for additional 15 minutes in processing time (allowed lateness)

Accumulators
Joining the individual (triggered) results
•every result on its own (discarding)
•Results based on each other (accumulating)
•Results based on each other & correction of the old
result (accumulating & retracting)
24

Accumulators
25
Discarding Accumulating
Accumulating &
Retracting
(5,2) 7 7 7
(8,3) 11 18 18, -7
(4) 4 22 22, -18
Last value 4 22 22
Total sum 22 47 22
5 2 | 8 3 | 4

Watermarks, Trigger, Accumulators
vgl. https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
26
input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(60)))
.triggering(
AtWatermark()
.withEarlyFirings(AtPeriod(Duration.standardMinutes(10)))
.withLateFirings(AtCount(1))))
.withAllowedLateness(Duration.standardMinutes(15))) 
.discardingFiredPanes())
.apply(Sum.integersPerKey());

Stream ~ Table Model
•Aggregating a stream over time yields a table
•Changes to a table over time yields a stream
•Table will be updated by every entry in the stream
•Every new entry triggers a computation
•Retention period for late events (c.f. allowed lateness)
•Stream/Table ⊆ Dataflow
27
(key1, value1) key1 value1 1
key1 value3 2
key2 value2 1
key1 value1 1
key2 value2 1
(key2, value2)
(key1, value3)
key value
update
count

State & Window Processing
•Non trivial applications mostly need some kind of (temporal) persistent state
•i.e aggregations over a longer time, counter, slowly refreshing metadata
•held in memory, can be stored on disk
•interesting: partitioning, rescaling, node failure?
29
state
operation

State Implementations
•State is most of the time partitioned
•Distributed over multiple nodes
•Number of nodes might change
•State must be fault-tolerant
•State access must be fast
•Storage backend
•native/own-build: i.e. in Spark Streaming
•existing tools: RocksDB in Kafka Streams
•pluggable: Flink, amongst others also RocksDB
•Carbone et. al. (2017), State Management in Flink,  
http://www.vldb.org/pvldb/vol10/p1718-carbone.pdf
30

Lookup Additional Data
32
Results
Queue Processing
Metadata

Lookup - Remote Read
33
Queue Metadata
Node 2
Node 1
cc
cc
cc
cc
cc
cc
cc
cc

Lookup - Local Read
34
Queue Metadata
Node 2
Node 1
cc
cc
cc
cc
cccc
cc cc

35
Deployment &
Runtime Environment

Runtime Environment - Cluster vs. Library
36
YARN

Framework
Dependent
•UI
•REST APIs
•Metrics
Scheduler
Monitoring
Own  
Logging
•Technical
•Business
Java  
„Classics"
•JMX
•Profiler
37
Monitoring

Guarantees
39
at-most-once at-least-once exactly-once
Record Acknowledgement
Micro Batching
Snapshots/Checkpoints
Changelogs

Guarantees
40
at-most-once at-least-once exactly-once

41
Streaming Frameworks
Helping you implement your solution

Tyler Akidau
“ ... an execution engine designed for unbounded data sets, and nothing more”
42
T. Akidau et. al (2015): The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

Apache Spark
•Open Source (2010) & Apache project (2013)
•Unified Batch & Stream Processing
•Wide distribution, especially in Hadoop environments
•Batch: RDD as base, DataFrames and DataSets as optimization
•Streaming: DStream & Structured Streaming
43

Apache Spark Streaming
•Microbatching
•Similiar, partly unified, programming model as with batch processing
•State and window operations
•Missing support for event time
44

Apache Spark Structured Streaming
•DataSets/DataFrames for streaming processing
•DataStream as an ever-growing table
•Unified API
•Limited support for event time operations
45
val ds = sparkSession
.read
.json("someFile.json")
ds
.write
.json("otherFile.json")
val ds = sparkSession
.readStream
.format("kafka")
.option("...","...")
.load
ds
.writeStream
.outputMode("complete")
.format("console")
.start()

Apache Flink
•Started as research project in 2010 (Stratosphere), Apache project since 2014
•Low latency streaming and high throughput batch processing
•Streaming first
•Flexible state and window handling
•Rich support for event time handling
46

Apache Kafka Streams API
•Only a library, no runtime environment
•Requires Kafka cluster ( >= 0.10)
•Uses Kafka consumer technologies for
•Ordering
•Partitioning
•Scaling
•Source & sink: Kafka topics only
•Kafka Connect for sources & sinks
47

48
Current developments
The latest promises and features

Queryable State
•Known as
•Queryable state (Flink)
•Interactive Queries (Kafka Streams)
•Still low level
•Data lifecycle
•(De)Serialization
•Partitioned state discovery
49
state
operation
query
interface

Streaming SQL
•Use SQL to query Streaming Data
•time varying relations i.e. [12:00, 12:00)
•query on multiple points in time
•Standard ANSI SQL + some extensions
•SELECT TABLE, SELECT STREAM
•WINDOWS
•TRIGGERS
•Supported by
•Flink
•Kafka Streams (KSQL)
•https://s.apache.org/streaming-sql-strata-nyc
50
CREATE TABLE possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 SECONDS)
GROUP BY card_number
HAVING count(*) > 3;

51
Recommendations
Or at least some hints when choosing a framework

Spark Streaming might be an option if
•Spark is already used for batch processing
•Hadoop, and therefore YARN, is used
•A huge community is important
•Scala is not a problem
•Latency is not an important criteria *
•Event time handling is not needed *
•* those change in Structured Streaming
•event time support
•reduce microbatching overhead
52

Flink is good for...
•flexible event time processing
•watermarks
•trigger
•accumulator
•connectivity to the most important peripheral systems
•low latency stream processing
•excellent state handling
53

And finally Kafka Streams, for...
•you want an easy deployment
•you already have a scheduler/micro service platform
•low latency and high throughput
•event time support
•a lightweight start in streaming
•if you already use Kafka
•if you are fine with making Kafka your central backbone
54

Comparison
55
Engine Microbatching Native Nativ
Programmingmodel Declarative Declarative Declarativ
Guarantees Exactly-Once Exactly-Once Exactly-Once
Event time Handling No/Yes* Yes Yes
State Storage Own Pluggable RocksDB + Topic
Community & Ecosystem Big Medium Big
Deployment Cluster Cluster Library
Monitoring
UI, REST API, Dropwizard
Metrics
UI, Metrics (JMX, Ganglia),
Rest API
Kafka Tools, Confluent
Control Center, JMX

A word on
•Apache Beam
•High-level API for different streaming runner, i.e Google Cloud Dataflow, Flink and Spark Streaming
•Google Cloud Data Flow
•Cloud Streaming by Google
•Apex
•YARN based with a static topology which can be changed at runtime
•Flume
•Logfile Shipping, especially into HDFS
•Storm/Heron
•Streaming pioneer by Twitter, Heron as a successor with the same API
56

Take aways
•Streaming is not easy
•(Event) Time
•State
•Deployment
•Correctness
•Different concepts and implementations
•Be aware of
•Monitoring
•„Overkill“
•Ongoing research and development
57

Our mission – to promote agile development, innovation
and technology – extends through everything we do.
codecentric AG
Hochstraße 11
42697 Solingen
Germany
Address
E-Mail: matthias.niehoff@codecentric.de
Twitter: @matthiasniehoff
www.codecentric.de
Contact Info
Stay connected!
58

Data Stream Processing - Concepts and Frameworks

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Data Stream Processing - Concepts and Frameworks

Similaire à Data Stream Processing - Concepts and Frameworks (20)

Dernier

Dernier (20)

Data Stream Processing - Concepts and Frameworks