Flink Streaming

Real-time data processing
with Flink Streaming
Gyula Fora
gyfora@apache.org
@GyulaFora
12/01/15 1

What is Flink Streaming
Part of Apache
Flink
Real-time data
processing
High
performance
Expressive
functional APIs
Programmable
in Java or
Scala
12/01/15 2

This Talk
• General introduction
• Flink Streaming APIs
• Running Flink programs
• Overview of Flink internals
• Development roadmap
• Summary
• Questions
12/01/15 3

Overview of stream processing
trends
Apache Storm
• True streaming, low latency - lower throughput
• Low level API (Bolts, Spouts) + Trident
Spark Streaming
• Stream processing on top of batch system, high throughput
- higher latency
• Functional API (DStreams), restricted by batch runtime
Flink Streaming
• True streaming with adjustable latency-throughput trade-off
• Rich functional API exploiting streaming runtime; e.g. rich
windowing semantics
12/01/15 4

Programming model
Data Stream
A
A (1)
A (2)
B (1)
B (2)
C (1)
C (2)
X
X
Y
Y
Program
Parallel Execution
X Y
Operator X Operator Y
Data abstraction: Data Stream
Data Stream
B
Data Stream
C
12/01/15 5

Flink Streaming APIs
12/01/15 6

Word count – Java
DataStream<String> text = env.socketTextStream(host,
port);
DataStream<Tuple2<String, Integer>> result = text
.flatMap((str, out) -> {
for (String token : value.split("W")) {
out.collect(new Tuple2<>(token, 1));
})
.groupBy(0)
.sum(1);
12/01/15 7
Socket
stream
Map Reduce
Output
stream

Word count - Scala
case class Word(word: String, count: Long)
val input = env.socketTextStream(host, port);
val words = input flatMap {
line => line.split("W+").map(Word(_,1)) }
val counts = words groupBy "word" sum "count"
12/01/15 8
Socket
stream
Map Reduce
Output
stream

Overview of the API
• Data stream sources
– File system
– Message queue connectors
– Arbitrary source functionality
• Stream transformations
– Basic transformations: Map, Reduce, Filter,
Aggregations…
– Windowing semantics: Policy based flexible
windowing (Time, Count, Delta…)
– Binary stream transformations: CoMap, CoReduce…
– Temporal binary stream operators: Joins, Crosses…
– Iterative stream transformations
• Data stream outputs
12/01/15 9

Data stream sources
• Process data from anywhere
• File-system sources
• Socket stream
• Message queues
– Kafka
– RabbitMQ
– Flume
• Scala/Java collections, streams, sequence generator
for development & testing
• Arbitrary source functionality using the SourceFunction
interface
– Only have to implement an invoke(out: Collector) method
12/01/15 10

Basic transformations
• Rich set of functional transformations:
– Map, FlatMap, Reduce, GroupReduce, Filter,
Project…
• Aggregations by field name or position
– Sum, Min, Max, MinBy, MaxBy, Count…
12/01/15 11
Reduce
Merge
FlatMap
Sum
Map
Source
Sink
Source

Windowing
• Flexible policy based windowing
• Trigger and Eviction policies
• Built-in policies:
– Time: Time.of(length, TimeUnit/Custom timestamp)
– Count: Count.of(windowSize)
– Delta: Delta.of(treshold, Delta function, Start value)
• Window transformations:
– Reduce
– ReduceGroup
– Grouped Reduce/ReduceGroup
• Custom trigger and eviction policies can also be
implemented easily
12/01/15 12

Windowing example
//Build new model every minute on the last 5 minutes
//worth of data
val model = trainingData
.window(Time.of(5, MINUTES))
.every(Time.of(1, MINUTES))
.reduceGroup(buildModel)
//Predict new data using the most up-to-date model
val prediction = newData
.connect(model)
.map(predict); M
P
Training Data
New Data Prediction
12/01/15 13

Temporal operators
• Binary stream operators that work on time
windows
• Database style operators:
– Join: s1.join(s2).onWindow(…).every(…)
.where(key1).equalTo(key2)
– Cross: s1.cross(s2).onWindow(…).every(…)
• UDFs can also be used for custom
operator logic on the elements in the
windows
12/01/15 14

Window Join example
case class Name(id: Long, name: String)
case class Age(id: Long, age: Int)
case class Person(name: String, age: Int)
val names = ...
val ages = ...
names.join(ages)
.onWindow(5, SECONDS)
.where("id")
.equalTo("id") {(n, a) => Person(n.name, a.age)}
12/01/15 15

Iterative stream processing
T R
Step function
Feedback stream
Output stream
def iterate[R](
stepFunction: DataStream[T] => (DataStream[T], DataStream[R]),
maxWaitTimeMillis: Long = 0 ): DataStream[R]
12/01/15 16

Iterative processing example
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.generateSequence(1, 10).iterate(incrementToTen, 1000)
.print
env.execute("Iterative example")
def incrementToTen(input: DataStream[Long]) = {
val incremented = input.map {_ + 1}
val split = incremented.split
{x => if (x >= 10) "out" else "feedback"}
(split.select("feedback"), split.select("out"))
}
12/01/15 17
Numbe
r
stream
Map Reduce
Output
stream
“out”
“feedback”

Running Flink Programs
12/01/15 18

Flink programs run everywhere
Cluster (Batch)
Local
Debugging
Fink Runtime or Apache Tez
As Java Collection
Programs
Embedded
(e.g., Web Container)
12/01/15 19

Little tuning or configuration
needed
• Requires no memory thresholds to configure
– Flink manages its own memory
• Requires no complicated network configs
– Pipelining engine requires much less memory for data exchange
• Requires no serializers to be configured
– Flink handles its own type extraction and data representation
• Programs can be adjusted to data
automatically
– Flink’s optimizer can choose execution strategies automatically
12/01/15 20

Distributed runtime
• Master (Job Manager)
handles job submission,
scheduling, and
metadata
• Workers (Task
Managers) execute
operations
• Data can be streamed
between nodes
• Data output is buffered
for higher-throughput
(tunable)12/01/15 22

Hybrid batch/streaming
• True data streaming on the runtime layer
• Data flow based runtime
• No unnecessary synchronization steps
• Batch and stream processing seamlessly
integrated
12/01/15 23

Development roadmap
12/01/15 24

Roadmap
• Fault tolerance – 2015 Q1-2
• Lambda architecture – 2015 Q2
• Integration with other frameworks
– SAMOA – 2015 Q1
– Zeppelin – 2015 ?
• Streaming machine learning library – 2015
Q3
• Streaming graph processing library – 2015
Q3
12/01/15 25

Fault tolerance
• At-least-once semantics
– Currently an alpha version
– Source level in-memory replication
– Record acknowledgments
• Exactly once semantics
– Final goal, current research
– Upstream backup with state checkpointing
12/01/15 26

Lambda architecture
In other systems
Source: https://www.mapr.com/developercentral/lambda-architecture
12/01/15 27

Lambda architecture
- One System
- One API
- One cluster
12/01/15 28
In Apache Flink

Flink vs Spark
0
20000
40000
60000
80000
100000
120000
140000
160000
2 4 6 8 10 12
Time(ms)
Memonry in GB
Processing me (300 mil pkt)
spark
fli
n
k
hpc
12/01/15 31

Summary
• Flink combines true streaming runtime with
expressive high-level APIs for a next-gen
stream processing solution
• Tunable throughput-latency trade-off with
competitive performance at both ends
• Iterative processing support opens new
horizons in online machine learning
• We are just getting started!
– Lambda architecture
– Integrations
12/01/15 32

Where to find us
flink.apache.org
github.com/apache/flink
@ApacheFlink
gyfora@apache.org
12/01/15 33

Flink Streaming

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Flink Streaming

Similaire à Flink Streaming (20)

Plus de Gyula Fóra

Plus de Gyula Fóra (6)

Dernier

Dernier (20)

Flink Streaming