Introduction to Apache Flink - Fast and reliable big data processing

Apache Flink
Fast and reliable big data processing
Till Rohrmann
trohrmann@apache.org

What is Apache Flink?
• Project undergoing incubation in the Apache
Software Foundation
• Originating from the Stratosphere research project
started at TU Berlin in 2009
• http://flink.incubator.apache.org
• 59 contributors (doubled in ~4 months)
• Has awesome squirrel logo

What is Apache Flink?
Flink Client Master'
Worker'
Worker'

Current state
• Fast - much faster than Hadoop, faster than Spark
in many cases
• Reliable - does not suffer from memory problems

Outline of this talk
• Introduction to Flink
• Distributed PageRank with Flink
• Other Flink features
• Flink roadmap and closing

Where to locate Flink in the
Open Source landscape?
Crunch"
5" 5
Hive"
Mahout"
MapReduce"
Flink"
Spark" Storm"
Yarn" Mesos"
HDFS"
Cascading"
Tez"
Pig"
Applica3ons$
Data$processing$
engines$
App$and$resource$
management$
Storage,$streams$ HBase" KaAa"
…"

Distributed data sets
DataSet
A
DataSet
B
DataSet
C
A (1)
A (2)
B (1)
B (2)
C (1)
C (2)
X
X
Y
Y
Program
Parallel Execution
X Y
Operator X Operator Y

Log Analysis
LogFile
Filter
Users Join
Result
1 ExecutionEnvironment env =
ExecutionEnvironment.getExecutionEnvironment();
2
3 DataSet<Tuple2<Integer, String>> log = env.readCsvFile(logInput)
.types(Integer.class, String.class);
4 DataSet<Tuple2<String, Integer>> users = env.readCsvFile(userInput)
.types(String.class, Integer.class);
5
6 DataSet<String> usersWhoDownloadedFlink = log
7 .filter(
8 (msg) -> msg.f1.contains(“flink.jar”)
9 )
10 .join(users).where(0).equalTo(1)
11 .with(
12 (msg,user,Collector<String> out) -> {
14 out.collect(user.f0);
15 }
16 );
17
18 usersWhoDownloadedFlink.print();
19
20 env.execute(“Log filter example”);

PageRank
• Algorithm which made Google
a multi billion dollar business
• Ranking of search results
• Model: Random surfer
• Follows links
• Randomly selects arbitrary
website

How can we solve the
problem?
PageRankDS = {
(1, 0.3)
(2, 0.5)
(3, 0.2)
}
AdjacencyDS = {
(1, [2, 3])
(2, [1])
(3, [1, 2])
}

PageRank{
node: Int
rank: Double
}
Adjacency{
node: Int
neighbours: List[Int]
}

case class Pagerank(node: Int, rank: Double)
case class Adjacency(node: Int, neighbours: Array[Int])
!
def main(args: Array[String]): Unit = {
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
!
val initialPagerank:DataSet[Pagerank] = createInitialPagerank(numVertices, env)
val adjacency: DataSet[Adjacency] = createRandomAdjacency(numVertices, sparsity, env)
!
val solution = initialPagerank.iterate(100){
pagerank =>
val partialPagerank = pagerank.join(adjacency).
where(“node”).
equalTo(“node”).
flatMap{
// generating the partial pageranks
pair => {
val (Pagerank(node, rank), Adjacency(_, neighbours)) = pair
val length = neighbours.length
neighbours.map{
neighbour=>
Pagerank(neighbour, dampingFactor*rank/length)
} :+ Pagerank(node, (1-dampingFactor)/numVertices)
}
}
!
// adding the partial pageranks up
partialPagerank.
groupBy(“node”).
reduce{
(left, right) => Pagerank(left.node, left.rank + right.rank)
}
!
}
solution.print()
env.execute("Flink pagerank.")
}

Common%API%
Storage%
Streams%
Flink%Op;mizer%
Hybrid%Batch/Streaming%Run;me%
Files%! HDFS%! S3!
Cluster%
Manager%! Na;ve YARN%! EC2%! !
Scala%API%
(batch)%
Graph%API%
(„Spargel“)%
JDBC! Rabbit% Redis%!
Azure! KaRa! MQ! …%
Java%
Collec;ons%
Streams%Builder%
Apache%Tez%
Python%API%
Java%API%
(streaming)%
Apache%
MRQL%
Batch!
Streaming!
Java%API%
(batch)%
Local%
Execu;on%

Memory management
• Flink manages its own memory on the heap
• Caching and data processing happens in managed memory
• Allows graceful spilling, never out of memory exceptions
JVM$Heap)
Unmanaged&
Heap&
Flink&Managed&
Heap&
Network&Buffers&
User Code
Flink Runtime
Shuffles/Broadcasts

Hadoop compatibility
• Flink supports out of the box
• Hadoop data types (Writables)
• Hadoop input/output formats
• Hadoop functions and object model
Input& Map& Reduce& Output&
S DataSet Red DataSet Join DataSet
Output&
DataSet Map DataSet
Input&

Flink Streaming
Word count with Java API
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
!
DataSet<Tuple2<String, Integer>> result = env
.readTextFile(input)
.flatMap(sentence -> asList(sentence.split(“ “)))
.map(word -> new Tuple2<>(word, 1))
.groupBy(0)
.aggregate(SUM, 1);
Word count with Flink Streaming
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
!
DataStream<Tuple2<String, Integer>> result = env
.readTextFile(input)
.flatMap(sentence -> asList(sentence.split(“ “)))
.map(word -> new Tuple2<>(word, 1))
.groupBy(0)
.sum(1);

Write once, run everywhere!
Cluster((Batch)(
Local(
Debugging(
Cluster((Streaming)(
Flink&Run)me&or&Apache&Tez&
As(Java(Collec;on(
Programs(
Embedded(
(e.g.,(Web(Container)(

Write once, run with any
data!
Execution$ Reusing%partition/sort,
Run$on$a$sample$
on$the$laptop.
Hash%vs.%Sort,
Partition%vs.%Broadcast,
Caching,
Run$a$month$later$
after$the$data$evolved$.
Plan$A.
Execution$
Plan$B.
Run$on$large$files$
on$the$cluster.
Execution$
Plan$C.

Little tuning required
• Requires no memory thresholds to configure
• Requires no complicated network configs
• Requires no serializers to be configured
• Programs adjust to data automatically

Flink roadmap
• Flink has a major release every 3 months
• Finer grained fault-tolerance
• Logical (SQL-like) field addressing
• Python API
• Flink Streaming, Lambda architecture
support
• Flink on Tez
• ML on Flink (Mahout DSL)
• Graph DSL on Flink
• … and much more

http://flink.incubator.apache.org
!
github.com/apache/incubator-flink
!
@ApacheFlink

Introduction to Apache Flink - Fast and reliable big data processing

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Introduction to Apache Flink - Fast and reliable big data processing

Similaire à Introduction to Apache Flink - Fast and reliable big data processing (20)

Plus de Till Rohrmann

Plus de Till Rohrmann (19)

Dernier

Dernier (20)

Introduction to Apache Flink - Fast and reliable big data processing