This presentation introduces Apache Flink, a massively parallel data processing engine which currently undergoes the incubation process at the Apache Software Foundation. Flink's programming primitives are presented and it is shown how easily a distributed PageRank algorithm can be implemented with Flink. Intriguing features such as dedicated memory management, Hadoop compatibility, streaming and automatic optimisation make it an unique system in the world of Big Data processing.
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Introduction to Apache Flink - Fast and reliable big data processing
1. Apache Flink
Fast and reliable big data processing
Till Rohrmann
trohrmann@apache.org
2. What is Apache Flink?
• Project undergoing incubation in the Apache
Software Foundation
• Originating from the Stratosphere research project
started at TU Berlin in 2009
• http://flink.incubator.apache.org
• 59 contributors (doubled in ~4 months)
• Has awesome squirrel logo
3. What is Apache Flink?
Flink Client Master'
Worker'
Worker'
4. Current state
• Fast - much faster than Hadoop, faster than Spark
in many cases
• Reliable - does not suffer from memory problems
5. Outline of this talk
• Introduction to Flink
• Distributed PageRank with Flink
• Other Flink features
• Flink roadmap and closing
6. Where to locate Flink in the
Open Source landscape?
Crunch"
5" 5
Hive"
Mahout"
MapReduce"
Flink"
Spark" Storm"
Yarn" Mesos"
HDFS"
Cascading"
Tez"
Pig"
Applica3ons$
Data$processing$
engines$
App$and$resource$
management$
Storage,$streams$ HBase" KaAa"
…"
7. Distributed data sets
DataSet
A
DataSet
B
DataSet
C
A (1)
A (2)
B (1)
B (2)
C (1)
C (2)
X
X
Y
Y
Program
Parallel Execution
X Y
Operator X Operator Y
9. PageRank
• Algorithm which made Google
a multi billion dollar business
• Ranking of search results
• Model: Random surfer
• Follows links
• Randomly selects arbitrary
website
10. How can we solve the
problem?
PageRankDS = {
(1, 0.3)
(2, 0.5)
(3, 0.2)
}
AdjacencyDS = {
(1, [2, 3])
(2, [1])
(3, [1, 2])
}
11. PageRank{
node: Int
rank: Double
}
Adjacency{
node: Int
neighbours: List[Int]
}
12.
13. case class Pagerank(node: Int, rank: Double)
case class Adjacency(node: Int, neighbours: Array[Int])
!
def main(args: Array[String]): Unit = {
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
!
val initialPagerank:DataSet[Pagerank] = createInitialPagerank(numVertices, env)
val adjacency: DataSet[Adjacency] = createRandomAdjacency(numVertices, sparsity, env)
!
val solution = initialPagerank.iterate(100){
pagerank =>
val partialPagerank = pagerank.join(adjacency).
where(“node”).
equalTo(“node”).
flatMap{
// generating the partial pageranks
pair => {
val (Pagerank(node, rank), Adjacency(_, neighbours)) = pair
val length = neighbours.length
neighbours.map{
neighbour=>
Pagerank(neighbour, dampingFactor*rank/length)
} :+ Pagerank(node, (1-dampingFactor)/numVertices)
}
}
!
// adding the partial pageranks up
partialPagerank.
groupBy(“node”).
reduce{
(left, right) => Pagerank(left.node, left.rank + right.rank)
}
!
}
solution.print()
env.execute("Flink pagerank.")
}
15. Memory management
• Flink manages its own memory on the heap
• Caching and data processing happens in managed memory
• Allows graceful spilling, never out of memory exceptions
JVM$Heap)
Unmanaged&
Heap&
Flink&Managed&
Heap&
Network&Buffers&
User Code
Flink Runtime
Shuffles/Broadcasts
16. Hadoop compatibility
• Flink supports out of the box
• Hadoop data types (Writables)
• Hadoop input/output formats
• Hadoop functions and object model
Input& Map& Reduce& Output&
S DataSet Red DataSet Join DataSet
Output&
DataSet Map DataSet
Input&
17. Flink Streaming
Word count with Java API
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
!
DataSet<Tuple2<String, Integer>> result = env
.readTextFile(input)
.flatMap(sentence -> asList(sentence.split(“ “)))
.map(word -> new Tuple2<>(word, 1))
.groupBy(0)
.aggregate(SUM, 1);
Word count with Flink Streaming
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
!
DataStream<Tuple2<String, Integer>> result = env
.readTextFile(input)
.flatMap(sentence -> asList(sentence.split(“ “)))
.map(word -> new Tuple2<>(word, 1))
.groupBy(0)
.sum(1);
20. Write once, run with any
data!
Execution$ Reusing%partition/sort,
Run$on$a$sample$
on$the$laptop.
Hash%vs.%Sort,
Partition%vs.%Broadcast,
Caching,
Run$a$month$later$
after$the$data$evolved$.
Plan$A.
Execution$
Plan$B.
Run$on$large$files$
on$the$cluster.
Execution$
Plan$C.
21. Little tuning required
• Requires no memory thresholds to configure
• Requires no complicated network configs
• Requires no serializers to be configured
• Programs adjust to data automatically
22. Flink roadmap
• Flink has a major release every 3 months
• Finer grained fault-tolerance
• Logical (SQL-like) field addressing
• Python API
• Flink Streaming, Lambda architecture
support
• Flink on Tez
• ML on Flink (Mahout DSL)
• Graph DSL on Flink
• … and much more