SlideShare une entreprise Scribd logo
1  sur  41
Batch Processing using Apache
Flink
By - Sameer Wadkar
Flink API
Table
• Input is in the form of
files or collections
(Unit Testing)
• Results of
transformations are
returned as Sinks
which may be files or
command line
terminal or
collections (Unit
Testing)
DataStreamDataSet
• SQL like expression
language embedded
in Java/Scala
• Instead of working
with DataSet or
DataStream use
Table abstraction
• Similar to DataSet
but applies to
streaming data
Source Code
Source Code for examples presented can be downloaded from
https://github.com/sameeraxiomine/FlinkMeetup
Flink DataSet API – Word Count
public class WordCount {
public static void main(String[] args) throws Exception {
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<String> text = getLines(env); //Create DataSet from lines in file
DataSet<Tuple2<String, Integer>> wordCounts = text
.flatMap(new LineSplitter())
.groupBy(0) //Group by first element of the Tuple
.aggregate(Aggregations.SUM, 1);
wordCounts.print();//Execute the WordCount job
}
/*FlatMap implantation which converts each line to many <Word,1> pairs*/
public static class LineSplitter implements
FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String line, Collector<Tuple2<String, Integer>> out) {
for (String word : line.split(" ")) {
out.collect(new Tuple2<String, Integer>(word, 1));
}
}
}
Source Code -
https://github.com/sameeraxiomine/FlinkMeetup/blob/master/src/main/java/org/a
pache/flink/examples/WordCount.java
Flink Batch API (Table API)
public class WordCountUsingTableAPI {
public static void main(String[] args) throws Exception {
final ExecutionEnvironment env = ExecutionEnvironment
.getExecutionEnvironment();
TableEnvironment tableEnv = new TableEnvironment();
DataSet<Word> words = getWords(env);
Table table = tableEnv.fromDataSet(words);
Table filtered = table
.groupBy("word")
.select("word.count as wrdCnt, word")
.filter(" wrdCnt = 2");
DataSet<Word> result = tableEnv.toDataSet(filtered, Word.class);
result.print();
}
public static DataSet<Word> getWords(ExecutionEnvironment env) { //Return DataSet of Word}
public static class Word {
public String word;
public int wrdCnt;
public Word(String word, int wrdCnt) {
this.word = word; this.wrdCnt = wrdCnt;
}
public Word() {} // empty constructor to satisfy POJO requirements
@Override
public String toString() {
return "Word [word=" + word + ", count=" + wrdCnt + "]";
}
}
}
Source Code -
https://github.com/sameeraxiomine/FlinkMeetup/blob/master/src/main/java/org/apach
e/flink/examples/WordCountUsingTableAPI.java
Table API – How it works
Table filtered = table
.groupBy("word")
.select(“word, word.count as wrdCnt")//count(word)
.filter(" wrdCnt = 2");
DataSet<Word> result = tableEnv.toDataSet(filtered, Word.class);
……
public static DataSet<Word> getWords(ExecutionEnvironment env) { //Return DataSet of Word}
public static class Word {
public String word;
public int wrdCnt;
…
}
groupby Word.word
Count words
(word.count as wrdCnt)
& emit word,wrdCnt
Transform to
DataSet<Word> using
reflection
Filter words with
wrdCnt ==2
Iterative Algorithm
Input Data
Update Input
yes
Output
Read
Write
1
Iteration
Continue
?
2
3
5
4
Result of the last iteration
Iterative Algorithm - MapReduce
Input Data
Update Input
Output
Read
Write
1
Iteration
Continue
?
2
3
5
4
Result of the last iteration
HDFS
HDFS
MapReduce Job
Check Counters or
New MapReduce job
yes
Iterative Algorithm - Spark
Input Data
Update RDD
and Cache
Output
Read
Write
1
Iteration
Continue
?
2
3
5
4
Write to Disk
HDFS
RDD
Spark Action
Spark Action or
check counters
yes
Iterative Algorithm - Flink
Input Data
New Input Data
Output
Read
Write
1
Iteration
Continue
?
2
3
5
4
Write to Disk
Dataset
IterativeDataSet
DeltaIteration
or
Inside Job Iteration
Aggregator
ConvergenceCriterion
Dataset
Pipelined
yes
Batch Processing - Iterator Operators
• Iterative algorithms are common used in
• Graph processing
• Machine Learning – Bayesian, Numerical Solutions, Optimization
algorithms
• Accumulators can be used as Job Level Counters
• Aggregators are used as Iteration level Counters
• Reset at the end of each iteration
• Can specify a convergence criterion to exit the loop (iterative process)
Bulk Iterations vs Delta Iterations
• Bulk Iterations are appropriate when entire datasets are consumed per
iteration.
• Example - K Means Clustering algorithm
• Delta Iterations are exploit the following features
• Each iteration processes on a subset of full DataSet
• The working dataset become smaller in each iterations allowing the
iterations to become faster in each subsequent step
• Example – Graph processing (Propagate minimum in a graph)
Bulk Iteration – Toy Example
• Consider a DataSet<Long> of random numbers from 0-99. This DataSet
can be arbitrarily large
• Each number needs to be incremented simultaneously
• Stop when the sum of all numbers exceeds an arbitrary but user defined
value ( Ex. noOfElements * 20000) at the end of the iteration
i1+1 i2+1 i3+1 in+1….
Increment all
numbers
simultaneously
Input Dataset of
numbers
Is sum of all
numbers > NNo
End
Bulk Iteration – Sample Dataset of 5 elements
Initial Dataset Final Dataset
<46,46> <46, 19999>
<32,32> <32, 19985>
<48,48> <48, ,20001>
<39,39> <39, 19992>
<73,73> <73, 20026>
Initial Total = 238 Final Total = 100,003
• DataSet<Tuple2<Long,Long>> is used as Input where the first element is the
key and the second element is incremented each iteration
• Sum of all second elements of the Tuple2 cannot exceed 100000
Bulk Iteration – Solution
• Solution Highlights
• Cannot use counters (Accumulators) to determine when to stop.
Accumulators are guaranteed to be accurate only at the end of the
program
• Aggregator’s are used at the end of each iteration to verify
terminating condition
• Source Code -
https://github.com/sameeraxiomine/FlinkMeetup/blob/master/src/main/java
/org/apache/flink/examples/AdderBulkIterations.java
Bulk Iteration – Implementation
<46,46>
<32,32>
<48,48>
<39,39>
<73,73>
Input
1
Map
Map
Map
……
Step Function
(Add 1)
2
Iterate (Max 100,000 times)
Check for
terminating
condition
(Synchronize)
Feedback to
next iteration
3
<46,19999>
<32,19985>
<48,20001>
<39,19992>
<73,20026>
Output
4
Terminates
after 19953
iterations
Bulk Iteration – Source Code
public static void main(String[] args) throws Exception {
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//First create an initial dataset
IterativeDataSet<Tuple2<Long, Long>> initial = getData(env)
.iterate(MAX_ITERATIONS);
//Register Aggregator and Convergence Criterion Class
initial.registerAggregationConvergenceCriterion("total", new LongSumAggregator(),
new VerifyIfMaxConvergence());
//Iterate
DataSet<Tuple2<Long, Long>> iteration = initial.map(
new RichMapFunction<Tuple2<Long, Long>, Tuple2<Long, Long>>() {
private LongSumAggregator agg = null;
@Override
public void open(Configuration parameters) {
this.agg = this.getIterationRuntimeContext().getIterationAggregator("total");
}
@Override
public Tuple2<Long, Long> map(Tuple2<Long, Long> input) throws Exception {
long incrementF1 = input.f1 + 1;
Tuple2<Long, Long> out = new Tuple2<>(input.f0, incrementF1);
this.agg.aggregate(out.f1);
return out;
}
});
DataSet<Tuple2<Long, Long>> finalDs = initial.closeWith(iteration); //Close Iteration
finalDs.print(); //Consume output
}
public static class VerifyIfMaxConvergence implements ConvergenceCriterion<LongValue>{
@Override
public boolean isConverged(int iteration, LongValue value) {
return (value.getValue()>AdderBulkIterations.ABSOLUTE_MAX);
}
}
Bulk Iteration – Steps
Create intial Dataset
(IterativeDataSet)
And define max iterations
IterativeDataSet<Tuple2<Long, Long>> initial = getData(
env).iterate(MAX_ITERATIONS);
Register Convergence Criterion
initial.registerAggregationConvergenceCriterion("total",
new LongSumAggregator(), new VerifyIfMaxConvergence());
Execute Iterations and update
aggregator and check for
convergence at end of each
iteration
End Iteration by executing
closewith(DataSet) on the
IterativeDataSet
DataSet<Tuple2<Long, Long>> iteration = initial.map(new
RichMapFunction<Tuple2<Long, Long>, Tuple2<Long, Long>>() {
return new Tuple2<>(input.f0, input.f1+1);
});
class VerifyIfMaxConvergence implements ConvergenceCriterion{
public boolean isConverged(int iteration, LongValue value) {
return (value.getValue()>AdderBulkIterations.ABSOLUTE_MAX);
}
}
DataSet<Tuple2<Long, Long>> finalDs =
initial.closeWith(iteration);
finalDs.print();//Consume results
Bulk Iteration – The Wrong Way
DataSet<Tuple2<Long, Long>> input = getData(env);
DataSet<Tuple2<Long, Long>> output = input;
for(int i=0;i<MAX_ITERATIONS;i++){
output = input.map(new MapFunction>() {
public Tuple2<Long, Long> map(Tuple2<Long, Long> input) {
return new Tuple2<>(input.f0, input.f1+1);
}
});
//This is what slows down iteration. Job starts immediately here
long sum = output.map(new FixTuple2()).reduce(new ReduceFunc())
.collect().get(0);
input = output;//Prepare for next iteration
System.out.println("Current Sum="+sum);
if(sum>100){
System.out.println("Breaking now:"+i);
break;
}
}
output.print();
• Flink cannot optimize because job executes immediately on
long sum = output.map(new FixTuple2()).reduce(new
ReduceFunc()).collect().get(0);
• https://github.com/sameeraxiomine/FlinkMeetup/blob/master/src/main/java/org/a
pache/flink/examples/AdderBulkIterationsWrongWay.java
Delta Iteration – Example
1 2 3
4
5
6 7
11 12
8 9
10 13
15
Given the following events and their relationships propagate root id of each event
to its children
Source Code -
https://github.com/sameeraxiomine/FlinkMeetup/blob/master/src/main/java/org/a
pache/flink/examples/DeltaIterationExample.java
Delta Iteration – Initial and Final Dataset
• Each event is represented as a
Tuple2<Integer,Integer>
instance
• Tuple2.f0 is the EventId
• Tuple2.f1 is the ParentId
Vertex Edge
<1,1> <1,2>
<2,2> <2,3>
<3,3> <2,4>
<4,4> <3,5>
<11,11> <6,7>
<12,12> <8,9>
<15,15> <8,10>
<6,6> <5,6>
<7,7> <7,6>
<8,8> <8,8>
<9,9> <9,8>
<10,10> <10,8>
<13,13> <13,8>
Delta Iteration – Implementation
Step Function
2
Check for
Convergence
or empty
workset
Next Workset
3
Initial
Workset
Initial
SolutionSet Solution Set 3 4
Final
SolutionSet
Delta Iteration
• Initial Workset and SolutionSet are identical
• Each iteration updates the SolutionSet and reduces the size of the Workset
• Iteration terminates when
• Max iterations are reached
• Workset fed back (3 below) is empty
• SolutionSet at termination is the result of the Iteration job
Delta Iteration – Working Set at end of Iteration 1
1,1 2,1 3,2
4,2
5,3
6,6 7,7
11,
5
12,
11
8,8 9,8
10,
8
13,
10
15,
1
Working Set
= Dropped off from working set
Delta Iteration – Working Set at end of Iteration 2
1,1 2,1 3,1
4,1
5,2
6,6 7,6
11,
3
12,
5
8,8 9,8
10,
8
13,
8
15,
1
Working Set
= Dropped off from working set
Delta Iteration – Working Set at end of Iteration 3
1,1 2,1 3,1
4,1
5,1
6,6 7,6
11,
2
12,
3
8,8 9,8
10,
8
13,
8
15,
1
Working Set
= Dropped off from working set
Delta Iteration – Working Set at end of Iteration 4
1,1 2,1 3,1
4,1
5,1
6,6 7,6
11,
1
12,
2
8,8 9,8
10,
8
13,
8
15,
1
Working Set
= Dropped off from working set
Delta Iteration – Working Set at end of Iteration 5
1,1 2,1 3,1
4,1
5,1
6,6 7,6
11,
1
12,
1
8,8 9,8
10,
8
13,
8
15,
1
Working Set
= Dropped off from working set
Delta Iteration – Working Set at end of Iteration 6
1,1 2,1 3,1
4,1
5,1
6,6 7,6
11,
1
12,
1
8,8 9,8
10,
8
13,
8
15,
1
Working Set is empty!
= Dropped off from working set
Delta Iteration – At Scale
• Imagine a dataset of over 10 billion transactions and sub-graphs of average
size 10
• Total of 1 billion sub-graphs
• Each iteration drops of 1 billion vertices
• Over in about 10 iterations
• Can optimize more with Tuple3 (Maintain information whether root vertex id is
propagated to a vertex). Save another iteration at the expense of increasing
storage requirements
• First iteration drops 10%. By end of 5th iteration working set drops by 20% wrt.
the working set at the beginning of the iteration. Iterations get progressively
faster
Delta Iteration – Source Code
private static final int MAX_ITERATIONS = 10;
public static void main(String... args) throws Exception {
// set up execution environment
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
// read vertex and edge data
// initially assign parent vertex id== my vertex id
DataSet<Tuple2<Long, Long>> vertices = GraphData.getDefaultVertexDataSet(env);
DataSet<Tuple2<Long, Long>> edges = GraphData.getDefaultEdgeDataSet(env);
int vertexIdIndex = 0;
// open a delta iteration
DeltaIteration<Tuple2<Long, Long>, Tuple2<Long, Long>> iteration =
vertices.iterateDelta(vertices , MAX_ITERATIONS, vertexIdIndex);
// apply the step logic: join with the edges,
// update if the component of the candidate is smaller
DataSet<Tuple2<Long, Long>> changes = iteration.getWorkset()
.join(edges).where(0).equalTo(0)
/* Update the parentVertex=parent.id */
.with(new NeighborWithComponentIDJoin())
/* Merge with solution set */
.join(iteration.getSolutionSet())
.where(0).equalTo(0)
/* Only pass on the changes to next iteration */
.with(new ComponentIdFilter());
// close the delta iteration (delta and new workset are identical)
DataSet<Tuple2<Long, Long>> result = iteration.closeWith(changes, changes);
result.print();
}
Delta Iteration – Read Vertices and Edges
// read vertex and edge data
// initially assign parent vertex id== my vertex id
DataSet<Tuple2<Long, Long>> vertices = GraphData.getDefaultVertexDataSet(env);
DataSet<Tuple2<Long, Long>> edges = GraphData.getDefaultEdgeDataSet(env);
Vertex Edge
<1,1> <1,2>
<2,2> <2,3>
<3,3> <2,4>
<4,4> <3,5>
<11,11> <6,7>
<12,12> <8,9>
<15,15> <8,10>
<6,6> <5,11>
<7,7> <11,12>
<8,8> <10,13>
<9,9> <9,14>
<10,10> <1,15>
<13,13>
Vertex – Tuple2<Long,Long>
f0 – Vertex Id
f1 – Root Id
Edge– Tuple2<Long,Long>
f0 – Parent Id
f1 – Receiving Id
Delta Iteration – Initiate Delta Iteration
int vertexIdIndex = 0; //Why does this need to be passed to the iterateDelta function
// open a delta iteration
DeltaIteration<Tuple2<Long, Long>, Tuple2<Long, Long>> iteration =
vertices.iterateDelta(vertices , MAX_ITERATIONS, vertexIdIndex );
• After each iteration during the merge step, only the delta solution set
is shuffled
• The elements of the delta solution set end up on the same nodes as
the initial solution set and merged (Always join on keys)
• This is considerable cheaper than shuffling the delta solution set and
the full solution set
• As the size of the delta solution set reduces in size this optimization
reaps increasingly higher performance benefits with subsequent
iteration steps
Initial
Solution
Set
Partition by
key indices
Cache
Each
Partition
Start
Solution Set is
never shuffled
again
Delta Iteration – Step Clause (Step 1)
DataSet<Tuple2<Long, Long>> changes = iteration.getWorkset().join(edges).where(0).equalTo(0)
/* Update the parentVertex*/
.with(new NeighborWithComponentIDJoin())
public static final class NeighborWithComponentIDJoin implements
JoinFunction<Tuple2<Long, Long>, Tuple2<Long, Long>, Tuple2<Long, Long>> {
@Override
public Tuple2<Long, Long> join(Tuple2<Long, Long> vertexWithComponent,
Tuple2<Long, Long> edge) {
return new Tuple2<Long, Long>(edge.f1, vertexWithComponent.f1);
}
}
1,1 1 2
e.f0=1
e.f1=2
v.f0=1
v.f1=1
Join on
v.f0=e.f0
Vertex Edge New Vertex
2,1
Apply
with(NeighborWithC
omponentIDJoin )
The above shows how event 2 gets a new parent id
Delta Iteration – Merge With Solution Set
changes = ...with(new NeighborWithComponentIDJoin())
.join(iteration.getSolutionSet()).where(0).equalTo(0)
.with(new ComponentIdFilter());
//Close with DeltaSolutionSet and NewWorkingSet. Both are equal to changes variable
DataSet<Tuple2<Long, Long>> result = iteration.closeWith(changes, changes);
public static final class ComponentIdFilter implements FlatJoinFunction {
public void join(..) {
if (candidate.f1 < old.f1) {
out.collect(candidate);
}
}
}
2,2
Join on
v.f0=s.f0
Initial
Solution Set
Result of Step
Delta Solution
Set & Work Set
2,1
Apply
with(Componen
tIdFilter)
1,1
2,1
Updated
Solution Set
2,1
1,1
The above shows how the parent id’s of event’s 1 and 2 transition by the end of
iteration 1. Event id 1 does not make it past the step function.
….….
2,2
Flink
Runtime
Merges
Framework merges Delta
Solution Set with Solution Set
on Index Indices
Remember – Only Delta
Solution Set is Shuffled
Delta Iteration – End of Iteration 1
Initial Dataset /
Initial Solution Set
Delta Solution Set/
New Workset
Merge Solution Set
<1,1> <1,1>
<2,2> <2,1> <2,1>
<3,3> <3,2> <3,2>
<4,4> <4,2> <4,2>
<11,11> <11,5> <11,5>
<12,12> <12,11> <12,11>
<15,15> <15,1> <15,1>
<6,6> <6,6>
<7,7> <7,6> <7,6>
<8,8> <8,8>
<9,9> <9,8> <9,8>
<10,10> <10,8> <10,8>
<13,13> <13,10> <13,10>
Delta Iteration – End of Iteration 2
Working Set Delta Solution Set/
New Workset
Merge Solution Set
<1,1>
<2,1> <2,1>
<3,2> <3,1> <3,1>
<4,2> <4,1> <4,1>
<11,5> <11,3> <11,3>
<12,11> <12,5> <12,5>
<15,1> <15,1>
<6,6>
<7,6> <7,6>
<8,8>
<9,8> <9,8>
<10,8> <10,8>
<13,10> <13,8> <13,8>
Delta Iteration – End of Iteration 3
Working Set Delta Solution Set/
New Workset
Merge Solution Set
<1,1>
<2,1>
<3,1> <3,1>
<4,1> <4,1>
<11,3> <11,2> <11,2>
<12,5> <12,3> <12,3>
<15,1>
<6,6>
<7,6>
<8,8>
<9,8>
<10,8>
<13,8> <13,8>
Delta Iteration – End of Iteration 4
Working Set Delta Solution Set/
New Workset
Merge Solution Set
<1,1>
<2,1>
<3,1>
<4,1>
<11,2> <11,1> <11,1>
<12,3> <12,2> <12,2>
<15,1>
<6,6>
<7,6>
<8,8>
<9,8>
<10,8>
<13,8>
Delta Iteration – End of Iteration 4
Working Set Delta Solution Set/
New Workset
Merge Solution Set
<1,1>
<2,1>
<3,1>
<4,1>
<11,1> <11,1>
<12,2> <12,1> <12,1>
<15,1>
<6,6>
<7,6>
<8,8>
<9,8>
<10,8>
<13,8>
Delta Iteration – End of Iteration 5
Working Set Delta Solution Set/
New Workset
Merge Solution Set
<1,1>
<2,1>
<3,1>
<4,1>
<11,1>
<12,1> <12,1> <12,1>
<15,1>
<6,6>
<7,6>
<8,8>
<9,8>
<10,8>
<13,8>
Delta Iteration – End of Iteration 6
Working Set Delta Solution Set/
New Workset
Merge Solution Set
<1,1>
<2,1>
<3,1>
<4,1>
<11,1>
<12,1>
<15,1>
<6,6>
<7,6>
<8,8>
<9,8>
<10,8>
<13,8>

Contenu connexe

Tendances

Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...Databricks
 
Keepalived+MaxScale+MariaDB_운영매뉴얼_1.0.docx
Keepalived+MaxScale+MariaDB_운영매뉴얼_1.0.docxKeepalived+MaxScale+MariaDB_운영매뉴얼_1.0.docx
Keepalived+MaxScale+MariaDB_운영매뉴얼_1.0.docxNeoClova
 
Running MariaDB in multiple data centers
Running MariaDB in multiple data centersRunning MariaDB in multiple data centers
Running MariaDB in multiple data centersMariaDB plc
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks
 
rtpengine and kamailio - or how to simulate calls at scale
rtpengine and kamailio - or how to simulate calls at scalertpengine and kamailio - or how to simulate calls at scale
rtpengine and kamailio - or how to simulate calls at scaleAndreas Granig
 
High Performance Mysql
High Performance MysqlHigh Performance Mysql
High Performance Mysqlliufabin 66688
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillDataWorks Summit
 
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxData
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performancePostgreSQL-Consulting
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
 
Query Compilation in Impala
Query Compilation in ImpalaQuery Compilation in Impala
Query Compilation in ImpalaCloudera, Inc.
 
Patroni - HA PostgreSQL made easy
Patroni - HA PostgreSQL made easyPatroni - HA PostgreSQL made easy
Patroni - HA PostgreSQL made easyAlexander Kukushkin
 
Oracle Drivers configuration for High Availability, is it a developer's job?
Oracle Drivers configuration for High Availability, is it a developer's job?Oracle Drivers configuration for High Availability, is it a developer's job?
Oracle Drivers configuration for High Availability, is it a developer's job?Ludovico Caldara
 
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
Kamailio with Docker and Kubernetes
Kamailio with Docker and KubernetesKamailio with Docker and Kubernetes
Kamailio with Docker and KubernetesPaolo Visintin
 

Tendances (20)

Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
 
Keepalived+MaxScale+MariaDB_운영매뉴얼_1.0.docx
Keepalived+MaxScale+MariaDB_운영매뉴얼_1.0.docxKeepalived+MaxScale+MariaDB_운영매뉴얼_1.0.docx
Keepalived+MaxScale+MariaDB_운영매뉴얼_1.0.docx
 
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
 
Running MariaDB in multiple data centers
Running MariaDB in multiple data centersRunning MariaDB in multiple data centers
Running MariaDB in multiple data centers
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
 
Kamailio - API Based SIP Routing
Kamailio - API Based SIP RoutingKamailio - API Based SIP Routing
Kamailio - API Based SIP Routing
 
rtpengine and kamailio - or how to simulate calls at scale
rtpengine and kamailio - or how to simulate calls at scalertpengine and kamailio - or how to simulate calls at scale
rtpengine and kamailio - or how to simulate calls at scale
 
Prestogres internals
Prestogres internalsPrestogres internals
Prestogres internals
 
High Performance Mysql
High Performance MysqlHigh Performance Mysql
High Performance Mysql
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performance
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
 
Query Compilation in Impala
Query Compilation in ImpalaQuery Compilation in Impala
Query Compilation in Impala
 
Patroni - HA PostgreSQL made easy
Patroni - HA PostgreSQL made easyPatroni - HA PostgreSQL made easy
Patroni - HA PostgreSQL made easy
 
Oracle Drivers configuration for High Availability, is it a developer's job?
Oracle Drivers configuration for High Availability, is it a developer's job?Oracle Drivers configuration for High Availability, is it a developer's job?
Oracle Drivers configuration for High Availability, is it a developer's job?
 
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...
Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Kamailio with Docker and Kubernetes
Kamailio with Docker and KubernetesKamailio with Docker and Kubernetes
Kamailio with Docker and Kubernetes
 

Similaire à Flink Batch Processing and Iterations

Apache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsApache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsFlink Forward
 
Lecture#6 functions in c++
Lecture#6 functions in c++Lecture#6 functions in c++
Lecture#6 functions in c++NUST Stuff
 
Jdk 7 4-forkjoin
Jdk 7 4-forkjoinJdk 7 4-forkjoin
Jdk 7 4-forkjoinknight1128
 
Parallel Programming With Dot Net
Parallel Programming With Dot NetParallel Programming With Dot Net
Parallel Programming With Dot NetNeeraj Kaushik
 
Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced Flink Forward
 
Apache Flink Training: DataStream API Part 1 Basic
 Apache Flink Training: DataStream API Part 1 Basic Apache Flink Training: DataStream API Part 1 Basic
Apache Flink Training: DataStream API Part 1 BasicFlink Forward
 
Finagle and Java Service Framework at Pinterest
Finagle and Java Service Framework at PinterestFinagle and Java Service Framework at Pinterest
Finagle and Java Service Framework at PinterestPavan Chitumalla
 
F# in the enterprise
F# in the enterpriseF# in the enterprise
F# in the enterprise7sharp9
 
Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Workshop "Can my .NET application use less CPU / RAM?", Yevhen TatarynovWorkshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Workshop "Can my .NET application use less CPU / RAM?", Yevhen TatarynovFwdays
 
GDG Jakarta Meetup - Streaming Analytics With Apache Beam
GDG Jakarta Meetup - Streaming Analytics With Apache BeamGDG Jakarta Meetup - Streaming Analytics With Apache Beam
GDG Jakarta Meetup - Streaming Analytics With Apache BeamImre Nagi
 
JDK1.7 features
JDK1.7 featuresJDK1.7 features
JDK1.7 featuresindia_mani
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flinkmxmxm
 
Advance Java Programs skeleton
Advance Java Programs skeletonAdvance Java Programs skeleton
Advance Java Programs skeletonIram Ramrajkar
 

Similaire à Flink Batch Processing and Iterations (20)

Apache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsApache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API Basics
 
Lecture#6 functions in c++
Lecture#6 functions in c++Lecture#6 functions in c++
Lecture#6 functions in c++
 
srgoc
srgocsrgoc
srgoc
 
Jdk 7 4-forkjoin
Jdk 7 4-forkjoinJdk 7 4-forkjoin
Jdk 7 4-forkjoin
 
Parallel Programming With Dot Net
Parallel Programming With Dot NetParallel Programming With Dot Net
Parallel Programming With Dot Net
 
Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced
 
Apache Flink Training: DataStream API Part 1 Basic
 Apache Flink Training: DataStream API Part 1 Basic Apache Flink Training: DataStream API Part 1 Basic
Apache Flink Training: DataStream API Part 1 Basic
 
Finagle and Java Service Framework at Pinterest
Finagle and Java Service Framework at PinterestFinagle and Java Service Framework at Pinterest
Finagle and Java Service Framework at Pinterest
 
IoT Best practices
 IoT Best practices IoT Best practices
IoT Best practices
 
Grid gain paper
Grid gain paperGrid gain paper
Grid gain paper
 
F# in the enterprise
F# in the enterpriseF# in the enterprise
F# in the enterprise
 
Functional Programming
Functional ProgrammingFunctional Programming
Functional Programming
 
Gwt and Xtend
Gwt and XtendGwt and Xtend
Gwt and Xtend
 
Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Workshop "Can my .NET application use less CPU / RAM?", Yevhen TatarynovWorkshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
 
GDG Jakarta Meetup - Streaming Analytics With Apache Beam
GDG Jakarta Meetup - Streaming Analytics With Apache BeamGDG Jakarta Meetup - Streaming Analytics With Apache Beam
GDG Jakarta Meetup - Streaming Analytics With Apache Beam
 
JDK1.7 features
JDK1.7 featuresJDK1.7 features
JDK1.7 features
 
Thread
ThreadThread
Thread
 
functions
functionsfunctions
functions
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Advance Java Programs skeleton
Advance Java Programs skeletonAdvance Java Programs skeleton
Advance Java Programs skeleton
 

Dernier

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一F sss
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 

Dernier (20)

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 

Flink Batch Processing and Iterations

  • 1. Batch Processing using Apache Flink By - Sameer Wadkar
  • 2. Flink API Table • Input is in the form of files or collections (Unit Testing) • Results of transformations are returned as Sinks which may be files or command line terminal or collections (Unit Testing) DataStreamDataSet • SQL like expression language embedded in Java/Scala • Instead of working with DataSet or DataStream use Table abstraction • Similar to DataSet but applies to streaming data
  • 3. Source Code Source Code for examples presented can be downloaded from https://github.com/sameeraxiomine/FlinkMeetup
  • 4. Flink DataSet API – Word Count public class WordCount { public static void main(String[] args) throws Exception { final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); DataSet<String> text = getLines(env); //Create DataSet from lines in file DataSet<Tuple2<String, Integer>> wordCounts = text .flatMap(new LineSplitter()) .groupBy(0) //Group by first element of the Tuple .aggregate(Aggregations.SUM, 1); wordCounts.print();//Execute the WordCount job } /*FlatMap implantation which converts each line to many <Word,1> pairs*/ public static class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String line, Collector<Tuple2<String, Integer>> out) { for (String word : line.split(" ")) { out.collect(new Tuple2<String, Integer>(word, 1)); } } } Source Code - https://github.com/sameeraxiomine/FlinkMeetup/blob/master/src/main/java/org/a pache/flink/examples/WordCount.java
  • 5. Flink Batch API (Table API) public class WordCountUsingTableAPI { public static void main(String[] args) throws Exception { final ExecutionEnvironment env = ExecutionEnvironment .getExecutionEnvironment(); TableEnvironment tableEnv = new TableEnvironment(); DataSet<Word> words = getWords(env); Table table = tableEnv.fromDataSet(words); Table filtered = table .groupBy("word") .select("word.count as wrdCnt, word") .filter(" wrdCnt = 2"); DataSet<Word> result = tableEnv.toDataSet(filtered, Word.class); result.print(); } public static DataSet<Word> getWords(ExecutionEnvironment env) { //Return DataSet of Word} public static class Word { public String word; public int wrdCnt; public Word(String word, int wrdCnt) { this.word = word; this.wrdCnt = wrdCnt; } public Word() {} // empty constructor to satisfy POJO requirements @Override public String toString() { return "Word [word=" + word + ", count=" + wrdCnt + "]"; } } } Source Code - https://github.com/sameeraxiomine/FlinkMeetup/blob/master/src/main/java/org/apach e/flink/examples/WordCountUsingTableAPI.java
  • 6. Table API – How it works Table filtered = table .groupBy("word") .select(“word, word.count as wrdCnt")//count(word) .filter(" wrdCnt = 2"); DataSet<Word> result = tableEnv.toDataSet(filtered, Word.class); …… public static DataSet<Word> getWords(ExecutionEnvironment env) { //Return DataSet of Word} public static class Word { public String word; public int wrdCnt; … } groupby Word.word Count words (word.count as wrdCnt) & emit word,wrdCnt Transform to DataSet<Word> using reflection Filter words with wrdCnt ==2
  • 7. Iterative Algorithm Input Data Update Input yes Output Read Write 1 Iteration Continue ? 2 3 5 4 Result of the last iteration
  • 8. Iterative Algorithm - MapReduce Input Data Update Input Output Read Write 1 Iteration Continue ? 2 3 5 4 Result of the last iteration HDFS HDFS MapReduce Job Check Counters or New MapReduce job yes
  • 9. Iterative Algorithm - Spark Input Data Update RDD and Cache Output Read Write 1 Iteration Continue ? 2 3 5 4 Write to Disk HDFS RDD Spark Action Spark Action or check counters yes
  • 10. Iterative Algorithm - Flink Input Data New Input Data Output Read Write 1 Iteration Continue ? 2 3 5 4 Write to Disk Dataset IterativeDataSet DeltaIteration or Inside Job Iteration Aggregator ConvergenceCriterion Dataset Pipelined yes
  • 11. Batch Processing - Iterator Operators • Iterative algorithms are common used in • Graph processing • Machine Learning – Bayesian, Numerical Solutions, Optimization algorithms • Accumulators can be used as Job Level Counters • Aggregators are used as Iteration level Counters • Reset at the end of each iteration • Can specify a convergence criterion to exit the loop (iterative process)
  • 12. Bulk Iterations vs Delta Iterations • Bulk Iterations are appropriate when entire datasets are consumed per iteration. • Example - K Means Clustering algorithm • Delta Iterations are exploit the following features • Each iteration processes on a subset of full DataSet • The working dataset become smaller in each iterations allowing the iterations to become faster in each subsequent step • Example – Graph processing (Propagate minimum in a graph)
  • 13. Bulk Iteration – Toy Example • Consider a DataSet<Long> of random numbers from 0-99. This DataSet can be arbitrarily large • Each number needs to be incremented simultaneously • Stop when the sum of all numbers exceeds an arbitrary but user defined value ( Ex. noOfElements * 20000) at the end of the iteration i1+1 i2+1 i3+1 in+1…. Increment all numbers simultaneously Input Dataset of numbers Is sum of all numbers > NNo End
  • 14. Bulk Iteration – Sample Dataset of 5 elements Initial Dataset Final Dataset <46,46> <46, 19999> <32,32> <32, 19985> <48,48> <48, ,20001> <39,39> <39, 19992> <73,73> <73, 20026> Initial Total = 238 Final Total = 100,003 • DataSet<Tuple2<Long,Long>> is used as Input where the first element is the key and the second element is incremented each iteration • Sum of all second elements of the Tuple2 cannot exceed 100000
  • 15. Bulk Iteration – Solution • Solution Highlights • Cannot use counters (Accumulators) to determine when to stop. Accumulators are guaranteed to be accurate only at the end of the program • Aggregator’s are used at the end of each iteration to verify terminating condition • Source Code - https://github.com/sameeraxiomine/FlinkMeetup/blob/master/src/main/java /org/apache/flink/examples/AdderBulkIterations.java
  • 16. Bulk Iteration – Implementation <46,46> <32,32> <48,48> <39,39> <73,73> Input 1 Map Map Map …… Step Function (Add 1) 2 Iterate (Max 100,000 times) Check for terminating condition (Synchronize) Feedback to next iteration 3 <46,19999> <32,19985> <48,20001> <39,19992> <73,20026> Output 4 Terminates after 19953 iterations
  • 17. Bulk Iteration – Source Code public static void main(String[] args) throws Exception { final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); //First create an initial dataset IterativeDataSet<Tuple2<Long, Long>> initial = getData(env) .iterate(MAX_ITERATIONS); //Register Aggregator and Convergence Criterion Class initial.registerAggregationConvergenceCriterion("total", new LongSumAggregator(), new VerifyIfMaxConvergence()); //Iterate DataSet<Tuple2<Long, Long>> iteration = initial.map( new RichMapFunction<Tuple2<Long, Long>, Tuple2<Long, Long>>() { private LongSumAggregator agg = null; @Override public void open(Configuration parameters) { this.agg = this.getIterationRuntimeContext().getIterationAggregator("total"); } @Override public Tuple2<Long, Long> map(Tuple2<Long, Long> input) throws Exception { long incrementF1 = input.f1 + 1; Tuple2<Long, Long> out = new Tuple2<>(input.f0, incrementF1); this.agg.aggregate(out.f1); return out; } }); DataSet<Tuple2<Long, Long>> finalDs = initial.closeWith(iteration); //Close Iteration finalDs.print(); //Consume output } public static class VerifyIfMaxConvergence implements ConvergenceCriterion<LongValue>{ @Override public boolean isConverged(int iteration, LongValue value) { return (value.getValue()>AdderBulkIterations.ABSOLUTE_MAX); } }
  • 18. Bulk Iteration – Steps Create intial Dataset (IterativeDataSet) And define max iterations IterativeDataSet<Tuple2<Long, Long>> initial = getData( env).iterate(MAX_ITERATIONS); Register Convergence Criterion initial.registerAggregationConvergenceCriterion("total", new LongSumAggregator(), new VerifyIfMaxConvergence()); Execute Iterations and update aggregator and check for convergence at end of each iteration End Iteration by executing closewith(DataSet) on the IterativeDataSet DataSet<Tuple2<Long, Long>> iteration = initial.map(new RichMapFunction<Tuple2<Long, Long>, Tuple2<Long, Long>>() { return new Tuple2<>(input.f0, input.f1+1); }); class VerifyIfMaxConvergence implements ConvergenceCriterion{ public boolean isConverged(int iteration, LongValue value) { return (value.getValue()>AdderBulkIterations.ABSOLUTE_MAX); } } DataSet<Tuple2<Long, Long>> finalDs = initial.closeWith(iteration); finalDs.print();//Consume results
  • 19. Bulk Iteration – The Wrong Way DataSet<Tuple2<Long, Long>> input = getData(env); DataSet<Tuple2<Long, Long>> output = input; for(int i=0;i<MAX_ITERATIONS;i++){ output = input.map(new MapFunction>() { public Tuple2<Long, Long> map(Tuple2<Long, Long> input) { return new Tuple2<>(input.f0, input.f1+1); } }); //This is what slows down iteration. Job starts immediately here long sum = output.map(new FixTuple2()).reduce(new ReduceFunc()) .collect().get(0); input = output;//Prepare for next iteration System.out.println("Current Sum="+sum); if(sum>100){ System.out.println("Breaking now:"+i); break; } } output.print(); • Flink cannot optimize because job executes immediately on long sum = output.map(new FixTuple2()).reduce(new ReduceFunc()).collect().get(0); • https://github.com/sameeraxiomine/FlinkMeetup/blob/master/src/main/java/org/a pache/flink/examples/AdderBulkIterationsWrongWay.java
  • 20. Delta Iteration – Example 1 2 3 4 5 6 7 11 12 8 9 10 13 15 Given the following events and their relationships propagate root id of each event to its children Source Code - https://github.com/sameeraxiomine/FlinkMeetup/blob/master/src/main/java/org/a pache/flink/examples/DeltaIterationExample.java
  • 21. Delta Iteration – Initial and Final Dataset • Each event is represented as a Tuple2<Integer,Integer> instance • Tuple2.f0 is the EventId • Tuple2.f1 is the ParentId Vertex Edge <1,1> <1,2> <2,2> <2,3> <3,3> <2,4> <4,4> <3,5> <11,11> <6,7> <12,12> <8,9> <15,15> <8,10> <6,6> <5,6> <7,7> <7,6> <8,8> <8,8> <9,9> <9,8> <10,10> <10,8> <13,13> <13,8>
  • 22. Delta Iteration – Implementation Step Function 2 Check for Convergence or empty workset Next Workset 3 Initial Workset Initial SolutionSet Solution Set 3 4 Final SolutionSet Delta Iteration • Initial Workset and SolutionSet are identical • Each iteration updates the SolutionSet and reduces the size of the Workset • Iteration terminates when • Max iterations are reached • Workset fed back (3 below) is empty • SolutionSet at termination is the result of the Iteration job
  • 23. Delta Iteration – Working Set at end of Iteration 1 1,1 2,1 3,2 4,2 5,3 6,6 7,7 11, 5 12, 11 8,8 9,8 10, 8 13, 10 15, 1 Working Set = Dropped off from working set
  • 24. Delta Iteration – Working Set at end of Iteration 2 1,1 2,1 3,1 4,1 5,2 6,6 7,6 11, 3 12, 5 8,8 9,8 10, 8 13, 8 15, 1 Working Set = Dropped off from working set
  • 25. Delta Iteration – Working Set at end of Iteration 3 1,1 2,1 3,1 4,1 5,1 6,6 7,6 11, 2 12, 3 8,8 9,8 10, 8 13, 8 15, 1 Working Set = Dropped off from working set
  • 26. Delta Iteration – Working Set at end of Iteration 4 1,1 2,1 3,1 4,1 5,1 6,6 7,6 11, 1 12, 2 8,8 9,8 10, 8 13, 8 15, 1 Working Set = Dropped off from working set
  • 27. Delta Iteration – Working Set at end of Iteration 5 1,1 2,1 3,1 4,1 5,1 6,6 7,6 11, 1 12, 1 8,8 9,8 10, 8 13, 8 15, 1 Working Set = Dropped off from working set
  • 28. Delta Iteration – Working Set at end of Iteration 6 1,1 2,1 3,1 4,1 5,1 6,6 7,6 11, 1 12, 1 8,8 9,8 10, 8 13, 8 15, 1 Working Set is empty! = Dropped off from working set
  • 29. Delta Iteration – At Scale • Imagine a dataset of over 10 billion transactions and sub-graphs of average size 10 • Total of 1 billion sub-graphs • Each iteration drops of 1 billion vertices • Over in about 10 iterations • Can optimize more with Tuple3 (Maintain information whether root vertex id is propagated to a vertex). Save another iteration at the expense of increasing storage requirements • First iteration drops 10%. By end of 5th iteration working set drops by 20% wrt. the working set at the beginning of the iteration. Iterations get progressively faster
  • 30. Delta Iteration – Source Code private static final int MAX_ITERATIONS = 10; public static void main(String... args) throws Exception { // set up execution environment ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); // read vertex and edge data // initially assign parent vertex id== my vertex id DataSet<Tuple2<Long, Long>> vertices = GraphData.getDefaultVertexDataSet(env); DataSet<Tuple2<Long, Long>> edges = GraphData.getDefaultEdgeDataSet(env); int vertexIdIndex = 0; // open a delta iteration DeltaIteration<Tuple2<Long, Long>, Tuple2<Long, Long>> iteration = vertices.iterateDelta(vertices , MAX_ITERATIONS, vertexIdIndex); // apply the step logic: join with the edges, // update if the component of the candidate is smaller DataSet<Tuple2<Long, Long>> changes = iteration.getWorkset() .join(edges).where(0).equalTo(0) /* Update the parentVertex=parent.id */ .with(new NeighborWithComponentIDJoin()) /* Merge with solution set */ .join(iteration.getSolutionSet()) .where(0).equalTo(0) /* Only pass on the changes to next iteration */ .with(new ComponentIdFilter()); // close the delta iteration (delta and new workset are identical) DataSet<Tuple2<Long, Long>> result = iteration.closeWith(changes, changes); result.print(); }
  • 31. Delta Iteration – Read Vertices and Edges // read vertex and edge data // initially assign parent vertex id== my vertex id DataSet<Tuple2<Long, Long>> vertices = GraphData.getDefaultVertexDataSet(env); DataSet<Tuple2<Long, Long>> edges = GraphData.getDefaultEdgeDataSet(env); Vertex Edge <1,1> <1,2> <2,2> <2,3> <3,3> <2,4> <4,4> <3,5> <11,11> <6,7> <12,12> <8,9> <15,15> <8,10> <6,6> <5,11> <7,7> <11,12> <8,8> <10,13> <9,9> <9,14> <10,10> <1,15> <13,13> Vertex – Tuple2<Long,Long> f0 – Vertex Id f1 – Root Id Edge– Tuple2<Long,Long> f0 – Parent Id f1 – Receiving Id
  • 32. Delta Iteration – Initiate Delta Iteration int vertexIdIndex = 0; //Why does this need to be passed to the iterateDelta function // open a delta iteration DeltaIteration<Tuple2<Long, Long>, Tuple2<Long, Long>> iteration = vertices.iterateDelta(vertices , MAX_ITERATIONS, vertexIdIndex ); • After each iteration during the merge step, only the delta solution set is shuffled • The elements of the delta solution set end up on the same nodes as the initial solution set and merged (Always join on keys) • This is considerable cheaper than shuffling the delta solution set and the full solution set • As the size of the delta solution set reduces in size this optimization reaps increasingly higher performance benefits with subsequent iteration steps Initial Solution Set Partition by key indices Cache Each Partition Start Solution Set is never shuffled again
  • 33. Delta Iteration – Step Clause (Step 1) DataSet<Tuple2<Long, Long>> changes = iteration.getWorkset().join(edges).where(0).equalTo(0) /* Update the parentVertex*/ .with(new NeighborWithComponentIDJoin()) public static final class NeighborWithComponentIDJoin implements JoinFunction<Tuple2<Long, Long>, Tuple2<Long, Long>, Tuple2<Long, Long>> { @Override public Tuple2<Long, Long> join(Tuple2<Long, Long> vertexWithComponent, Tuple2<Long, Long> edge) { return new Tuple2<Long, Long>(edge.f1, vertexWithComponent.f1); } } 1,1 1 2 e.f0=1 e.f1=2 v.f0=1 v.f1=1 Join on v.f0=e.f0 Vertex Edge New Vertex 2,1 Apply with(NeighborWithC omponentIDJoin ) The above shows how event 2 gets a new parent id
  • 34. Delta Iteration – Merge With Solution Set changes = ...with(new NeighborWithComponentIDJoin()) .join(iteration.getSolutionSet()).where(0).equalTo(0) .with(new ComponentIdFilter()); //Close with DeltaSolutionSet and NewWorkingSet. Both are equal to changes variable DataSet<Tuple2<Long, Long>> result = iteration.closeWith(changes, changes); public static final class ComponentIdFilter implements FlatJoinFunction { public void join(..) { if (candidate.f1 < old.f1) { out.collect(candidate); } } } 2,2 Join on v.f0=s.f0 Initial Solution Set Result of Step Delta Solution Set & Work Set 2,1 Apply with(Componen tIdFilter) 1,1 2,1 Updated Solution Set 2,1 1,1 The above shows how the parent id’s of event’s 1 and 2 transition by the end of iteration 1. Event id 1 does not make it past the step function. ….…. 2,2 Flink Runtime Merges Framework merges Delta Solution Set with Solution Set on Index Indices Remember – Only Delta Solution Set is Shuffled
  • 35. Delta Iteration – End of Iteration 1 Initial Dataset / Initial Solution Set Delta Solution Set/ New Workset Merge Solution Set <1,1> <1,1> <2,2> <2,1> <2,1> <3,3> <3,2> <3,2> <4,4> <4,2> <4,2> <11,11> <11,5> <11,5> <12,12> <12,11> <12,11> <15,15> <15,1> <15,1> <6,6> <6,6> <7,7> <7,6> <7,6> <8,8> <8,8> <9,9> <9,8> <9,8> <10,10> <10,8> <10,8> <13,13> <13,10> <13,10>
  • 36. Delta Iteration – End of Iteration 2 Working Set Delta Solution Set/ New Workset Merge Solution Set <1,1> <2,1> <2,1> <3,2> <3,1> <3,1> <4,2> <4,1> <4,1> <11,5> <11,3> <11,3> <12,11> <12,5> <12,5> <15,1> <15,1> <6,6> <7,6> <7,6> <8,8> <9,8> <9,8> <10,8> <10,8> <13,10> <13,8> <13,8>
  • 37. Delta Iteration – End of Iteration 3 Working Set Delta Solution Set/ New Workset Merge Solution Set <1,1> <2,1> <3,1> <3,1> <4,1> <4,1> <11,3> <11,2> <11,2> <12,5> <12,3> <12,3> <15,1> <6,6> <7,6> <8,8> <9,8> <10,8> <13,8> <13,8>
  • 38. Delta Iteration – End of Iteration 4 Working Set Delta Solution Set/ New Workset Merge Solution Set <1,1> <2,1> <3,1> <4,1> <11,2> <11,1> <11,1> <12,3> <12,2> <12,2> <15,1> <6,6> <7,6> <8,8> <9,8> <10,8> <13,8>
  • 39. Delta Iteration – End of Iteration 4 Working Set Delta Solution Set/ New Workset Merge Solution Set <1,1> <2,1> <3,1> <4,1> <11,1> <11,1> <12,2> <12,1> <12,1> <15,1> <6,6> <7,6> <8,8> <9,8> <10,8> <13,8>
  • 40. Delta Iteration – End of Iteration 5 Working Set Delta Solution Set/ New Workset Merge Solution Set <1,1> <2,1> <3,1> <4,1> <11,1> <12,1> <12,1> <12,1> <15,1> <6,6> <7,6> <8,8> <9,8> <10,8> <13,8>
  • 41. Delta Iteration – End of Iteration 6 Working Set Delta Solution Set/ New Workset Merge Solution Set <1,1> <2,1> <3,1> <4,1> <11,1> <12,1> <15,1> <6,6> <7,6> <8,8> <9,8> <10,8> <13,8>