2. Hadoop MapReduce Jobs
Input Map Reduce Output
InputFormat Mapper Reducer OutputFormat
• Jobs have a static structure.
• Input, Output, Map, Reduce run your custom (or library) code.
• If application logic is too complex, you need more than one job.
3. Flink Programs
Source Map Reduce
Source
Source
Filter
Join
CoGroup Sink
• Flink program are DAG data flows.
• Data Sources, Data Sinks, Map and Reduce operators are included.
• Everything that MapReduce gives and much more (super set).
• Much better performance
• Especially if more than 1 MR job is executed.
4. Run your Hadoop code with Flink?
• Hadoop data types (Writable) are natively supported.
• Hadoop Filesystems are natively supported.
• Flink features Input- & OutputFormats, Map, and Reduce
functions, just like Hadoop MapReduce.
• Concepts are the same, but interfaces are not :-(
But Flink provides wrappers for Hadoop code :-)
• mapred.* API: In/OutputFormat, Mappers, & Reducers
• mapreduce.* API: In/OutputFormat
6. final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
// set up Hadoop InputFormat
HadoopInputFormat<LongWritable, Text> hadoopInputFormat =
new HadoopInputFormat<LongWritable, Text>(new TextInputFormat(), LongWritable.class, Text.class, new JobConf());
TextInputFormat.addInputPath(hadoopInputFormat.getJobConf(), new Path(inputPath));
DataSet<Tuple2<LongWritable, Text>> text = env.createInput(hadoopInputFormat); // read data with Hadoop InputFormat
DataSet<Tuple2<Text, LongWritable>> words =
// apply Hadoop Mapper
text.flatMap(new HadoopMapFunction<LongWritable, Text, Text, LongWritable>(new Tokenizer()))
// apply Hadoop Reducer
.groupBy(0).reduceGroup(new HadoopReduceFunction<Text, LongWritable, Text, LongWritable>(new Counter()));
// set up Hadoop Output Format
HadoopOutputFormat<Text, LongWritable> hadoopOutputFormat =
new HadoopOutputFormat<Text, LongWritable>(new TextOutputFormat<Text, LongWritable>(), new JobConf());
hadoopOutputFormat.getJobConf().set("mapred.textoutputformat.separator", " ");
TextOutputFormat.setOutputPath(hadoopOutputFormat.getJobConf(), new Path(outputPath));
words.output(hadoopOutputFormat); // write data with Hadoop OutputFormat
env.execute("Hadoop Compat WordCount"); // execute the program
Hadoop Data Types Hadoop Input- & OutputFormats Your Hadoop Functions
Yes, it will…
7. Use MapReduce like you always wanted
• Freely assemble your functions into a program.
• Very efficient, pipelined execution.
– Program is executed on Flink (no Hadoop involved).
– No writing to/reading from HDFS within a program.
• Caveat: No support for custom Hadoop partitioners & sorters, yet :-(
Input Map Reduce
Input
Output
Reduce
Map Reduce
Output
9. Hadoop Job
Do not change a single line of code!
• Inject MapReduce jobs as a whole into Flink programs
– with support for custom partitioners, sorters, groupers.
• Run Hadoop MapReduce jobs on Flink
– without changing a single line of code.
Source Map Reduce
Source
Source
Filter
Join
CoGroup Sink