Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
1. FLINK - A CONVENIENT ABSTRACTION LAYER
FOR YARN?
VYACHESLAV ZHOLUDEV
2. INTRODUCTION
• YARN opened Hadoop for many more developers
• API to integrate into a Hadoop cluster
• Flexibility
• Applications: MR, TEZ, Flink, Spark,…
• Flink has been great in using the opportunity
• Flexible program execution graph
• Operators other than Map and Reduce
• Clean and convenient API
• Efficient with I/O
3. EXPECTATIONS FROM YARN
• New programming models in addition to MapReduce
• More alternatives to cover cases where the MapReduce paradigm does
not suit well
• Flexibility with expressing operations on data
• Elasticity of a cluster
• Ability to write own applications to distribute computations across
the cluster
4. DISTRIBUTING COMPUTATIONAL TASKS
• Writing own YARN application
• Complicated
• Tedious
• Error-prone
• Somebody must have done
something simpler
• Apache Twill
• Was not simple enough still
• Execute CLI tools remotely
(if everything else fails)
• Flink?
5. FLINK AT RESEARCHGATE
Lots of benefits:
• Made MapReduce jobs more readable
• More compact
• Less boiler plate code
• Easier to understand and maintain
• Got rid of ugly Hive queries and optimised runtime
• Better and cleaner orchestration of workflow
subtasks (before we had to glue multiple MR jobs)
• Iterative machine learning algorithms
• Distributing computational tasks across a cluster
7. REAL USE CASE
• In essence:
• Reads MongoDB documents
• Converts them to Avro records (based on a provided Avro schema)
• Persists them on HDFS
• Avrongo evolution
• One threaded program
• Multi-threaded program talking to different shards in parallel
• Distributed across cluster
• Reasons for distributing:
• Were CPU bound
• HDFS load distribution
A MongoDB to Avro Bridge (aka Avrongo)
Used to dump live DB data to HDFS for further batch-processing and analytics
8. HOW AVRONGO WORKS?
Basic Version
• One thread
• Using one MongoDB cursor to iterate the whole collection
• Suitable for smaller collections
9. MONGODB SHARDS AND CHUNKS
• Controlling load on the MongoDB cluster
• Deterministic way of splitting collection for input
Utilizing MongoDB chunks
10. AVRONGO - SHARDED VERSION
• Collecting chunks information (sets of documents living on a particular
shard)
• Processing chunks of each shard in a separate group of threads
11. AVRONGO - FLINK VERSION
• Custom InputFormat that distributes MongoDB chunks uniformly
• FlatMap operator
• Number of task nodes = (number of shards) x (parallelism per shard)
• Custom Generic AvroOutputFormat
• Slower shards receive a bit more attention
12. FLINK APPROACH
Outcome
• No longer bound by CPU
• Imports to HDFS are faster
• Some collections: from 6h to 2.5h or from 3.5h to 2h
• Very few lines of code
• Same command line interface (no efforts to migrate to Flink-based version)
• Reusing the same converter as in standalone versions
• All orchestration and parallelisation work is done automatically by Flink
Benefits
14. HADOOP DISTCP
• Generates a MapReduce job that copies big amount of data
• List of files as an input to a Map Task
• Two types of Input Formats:
• UniformSizeInputFormat
• DynamicInputFormat
• gives more load to faster mappers
• complicated code
• utilizes FS to feed the mappers
https://hadoop.apache.org/docs/r1.2.1/distcp2.html
15. • Implements the same logic as in a
DynamicInputFormat of Hadoop’s distcp
• Much fewer lines of code
• Same runtime as Hadoop distcp
• Available in Flink Java examples
• Not fault-tolerant (yet)
FLINK DISTCP
https://github.com/apache/flink/tree/master/flink-examples/flink-java-examples/
src/main/java/org/apache/flink/examples/java/distcp
17. CONCLUSIONS
• Flink - a thin layer for implementing your YARN application for parallelising
independent tasks on the cluster
• Thanks to custom input formats that are easy to implement
• No boilerplate code
Would be nice to have:
• Elasticity
• Better progress tracking
• Fault tolerance
Custom input format + a Flink operator with business logic = Happiness