Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?

6 485 vues

Publié le

Flink Forward 2015

Publié dans : Technologie
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?

  2. 2. INTRODUCTION • YARN opened Hadoop for many more developers • API to integrate into a Hadoop cluster • Flexibility • Applications: MR, TEZ, Flink, Spark,… • Flink has been great in using the opportunity • Flexible program execution graph • Operators other than Map and Reduce • Clean and convenient API • Efficient with I/O
  3. 3. EXPECTATIONS FROM YARN • New programming models in addition to MapReduce • More alternatives to cover cases where the MapReduce paradigm does not suit well • Flexibility with expressing operations on data • Elasticity of a cluster • Ability to write own applications to distribute computations across the cluster
  4. 4. DISTRIBUTING COMPUTATIONAL TASKS • Writing own YARN application • Complicated • Tedious • Error-prone • Somebody must have done something simpler • Apache Twill • Was not simple enough still • Execute CLI tools remotely (if everything else fails) • Flink?
  5. 5. FLINK AT RESEARCHGATE Lots of benefits: • Made MapReduce jobs more readable • More compact • Less boiler plate code • Easier to understand and maintain • Got rid of ugly Hive queries and optimised runtime • Better and cleaner orchestration of workflow subtasks (before we had to glue multiple MR jobs) • Iterative machine learning algorithms • Distributing computational tasks across a cluster
  7. 7. REAL USE CASE • In essence: • Reads MongoDB documents • Converts them to Avro records (based on a provided Avro schema) • Persists them on HDFS • Avrongo evolution • One threaded program • Multi-threaded program talking to different shards in parallel • Distributed across cluster • Reasons for distributing: • Were CPU bound • HDFS load distribution A MongoDB to Avro Bridge (aka Avrongo) Used to dump live DB data to HDFS for further batch-processing and analytics
  8. 8. HOW AVRONGO WORKS? Basic Version • One thread • Using one MongoDB cursor to iterate the whole collection • Suitable for smaller collections
  9. 9. MONGODB SHARDS AND CHUNKS • Controlling load on the MongoDB cluster • Deterministic way of splitting collection for input Utilizing MongoDB chunks
  10. 10. AVRONGO - SHARDED VERSION • Collecting chunks information (sets of documents living on a particular shard) • Processing chunks of each shard in a separate group of threads
  11. 11. AVRONGO - FLINK VERSION • Custom InputFormat that distributes MongoDB chunks uniformly • FlatMap operator • Number of task nodes = (number of shards) x (parallelism per shard) • Custom Generic AvroOutputFormat • Slower shards receive a bit more attention
  12. 12. FLINK APPROACH Outcome • No longer bound by CPU • Imports to HDFS are faster • Some collections: from 6h to 2.5h or from 3.5h to 2h • Very few lines of code • Same command line interface (no efforts to migrate to Flink-based version) • Reusing the same converter as in standalone versions • All orchestration and parallelisation work is done automatically by Flink Benefits
  14. 14. HADOOP DISTCP • Generates a MapReduce job that copies big amount of data • List of files as an input to a Map Task • Two types of Input Formats: • UniformSizeInputFormat • DynamicInputFormat • gives more load to faster mappers • complicated code • utilizes FS to feed the mappers https://hadoop.apache.org/docs/r1.2.1/distcp2.html
  15. 15. • Implements the same logic as in a DynamicInputFormat of Hadoop’s distcp • Much fewer lines of code • Same runtime as Hadoop distcp • Available in Flink Java examples • Not fault-tolerant (yet) FLINK DISTCP https://github.com/apache/flink/tree/master/flink-examples/flink-java-examples/ src/main/java/org/apache/flink/examples/java/distcp
  17. 17. CONCLUSIONS • Flink - a thin layer for implementing your YARN application for parallelising independent tasks on the cluster • Thanks to custom input formats that are easy to implement • No boilerplate code Would be nice to have: • Elasticity • Better progress tracking • Fault tolerance Custom input format + a Flink operator with business logic = Happiness
  18. 18. QUESTIONS? https://www.researchgate.net/careers