Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Distributed Processing Frameworks

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Prochain SlideShare
Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
Chargement dans…3
×

Consultez-les par la suite

1 sur 9 Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à Distributed Processing Frameworks (20)

Publicité

Plus récents (20)

Distributed Processing Frameworks

  1. 1. Distributed Processing Frameworks Author: Antonios Katsarakis
  2. 2. Literature • MapReduce: Simplified Data Processing on Large Clusters Jeff Dean et al. - OSDI’04. • Spark: Cluster Computing with Working Sets M. Zaharia et al. - HotCloud’10.
  3. 3. Why Big Data? • More data to process: IoT, smart devices, web applications - About 2.3 trillion GB of new data are generated every day • Growth of CPU performance cannot keep up with increasing amount of data to process • This leads us to the Big Data era - Big data: Data sets are so large that the processing power of a single machine is inadequate to deal with them • We need to find ways to process these massive amounts of data
  4. 4. MapReduce • Proposed by Jeff Dean et al. (Google) 2004 - Cited more than 18k • A programming model that enables the parallel and distributed processing of large data sets • Typical MapReduce Program: - Read Data - Map: filtering of the data - Shuffle and short - Reduce: summary operation on data - Write the Results ReduceReduce Input Data 1/3 Input 1/3 Input 1/3 Input Map Map Map Interm. Data Interm. Data Interm. Data Output Data Output Data
  5. 5. Critical Reflection • Outcome: - Novel idea that lead to a whole new era of distributed systems - Big impact in industry (Hadoop MapReduce) - Lowered the cost of computations • Limitations: - Restricted to batch processing - It only support map and reduce operations - The shuffling phase introduces overheads
  6. 6. Spark • Proposed by Matei Zaharia et al. 2010 - Cited 1.5k • Another programming model based on higher-ordered functions that execute user-defined functions in parallel • Aims to replace MapReduce in industry • Main Ideas: - Represent the computations as DAGs - Cache datasets into memory
  7. 7. Spark Model • Resilient Distributed Datasets (RRDs): immutable collections of objects spread across a cluster • Operations over RDDs: 1.Transformations: lazy operators that create new RDDs 2.Actions: launch a computation on an RDD Pipelined RDD1 var count = readFile(…) .map(…) .filter(..) .reduceByKey() .count() File splited into chunks (RDD0) RDD2 RDD3 RDD4 Result Job (RDD) Graph Stage1St.2
  8. 8. Critical Reflection • Benefits: - High level API - Support more applications types - Performance optimizations • Limitations: - Detailed performance analysis on the thread level is hard - Multipurpose application support makes performance improvements and tuning really challenging - The shuffling phase introduces overheads
  9. 9. Conclusion • Clusters provide the computational power to process Big Data • MapReduce allows developers to build programs for clusters • Spark tries to overcome limitations of MapReduce • These systems introduce many challenges in terms of measuring and improving their performance

Notes de l'éditeur

  • HL API - (in Scala, Java, Python) - usable by non computer scientists
    SMAT - (streaming, iterative and interactive)
    PO - (memory caching, transformation pipelining etc.)
  • 3* (in terms of performance, application support and user friendliness)

×