Map and Reduce

Map & Reduce Christopher Schleiden, Christian Corsten, Michael Lottko, Jinhui Li 1 The slides are licensed under aCreative Commons Attribution 3.0 License

Outline Motivation Concept Parallel Map & Reduce Google’s MapReduce Example: Word Count Demo: Hadoop Summary Web Technologies 2

Today the web is all about data! Google Processing of 20 PB/day (2008) LHC Will generate about 15PB/year Facebook 2.5 PB of data + 15 TB/day (4/2009) 3 BUT: It takes ~2.5 hours to read one terabyte off a typical hard disk!

4 Solution: Going Parallel! Data Distribution However, parallel programming is hard! Synchronization Load Balancing …

Map & Reduce Programming model and Framework Designed for large volumes of data in parallel Based on functional map and reduce concept e.g., Output of functions only depends on their input, there are no side-effects 5

Functional Concept Map Apply function to each value of a sequence map(k,v)  <k’, v’>* Reduce/Fold Combine all elements of a sequence using binary operator reduce(k’, <v’>*) <k’, v’>* 6

Typical problem Iterate over large number of records Extract something interesting Shuffle & sort intermediate results Aggregate intermediate results Write final output 7 Map Reduce

Parallel Map & Reduce Published (2004) and patented (2010) by Google Inc C++ Runtime with Bindings to Java/Python Other Implementations: Apache Hadoop/Hive project (Java) Developed at Yahoo! Used by: Facebook Hulu IBM And many more Microsoft COSMOS (Scope, based on SQL and C#) Starfish (Ruby) … 9 Footer Text

Parallel Map & Reduce /2 Parallel execution of Map and Reduce stages Scheduling through Master/Worker pattern Runtime handles: Assigning workers to map and reduce tasks Data distribution Detects crashed workers 10

Parallel Map & Reduce Execution 11 Map Reduce Input Output Shuffle & Sort D RE A SU T LT A

Components in Google’s MapReduce Web Technologies 12

Google Filesystem (GFS) Stores… Input data Intermediate results Final results …in 64MB chunks on at least three different machines Web Technologies 13 File Nodes

Scheduling (Master/Worker) One master, many worker Input data split into M map tasks (~64MB in Size; GFS) Reduce phase partitioned into R tasks Tasks are assigned to workers dynamically Master assigns each map task to a free worker Master assigns each reducetask to a free worker Fault handling via Redundancy Master checks if Worker still alive via heart-beat Reschedules work item if worker has died Web Technologies 14

Scheduling Example 15 Map Reduce Input Output Temp Master Assign map Assign reduce D Worker Worker RES A Worker T Worker ULT Worker A

Googles M&R vsHadoop Google MapReduce Main language: C++ Google Filesystem (GFS) GFS Master GFS chunkserver HadoopMapReduce Main language: Java HadoopFilesystem (HDFS) Hadoopnamenode Hadoopdatanode Web Technologies 16

Word Count The Map & Reduce “Hello World” example 17

Word Count - Input Set of text files: Expected Output: sweet (1), this (2), is (2), the (2), foo (1), bar (1), file (1) 18 bar.txt This is the bar file foo.txt Sweet, this is the foo file

Word Count - Map Mapper(filename, file-contents): for each word emit(word,1) Output this (1) is (1) the (1) sweet (1) this (1) the (1) is (1) foo (1) bar (1) file (1) 19

Word Count – Shuffle Sort this (1) is (1) the (1) sweet (1) this (1) the (1) is (1) foo (1) bar (1) file (1) this (1) this (1) is (1) is (1) the (1) the (1) sweet (1) foo (1) bar (1) file (1) 20

Word Count - Reduce reducer(word, values): sum = 0 for each value in values: sum = sum + value emit(word,sum) Output sweet (1) this (2) is (2) the (2) foo (1) bar (1) file (1) 21

Summary Lots of data processed on the web (e.g., Google) Performance solution: Go parallel Input, Map, Shuffle & Sort, Reduce, Output Google File System Scheduling: Master/Worker Word Count example Hadoop Questions? Web Technologies 23

References Inspirations for presentation http://www4.informatik.uni-erlangen.de/Lehre/WS10/V_MW/Uebung/folien/05-Map-Reduce-Framework.pdf http://www.scribd.com/doc/23844299/Map-Reduce-Hadoop-Pig RWTH Map Reduce Talk: http://bit.ly/f5oM7p Paper Dean et al, MapReduce: Simplified Data Processing on Large Clusters, OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004 Ghemawat et al, The Google File System, 19th ACM Symposium on Operating Systems Principles, Lake George, NY, October, 2003. 24

Map and Reduce

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Map and Reduce

Similaire à Map and Reduce (20)

Dernier

Dernier (20)

Map and Reduce

Notes de l'éditeur