1. Map & Reduce Christopher Schleiden, Christian Corsten, Michael Lottko, Jinhui Li 1 The slides are licensed under aCreative Commons Attribution 3.0 License
2. Outline Motivation Concept Parallel Map & Reduce Google’s MapReduce Example: Word Count Demo: Hadoop Summary Web Technologies 2
3. Today the web is all about data! Google Processing of 20 PB/day (2008) LHC Will generate about 15PB/year Facebook 2.5 PB of data + 15 TB/day (4/2009) 3 BUT: It takes ~2.5 hours to read one terabyte off a typical hard disk!
4. 4 Solution: Going Parallel! Data Distribution However, parallel programming is hard! Synchronization Load Balancing …
5. Map & Reduce Programming model and Framework Designed for large volumes of data in parallel Based on functional map and reduce concept e.g., Output of functions only depends on their input, there are no side-effects 5
6. Functional Concept Map Apply function to each value of a sequence map(k,v) <k’, v’>* Reduce/Fold Combine all elements of a sequence using binary operator reduce(k’, <v’>*) <k’, v’>* 6
7. Typical problem Iterate over large number of records Extract something interesting Shuffle & sort intermediate results Aggregate intermediate results Write final output 7 Map Reduce
9. Parallel Map & Reduce Published (2004) and patented (2010) by Google Inc C++ Runtime with Bindings to Java/Python Other Implementations: Apache Hadoop/Hive project (Java) Developed at Yahoo! Used by: Facebook Hulu IBM And many more Microsoft COSMOS (Scope, based on SQL and C#) Starfish (Ruby) … 9 Footer Text
10. Parallel Map & Reduce /2 Parallel execution of Map and Reduce stages Scheduling through Master/Worker pattern Runtime handles: Assigning workers to map and reduce tasks Data distribution Detects crashed workers 10
11. Parallel Map & Reduce Execution 11 Map Reduce Input Output Shuffle & Sort D RE A SU T LT A
13. Google Filesystem (GFS) Stores… Input data Intermediate results Final results …in 64MB chunks on at least three different machines Web Technologies 13 File Nodes
14. Scheduling (Master/Worker) One master, many worker Input data split into M map tasks (~64MB in Size; GFS) Reduce phase partitioned into R tasks Tasks are assigned to workers dynamically Master assigns each map task to a free worker Master assigns each reducetask to a free worker Fault handling via Redundancy Master checks if Worker still alive via heart-beat Reschedules work item if worker has died Web Technologies 14
15. Scheduling Example 15 Map Reduce Input Output Temp Master Assign map Assign reduce D Worker Worker RES A Worker T Worker ULT Worker A
16. Googles M&R vsHadoop Google MapReduce Main language: C++ Google Filesystem (GFS) GFS Master GFS chunkserver HadoopMapReduce Main language: Java HadoopFilesystem (HDFS) Hadoopnamenode Hadoopdatanode Web Technologies 16
18. Word Count - Input Set of text files: Expected Output: sweet (1), this (2), is (2), the (2), foo (1), bar (1), file (1) 18 bar.txt This is the bar file foo.txt Sweet, this is the foo file
19. Word Count - Map Mapper(filename, file-contents): for each word emit(word,1) Output this (1) is (1) the (1) sweet (1) this (1) the (1) is (1) foo (1) bar (1) file (1) 19
20. Word Count – Shuffle Sort this (1) is (1) the (1) sweet (1) this (1) the (1) is (1) foo (1) bar (1) file (1) this (1) this (1) is (1) is (1) the (1) the (1) sweet (1) foo (1) bar (1) file (1) 20
21. Word Count - Reduce reducer(word, values): sum = 0 for each value in values: sum = sum + value emit(word,sum) Output sweet (1) this (2) is (2) the (2) foo (1) bar (1) file (1) 21
23. Summary Lots of data processed on the web (e.g., Google) Performance solution: Go parallel Input, Map, Shuffle & Sort, Reduce, Output Google File System Scheduling: Master/Worker Word Count example Hadoop Questions? Web Technologies 23
24. References Inspirations for presentation http://www4.informatik.uni-erlangen.de/Lehre/WS10/V_MW/Uebung/folien/05-Map-Reduce-Framework.pdf http://www.scribd.com/doc/23844299/Map-Reduce-Hadoop-Pig RWTH Map Reduce Talk: http://bit.ly/f5oM7p Paper Dean et al, MapReduce: Simplified Data Processing on Large Clusters, OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004 Ghemawat et al, The Google File System, 19th ACM Symposium on Operating Systems Principles, Lake George, NY, October, 2003. 24
Notes de l'éditeur
In these days the web is all about data. All major and important websites relay on huge amount of data in some form in order to provide services to users. For example Google … and Facebook …. Also facilities like the LHC will produce data measures in peta bytes each year. However, it takes about 2.5 hours in order to read one terabyte off a typical hard drive. The solution that comes immediately to mind, of course, is going parallel. KonkretesBeispiel [TODO], [Kontextzu Cloud Computing]
Parallel programming is still hard. Programmers have to deal with a lot of boilerplate code and have to manually write code for things like scheduling and load balancing. Also people want to use the company cluster in parallel, so something like a batch system is needed. As more and more companies use huge amounts of data, a some kind of standard framework or platform has emerged in recent years and that is the Map/Reduce framework.
Map Reduce known for years as functional programming concept