+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
An Introduction to MapReduce
1. An Introduction to MapReduce
Presented by Frane Bandov
at the Operating Complex IT-Systems seminar
Berlin, 1/26/2010
2. Outline
• Introduction
• Google MapReduce
– Idea
– Overview
– Fault Tolerance
– GFS: Google File System
– Job Example
• Alternative Implementations
• Reception and Criticism
• Trends and Future Development
• Conclusion
2/16/10 An Introduction to MapReduce 2
3. Outline
• Introduction
• Google MapReduce
– Idea
– Overview
– Fault Tolerance
– GFS: Google File System
– Job Example
• Alternative Implementations
• Reception and Criticism
• Trends and Future Development
• Conclusion
2/16/10 An Introduction to MapReduce 3
4. Introduction – Problem
Sometimes we have to deal with huge amounts
of data
TBytes
250
200
150
100
50
0
You Facebook Yahoo! Groups German Climate
Computing Centre
2/16/10 An Introduction to MapReduce 4
5. Introduction – Problem
The data needs to be processed, but how?
Can‘t process all of this data on one machine
Distribute the processing to many machines
2/16/10 An Introduction to MapReduce 5
6. Introduction – Approach
Distributed computing is the solution
“Let’s write our own distributed computing
software as a solution to our problem”
Checklist
design protocols evelopment takes a long time
D
design data structures
write the code Expensive: Cost-benefit ratio?
assure failure tolerance
Build complex software for simple computations?
2/16/10 An Introduction to MapReduce 6
7. Outline
• Introduction
• Google MapReduce
– Idea
– Overview
– Fault Tolerance
– GFS: Google File System
– Job Example
• Alternative Implementations
• Reception and Criticism
• Trends and Future Development
• Conclusion
2/16/10 An Introduction to MapReduce 7
8. Google MapReduce – Idea
A framework for distributed computing
Don‘t care about protocols, failure tolerance, etc.
Just write your simple computation
2/16/10 An Introduction to MapReduce 8
9. Google MapReduce – Idea
MapReduce Paradigm
Map: Reduce:
Apply function to all Combine all elements
elements of a list of a list
square x = x * x; reduce (+)[1, 2, 3, 4, 5];
map square [1, 2, 3, 4, 5];
[1, 4, 9, 16, 25] 15
2/16/10 An Introduction to MapReduce 9
10. Google MapReduce – Idea
Basic functioning
Input Map Reduce Output
2/16/10 An Introduction to MapReduce 10
12. MapReduce – Fault Tolerance
• Workers are periodically pinged by master
• No answer over certain time worker failed
Mapper fails:
– Reset map job as idle
– Even if job was completed intermediate files are
inaccessible
– Notify reducers where to get the new intermediate file
Reducer fails:
– Reset its job as idle
2/16/10 An Introduction to MapReduce 12
13. MapReduce – Fault Tolerance
Master fails:
– Periodically sets checkpoints
– In case of failure MapReduce-Operation is aborted
– Operation can be restarted from last checkpoint
2/16/10 An Introduction to MapReduce 13
14. Google MapReduce – GFS
Google File System
• In-house distributed file system at Google
• Stores all input an output files
• Stores files…
– divided into 64 MB blocks
– on at least 3 different machines
• Machines running GFS also
run MapReduce
2/16/10 An Introduction to MapReduce 14
19. Outline
• Introduction
• Google MapReduce
– Idea
– Overview
– Fault Tolerance
– GFS: Google File System
– Job Example
• Alternative Implementations
• Reception and Criticism
• Trends and Future Development
• Conclusion
2/16/10 An Introduction to MapReduce 19
20. Alternative Implementations
Apache Hadoop
• Open-Source-Implementation in Java
• Jobs can be written in C++, Java, Python, etc.
• Used by Yahoo!, Facebook, Amazon and others
• Most commonly used implementation
• HDFS as open-source-implementation of GFS
• Can also use Amazon S3, HTTP(S) or FTP
• Extensions: Hive, Pig, HBase
2/16/10 An Introduction to MapReduce 20
21. Alternative Implementations
Mars
MapReduce-Implementation for nVidia GPU
using the CUDA framework
MapReduce-Cell
Implementation for the Cell multi-core
processor
Qizmt
MySpace’s implementation of MapReduce in C#
2/16/10 An Introduction to MapReduce 21
22. Alternative Implementations
There are many other open- and closed-
source implementations of MapReduce!
2/16/10 An Introduction to MapReduce 22
23. Outline
• Introduction
• Google MapReduce
– Idea
– Overview
– Fault Tolerance
– GFS: Google File System
– Job Example
• Alternative Implementations
• Reception and Criticism
• Trends and Future Development
• Conclusion
2/16/10 An Introduction to MapReduce 23
24. Reception and Criticism
• Yahoo!: Hadoop on a 10,000 server cluster
• Facebook analyses the daily log (25TB) on
a 1,000 server cluster
• Amazon Elastic MapReduce: Hadoop
clusters for rent on EC2 and S3
• IBM and Google: Support university
courses in distributed programming
• UC Berkley announced to teach freashmen
programming MapReduce
2/16/10 An Introduction to MapReduce 24
26. Reception and Criticism
• Criticism mainly by RDBMS experts
DeWitt and Stonebraker
• MapReduce
– is a step backwards in database access
– is a poor implementation
– is not novel
– is missing features that are routinely provided
by modern DBMSs
– is incompatible with the DBMS tools
2/16/10 An Introduction to MapReduce 26
27. Reception and Criticism
Response to criticism
MapReduce is no RDBMS
It suits well for processing and structuring huge
amounts of unstructured data
MapReduce's big inovation is that it enables
distributing data processing across a network of
cheap and possibly unreliable computers
2/16/10 An Introduction to MapReduce 27
28. Outline
• Introduction
• Google MapReduce
– Idea
– Overview
– Fault Tolerance
– GFS: Google File System
– Job Example
• Alternative Implementations
• Reception and Criticism
• Trends and Future Development
• Conclusion
2/16/10 An Introduction to MapReduce 28
29. Trends and Future Development
Trend of utilizing MapReduce/Hadoop as
parallel database
• Hive: Query language for Hadoop
• HBase: Column-oriented distributed database
(modeled after Google’s BigTable)
• Map-Reduce-Merge: Adding merge to the
paradigm allows implementing features of
relational algebra
2/16/10 An Introduction to MapReduce 29
30. Trends and Future Development
Trend to use the MapReduce-paradigm to
better utilize multi-core CPUs
• Qt Concurrent
– Simplified C++ version of MapReduce for distributing
tasks between multiple processor cores
• Mars
• MapReduce-Cell
2/16/10 An Introduction to MapReduce 30
31. Outline
• Introduction
• Google MapReduce
– Idea
– Overview
– Fault Tolerance
– GFS: Google File System
– Job Example
• Alternative Implementations
• Reception and Criticism
• Trends and Future Development
• Conclusion
2/16/10 An Introduction to MapReduce 31
32. Conclusion
MapReduce
provides an easy solution for the processing of
large amounts of data
brings a paradigm shift in programming
changed the world,
i.e. made data processing more efficient and
cheaper, is the foundation of many other
approaches and solutions
2/16/10 An Introduction to MapReduce 32