This is a deck of slides from a recent meetup of AWS Usergroup Greece, presented by Ioannis Konstantinou from the National Technical University of Athens.
The presentation gives an overview of the Map Reduce framework and a description of its open source implementation (Hadoop). Amazon's own Elastic Map Reduce (EMR) service is also mentioned. With the growing interest on Big Data this is a good introduction to the subject.
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Hadoop & MapReduce
1. Hadoop & MapReduce
Dr. Ioannis Konstantinou
http://www.cslab.ntua.gr/~ikons
AWS Usergroup Greece
18/07/2012
Computing Systems Laboratory
School of Electrical and Computer Engineering
National Technical University of Athens
2. Big Data
90% of today's data was created in the last 2 years
Moore's law: Data volume doubles every 18 months
YouTube: 13 million hours and 700 billion views in 2010
Facebook: 20TB/day (compressed)
CERN/LHC: 40TB/day (15PB/year)
Many more examples
Web logs, presentation files, medical files etc
3. Problem: Data explosion
1 EB (Exabyte=1018bytes) = 1000 PB (Petabyte=1015bytes)
Data traffic of mobile telephony in the USA in 2010
1.2 ZB (Zettabyte) = 1200 EB
Total of digital data in 2010
35 ZB (Zettabyte = 1021 bytes)
Estimate for volume of total digital
data in 2020
7. Parallelization challenges
How to assign units of work to the workers?
What if there are more units of work than workers?
What if the workers need to share intermediate incomplete
data?
How do we aggregate such intermediate data?
How do we know when all workers have completed their
assignments?
What if some workers failed?
8. What is MapReduce?
A programming model
A programming framework
Used to develop solutions that will
Process large amounts of data in a parallelized fashion
In clusters of computing nodes
Originally a closed-source implementation at Google
Scientific papers of ’03 & ’04 describe the framework
Hadoop: opensource implementation of the algorithms described in
the scientific papers
http://hadoop.apache.org/
9. What is Hadoop?
2 large subsystems, 1 for data management & 1 for computation:
HDFS (Hadoop Distributed File System)
MapReduce computation framework runs above HDFS
HDFS is essentially the I/O of Hadoop
Written in java: A set of java processes running in multiple nodes
Who uses it:
Yahoo!
Amazon
Facebook
Twitter
Plus many more...
10. HDFS – distributed file system
A scalable distributed file system for applications dealing with
large data sets.
Distributed: runs in a cluster
Scalable: 10Κ nodes, 100Κ files 10PB storage
Storage space is seamless for the whole cluster
Files broken into blocks
Typical block size: 128 MB.
Replication: Each block copied to multiple data nodes.
11. Architecture of HDFS/MapReduce
Master/Slave scheme
HDFS: A central NameNode administers multple DataNodes
NameNode: holds information about which DataNode holds which files
DataNodes: «dummy» servers that hold raw file chunks
MapReduce: A central JobTracker administers multiple TaskTrackers
-NameNode and JobTracker
They run on the master
-DataNode and TaskTracker
They run on the slaves
12. MapReduce
The problem is broken down in 2 phases.
●
Map: Non overlapping sets of data input
(<key,value> records) are assigned to different
processes (mappers) that produce a set of
intermediate <key,value> results
●
Reduce: Data of Map phase are fed to a typically
smaller number of processes(reducers) that
aggregate the input results to a smaller number of
<key,value> records.
14. Initialization phase
Input is uploaded to HDFS and is split into pieces of
fixed size
Each TaskTracker node that participates in the
computation is executing a copy of the MapReduce
program
One of the nodes plays the JobTracker master role.
This node will assign tasks to the rest (workers). Tasks
can either be of type map or reduce.
15. JobTracker (Master)
The jobTracker holds data about:
Status of tasks
Location of input, output and intermediate data (runs
together with NameNode - HDFS master)
The master is responsible for timecheduling of work
tasks execution.
16. TaskTracker (Slave)
The TaskTracker runs tasks assigned by the master.
Runs at the same node as the DataNode (HFDS slave)
Task can be either of type Map or type Reduce
Typically the maximum number of concurrent tasks
that can be run by a node is equal to the number of
cpu cores it has (achieving optimal CPU utilization)
17. Map task
A worker (TaskTracker) that has been assigned a map task
●
Reads the relevant input data (input split) from HDFS, analyzes the <key, value>
pairs and the output is passed as input to the map function.
●
The map function processes the pairs and produces intermediate pairs that are
aggregated in memory.
●
Periodically a partition function is executed which stores the intermediate key-
value pairs in the local node storage, while grouping them in R sets.This function
is user defined.
●
When the partition function completes the storage of the key-value pairs it
informs the master that the task is complete and where the data are stored.
●
The master forwards this information to the workers that run the reduce tasks
18. Reduce task
A worker that has been assigned a reduce task
Reads from every map process that has been executed the pairs that
correspond to itself based on the locations instructed by the master.
When all intermediate pairs have been retrieved they are sorted based on
their key. Entries with the same key are grouped together.
Function reduce is executed with input the pairs <key, group_of_values>
that were the result of the previous phase.
The reduce task processes the input data and produces the final pairs.
The output pairs are attached in a file in the local file system. When the
reduce task is completed the file becomes available in the distributed file
system.
19. Task Completion
When a worker has completed its task it informs
the master.
When all workers have informed the master then
the master will return the function to the original
program of the user.
20. Example
Master
worker
Map Reduce
worker
Part 1
Part 2
Input worker
Map Reduce
worker Output
Part 3
worker
Map Reduce
worker
22. Example: Word count 1/3
Objective: measure the frequency of appearance of words in a large set
of documents
Potential use case: Discovery of popular url in a set of webserver
logfiles
Implementation plan:
“Upload” documents on MapReduce
Author a map function
Author a reduce function
Run a MapReduce task
Retrieve results
23. Example: Word count 2/3
map(key, value):
// key: document name; value: text of document
for each word w in value:
emit(w, 1)
reduce(key, values):
// key: a word; value: an iterator over counts
result = 0
for each count v in values:
result += v
emit(result)
26. Locality
Move computation near the data: The master tries to
have a task executed on a worker that is as “near” as
possible to the input data, thus reducing the
bandwidth usage
How does the master know?
27. Task distribution
The number of tasks is usually higher than the
number of the available workers
One worker can execute more than one tasks
The balance of work load is improved. In the case
of a single worker failure there is faster recovery
and redistribution of tasks to other nodes.
28. Redundant task executions
Some tasks can be delayed, resulting in a delay in the
overall work execution
The solution to the problem is the creation of task
copies that can be executed in parallel from 2 or more
different workers (speculative execution)
A task is considered complete when the master is
informed about its completion by at least one node.
29. Partitioning
A user can specify a custom function that will
partition the tasks during shuffling.
The type of input and output data can be defined by
the user and has no limitation on what form it should
have.
30. The input of a reducer is always sorted
There is the possibility to execute tasks locally in a
serial manner
The master provides web interfaces for
Monitoring tasks progress
Browsing of HDFS
31. When should I use it?
Good choice for jobs that can be broken into parallelized jobs:
Indexing/Analysis of log files
Sorting of large data sets
Image processing
•
Bad choice for serial or low latency jobs:
–
Computation of number π with precision of 1,000,000 digits
–
Computation of Fibonacci sequence
–
Replacing MySQL
32. Use cases 1/3
Large Scale Image Conversions
100 Amazon EC2 Instances, 4TB raw TIFF data
11 Million PDF in 24 hours and 240$
•
Internal log processing
•
Reporting, analytics and machine learning
•
Cluster of 1110 machines, 8800 cores and 12PB
raw storage
•
Open source contributors (Hive)
•
Store and process tweets, logs, etc
•
Open source contributors (hadoop-lzo)
•
Large scale machine learning
33. Use cases 2/3
100.000 CPUs in 25.000 computers
Content/Ads Optimization, Search index
Machine learning (e.g. spam filtering)
Open source contributors (Pig)
•
Natural language search (through
Powerset)
•
400 nodes in EC2, storage in S3
•
Open source contributors (!) to HBase
•
ElasticMapReduce service
•
On demand elastic Hadoop clusters for the
Cloud
34. Use cases 3/3
ETL processing, statistics generation
Advanced algorithms for behavioral
analysis and targeting
•
Used for discovering People you May Know,
and for other apps
•
3X30 node cluster, 16GB RAM and 8TB
storage
•
Leading Chinese language search engine
•
Search log analysis, data mining
•
300TB per week
•
10 to 500 node clusters
35. Amazon ElasticMapReduce (EMR)
A hosted Hadoop-as-a-service solution provided by AWS
No need for management or tuning of Hadoop clusters
●
upload your input data, store your output data on S3
●
procure as many EC2 instances as you need and only pay for the
time you use them
Hive and Pig support makes it easy to write data analytical scripts
Java, Perl, Python, PHP, C++ for more sophisticated algorithms
Integrates to dynamoDB (process combined datasets in S3 &
dynamoDB)
Support for HBase (NoSQL)