2. What is MapReduce?
From Wikipedia:
MapReduce is a framework for processing highly distributable problems across huge datasets using a large number of
computers (nodes), collectively referred to as a cluster (if all nodes use the same hardware) or a grid (if the nodes use
different hardware). Computational processing can occur on data stored either in a filesystem (unstructured) or in a
database (structured).
"Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker
nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the
smaller problem, and passes the answer back to its master node.
"Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way
to form the output – the answer to the problem it was originally trying to solve.
3. The Map
Mapping involves taking raw data and converting it into a
series of symbols.
For example, DNA sequencing:
ddATP -> A
ddGTP -> G
ddCTP -> C
ddTTP -> T
Results in representations like "GATTACA"
4. Practical Mapping
Inputs are generally flat-files containing lines of text.
clever_critters.txt:
foxes are clever
cats are clever
Files are read in and fed to a mapper one line at a time via
STDIN.
cat clever_critters.txt | mapper.rb
5. Practical Mapping Cont'd
The mapper processes the line and outputs a key/value
pair to STDOUT for each symbol it maps
foxes 1
are 1
clever 1
cats 1
are 1
clever 1
6. Work Partitioning
These key/value pairs are passed to a "partition function"
which organizes the output and assigns it to reducer nodes
foxes -> node 1
are -> node 2
clever -> node 3
cat -> node 4
7. Practical Reduction
The Reducers each receive the sharded
workload assigned to them by the partitioning.
Typically the work is received as a stream of
key/value pairs via STDIN:
"foxes 1" -> node 1
"are 1|are 1" -> node 2
"clever 1|clever 1" -> node 3
"cats 1|cats 1" -> node 4
8. Practical Reduction Cont'd
The reduction is essentially whatever you want it to be.
There are common patterns that are often pre-solved by
the map-reduce framework.
See Hadoop's Built-In Reducers
eg, "Aggregate" - give me a total of all the key/values
foxes - 1
are - 2
clever -2
cats - 1
9. What is Hadoop?
From wikipedia:
Apache Hadoop is a software framework that supports data-intensive distributed applications under a
free license.[1] It enables applications to work with thousands of computational independent
computers and petabytes of data. Hadoop was derived from Google's MapReduce and Google File
System (GFS) papers.
Essentially, Hadoop is a practical implementation of all the pieces you'd need to
accomplish everything we've discussed thus far. It takes in the data, organizes
the tasks, passes the data through its entire path and finally outputs the
reduction.
13. Look complex? It's not
1. Sign up for the service
2. Download the tools (requires ruby 1.8)
3. mkdir ~/elastic-mapreduce-cli; cd ~/elastic-mapreduce-cli
4. Create your credentials.json file
{
"access_id": "<key>",
"private_key": "<secret key>",
"keypair": "<name of keypair>",
"key-pair-file": "~/.ssh/<key>.pem",
"log_uri": "s3://<unique s3 bucket/",
"region": "us-east-1"
}
5. unzip ~/Downloads/elastic-mapreduce-ruby.zip
14. Run it
ruby elastic-mapreduce --list
ruby elastic-mapreduce --create --alive
ruby elastic-mapreduce --list
ruby elastic-mapreduce --terminate <JobFlowID>
Note you can also view it in the Amazon EMR web interface
Logs can be viewed by looking into the s3 bucket you specified in your
credentials.json file. Just drill down via the s3 web interface and double-
click the file.
15. Creating a minimal job
1. Set up a dedicated s3 bucket
2. Create a folder called "input" in that bucket
3. Upload your inputs into s3://bucket/input
s3cmd put *log s3://bucket/input
16. Minimal Job Cont'd
4. Write a mapper
eg:
ARGF.each do |line|
# remove any newline
line = line.chomp
if /ERROR/.match(line)
puts "ERRORt1"
end
if /INFO/.match(line)
puts "INFOt1"
end
if /DEBUG/.match(line)
puts "DEBUGt1"
end
end
See http://www.cloudera.com/blog/2011/01/map-reduce-with-ruby-using-apache-hadoop/ for
examples
17. Minimal Job Cont'd
5. Upload your mapper to your s3 bucket
s3cmd put mapper.rb s3://bucket
6. Run it
elastic-mapreduce --create --stream
--mapper s3://bucket/mapper.rb
--input s3://bucket/input
--output s3://bucket/output
--reducer aggregate
NOTE: This job uses the built-in aggregator.
NOTE: The output directory must NOT exist at the time of the run
Amazon will scale ec2 instances to consume the load dynamically.
7. Pick up your results in the output folder
18. AWS Demo App
AWS has a very cool publicly-available app to
run:
elastic-mapreduce --create --stream
--mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py
--input s3://elasticmapreduce/samples/wordcount/input
--output s3://bucket/output
--reducer aggregate
See Amazon Example Doc
19. Possibilities
EMR is a fully-functional Hadoop
implementation.
Mappers and reducers can be written in python,
ruby, PHP and Java
Go crazy.