SlideShare une entreprise Scribd logo
1  sur  20
Télécharger pour lire hors ligne
Elastic MapReduce
   Outsourcing BigData

        Nathan McCourtney
            @beaknit
What is MapReduce?
From Wikipedia:

MapReduce is a framework for processing highly distributable problems across huge datasets using a large number of
computers (nodes), collectively referred to as a cluster (if all nodes use the same hardware) or a grid (if the nodes use
different hardware). Computational processing can occur on data stored either in a filesystem (unstructured) or in a
database (structured).

"Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker
nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the
smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way
to form the output – the answer to the problem it was originally trying to solve.
The Map
Mapping involves taking raw data and converting it into a
series of symbols.

For example, DNA sequencing:
ddATP   ->   A
ddGTP   ->   G
ddCTP   ->   C
ddTTP   ->   T

Results in representations like "GATTACA"
Practical Mapping
Inputs are generally flat-files containing lines of text.
   clever_critters.txt:
       foxes are clever
       cats are clever




Files are read in and fed to a mapper one line at a time via
STDIN.
   cat clever_critters.txt | mapper.rb
Practical Mapping Cont'd
The mapper processes the line and outputs a key/value
pair to STDOUT for each symbol it maps
   foxes 1
   are 1
   clever 1
   cats 1
   are 1
   clever 1
Work Partitioning
These key/value pairs are passed to a "partition function"
which organizes the output and assigns it to reducer nodes

   foxes -> node 1
   are -> node 2
   clever -> node 3
   cat -> node 4
Practical Reduction
The Reducers each receive the sharded
workload assigned to them by the partitioning.

Typically the work is received as a stream of
key/value pairs via STDIN:
 "foxes 1" -> node 1
 "are 1|are 1" -> node 2
 "clever 1|clever 1" -> node 3
 "cats 1|cats 1" -> node 4
Practical Reduction Cont'd
The reduction is essentially whatever you want it to be.
There are common patterns that are often pre-solved by
the map-reduce framework.

See Hadoop's Built-In Reducers

eg, "Aggregate" - give me a total of all the key/values
  foxes - 1
  are - 2
  clever -2
  cats - 1
What is Hadoop?
From wikipedia:
Apache Hadoop is a software framework that supports data-intensive distributed applications under a
free license.[1] It enables applications to work with thousands of computational independent
computers and petabytes of data. Hadoop was derived from Google's MapReduce and Google File
System (GFS) papers.


Essentially, Hadoop is a practical implementation of all the pieces you'd need to
accomplish everything we've discussed thus far. It takes in the data, organizes
the tasks, passes the data through its entire path and finally outputs the
reduction.
Hadoop's Guts




source: http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html
Fun to build?



    No
Solution?
Amazon's Elastic MapReduce
Look complex? It's not
1.   Sign up for the service
2.   Download the tools (requires ruby 1.8)
3.   mkdir ~/elastic-mapreduce-cli; cd ~/elastic-mapreduce-cli
4.   Create your credentials.json file
      {
      "access_id": "<key>",
      "private_key": "<secret key>",
      "keypair": "<name of keypair>",
      "key-pair-file": "~/.ssh/<key>.pem",
      "log_uri": "s3://<unique s3 bucket/",
      "region": "us-east-1"
      }

5. unzip ~/Downloads/elastic-mapreduce-ruby.zip
Run it

  ruby   elastic-mapreduce        --list
  ruby   elastic-mapreduce        --create --alive
  ruby   elastic-mapreduce        --list
  ruby   elastic-mapreduce        --terminate <JobFlowID>

  Note you can also view it in the Amazon EMR web interface

  Logs can be viewed by looking into the s3 bucket you specified in your
  credentials.json file. Just drill down via the s3 web interface and double-
  click the file.
Creating a minimal job
1. Set up a dedicated s3 bucket

2. Create a folder called "input" in that bucket

3. Upload your inputs into s3://bucket/input
     s3cmd put *log s3://bucket/input
Minimal Job Cont'd
4. Write a mapper
     eg:
     ARGF.each do |line|

        # remove any newline
        line = line.chomp

        if /ERROR/.match(line)
           puts "ERRORt1"
        end
        if /INFO/.match(line)
           puts "INFOt1"
        end
        if /DEBUG/.match(line)
           puts "DEBUGt1"
        end
     end


See http://www.cloudera.com/blog/2011/01/map-reduce-with-ruby-using-apache-hadoop/ for
examples
Minimal Job Cont'd
5. Upload your mapper to your s3 bucket
     s3cmd put mapper.rb s3://bucket


6. Run it
     elastic-mapreduce --create --stream 
          --mapper s3://bucket/mapper.rb 
          --input   s3://bucket/input 
          --output s3://bucket/output 
          --reducer aggregate


      NOTE: This job uses the built-in aggregator.
      NOTE: The output directory must NOT exist at the time of the run

      Amazon will scale ec2 instances to consume the load dynamically.

7. Pick up your results in the output folder
AWS Demo App
AWS has a very cool publicly-available app to
run:

elastic-mapreduce --create --stream 
     --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py 
     --input   s3://elasticmapreduce/samples/wordcount/input 
     --output s3://bucket/output 
     --reducer aggregate



See Amazon Example Doc
Possibilities
EMR is a fully-functional Hadoop
implementation.

Mappers and reducers can be written in python,
ruby, PHP and Java

Go crazy.
Further Reading
Tom White's O'Reilly on Hadoop

AWS EMR Getting Started Guide

Hadoop Wiki

Contenu connexe

Tendances

Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
Tianwei Liu
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
Dilip Reddy
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
Dmytro Sandu
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
rantav
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
soujavajug
 

Tendances (18)

Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Hadoop map reduce v2
Hadoop map reduce v2Hadoop map reduce v2
Hadoop map reduce v2
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 

En vedette

Mi vida (sebastián)
Mi vida (sebastián)Mi vida (sebastián)
Mi vida (sebastián)
najuldb
 
hello ( julián)
hello ( julián)hello ( julián)
hello ( julián)
najuldb
 
ALL ABOUT ME
ALL ABOUT ME ALL ABOUT ME
ALL ABOUT ME
najuldb
 
ALL ABOUT ME ( PAULA)
ALL ABOUT ME ( PAULA) ALL ABOUT ME ( PAULA)
ALL ABOUT ME ( PAULA)
najuldb
 
Aws dc elastic-mapreduce
Aws dc elastic-mapreduceAws dc elastic-mapreduce
Aws dc elastic-mapreduce
beaknit
 
all about me
all about meall about me
all about me
najuldb
 

En vedette (7)

Mi vida (sebastián)
Mi vida (sebastián)Mi vida (sebastián)
Mi vida (sebastián)
 
hello ( julián)
hello ( julián)hello ( julián)
hello ( julián)
 
ALL ABOUT ME
ALL ABOUT ME ALL ABOUT ME
ALL ABOUT ME
 
ALL ABOUT ME ( PAULA)
ALL ABOUT ME ( PAULA) ALL ABOUT ME ( PAULA)
ALL ABOUT ME ( PAULA)
 
Aws dc elastic-mapreduce
Aws dc elastic-mapreduceAws dc elastic-mapreduce
Aws dc elastic-mapreduce
 
all about me
all about meall about me
all about me
 
Mi vida
Mi vidaMi vida
Mi vida
 

Similaire à Aws dc elastic-mapreduce

Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
pappupassindia
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Yahoo Developer Network
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Mohamed Ali Mahmoud khouder
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
Subhas Kumar Ghosh
 
Hadoop tutorial hand-outs
Hadoop tutorial hand-outsHadoop tutorial hand-outs
Hadoop tutorial hand-outs
pardhavi reddy
 

Similaire à Aws dc elastic-mapreduce (20)

Scala+data
Scala+dataScala+data
Scala+data
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche Spark
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
 
Hadoop tutorial hand-outs
Hadoop tutorial hand-outsHadoop tutorial hand-outs
Hadoop tutorial hand-outs
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Eedc.apache.pig last
Eedc.apache.pig lastEedc.apache.pig last
Eedc.apache.pig last
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Aws dc elastic-mapreduce

  • 1. Elastic MapReduce Outsourcing BigData Nathan McCourtney @beaknit
  • 2. What is MapReduce? From Wikipedia: MapReduce is a framework for processing highly distributable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes use the same hardware) or a grid (if the nodes use different hardware). Computational processing can occur on data stored either in a filesystem (unstructured) or in a database (structured). "Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. "Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.
  • 3. The Map Mapping involves taking raw data and converting it into a series of symbols. For example, DNA sequencing: ddATP -> A ddGTP -> G ddCTP -> C ddTTP -> T Results in representations like "GATTACA"
  • 4. Practical Mapping Inputs are generally flat-files containing lines of text. clever_critters.txt: foxes are clever cats are clever Files are read in and fed to a mapper one line at a time via STDIN. cat clever_critters.txt | mapper.rb
  • 5. Practical Mapping Cont'd The mapper processes the line and outputs a key/value pair to STDOUT for each symbol it maps foxes 1 are 1 clever 1 cats 1 are 1 clever 1
  • 6. Work Partitioning These key/value pairs are passed to a "partition function" which organizes the output and assigns it to reducer nodes foxes -> node 1 are -> node 2 clever -> node 3 cat -> node 4
  • 7. Practical Reduction The Reducers each receive the sharded workload assigned to them by the partitioning. Typically the work is received as a stream of key/value pairs via STDIN: "foxes 1" -> node 1 "are 1|are 1" -> node 2 "clever 1|clever 1" -> node 3 "cats 1|cats 1" -> node 4
  • 8. Practical Reduction Cont'd The reduction is essentially whatever you want it to be. There are common patterns that are often pre-solved by the map-reduce framework. See Hadoop's Built-In Reducers eg, "Aggregate" - give me a total of all the key/values foxes - 1 are - 2 clever -2 cats - 1
  • 9. What is Hadoop? From wikipedia: Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license.[1] It enables applications to work with thousands of computational independent computers and petabytes of data. Hadoop was derived from Google's MapReduce and Google File System (GFS) papers. Essentially, Hadoop is a practical implementation of all the pieces you'd need to accomplish everything we've discussed thus far. It takes in the data, organizes the tasks, passes the data through its entire path and finally outputs the reduction.
  • 13. Look complex? It's not 1. Sign up for the service 2. Download the tools (requires ruby 1.8) 3. mkdir ~/elastic-mapreduce-cli; cd ~/elastic-mapreduce-cli 4. Create your credentials.json file { "access_id": "<key>", "private_key": "<secret key>", "keypair": "<name of keypair>", "key-pair-file": "~/.ssh/<key>.pem", "log_uri": "s3://<unique s3 bucket/", "region": "us-east-1" } 5. unzip ~/Downloads/elastic-mapreduce-ruby.zip
  • 14. Run it ruby elastic-mapreduce --list ruby elastic-mapreduce --create --alive ruby elastic-mapreduce --list ruby elastic-mapreduce --terminate <JobFlowID> Note you can also view it in the Amazon EMR web interface Logs can be viewed by looking into the s3 bucket you specified in your credentials.json file. Just drill down via the s3 web interface and double- click the file.
  • 15. Creating a minimal job 1. Set up a dedicated s3 bucket 2. Create a folder called "input" in that bucket 3. Upload your inputs into s3://bucket/input s3cmd put *log s3://bucket/input
  • 16. Minimal Job Cont'd 4. Write a mapper eg: ARGF.each do |line| # remove any newline line = line.chomp if /ERROR/.match(line) puts "ERRORt1" end if /INFO/.match(line) puts "INFOt1" end if /DEBUG/.match(line) puts "DEBUGt1" end end See http://www.cloudera.com/blog/2011/01/map-reduce-with-ruby-using-apache-hadoop/ for examples
  • 17. Minimal Job Cont'd 5. Upload your mapper to your s3 bucket s3cmd put mapper.rb s3://bucket 6. Run it elastic-mapreduce --create --stream --mapper s3://bucket/mapper.rb --input s3://bucket/input --output s3://bucket/output --reducer aggregate NOTE: This job uses the built-in aggregator. NOTE: The output directory must NOT exist at the time of the run Amazon will scale ec2 instances to consume the load dynamically. 7. Pick up your results in the output folder
  • 18. AWS Demo App AWS has a very cool publicly-available app to run: elastic-mapreduce --create --stream --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py --input s3://elasticmapreduce/samples/wordcount/input --output s3://bucket/output --reducer aggregate See Amazon Example Doc
  • 19. Possibilities EMR is a fully-functional Hadoop implementation. Mappers and reducers can be written in python, ruby, PHP and Java Go crazy.
  • 20. Further Reading Tom White's O'Reilly on Hadoop AWS EMR Getting Started Guide Hadoop Wiki