SlideShare une entreprise Scribd logo
1  sur  43
Télécharger pour lire hors ligne
Getting Started with Hadoop
                      with Amazon’s Elastic MapReduce

                           Scott Hendrickson
                           scott@drskippy.net
          http://drskippy.net/projects/EMR-HadoopMeetup.pdf

                                    Boulder/Denver Hadoop Meetup


                                           8 July 2010




Scott Hendrickson (Hadoop Meetup)            EMR-Hadoop            8 July 2010   1 / 43
Agenda


1   Amazon Web Services

2   Interlude: Solving problems with Map and Reduce

3   Running MapReduce on Amazon Elastic MapReduce
      Example 1: Streaming Work Flow with AWS Management Console
      Example 2 - Word count (Slightly more useful)
      Example 3 - elastic-mapreduce command line tool

4   References and Notes



Scott Hendrickson (Hadoop Meetup)   EMR-Hadoop         8 July 2010   2 / 43
Amazon Web Services


What is Amazon Web Services?


For first Hadoop project on AWS, use these services:
       Elastic Compute Cloud (EC2)
       Amazon Simple Storage Service (S3)
       Elastic MapReduce (EMR)
For future projects, AWS is much more:
       SimpleDB, Relational Database Services
       Simple Queue Service (SQS), Simple Notification Service (SNS)
       Alexa
       Mechanical Turk
       ...



Scott Hendrickson (Hadoop Meetup)                 EMR-Hadoop   8 July 2010   3 / 43
Amazon Web Services


Signing up for AWS



   1   Create an AWS account - http://aws.amazon.com/
   2   Sign up for EC2 cloud compute services -
       http://aws.amazon.com/ec2/
   3   Set up Security Credentials (under menu Account|Security
       Credentials) - 3 kinds of credentials, you need to create an “Access
       Key”; use it to access S3 storage
   4   Sign up for S3 storage services - http://aws.amazon.com/s3/
   5   Sign up for EMR - http://aws.amazon.com/elasticmapreduce/




Scott Hendrickson (Hadoop Meetup)                 EMR-Hadoop    8 July 2010   4 / 43
Amazon Web Services


What are S3 buckets?


Streaming EMR projects use Simple Storage Service (S3) Buckets for
data, code, logging and output.
            Bucket “A bucket is a container for objects stored in Amazon S3.
                   Every object is contained in a bucket.” Bucket names
                   must be globally unique.
            Object “Entities stored in Amazon S3. Objects consist of object
                   data and metadata.” Metadata consists of key-value pairs.
                   Object data is opaque.
   Objects Keys “An object is uniquely identified within a bucket by a key
                (name) and a version ID.”




Scott Hendrickson (Hadoop Meetup)                 EMR-Hadoop    8 July 2010   5 / 43
Amazon Web Services


Accessing objects in S3 buckets

Want to:
   1   Move data into and out of S3 buckets
   2   Set access privileges
Tools:
       S3 console in your AWS control panel is adequate for managing S3
       buckets and objects one at a time
       Other browser options: good for multiple file upload/download -
       Firefox S3
       https://addons.mozilla.org/en-US/firefox/addon/3247/ ; or
       minimal - S3 plug-in for Chrome https://chrome.google.com/
       extensions/detail/appeggcmoaojledegaonmdaakfhjhchf
       Programmatic options: Web Services (both SOAP-y and REST-ful):
       wget, curl, Python, Ruby, Java . . .

Scott Hendrickson (Hadoop Meetup)                 EMR-Hadoop   8 July 2010   6 / 43
Amazon Web Services


S3 Bucket Example 1 - RESTful GET

Example - Image object
Bucket: bsi-test
Key: image.jpg
Object: JPEG structured data data from image.jpg
RESTful GET access, use URL:
http://s3.amazonaws.com/bsi-test/image.jpg

Example - Text file object
Bucket: bsi-test
Key: foobar
Object: text
RESTful GET access, use URL:
http://s3.amazonaws.com/bsi-test/foobar


Scott Hendrickson (Hadoop Meetup)                 EMR-Hadoop   8 July 2010   7 / 43
Amazon Web Services


S3 Bucket Example 2

Example - Python, Boto, Metadata
from boto.s3.connection import S3Connection
conn = S3Connection(’key-id’, ’secret-key’)
bucket = conn.get_bucket(’bsi-test’)

k = bucket.get_key(’image.jpg’)
print "Value for key ’x-amz-meta-s3fox-modifiedtime’ is:"
print k.get_metadata(’s3fox-modifiedtime’)
k.get_contents_to_filename(’deleteme.jpg’)

k = bucket.get_key(’foobar’)
print "Object value for key ’foobar’ is:"
print k.get_contents_as_string()
print "Value for key ’x-amz-meta-example-key’ is:"
print k.get_metadata(’example-key’)
Scott Hendrickson (Hadoop Meetup)                 EMR-Hadoop   8 July 2010   8 / 43
Amazon Web Services


S3 Bucket Example 2



Example - Python, Boto, Metadata - Output

scott@mowgli-ubuntu:~/Dropbox/hadoop$ ./botoExample.py
Value for key ’x-amz-meta-s3fox-modifiedtime’ is:
1273869756000
Object value for key ’foobar’ is:
This is a test of S3
Value for key ’x-amz-meta-example-key’ is:
This is an example value.




Scott Hendrickson (Hadoop Meetup)                 EMR-Hadoop   8 July 2010   9 / 43
Amazon Web Services


What is Elastic Map Reduce?




                  Hadoop Hosted Hadoop framework running on EC2 and S3.
                Job Flow Processing steps EMR “runs on a specified dataset
                         using a set of Amazon EC2 instances.”
          S3 Bucket(s) Input data, output, scripts, jars, logs.




Scott Hendrickson (Hadoop Meetup)                 EMR-Hadoop      8 July 2010   10 / 43
Amazon Web Services


Controlling Job Flows

Want to:
   1   Configure jobs
   2   Start jobs
   3   Check status or stop jobs
Tools:
       AWS Management Console
       https://console.aws.amazon.com/elasticmapreduce/home
       Command Line Tools
       (requires Ruby [sudo apt-get install ruby libopenssl-ruby])
       http://developer.amazonwebservices.com/connect/entry.
       jspa?externalID=2264&categoryID=262
       API calls defined by the service (REST-ful and SOAP-y)


Scott Hendrickson (Hadoop Meetup)                 EMR-Hadoop   8 July 2010   11 / 43
Amazon Web Services


EMR Example 1 - Running a simple Work Flow from the
AWS Management Console




EMR Example 1
                                        Hold up a minute. . . !

                                    What problem are we solving?




Scott Hendrickson (Hadoop Meetup)                 EMR-Hadoop       8 July 2010   12 / 43
Interlude: Solving problems with Map and Reduce


Agenda


1   Amazon Web Services

2   Interlude: Solving problems with Map and Reduce

3   Running MapReduce on Amazon Elastic MapReduce
      Example 1: Streaming Work Flow with AWS Management Console
      Example 2 - Word count (Slightly more useful)
      Example 3 - elastic-mapreduce command line tool

4   References and Notes



Scott Hendrickson (Hadoop Meetup)                   EMR-Hadoop   8 July 2010   13 / 43
Interlude: Solving problems with Map and Reduce


Central MapReduce Ideas

       Operate on key-value pairs
       Data scientist provides map and reduce
                           (input)
                                                       map
                           < k1, v 1 >                −→
                                                       −         < k2, v 2 >
                                                  combine,sort
                           < k2, v 2 > − − − −
                                        − − − → < k2, v 2 >
                                                      reduce
                           < k2, v 2 >               −−→
                                                     −−          < k3, v 3 >
                                                                   (output)

       (Optional: Combine provided in map, may significantly reduce
       bandwidth between workers)
       Efficient Sort provide by MapReduce library. Implies efficient
       compare(k2a , k2b )
       “Implicit” parallelization - splitting and distributing data, starting
       maps, reduces, collecting output
Scott Hendrickson (Hadoop Meetup)                   EMR-Hadoop                 8 July 2010   14 / 43
Interlude: Solving problems with Map and Reduce


Key components of MapReduce framework


(wikipedia http://en.wikipedia.org/wiki/MapReduce)
The frozen part of the MapReduce framework is a large distributed sort.
The hot spots, which the application defines, are:
   1   input reader
   2   Map function
   3   partition function
   4   compare function
   5   Reduce function
   6   output writer




Scott Hendrickson (Hadoop Meetup)                   EMR-Hadoop   8 July 2010   15 / 43
Interlude: Solving problems with Map and Reduce


Google Tutorial View
   1   MapReduce library shards the input files and starts up many copies on
       a cluster.
   2   Master assigns work to workers. There are map and reduce tasks.
   3   Workers assigned map tasks reads the contents input shard, parse
       key-value pairs and pass pairs to map function. Intermediate
       key-value pairs produced by the map function are buffered in memory.
   4   Periodically, buffered pairs are written to disk, partitioned into regions.
       Locations of buffered pairs on the local disk are passed to the master.
   5   When a reduce worker has read all intermediate data, it sorts by the
       intermediate keys. All occurrences a key are grouped together.
   6   Reduce workers pass a key and the corresponding set of intermediate
       values to the reduce function.
   7   Output of the reduce function is appended to a final output file for
       each reduce partition.
Scott Hendrickson (Hadoop Meetup)                   EMR-Hadoop     8 July 2010   16 / 43
Interlude: Solving problems with Map and Reduce


MapReduce Example 1 - Word Count - Data




(from Apache Hadoop tutorial)
Example: Word Count
file1:
Hello World Bye World
file2:
Hello Hadoop Goodbye Hadoop




Scott Hendrickson (Hadoop Meetup)                   EMR-Hadoop   8 July 2010   17 / 43
Interlude: Solving problems with Map and Reduce


MapReduce Example 1 - Word Count - Map


Example: Word Count
The first map emits:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>

The second map emits:
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>



Scott Hendrickson (Hadoop Meetup)                   EMR-Hadoop   8 July 2010   18 / 43
Interlude: Solving problems with Map and Reduce


MapReduce Example 1 - Word Count - Sort and Combine


Example: Word Count
The output of the first map:
< Bye, 1>
< Hello, 1>
< World, 2>

The output of the second map:
< Goodbye, 1>
< Hadoop, 2>
< Hello, 1>




Scott Hendrickson (Hadoop Meetup)                   EMR-Hadoop   8 July 2010   19 / 43
Interlude: Solving problems with Map and Reduce


MapReduce Example 1 - Word Count - Sort and Reduce



Example: Word Count
The Reducer method sums up the values for each key.

The output of the job is:
< Bye, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>




Scott Hendrickson (Hadoop Meetup)                   EMR-Hadoop   8 July 2010   20 / 43
Interlude: Solving problems with Map and Reduce


What problems is MapReduce good at solving?


Themes:
       Identify, transform, aggregate, filter, count, sort. . .
       Requirement of global knowledge of data is (a) “occasional” (vs. cost
       of map) (b) confined to ordinality
       Discovery tasks (vs. high repetition of similar transactional tasks,
       many reads)
       Unstructured data (vs. tabular, indexes!)
       Continuously updated data (indexing cost)
       Many, many, many machines (fault tolerance)




Scott Hendrickson (Hadoop Meetup)                   EMR-Hadoop    8 July 2010   21 / 43
Interlude: Solving problems with Map and Reduce


What problems is MapReduce good at solving?

Memes:
       MapReduce ⇔ SQL (read the comments too)
       http://www.data-miners.com/blog/2008/01/
       mapreduce-and-sql-aggregations.html
       MapReduce vs. Message Passing Interface (MPI) “MPI is good for
       task parallelism and Hadoop is good for Data Parallelism.” finite
       differences, finite elements, particle-in-cell. . .
       MapReduce vs. column-oriented DBs tabular data, indexes
       (cantankerous old farts!) http://databasecolumn.vertica.com/
       database-innovation/mapreduce-a-major-step-backwards/
       and http://databasecolumn.vertica.com/
       database-innovation/mapreduce-ii/
       MapReduce vs. relational DBs http://scienceblogs.com/
       goodmath/2008/01/databases_are_hammers_mapreduc.php

Scott Hendrickson (Hadoop Meetup)                   EMR-Hadoop   8 July 2010   22 / 43
Running MapReduce on Amazon Elastic MapReduce   Example 1: Streaming Work Flow with AWS Management Console


Agenda


1   Amazon Web Services

2   Interlude: Solving problems with Map and Reduce

3   Running MapReduce on Amazon Elastic MapReduce
      Example 1: Streaming Work Flow with AWS Management Console
      Example 2 - Word count (Slightly more useful)
      Example 3 - elastic-mapreduce command line tool

4   References and Notes



Scott Hendrickson (Hadoop Meetup)               EMR-Hadoop                                8 July 2010    23 / 43
Running MapReduce on Amazon Elastic MapReduce   Example 1: Streaming Work Flow with AWS Management Console


Example 1 - Add up integers

Data
3
4
-1
4
-3
1
1
...

Map
import sys
for line in sys.stdin:
    print ’%s%s%d’ % ("sum", ’t’, int(line))

Scott Hendrickson (Hadoop Meetup)               EMR-Hadoop                                8 July 2010    24 / 43
Running MapReduce on Amazon Elastic MapReduce   Example 1: Streaming Work Flow with AWS Management Console


Example 1 - Add up integers

Reduce
import sys
sum_of_ints = 0
for line in sys.stdin:
    key, value = line.split(’t’) # key is always the same
    try:
        sum_of_ints += int(value)
    except ValueError:
        pass
try:
    print "%s%s%d" % (key, ’t’, sum_of_ints)
except NameError: # No items processed
    pass


Scott Hendrickson (Hadoop Meetup)               EMR-Hadoop                                8 July 2010    25 / 43
Running MapReduce on Amazon Elastic MapReduce   Example 1: Streaming Work Flow with AWS Management Console


Example 1 - Add up integers



Shell test
cat ./input/ints.txt | ./mapper.py > ./inter
cat ./input/ints1.txt | ./mapper.py >> ./inter
cat ./input/ints2.txt | ./mapper.py >> ./inter
cat ./input/ints3.txt | ./mapper.py >> ./inter
echo "Intermediate output:"
cat ./inter
cat ./inter | sort | 
           ./reducer.py > ./output/cmdLineOutput.txt




Scott Hendrickson (Hadoop Meetup)               EMR-Hadoop                                8 July 2010    26 / 43
Running MapReduce on Amazon Elastic MapReduce   Example 1: Streaming Work Flow with AWS Management Console


Example 1 - Add up integers



What was that comment earlier about an optional combiner?
Combiner in map
import sys
sum_of_ints = 0
for line in sys.stdin:
    sum_of_ints += int(line)
print ’%s%s%d’ % ("sum", ’t’, sum_of_ints)




Scott Hendrickson (Hadoop Meetup)               EMR-Hadoop                                8 July 2010    27 / 43
Running MapReduce on Amazon Elastic MapReduce   Example 1: Streaming Work Flow with AWS Management Console


Example 1 - Add up integers



Combiner shell test
cat ./input/ints.txt | ./mapper_combine.py > ./inter
cat ./input/ints1.txt | ./mapper_combine.py >> ./inter
cat ./input/ints2.txt | ./mapper_combine.py >> ./inter
cat ./input/ints3.txt | ./mapper_combine.py >> ./inter
echo "Intermediate output:"
cat ./inter
cat ./inter | sort | 
          ./reducer.py > ./output/cmdLineCombOutput.txt




Scott Hendrickson (Hadoop Meetup)               EMR-Hadoop                                8 July 2010    28 / 43
Running MapReduce on Amazon Elastic MapReduce   Example 1: Streaming Work Flow with AWS Management Console


Example 1 - Add up integers - AWS Console


   1   Upload oneCount directory with FFS3
   2   Create a New Job Flow
       Name: ”oneCount”
       Job Flow: Run own app
       Job Type: Streaming
   3   Input: bsi-test/oneCount/input
       Output: bsi-test/oneCount/outputConsole (must not exist)
       Mapper: bsi-test/oneCount/mapper.py
       Reducer: bsi-test/oneCount/reducer.py
       Extra Args: none




Scott Hendrickson (Hadoop Meetup)               EMR-Hadoop                                8 July 2010    29 / 43
Running MapReduce on Amazon Elastic MapReduce   Example 1: Streaming Work Flow with AWS Management Console


Example 1 - Add up integers - AWS Console



   4   Instances: 4
       Type: small
       Keypair: No (Yes allows ssh to Hadoop master)
       Log: yes
       Log Location: bsi-test/oneCount/log
       Hadoop Debug: no
   5   No bootstrap actions
   6   Start it, and wait. . .




Scott Hendrickson (Hadoop Meetup)               EMR-Hadoop                                8 July 2010    30 / 43
Running MapReduce on Amazon Elastic MapReduce   Example 2 - Word count (Slightly more useful)


Agenda


1   Amazon Web Services

2   Interlude: Solving problems with Map and Reduce

3   Running MapReduce on Amazon Elastic MapReduce
      Example 1: Streaming Work Flow with AWS Management Console
      Example 2 - Word count (Slightly more useful)
      Example 3 - elastic-mapreduce command line tool

4   References and Notes



Scott Hendrickson (Hadoop Meetup)               EMR-Hadoop                                   8 July 2010   31 / 43
Running MapReduce on Amazon Elastic MapReduce   Example 2 - Word count (Slightly more useful)


Example 2 - Word count


Map
def read_input(file):
    for line in file:
        yield line.split()

def main(separator=’t’):
    data = read_input(sys.stdin)
    for words in data:
        for word in words:
            lword = word.lower().strip(string.puctuation)
            print ’%s%s%d’ % (lword, separator, 1)



Scott Hendrickson (Hadoop Meetup)               EMR-Hadoop                                   8 July 2010   32 / 43
Running MapReduce on Amazon Elastic MapReduce   Example 2 - Word count (Slightly more useful)


Example 2 - Word count
Reduce
def read_mapper_output(file, separator=’t’):
    for line in file:
        yield line.rstrip().split(separator, 1)

def main(separator=’t’):
    data = read_mapper_output(sys.stdin,
                              separator=separator)
    for current_word,group in groupby(data,itemgetter(0)):
        try:
            total_count = sum(int(count)
                          for current_word, count in group)
            print "%s%s%d" % (current_word,
                              separator, total_count)
        except ValueError:
            pass
Scott Hendrickson (Hadoop Meetup)               EMR-Hadoop                                   8 July 2010   33 / 43
Running MapReduce on Amazon Elastic MapReduce   Example 2 - Word count (Slightly more useful)


Example 2 - Word count




Shell test
echo "foo foo quux labs foo bar quux" | ./mapper.py
echo "foo foo quux labs foo bar quux" | ./mapper.py 
           | sort | ./reducer.py
cat ./input/alice.txt | ./mapper.py 
           | sort | ./reducer.py > ./output/cmdLineOutput.txt




Scott Hendrickson (Hadoop Meetup)               EMR-Hadoop                                   8 July 2010   34 / 43
Running MapReduce on Amazon Elastic MapReduce   Example 2 - Word count (Slightly more useful)


Example 2 - Word count - AWS Console


   1   Upload myWordCount directory with FFS3
   2   Create a New Job Flow
       Name: ”myWordCount”
       Job Flow: Run own app
       Job Type: Streaming
   3   Input: bsi-test/myWordCount/input
       Output: bsi-test/myWordCount/outputConsole (must not exist)
       Mapper: bsi-test/myWordCount/mapper.py
       Reducer: bsi-test/myWordCount/reducer.py
       Extra Args: none




Scott Hendrickson (Hadoop Meetup)               EMR-Hadoop                                   8 July 2010   35 / 43
Running MapReduce on Amazon Elastic MapReduce   Example 2 - Word count (Slightly more useful)


Example 2 - Word count - AWS Console



   4   Instances: 4
       Type: small
       Keypair: No (Yes allows ssh to Hadoop master)
       Log: yes
       Log Location: bsi-test/myWordCount/log
       Hadoop Debug: no
   5   No bootstrap actions
   6   Start it, and wait. . .




Scott Hendrickson (Hadoop Meetup)               EMR-Hadoop                                   8 July 2010   36 / 43
Running MapReduce on Amazon Elastic MapReduce   Example 3 - elastic-mapreduce command line tool


Agenda


1   Amazon Web Services

2   Interlude: Solving problems with Map and Reduce

3   Running MapReduce on Amazon Elastic MapReduce
      Example 1: Streaming Work Flow with AWS Management Console
      Example 2 - Word count (Slightly more useful)
      Example 3 - elastic-mapreduce command line tool

4   References and Notes



Scott Hendrickson (Hadoop Meetup)               EMR-Hadoop                                   8 July 2010   37 / 43
Running MapReduce on Amazon Elastic MapReduce   Example 3 - elastic-mapreduce command line tool


Example 3 - elastic-mapreduce command line tool


Word count (again, only better)
/usr/local/emr-ruby/elastic-mapreduce --create 
      --stream 
      --num-instances 2 
      --name from-elastic-mapreduce 
      --input s3n://bsi-test/myWordCount/input 
      --output s3n://bsi-test/myWordCount/outputRubyTool 
      --mapper s3n://bsi-test/myWordCount/mapper.py 
      --reducer s3n://bsi-test/myWordCount/reducer.py 
      --log-uri s3n://bsi-test/myWordCount/log

/usr/local/emr-ruby/elastic-mapreduce --list


Scott Hendrickson (Hadoop Meetup)               EMR-Hadoop                                   8 July 2010   38 / 43
References and Notes


Agenda


1   Amazon Web Services

2   Interlude: Solving problems with Map and Reduce

3   Running MapReduce on Amazon Elastic MapReduce
      Example 1: Streaming Work Flow with AWS Management Console
      Example 2 - Word count (Slightly more useful)
      Example 3 - elastic-mapreduce command line tool

4   References and Notes



Scott Hendrickson (Hadoop Meetup)                  EMR-Hadoop   8 July 2010   39 / 43
References and Notes


MapReduce Concepts Links



       Google MapReduce Tutorial: http:
       //code.google.com/edu/parallel/mapreduce-tutorial.html
       Apache Hadoop tutorial: http://hadoop.apache.org/common/
       docs/current/mapred_tutorial.html
       Google Code University presentation on MapReduce: http://code.
       google.com/edu/submissions/mapreduce/listing.html
       MapReduce framework paper:
       http://labs.google.com/papers/mapreduce-osdi04.pdf




Scott Hendrickson (Hadoop Meetup)                  EMR-Hadoop   8 July 2010   40 / 43
References and Notes


Amazon Web Services Links



       EMR Getting Started documentation:
       http://aws.amazon.com/documentation/elasticmapreduce/
       Getting started with Amazon S3: http:
       //docs.amazonwebservices.com/AmazonS3/2006-03-01/gsg/
       PIG on EMR: http:
       //s3.amazonaws.com/awsVideos/AmazonElasticMapReduce/
       ElasticMapReduce-PigTutorial.html
       Boto Python library (multiple Amazon Services):
       http://code.google.com/p/boto/




Scott Hendrickson (Hadoop Meetup)                  EMR-Hadoop   8 July 2010   41 / 43
References and Notes


Machine Learning




       Linear speedup (with processor number) for “locally weighted linear
       regression (LWLR), k-means, logistic regression (LR), naive Bayes
       (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM,
       and backpropagation (NN)”: http://www.cs.stanford.edu/
       people/ang/papers/nips06-mapreducemulticore.pdf
       Mahout framework: http://mahout.apache.org/




Scott Hendrickson (Hadoop Meetup)                  EMR-Hadoop   8 July 2010   42 / 43
References and Notes


Examples Links



       Wordcount example/tutorial: http://www.michael-noll.com/
       wiki/Writing_An_Hadoop_MapReduce_Program_In_Python
       CouchDB and MapReduce (interesting examples of MR
       implementations for common problems)
       http://wiki.apache.org/couchdb/View_Snippets
       This presentation:
       http://drskippy.net/projects/EMR-HadoopMeetup.pdf or
       presentation source, example files etc.:
       http://drskippy.net/projects/EMR-HadoopMeetup.zip




Scott Hendrickson (Hadoop Meetup)                  EMR-Hadoop   8 July 2010   43 / 43

Contenu connexe

Dernier

URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 

Dernier (20)

URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 

En vedette

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

En vedette (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Amazon Elastic MapReduce -- Getting started with Hadoop

  • 1. Getting Started with Hadoop with Amazon’s Elastic MapReduce Scott Hendrickson scott@drskippy.net http://drskippy.net/projects/EMR-HadoopMeetup.pdf Boulder/Denver Hadoop Meetup 8 July 2010 Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 1 / 43
  • 2. Agenda 1 Amazon Web Services 2 Interlude: Solving problems with Map and Reduce 3 Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Example 2 - Word count (Slightly more useful) Example 3 - elastic-mapreduce command line tool 4 References and Notes Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 2 / 43
  • 3. Amazon Web Services What is Amazon Web Services? For first Hadoop project on AWS, use these services: Elastic Compute Cloud (EC2) Amazon Simple Storage Service (S3) Elastic MapReduce (EMR) For future projects, AWS is much more: SimpleDB, Relational Database Services Simple Queue Service (SQS), Simple Notification Service (SNS) Alexa Mechanical Turk ... Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 3 / 43
  • 4. Amazon Web Services Signing up for AWS 1 Create an AWS account - http://aws.amazon.com/ 2 Sign up for EC2 cloud compute services - http://aws.amazon.com/ec2/ 3 Set up Security Credentials (under menu Account|Security Credentials) - 3 kinds of credentials, you need to create an “Access Key”; use it to access S3 storage 4 Sign up for S3 storage services - http://aws.amazon.com/s3/ 5 Sign up for EMR - http://aws.amazon.com/elasticmapreduce/ Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 4 / 43
  • 5. Amazon Web Services What are S3 buckets? Streaming EMR projects use Simple Storage Service (S3) Buckets for data, code, logging and output. Bucket “A bucket is a container for objects stored in Amazon S3. Every object is contained in a bucket.” Bucket names must be globally unique. Object “Entities stored in Amazon S3. Objects consist of object data and metadata.” Metadata consists of key-value pairs. Object data is opaque. Objects Keys “An object is uniquely identified within a bucket by a key (name) and a version ID.” Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 5 / 43
  • 6. Amazon Web Services Accessing objects in S3 buckets Want to: 1 Move data into and out of S3 buckets 2 Set access privileges Tools: S3 console in your AWS control panel is adequate for managing S3 buckets and objects one at a time Other browser options: good for multiple file upload/download - Firefox S3 https://addons.mozilla.org/en-US/firefox/addon/3247/ ; or minimal - S3 plug-in for Chrome https://chrome.google.com/ extensions/detail/appeggcmoaojledegaonmdaakfhjhchf Programmatic options: Web Services (both SOAP-y and REST-ful): wget, curl, Python, Ruby, Java . . . Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 6 / 43
  • 7. Amazon Web Services S3 Bucket Example 1 - RESTful GET Example - Image object Bucket: bsi-test Key: image.jpg Object: JPEG structured data data from image.jpg RESTful GET access, use URL: http://s3.amazonaws.com/bsi-test/image.jpg Example - Text file object Bucket: bsi-test Key: foobar Object: text RESTful GET access, use URL: http://s3.amazonaws.com/bsi-test/foobar Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 7 / 43
  • 8. Amazon Web Services S3 Bucket Example 2 Example - Python, Boto, Metadata from boto.s3.connection import S3Connection conn = S3Connection(’key-id’, ’secret-key’) bucket = conn.get_bucket(’bsi-test’) k = bucket.get_key(’image.jpg’) print "Value for key ’x-amz-meta-s3fox-modifiedtime’ is:" print k.get_metadata(’s3fox-modifiedtime’) k.get_contents_to_filename(’deleteme.jpg’) k = bucket.get_key(’foobar’) print "Object value for key ’foobar’ is:" print k.get_contents_as_string() print "Value for key ’x-amz-meta-example-key’ is:" print k.get_metadata(’example-key’) Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 8 / 43
  • 9. Amazon Web Services S3 Bucket Example 2 Example - Python, Boto, Metadata - Output scott@mowgli-ubuntu:~/Dropbox/hadoop$ ./botoExample.py Value for key ’x-amz-meta-s3fox-modifiedtime’ is: 1273869756000 Object value for key ’foobar’ is: This is a test of S3 Value for key ’x-amz-meta-example-key’ is: This is an example value. Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 9 / 43
  • 10. Amazon Web Services What is Elastic Map Reduce? Hadoop Hosted Hadoop framework running on EC2 and S3. Job Flow Processing steps EMR “runs on a specified dataset using a set of Amazon EC2 instances.” S3 Bucket(s) Input data, output, scripts, jars, logs. Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 10 / 43
  • 11. Amazon Web Services Controlling Job Flows Want to: 1 Configure jobs 2 Start jobs 3 Check status or stop jobs Tools: AWS Management Console https://console.aws.amazon.com/elasticmapreduce/home Command Line Tools (requires Ruby [sudo apt-get install ruby libopenssl-ruby]) http://developer.amazonwebservices.com/connect/entry. jspa?externalID=2264&categoryID=262 API calls defined by the service (REST-ful and SOAP-y) Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 11 / 43
  • 12. Amazon Web Services EMR Example 1 - Running a simple Work Flow from the AWS Management Console EMR Example 1 Hold up a minute. . . ! What problem are we solving? Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 12 / 43
  • 13. Interlude: Solving problems with Map and Reduce Agenda 1 Amazon Web Services 2 Interlude: Solving problems with Map and Reduce 3 Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Example 2 - Word count (Slightly more useful) Example 3 - elastic-mapreduce command line tool 4 References and Notes Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 13 / 43
  • 14. Interlude: Solving problems with Map and Reduce Central MapReduce Ideas Operate on key-value pairs Data scientist provides map and reduce (input) map < k1, v 1 > −→ − < k2, v 2 > combine,sort < k2, v 2 > − − − − − − − → < k2, v 2 > reduce < k2, v 2 > −−→ −− < k3, v 3 > (output) (Optional: Combine provided in map, may significantly reduce bandwidth between workers) Efficient Sort provide by MapReduce library. Implies efficient compare(k2a , k2b ) “Implicit” parallelization - splitting and distributing data, starting maps, reduces, collecting output Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 14 / 43
  • 15. Interlude: Solving problems with Map and Reduce Key components of MapReduce framework (wikipedia http://en.wikipedia.org/wiki/MapReduce) The frozen part of the MapReduce framework is a large distributed sort. The hot spots, which the application defines, are: 1 input reader 2 Map function 3 partition function 4 compare function 5 Reduce function 6 output writer Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 15 / 43
  • 16. Interlude: Solving problems with Map and Reduce Google Tutorial View 1 MapReduce library shards the input files and starts up many copies on a cluster. 2 Master assigns work to workers. There are map and reduce tasks. 3 Workers assigned map tasks reads the contents input shard, parse key-value pairs and pass pairs to map function. Intermediate key-value pairs produced by the map function are buffered in memory. 4 Periodically, buffered pairs are written to disk, partitioned into regions. Locations of buffered pairs on the local disk are passed to the master. 5 When a reduce worker has read all intermediate data, it sorts by the intermediate keys. All occurrences a key are grouped together. 6 Reduce workers pass a key and the corresponding set of intermediate values to the reduce function. 7 Output of the reduce function is appended to a final output file for each reduce partition. Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 16 / 43
  • 17. Interlude: Solving problems with Map and Reduce MapReduce Example 1 - Word Count - Data (from Apache Hadoop tutorial) Example: Word Count file1: Hello World Bye World file2: Hello Hadoop Goodbye Hadoop Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 17 / 43
  • 18. Interlude: Solving problems with Map and Reduce MapReduce Example 1 - Word Count - Map Example: Word Count The first map emits: < Hello, 1> < World, 1> < Bye, 1> < World, 1> The second map emits: < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1> Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 18 / 43
  • 19. Interlude: Solving problems with Map and Reduce MapReduce Example 1 - Word Count - Sort and Combine Example: Word Count The output of the first map: < Bye, 1> < Hello, 1> < World, 2> The output of the second map: < Goodbye, 1> < Hadoop, 2> < Hello, 1> Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 19 / 43
  • 20. Interlude: Solving problems with Map and Reduce MapReduce Example 1 - Word Count - Sort and Reduce Example: Word Count The Reducer method sums up the values for each key. The output of the job is: < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2> Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 20 / 43
  • 21. Interlude: Solving problems with Map and Reduce What problems is MapReduce good at solving? Themes: Identify, transform, aggregate, filter, count, sort. . . Requirement of global knowledge of data is (a) “occasional” (vs. cost of map) (b) confined to ordinality Discovery tasks (vs. high repetition of similar transactional tasks, many reads) Unstructured data (vs. tabular, indexes!) Continuously updated data (indexing cost) Many, many, many machines (fault tolerance) Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 21 / 43
  • 22. Interlude: Solving problems with Map and Reduce What problems is MapReduce good at solving? Memes: MapReduce ⇔ SQL (read the comments too) http://www.data-miners.com/blog/2008/01/ mapreduce-and-sql-aggregations.html MapReduce vs. Message Passing Interface (MPI) “MPI is good for task parallelism and Hadoop is good for Data Parallelism.” finite differences, finite elements, particle-in-cell. . . MapReduce vs. column-oriented DBs tabular data, indexes (cantankerous old farts!) http://databasecolumn.vertica.com/ database-innovation/mapreduce-a-major-step-backwards/ and http://databasecolumn.vertica.com/ database-innovation/mapreduce-ii/ MapReduce vs. relational DBs http://scienceblogs.com/ goodmath/2008/01/databases_are_hammers_mapreduc.php Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 22 / 43
  • 23. Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Agenda 1 Amazon Web Services 2 Interlude: Solving problems with Map and Reduce 3 Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Example 2 - Word count (Slightly more useful) Example 3 - elastic-mapreduce command line tool 4 References and Notes Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 23 / 43
  • 24. Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Example 1 - Add up integers Data 3 4 -1 4 -3 1 1 ... Map import sys for line in sys.stdin: print ’%s%s%d’ % ("sum", ’t’, int(line)) Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 24 / 43
  • 25. Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Example 1 - Add up integers Reduce import sys sum_of_ints = 0 for line in sys.stdin: key, value = line.split(’t’) # key is always the same try: sum_of_ints += int(value) except ValueError: pass try: print "%s%s%d" % (key, ’t’, sum_of_ints) except NameError: # No items processed pass Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 25 / 43
  • 26. Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Example 1 - Add up integers Shell test cat ./input/ints.txt | ./mapper.py > ./inter cat ./input/ints1.txt | ./mapper.py >> ./inter cat ./input/ints2.txt | ./mapper.py >> ./inter cat ./input/ints3.txt | ./mapper.py >> ./inter echo "Intermediate output:" cat ./inter cat ./inter | sort | ./reducer.py > ./output/cmdLineOutput.txt Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 26 / 43
  • 27. Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Example 1 - Add up integers What was that comment earlier about an optional combiner? Combiner in map import sys sum_of_ints = 0 for line in sys.stdin: sum_of_ints += int(line) print ’%s%s%d’ % ("sum", ’t’, sum_of_ints) Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 27 / 43
  • 28. Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Example 1 - Add up integers Combiner shell test cat ./input/ints.txt | ./mapper_combine.py > ./inter cat ./input/ints1.txt | ./mapper_combine.py >> ./inter cat ./input/ints2.txt | ./mapper_combine.py >> ./inter cat ./input/ints3.txt | ./mapper_combine.py >> ./inter echo "Intermediate output:" cat ./inter cat ./inter | sort | ./reducer.py > ./output/cmdLineCombOutput.txt Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 28 / 43
  • 29. Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Example 1 - Add up integers - AWS Console 1 Upload oneCount directory with FFS3 2 Create a New Job Flow Name: ”oneCount” Job Flow: Run own app Job Type: Streaming 3 Input: bsi-test/oneCount/input Output: bsi-test/oneCount/outputConsole (must not exist) Mapper: bsi-test/oneCount/mapper.py Reducer: bsi-test/oneCount/reducer.py Extra Args: none Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 29 / 43
  • 30. Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Example 1 - Add up integers - AWS Console 4 Instances: 4 Type: small Keypair: No (Yes allows ssh to Hadoop master) Log: yes Log Location: bsi-test/oneCount/log Hadoop Debug: no 5 No bootstrap actions 6 Start it, and wait. . . Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 30 / 43
  • 31. Running MapReduce on Amazon Elastic MapReduce Example 2 - Word count (Slightly more useful) Agenda 1 Amazon Web Services 2 Interlude: Solving problems with Map and Reduce 3 Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Example 2 - Word count (Slightly more useful) Example 3 - elastic-mapreduce command line tool 4 References and Notes Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 31 / 43
  • 32. Running MapReduce on Amazon Elastic MapReduce Example 2 - Word count (Slightly more useful) Example 2 - Word count Map def read_input(file): for line in file: yield line.split() def main(separator=’t’): data = read_input(sys.stdin) for words in data: for word in words: lword = word.lower().strip(string.puctuation) print ’%s%s%d’ % (lword, separator, 1) Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 32 / 43
  • 33. Running MapReduce on Amazon Elastic MapReduce Example 2 - Word count (Slightly more useful) Example 2 - Word count Reduce def read_mapper_output(file, separator=’t’): for line in file: yield line.rstrip().split(separator, 1) def main(separator=’t’): data = read_mapper_output(sys.stdin, separator=separator) for current_word,group in groupby(data,itemgetter(0)): try: total_count = sum(int(count) for current_word, count in group) print "%s%s%d" % (current_word, separator, total_count) except ValueError: pass Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 33 / 43
  • 34. Running MapReduce on Amazon Elastic MapReduce Example 2 - Word count (Slightly more useful) Example 2 - Word count Shell test echo "foo foo quux labs foo bar quux" | ./mapper.py echo "foo foo quux labs foo bar quux" | ./mapper.py | sort | ./reducer.py cat ./input/alice.txt | ./mapper.py | sort | ./reducer.py > ./output/cmdLineOutput.txt Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 34 / 43
  • 35. Running MapReduce on Amazon Elastic MapReduce Example 2 - Word count (Slightly more useful) Example 2 - Word count - AWS Console 1 Upload myWordCount directory with FFS3 2 Create a New Job Flow Name: ”myWordCount” Job Flow: Run own app Job Type: Streaming 3 Input: bsi-test/myWordCount/input Output: bsi-test/myWordCount/outputConsole (must not exist) Mapper: bsi-test/myWordCount/mapper.py Reducer: bsi-test/myWordCount/reducer.py Extra Args: none Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 35 / 43
  • 36. Running MapReduce on Amazon Elastic MapReduce Example 2 - Word count (Slightly more useful) Example 2 - Word count - AWS Console 4 Instances: 4 Type: small Keypair: No (Yes allows ssh to Hadoop master) Log: yes Log Location: bsi-test/myWordCount/log Hadoop Debug: no 5 No bootstrap actions 6 Start it, and wait. . . Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 36 / 43
  • 37. Running MapReduce on Amazon Elastic MapReduce Example 3 - elastic-mapreduce command line tool Agenda 1 Amazon Web Services 2 Interlude: Solving problems with Map and Reduce 3 Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Example 2 - Word count (Slightly more useful) Example 3 - elastic-mapreduce command line tool 4 References and Notes Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 37 / 43
  • 38. Running MapReduce on Amazon Elastic MapReduce Example 3 - elastic-mapreduce command line tool Example 3 - elastic-mapreduce command line tool Word count (again, only better) /usr/local/emr-ruby/elastic-mapreduce --create --stream --num-instances 2 --name from-elastic-mapreduce --input s3n://bsi-test/myWordCount/input --output s3n://bsi-test/myWordCount/outputRubyTool --mapper s3n://bsi-test/myWordCount/mapper.py --reducer s3n://bsi-test/myWordCount/reducer.py --log-uri s3n://bsi-test/myWordCount/log /usr/local/emr-ruby/elastic-mapreduce --list Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 38 / 43
  • 39. References and Notes Agenda 1 Amazon Web Services 2 Interlude: Solving problems with Map and Reduce 3 Running MapReduce on Amazon Elastic MapReduce Example 1: Streaming Work Flow with AWS Management Console Example 2 - Word count (Slightly more useful) Example 3 - elastic-mapreduce command line tool 4 References and Notes Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 39 / 43
  • 40. References and Notes MapReduce Concepts Links Google MapReduce Tutorial: http: //code.google.com/edu/parallel/mapreduce-tutorial.html Apache Hadoop tutorial: http://hadoop.apache.org/common/ docs/current/mapred_tutorial.html Google Code University presentation on MapReduce: http://code. google.com/edu/submissions/mapreduce/listing.html MapReduce framework paper: http://labs.google.com/papers/mapreduce-osdi04.pdf Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 40 / 43
  • 41. References and Notes Amazon Web Services Links EMR Getting Started documentation: http://aws.amazon.com/documentation/elasticmapreduce/ Getting started with Amazon S3: http: //docs.amazonwebservices.com/AmazonS3/2006-03-01/gsg/ PIG on EMR: http: //s3.amazonaws.com/awsVideos/AmazonElasticMapReduce/ ElasticMapReduce-PigTutorial.html Boto Python library (multiple Amazon Services): http://code.google.com/p/boto/ Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 41 / 43
  • 42. References and Notes Machine Learning Linear speedup (with processor number) for “locally weighted linear regression (LWLR), k-means, logistic regression (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN)”: http://www.cs.stanford.edu/ people/ang/papers/nips06-mapreducemulticore.pdf Mahout framework: http://mahout.apache.org/ Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 42 / 43
  • 43. References and Notes Examples Links Wordcount example/tutorial: http://www.michael-noll.com/ wiki/Writing_An_Hadoop_MapReduce_Program_In_Python CouchDB and MapReduce (interesting examples of MR implementations for common problems) http://wiki.apache.org/couchdb/View_Snippets This presentation: http://drskippy.net/projects/EMR-HadoopMeetup.pdf or presentation source, example files etc.: http://drskippy.net/projects/EMR-HadoopMeetup.zip Scott Hendrickson (Hadoop Meetup) EMR-Hadoop 8 July 2010 43 / 43