Hadoop isn't limited to running Java code, you can write your jobs in a variety of dynamic languages.
This talk is about Hadoop's Streaming API, and the best way we found to run Perl jobs on Amazon's Elastic MapReduce platform.
2. A Gentle Introduction
to MapReduce
• Distributed computing model
• Mappers process the input and forward
intermediate results to reducers.
• Reducers aggregate these intermediate
results, and emit the final results.
4. MapReduce
• Input data sent to mappers as (k, v) pairs.
• After processing, mappers emit (k v ).
out, out
• These pairs are sorted and sent to
reducers.
• All (k out, vout)
pairs for a given kout are sent
to a single reducer.
5. MapReduce
• Reducers get (k, [v , v , …, v ]).
1 2 n
• After processing, the reducer emits a (k , v )
f f
per result.
6. MapReduce
We wanted to have a world map showing
where people were starting our games (like
Mozilla Glow)
8. MapReduce
• Input: ( epoch, IP address )
• Mappers group these into 5-minute blocks,
and emit ( block Id, IP address )
• Reducers get ( blockId, [ip , ip , …, ip ] )
1 2 n
• Do a geo lookup and emit
( epoch, [ ( lat1, lon1 ), ( lat2, lon2), … ] )
11. Apache Hadoop
• Distributed programming framework
• Implements MapReduce
• Does all the usual distributed programming
heavy-lifting for you
• Highly-fault tolerant, automatic task re-
assignment in case of failure
• You focus on mappers and reducers
12. Apache Hadoop
• Native Java API
• Streaming API which can use mappers and
reducers written in any programming
language.
• Distributed file system (HDFS)
• Distributed Cache
13. Amazon Elastic
MapReduce
• On-demand Hadoop clusters running on
EC2 instances.
• Improved S3 support for storage of input
and output data.
• Build workflows by sending jobs to a
cluster.
14. EMR Downsides
• No control over the machine images.
• Perl 5.8.8
• Ephemeral, when your cluster is shut down
(or dies), HDFS is gone.
• HDFS not available at cluster-creation time.
• Debian
17. Composite Keys
• Reducers receive both keys and values
sorted
• Merge 3 tables:
userid, 0, … # customer info
userid, 1, … # payments history
userid, recordid1, … # clickstream
userid, recordid2, … # clickstream
18. Streaming vs. Native
• Limited API
• About a 7-10% increase in run time
• About a 1000% decrease in development
time (as reported by a non-representative
sample of developers)
19. Where’s My Towel?
• Tasks run chrooted in a non-deterministic
location.
• It’s easy to store files in HDFS when
submitting a job, impossible to store
directory trees.
• For native Java jobs, your dependencies get
packaged in the JAR alongside your code.
20. Streaming’s Little
Helpers
Define your inputs and outputs:
--input s3://events/2011-30-10
--output s3://glowfish/output/2011-30-10
21. Streaming’s Little
Helpers
You can use any class in Hadoop’s classpath
as a codec, several come bundled:
-D mapred.output.key.comparator.class =
org.apache.hadoop.mapred.lib.KeyFieldBasedComparator
-partitioner
org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
22. Streaming’s Little
Helpers
• Use S3 to store…
• input data
• output data
• supporting data (e.g., Geo-IP)
• your code
23. Mapper and Reducer
To specify the mapper and reducer to be
used in your streaming job, you can point
Hadoop to S3:
--mapper s3://glowfish/bin/mapper.pl
--reducer s3://glowfish/bin/reducer.pl
24. Support Files
When specifying a file to store in the DC, a
URI fragment will be used as a symlink in the
local filesystem:
-cacheFile s3://glowfish/data/GeoLiteCity.dat#GeoLiteCity.dat
25. Support Files
When specifying a file to store in the DC, a
URI fragment will be used as a symlink in the
local filesystem:
-cacheFile s3://glowfish/data/GeoLiteCity.dat#GeoLiteCity.dat
26. Dependencies
But if you store an archive (Zip, TGZ, or JAR)
in the Distributed Cache, …
-cacheArchive s3://glowfish/lib/perllib.tgz
27. Dependencies
But if you store an archive (Zip, TGZ, or JAR)
in the Distributed Cache, …
-cacheArchive s3://glowfish/lib/perllib.tgz
28. Dependencies
But if you store an archive (Zip, TGZ, or JAR)
in the Distributed Cache, …
-cacheArchive s3://glowfish/lib/perllib.tgz#locallib
29. Dependencies
Hadoop will uncompress it and create a link
to whatever directory it created, in the task’s
working directory.
32. Mapper
#!/usr/bin/env perl
use strict;
use warnings;
use lib qw/ locallib /;
use JSON::PP;
my $decoder = JSON::PP->new->utf8;
my $missing_ip = 0;
while ( <> ) {
chomp;
next unless /load_complete/;
my @line = split /t/;
my ( $epoch, $payload ) = ( int( $line[1] / 1000 / 300 ), $line[5] );
my $json = $decoder->decode( $payload );
if ( ! exists $json->{'ip'} ) {
$missing_ip++;
next;
}
print "$epocht$json->{'ip'}n";
}
print STDERR "reporter:counter:Job Counters,MISSING_IP,$missing_ipn";
33. Mapper
#!/usr/bin/env perl
use strict;
use warnings;
use lib qw/ locallib /;
use JSON::PP;
my $decoder = JSON::PP->new->utf8;
my $missing_ip = 0;
while ( <> ) {
chomp;
next unless /load_complete/;
my @line = split /t/;
my ( $epoch, $payload ) = ( int( $line[1] / 1000 / 300 ), $line[5] );
my $json = $decoder->decode( $payload );
if ( ! exists $json->{'ip'} ) {
$missing_ip++;
next;
}
print "$epocht$json->{'ip'}n";
}
print STDERR "reporter:counter:Job Counters,MISSING_IP,$missing_ipn";
34. Reducer
#!/usr/bin/env perl
use strict;
use warnings;
use lib qw/ locallib /;
use Geo::IP;
use Regexp::Common qw/ net /;
use Readonly;
Readonly::Scalar my $TAB => "t";
my $geo = Geo::IP->open( 'GeoLiteCity.dat', GEOIP_MEMORY_CACHE )
or die "Could not open GeoIP database: $!n";
my $format_errors = 0;
my $invalid_ip_address = 0;
my $geo_lookup_errors = 0;
my $time_slot;
my $previous_time_slot = -1;
35. Reducer
#!/usr/bin/env perl
use strict;
use warnings;
use lib qw/ locallib /;
use Geo::IP;
use Regexp::Common qw/ net /;
use Readonly;
Readonly::Scalar my $TAB => "t";
my $geo = Geo::IP->open( 'GeoLiteCity.dat', GEOIP_MEMORY_CACHE )
or die "Could not open GeoIP database: $!n";
my $format_errors = 0;
my $invalid_ip_address = 0;
my $geo_lookup_errors = 0;
my $time_slot;
my $previous_time_slot = -1;
36. Reducer
#!/usr/bin/env perl
use strict;
use warnings;
use lib qw/ locallib /;
use Geo::IP;
use Regexp::Common qw/ net /;
use Readonly;
Readonly::Scalar my $TAB => "t";
my $geo = Geo::IP->open( 'GeoLiteCity.dat', GEOIP_MEMORY_CACHE )
or die "Could not open GeoIP database: $!n";
my $format_errors = 0;
my $invalid_ip_address = 0;
my $geo_lookup_errors = 0;
my $time_slot;
my $previous_time_slot = -1;
37. Reducer
while ( <> ) {
chomp;
my @cols = split( TAB );
if ( scalar @cols != 2 ) {
$format_errors++;
next;
}
my ( $time_slot, $ip_addr ) = @cols;
if ( $previous_time_slot != -1 &&
$time_slot != $previous_time_slot ) {
# we've entered a new time slot, write the previous one out
emit( $time_slot, $previous_time_slot );
}
if ( $ip_addr !~ /$RE{net}{IPv4}/ ) {
$invalid_ip_address++;
$previous_time_slot = $time_slot;
next;
}
38. Reducer
while ( <> ) {
chomp;
my @cols = split( TAB );
if ( scalar @cols != 2 ) {
$format_errors++;
next;
}
my ( $time_slot, $ip_addr ) = @cols;
if ( $previous_time_slot != -1 &&
$time_slot != $previous_time_slot ) {
# we've entered a new time slot, write the previous one out
emit( $time_slot, $previous_time_slot );
}
if ( $ip_addr !~ /$RE{net}{IPv4}/ ) {
$invalid_ip_address++;
$previous_time_slot = $time_slot;
next;
}
39. Reducer
while ( <> ) {
chomp;
my @cols = split( TAB );
if ( scalar @cols != 2 ) {
$format_errors++;
next;
}
my ( $time_slot, $ip_addr ) = @cols;
if ( $previous_time_slot != -1 &&
$time_slot != $previous_time_slot ) {
# we've entered a new time slot, write the previous one out
emit( $time_slot, $previous_time_slot );
}
if ( $ip_addr !~ /$RE{net}{IPv4}/ ) {
$invalid_ip_address++;
$previous_time_slot = $time_slot;
next;
}
40. Reducer
my $geo_record = $geo->record_by_addr( $ip_addr );
if ( ! defined $geo_record ) {
$geo_lookup_errors++;
$previous_time_slot = $time_slot;
next;
}
# update entry for time slot with lat and lon
$previous_time_slot = $time_slot;
} # while ( <> )
emit( $time_slot + 1, $time_slot );
print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS,
$format_errorsn";
print STDERR "reporter:counter:Job Counters,INVALID_IPS,
$invalid_ip_addressn";
print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS,
$geo_lookup_errorsn";
41. Reducer
my $geo_record = $geo->record_by_addr( $ip_addr );
if ( ! defined $geo_record ) {
$geo_lookup_errors++;
$previous_time_slot = $time_slot;
next;
}
# update entry for time slot with lat and lon
$previous_time_slot = $time_slot;
} # while ( <> )
emit( $time_slot + 1, $time_slot );
print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS,
$format_errorsn";
print STDERR "reporter:counter:Job Counters,INVALID_IPS,
$invalid_ip_addressn";
print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS,
$geo_lookup_errorsn";
42. Reducer
my $geo_record = $geo->record_by_addr( $ip_addr );
if ( ! defined $geo_record ) {
$geo_lookup_errors++;
$previous_time_slot = $time_slot;
next;
}
# update entry for time slot with lat and lon
$previous_time_slot = $time_slot;
} # while ( <> )
emit( $time_slot + 1, $time_slot );
print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS,
$format_errorsn";
print STDERR "reporter:counter:Job Counters,INVALID_IPS,
$invalid_ip_addressn";
print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS,
$geo_lookup_errorsn";
43. Reducer
my $geo_record = $geo->record_by_addr( $ip_addr );
if ( ! defined $geo_record ) {
$geo_lookup_errors++;
$previous_time_slot = $time_slot;
next;
}
# update entry for time slot with lat and lon
$previous_time_slot = $time_slot;
} # while ( <> )
emit( $time_slot + 1, $time_slot );
print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS,
$format_errorsn";
print STDERR "reporter:counter:Job Counters,INVALID_IPS,
$invalid_ip_addressn";
print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS,
$geo_lookup_errorsn";
45. Recap
• EMR clusters are volatile.
• Values for a given key will all go to a single
reducer, sorted.
46. Recap
• EMR clusters are volatile.
• Values for a given key will all go to a single
reducer, sorted.
• Use S3 for everything, and plan your
dataflow ahead.
47. ( On data )
• Store it wisely, e.g., using a directory
structure looking like the following to get
free partitioning in Hive/others:
s3://bucket/path/data/run_date=2011-11-12
• Don’t worry about getting the data out of
S3, you can always write a simple job that
does that and run it at the end of your
workflow.
48. Recap
• EMR clusters are volatile.
• Values for a given key will all go to a single
reducer, sorted. Watch for the key
changing.
• Use S3 for everything, and plan your
dataflow ahead.
• Make carton a part of your life, and
especially of your build tool’s.
49. ( carton )
• Shipwright for humans
• Reads dependencies from Makefile.PL
• Installs them locally to your app
• Deploy your stuff, including carton.lock
• Run carton install --deployment
• Tar result and upload to S3
50. URLs
• The MapReduce Paper
http://labs.google.com/papers/mapreduce.html
• Apache Hadoop
http://hadoop.apache.org/
• Amazon Elastic MapReduce
http://aws.amazon.com/elasticmapreduce/
53. That’s All, Folks!
Slides available at
http://slideshare.net/pfig/perl-on-amazon-elastic-mapreduce
me@pedrofigueiredo.org
Editor's Notes
\n
Sort/shuffle between the two steps, guaranteeing that all mapper results for a single key go to the same reducer, and that workload is distributed evenly.\n
\n
The sorting guarantees that all values for a given key are sent to a single reducer.\n
\n
Mozilla Glow tracked Firefox 4 downloads on a world map, in near real-time.\n
\n
On a 50-node cluster, processing ~3BN events takes 11 minutes, including data transfers.\n2 hours worth take 3 minutes, so we can easily have data from 5 minutes ago\n1 day to modify the Glow protocol, 1 day to build\nEverything stored on S3\n
\n
\n
Serialisation, heartbeat, node management, directory, etc.\nSpeculative task execution, first one to finish wins\nPotentially very simple and contained code\n
You supply the mapper, reducer, and driver code\n
S3 gives you virtually unlimited storage with very high redundancy\nS3 performance: ~750MB of uncompressed data (110-byte rows -> ~7M rows/sec)\nAll this is controlled using a REST API\nJobs are called &#x2018;steps&#x2019; in EMR lingo\n
No way to customise the image and, e.g., install your own Perl\nSo it&#x2019;s a good idea to store the final results of a workflow in S3\nNo way to store dependencies in HDFS when cluster is created\n
\n
\n
If you set a value to 0, you&#x2019;ll know that it&#x2019;s going to be the first (k,v) the reducer will see, 1 will be the second, etc.\nwhen the userid changes, it&#x2019;s a new user.\n
E.g., no control over output file names, many of the API settings can&#x2019;t be configured programmatically (cmd-line switches), no separate mappers per input, etc.\nBecause reducer input is also sorted on keys, when the key changes you know you won&#x2019;t be seeing any more of those. Might need to keep track of the current key, to use as the previous.\n
So how do you get all the CPAN goodness you know and love in there?\nHDFS operations are limited to copy, move, delete, and the host OS doesn&#x2019;t see it - no untar&#x2019;ing!\n
Can have multiple inputs\n
That -D is a Hadoop define, not a JVM system property definition\n
On a streaming job you specify the programs to use as mapper and reducer\n
\n
\n
In the unknown directory where the task is running, making it accessible to it\n
\n
\n
\n
\n
\n
\n
\n
At the end of the job, Hadoop aggregates counters from all tasks.\n