Perl on Amazon Elastic MapReduce

1. Perl on Amazon Elastic MapReduce

2. A Gentle Introduction to MapReduce • Distributed computing model • Mappers process the input and forward intermediate results to reducers. • Reducers aggregate these intermediate results, and emit the ﬁnal results.

3. $ map | sort | reduce

4. MapReduce • Input data sent to mappers as (k, v) pairs. • After processing, mappers emit (k v ). out, out • These pairs are sorted and sent to reducers. • All (k out, vout) pairs for a given kout are sent to a single reducer.

5. MapReduce • Reducers get (k, [v , v , …, v ]). 1 2 n • After processing, the reducer emits a (k , v ) f f per result.

6. MapReduce We wanted to have a world map showing where people were starting our games (like Mozilla Glow)

7. Glowﬁsh

8. MapReduce • Input: ( epoch, IP address ) • Mappers group these into 5-minute blocks, and emit ( block Id, IP address ) • Reducers get ( blockId, [ip , ip , …, ip ] ) 1 2 n • Do a geo lookup and emit ( epoch, [ ( lat1, lon1 ), ( lat2, lon2), … ] )

9. $ map | sort | reduce

11. Apache Hadoop • Distributed programming framework • Implements MapReduce • Does all the usual distributed programming heavy-lifting for you • Highly-fault tolerant, automatic task re- assignment in case of failure • You focus on mappers and reducers

12. Apache Hadoop • Native Java API • Streaming API which can use mappers and reducers written in any programming language. • Distributed ﬁle system (HDFS) • Distributed Cache

13. Amazon Elastic MapReduce • On-demand Hadoop clusters running on EC2 instances. • Improved S3 support for storage of input and output data. • Build workﬂows by sending jobs to a cluster.

14. EMR Downsides • No control over the machine images. • Perl 5.8.8 • Ephemeral, when your cluster is shut down (or dies), HDFS is gone. • HDFS not available at cluster-creation time. • Debian

15. Streaming vs. Native $ cat | map | sort | reduce

16. Streaming vs. Native Instead of ( k, [ v1, v2, …, vn ] ) reducers get (( k1, v1 ), …, ( k1, vn ), ( k2, v1 ), …, ( k2, v2 ))

17. Composite Keys • Reducers receive both keys and values sorted • Merge 3 tables: userid, 0, … # customer info userid, 1, … # payments history userid, recordid1, … # clickstream userid, recordid2, … # clickstream

18. Streaming vs. Native • Limited API • About a 7-10% increase in run time • About a 1000% decrease in development time (as reported by a non-representative sample of developers)

19. Where’s My Towel? • Tasks run chrooted in a non-deterministic location. • It’s easy to store ﬁles in HDFS when submitting a job, impossible to store directory trees. • For native Java jobs, your dependencies get packaged in the JAR alongside your code.

20. Streaming’s Little Helpers Deﬁne your inputs and outputs: --input s3://events/2011-30-10 --output s3://glowfish/output/2011-30-10

21. Streaming’s Little Helpers You can use any class in Hadoop’s classpath as a codec, several come bundled: -D mapred.output.key.comparator.class = org.apache.hadoop.mapred.lib.KeyFieldBasedComparator -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

22. Streaming’s Little Helpers • Use S3 to store… • input data • output data • supporting data (e.g., Geo-IP) • your code

23. Mapper and Reducer To specify the mapper and reducer to be used in your streaming job, you can point Hadoop to S3: --mapper s3://glowfish/bin/mapper.pl --reducer s3://glowfish/bin/reducer.pl

24. Support Files When specifying a ﬁle to store in the DC, a URI fragment will be used as a symlink in the local ﬁlesystem: -cacheFile s3://glowfish/data/GeoLiteCity.dat#GeoLiteCity.dat

25. Support Files When specifying a ﬁle to store in the DC, a URI fragment will be used as a symlink in the local ﬁlesystem: -cacheFile s3://glowfish/data/GeoLiteCity.dat#GeoLiteCity.dat

26. Dependencies But if you store an archive (Zip, TGZ, or JAR) in the Distributed Cache, … -cacheArchive s3://glowfish/lib/perllib.tgz

27. Dependencies But if you store an archive (Zip, TGZ, or JAR) in the Distributed Cache, … -cacheArchive s3://glowfish/lib/perllib.tgz

28. Dependencies But if you store an archive (Zip, TGZ, or JAR) in the Distributed Cache, … -cacheArchive s3://glowfish/lib/perllib.tgz#locallib

29. Dependencies Hadoop will uncompress it and create a link to whatever directory it created, in the task’s working directory.

30. Dependencies Which is where it stores your mapper and reducer.

31. Dependencies use lib qw/ locallib /;

32. Mapper #!/usr/bin/env perl use strict; use warnings; use lib qw/ locallib /; use JSON::PP; my $decoder = JSON::PP->new->utf8; my $missing_ip = 0; while ( <> ) { chomp; next unless /load_complete/; my @line = split /t/; my ( $epoch, $payload ) = ( int( $line[1] / 1000 / 300 ), $line[5] ); my $json = $decoder->decode( $payload ); if ( ! exists $json->{'ip'} ) { $missing_ip++; next; } print "$epocht$json->{'ip'}n"; } print STDERR "reporter:counter:Job Counters,MISSING_IP,$missing_ipn";

33. Mapper #!/usr/bin/env perl use strict; use warnings; use lib qw/ locallib /; use JSON::PP; my $decoder = JSON::PP->new->utf8; my $missing_ip = 0; while ( <> ) { chomp; next unless /load_complete/; my @line = split /t/; my ( $epoch, $payload ) = ( int( $line[1] / 1000 / 300 ), $line[5] ); my $json = $decoder->decode( $payload ); if ( ! exists $json->{'ip'} ) { $missing_ip++; next; } print "$epocht$json->{'ip'}n"; } print STDERR "reporter:counter:Job Counters,MISSING_IP,$missing_ipn";

34. Reducer #!/usr/bin/env perl use strict; use warnings; use lib qw/ locallib /; use Geo::IP; use Regexp::Common qw/ net /; use Readonly; Readonly::Scalar my $TAB => "t"; my $geo = Geo::IP->open( 'GeoLiteCity.dat', GEOIP_MEMORY_CACHE ) or die "Could not open GeoIP database: $!n"; my $format_errors = 0; my $invalid_ip_address = 0; my $geo_lookup_errors = 0; my $time_slot; my $previous_time_slot = -1;

37. Reducer while ( <> ) { chomp; my @cols = split( TAB ); if ( scalar @cols != 2 ) { $format_errors++; next; } my ( $time_slot, $ip_addr ) = @cols; if ( $previous_time_slot != -1 && $time_slot != $previous_time_slot ) { # we've entered a new time slot, write the previous one out emit( $time_slot, $previous_time_slot ); } if ( $ip_addr !~ /$RE{net}{IPv4}/ ) { $invalid_ip_address++; $previous_time_slot = $time_slot; next; }

40. Reducer my $geo_record = $geo->record_by_addr( $ip_addr ); if ( ! defined $geo_record ) { $geo_lookup_errors++; $previous_time_slot = $time_slot; next; } # update entry for time slot with lat and lon $previous_time_slot = $time_slot; } # while ( <> ) emit( $time_slot + 1, $time_slot ); print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS, $format_errorsn"; print STDERR "reporter:counter:Job Counters,INVALID_IPS, $invalid_ip_addressn"; print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS, $geo_lookup_errorsn";

44. Recap • EMR clusters are volatile!

45. Recap • EMR clusters are volatile. • Values for a given key will all go to a single reducer, sorted.

46. Recap • EMR clusters are volatile. • Values for a given key will all go to a single reducer, sorted. • Use S3 for everything, and plan your dataﬂow ahead.

47. ( On data ) • Store it wisely, e.g., using a directory structure looking like the following to get free partitioning in Hive/others: s3://bucket/path/data/run_date=2011-11-12 • Don’t worry about getting the data out of S3, you can always write a simple job that does that and run it at the end of your workﬂow.

48. Recap • EMR clusters are volatile. • Values for a given key will all go to a single reducer, sorted. Watch for the key changing. • Use S3 for everything, and plan your dataﬂow ahead. • Make carton a part of your life, and especially of your build tool’s.

49. ( carton ) • Shipwright for humans • Reads dependencies from Makeﬁle.PL • Installs them locally to your app • Deploy your stuff, including carton.lock • Run carton install --deployment • Tar result and upload to S3

50. URLs • The MapReduce Paper http://labs.google.com/papers/mapreduce.html • Apache Hadoop http://hadoop.apache.org/ • Amazon Elastic MapReduce http://aws.amazon.com/elasticmapreduce/

51. URLs • Hadoop Streaming Tutorial (Apache) http://hadoop.apache.org/common/docs/r0.20.2/streaming.html • Hadoop Streaming How-To (Amazon) http://docs.amazonwebservices.com/ElasticMapReduce/latest/ GettingStartedGuide/CreateJobFlowStreaming.html

52. URLs • Amazon EMR Perl Client Library http://aws.amazon.com/code/Elastic-MapReduce/2309 • Amazon EMR Command-Line Tool http://aws.amazon.com/code/Elastic-MapReduce/2264

53. That’s All, Folks! Slides available at http://slideshare.net/pfig/perl-on-amazon-elastic-mapreduce me@pedrofigueiredo.org

Perl on Amazon Elastic MapReduce

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Perl on Amazon Elastic MapReduce

Similar to Perl on Amazon Elastic MapReduce (20)

Recently uploaded

Recently uploaded (20)

Perl on Amazon Elastic MapReduce

Editor's Notes