Solving real world problems with Hadoop

Solving Real World Problems with Hadoop
and
SQL -> Hadoop

Masahji Stewart <masahji@synctree.com>

Tuesday, April 5, 2011

Solving Real World Problems
with Hadoop


Word Count
Input
MapReduce is a framework for processing huge datasets on
certain kinds of distributable problems using a large number
of computers (nodes), collectively referred to as a cluster
...


Word Count
Input
MapReduce is a framework for processing huge datasets on
certain kinds of distributable problems using a large number
of computers (nodes), collectively referred to as a cluster
...

Output
as! ! ! ! 1 MapReduce!! 1 (nodes),! ! 1
certain! ! 1 cluster! ! 1 a! ! ! ! 3
collectively! 1 computers!! 1 is! ! ! ! 1
datasets! ! 1 distributable!
1 large! ! ! 1
framework!! 1 for!! ! ! 1 processing! 1
huge! ! ! 1 kinds! ! ! 1 using! ! ! 1
number!! ! 1 of! ! ! ! 2
on! ! ! ! 1 problems! ! 1
referred! ! 1
to! ! ! ! 1


Word Count (Mapper)

public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr =
new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}


Word Count (Mapper)



Extract
StringTokenizer itr = word = “MapReduce”
new StringTokenizer(value.toString()); word = ”is”
while (itr.hasMoreTokens()) { word = “a”
...
context.write(word, one);
}
}
}


Word Count (Mapper)



StringTokenizer itr =
new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
context.write(word, one); Emit
} “MapReduce”, 1
}
} “is”, 1
“a”, 1
...


Word Count (Reducer)
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,
Context context
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}



Sum

Context context
key=“of”
int sum = 0;
for (IntWritable val : values) { sum = 2
sum += val.get();
}
result.set(sum);
}
}



Context context
int sum = 0;
for (IntWritable val : values) {
sum += val.get();

Emit
}
result.set(sum);
} “of”, 2
}


Word Count (Running)
$ hadoop jar ./.versions/0.20/hadoop-0.20-examples.jar wordcount
-D mapred.reduce.tasks=3
input_file out

11/04/03 21:21:27 INFO mapred.JobClient: Default number of map tasks: 2
11/04/03 21:21:27 INFO mapred.JobClient: Default number of reduce tasks: 3
11/04/03 21:21:28 INFO input.FileInputFormat: Total input paths to process : 1
11/04/03 21:21:29 INFO mapred.JobClient: Running job: job_201103252110_0659
11/04/03 21:21:30 INFO mapred.JobClient: map 0% reduce 0%
11/04/03 21:22:08 INFO mapred.JobClient: Job complete: job_201103252110_0659
11/04/03 21:22:08 INFO mapred.JobClient: Counters: 17
...
11/04/03 21:22:08 INFO mapred.JobClient: Map output bytes=286
11/04/03 21:22:08 INFO mapred.JobClient: Combine input records=27
11/04/03 21:22:08 INFO mapred.JobClient: Map output records=27
11/04/03 21:22:08 INFO mapred.JobClient: Reduce input records=24


Word Count (Output)
$ hadoop@ip-10-245-210-191:~$ hadoop fs -ls out
Found 3 items
-rw-r--r-- 2 hadoop supergroup 90 2011-04-03 21:21 /user/hadoop/out/part-r-00000
$ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00000
as! 1
certain! 1
collectively! 1
datasets! 1
framework!1
...
$ hadoop@ip-10-245-210-191:~$ hadoop fs -cat out/part-r-00001 A ﬁle per
reducer
MapReduce!1
cluster! 1
computers!1
distributable!1
for!1
...
(nodes),! 1
a! 3
is! 1
large! 1
processing! 1
using! 1


Word Count (Output)
$ hadoop@ip-10-245-210-191:~$ hadoop fs -ls out
Found 3 items
as! 1
certain! 1
collectively! 1
datasets! 1
framework!1
...
MapReduce!1
cluster! 1
computers!1
distributable!1
for!1
...
(nodes),! 1
a! 3
is! 1
large! 1
processing! 1
using! 1


Word Count
Input Split Map Shufﬂe/Sort Reduce Output

as 1
certain 1
collectively 1
MAP datasets 1
MapReduce is a framework 1
MapReduce is a framework for huge 1
processsing number 1
framework for on 1
REDUCE
processing referred 1
huge datasets on MAP to 1
huge datasets certain kinds of
on certain kinds distributable MapReduce 1
of distributable cluster 1
problems using a computers 1
problems using distributable 1
large number of MAP REDUCE for 1
a large number computers kinds 1
o f c o m p u te r s of 2
problems 1
( n o d e s ) , (nodes)
collectively
collectively referrered to as a MAP
referred to as a
REDUCE (nodes), 1
cluster a 3
cluster is 1
... large 1
MAP processing 1
using 1


Log Processing (Date IP COUNT)

Input
67.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0")
189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-"
90.221.175.16 - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0"
201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/4.0"
201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0"
201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/4.0"
66.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0"
90.221.75.196 - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music" "Mozilla/5.0"
...


Log Processing (Date IP COUNT)

Input
189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-"
...

Output
18/Jul/2010!
! 189.186.9.181! 1
18/Jul/2010!
! 201.201.16.82! 3
18/Jul/2010!
! 66.195.114.59! 1
18/Jul/2010!
! 67.195.114.59! 1
18/Jul/2010!
! 90.221.175.16! 1
19/Jul/2010!
! 90.221.75.196! 1
...


Log Processing (Mapper)
public static final Pattern LOG_PATTERN = Pattern.compile("^
([d.]+) (S+) (S+) [(([w/]+):([d:]+)s[+-]d{4})
] "(.+?)" (d{3}) (d+) "([^"]+)" "([^"]+)"");

public static class ExtractDateAndIpMapper
extends Mapper<Object, Text, Text, IntWritable> {

private Text ip = new Text();

public void map(Object key, Text value, Context context)
throws IOException {

String text = value.toString();
Matcher matcher = LOG_PATTERN.matcher(text);
while (matcher.find()) {
try {
ip.set(matcher.group(5) + "t" + matcher.group(1));
context.write(ip, one);
} catch(InterruptedException ex) {
throw new IOException(ex);
}
}

}
}

([d.]+) (S+) (S+) [(([w/]+):([d:]+)s[+-]d{4})
] "(.+?)" (d{3}) (d+) "([^"]+)" "([^"]+)"");



public void map(Object key, Text value, Context context) Extract
throws IOException { ip = “189.186.9.181”
ip = ”201.201.16.82”
ip = “66.249.67.57”
...
try {
}
}

}
}

([d.]+) (S+) (S+) [(([w/]+):([d:]+)s[+-]d{4})
] "(.+?)" (d{3}) (d+) "([^"]+)" "([^"]+)"");




try {

Emit
throw new IOException(ex); “18/Jul/2010t189.186.9.181”,
} ...
}

}
}

Log Processing (main)
public class LogAggregator {
...
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: LogAggregator <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "LogAggregator");
job.setJarByClass(LogAggregator.class);
job.setMapperClass(ExtractDateAndIpMapper.class);
job.setCombinerClass(WordCount.IntSumReducer.class);
job.setReducerClass(WordCount.IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}


...
System.exit(2);
}
job.setCombinerClass(WordCount.IntSumReducer.class); Mapper
}
}


...
System.exit(2);
}
job.setReducerClass(WordCount.IntSumReducer.class); Reducer
}
}


...
System.exit(2);
}

}
Input/
}
Output
Settings

...
System.exit(2);
}

}
} Run it!


Log Processing (Running)
$ hadoop jar target/hadoop-recipes-1.0.jar com.synctree.hadoop.recipes.LogAggregator
-libjars hadoop-examples.jar data/access.log log_results

11/04/04 00:51:30 INFO jvm.JvmMetrics: Initializing JVM Metrics with 11/04/04
00:51:30 INFO input.FileInputFormat: Total input paths to process : 1
11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Creating hadoop-
examples.jar in /tmp/hadoop-masahji/mapred/local/
archive/-8850340642758714312_382885124_516658918/file/Users/masahji/Development/
hadoop-recipes-work--8125788655475885988 with rwxr-xr-x
11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Cached file:///
Users/masahji/Development/hadoop-recipes/hadoop-examples.jar as /tmp/hadoop-masahji/
mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/
Development/hadoop-recipes/hadoop-examples.jar


Log Processing (Running)
$ hadoop jar target/hadoop-recipes-1.0.jar com.synctree.hadoop.recipes.LogAggregator
-libjars hadoop-examples.jar data/access.log log_results

00:51:30 INFO input.FileInputFormat: Total input paths to process : 1
11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Creating hadoop-
examples.jar in /tmp/hadoop-masahji/mapred/local/
archive/-8850340642758714312_382885124_516658918/file/Users/masahji/Development/
hadoop-recipes-work--8125788655475885988 with rwxr-xr-x

JAR placed into
Distributed Cache

Log Processing (Output)

$ hadoop fs -ls log_results
Found 2 items
-rwxrwxrwx 1 masahji staff 0 2011-04-04 00:51 log_results/_SUCCESS
-rwxrwxrwx 1 masahji staff 168 2011-04-04 00:51 log_results/part-r-00000

$ hadoop fs -cat log_results/part-r-00000
18/Jul/2010! 189.186.9.181!1
18/Jul/2010! 201.201.16.82!3
18/Jul/2010! 66.195.114.59!1
18/Jul/2010! 67.195.114.59!1
18/Jul/2010! 90.221.175.16!1
19/Jul/2010! 90.221.75.196!1
...


Hadoop Streaming

Fork Mapper /
Task Tracker
Reducer

STDIN STDOUT

script


Basic grep
Input
...
[sou1 suo3] /to search/.../internet search/database search/
[ji2 ri4] /propitious day/lucky day/
[ji2 xiang2] /lucky/auspicious/propitious/
[duo1 duo1] /to cluck one's tongue/tut-tut/
鹊 [xi3 que4] /black-billed magpie, legendary bringer of good luck/
...


Basic grep
Input
...
[sou1 suo3] /to search/.../internet search/database search/
[ji2 ri4] /propitious day/lucky day/
[ji2 xiang2] /lucky/auspicious/propitious/
[duo1 duo1] /to cluck one's tongue/tut-tut/
鹊 [xi3 que4] /black-billed magpie, legendary bringer of good luck/
...

Output
...
汇 [hui4 chu1] /to export data (e.g. from a database)/!
[sou1 suo3] /to search/.../internet search/database search/!
库 [shu4 ju4 ku4] /database/!
库软 [shu4 ju4 ku4 ruan3 jian4] /database software/!
资库 [zi1 liao4 ku4] /database//
...


Basic grep
$ hadoop jar $HADOOP_HOME/hadoop-streaming.jar
-input data/cedict.txt.gz
-output streaming/grep_database_mandarin
-mapper 'grep database'
-reducer org.apache.hadoop.mapred.lib.IdentityReducer
...
11/04/04 05:27:58 INFO streaming.StreamJob: map 100% reduce 100%
11/04/04 05:27:58 INFO streaming.StreamJob: Job complete: job_local_0001
11/04/04 05:27:58 INFO streaming.StreamJob: Output: streaming/grep_database_mandarin


Basic grep
Scripts
or

...
Java Classes


Basic grep
...

$ hadoop fs -cat streaming/grep_database_mandarin/part-00000

汇 [hui4 chu1] /to remit (money)//to export data (e.g. from a database)/!
[sou1 suo3] /to search/to look for sth/internet search/database search/!
库 [shu4 ju4 ku4] /database/!
库软 [shu4 ju4 ku4 ruan3 jian4] /database software/!
资库 [zi1 liao4 ku4] /database/


Ruby Example (ignore ip list)
Input
192.168.10.4 - - [18/Jul/2010:16:21:35 -0700] "GET /healthcheck HTTP/1.0" 200 96 "-" "Mozilla/4.0"
189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-"
10.1.10.4 - - [18/Jul/2010:16:21:35 -0700] "GET /healthcheck HTTP/1.0" 200 94 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0")
...

Output
189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-"!
201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla
4.0"!
201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/1450" "Mozilla/
4.0"!
201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/4.0"!
66.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0"!
67.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0")!
90.221.175.16 - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0"!
90.221.75.196 - - [19/Jul/2010] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0"!
...



#!/usr/bin/env ruby

ignore = %w(127.0.0.1 192.168 10)
log_regex = /^([d.]+)s/
Read STDIN
while(line = STDIN.gets) Write STDOUT
next unless line =~ log_regex
ip = $1

print line if ignore.reject { |ignore_ip| ip !~ /^#{ignore_ip}(.|$)/ }.empty?
end



#!/usr/bin/env ruby

ignore = %w(127.0.0.1 192.168 10)
log_regex = /^([d.]+)s/

while(line = STDIN.gets)
next unless line =~ log_regex
ip = $1

print line if ignore.reject { |ignore_ip| ip !~ /^#{ignore_ip}(.|$)/ }.empty?
end


-input data/access.log
-output out/streaming/filter_ips
-mapper './script/filter_ips'
11/04/04 07:08:08 WARN mapred.JobClient: No job jar file set. User classes may not
11/04/04 07:08:08 INFO mapred.FileInputFormat: Total input paths to process : 1
11/04/04 07:08:09 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-masahji/
11/04/04 07:08:09 INFO streaming.StreamJob: Running job: job_local_0001
11/04/04 07:08:09 INFO streaming.StreamJob: Job running in-process (local Hadoop)
...


-input data/access.log
-output out/streaming/filter_ips
-mapper './script/filter_ips'
11/04/04 07:08:08 WARN mapred.JobClient: No job jar file set. User classes may not
11/04/04 07:08:08 INFO mapred.FileInputFormat: Total input paths to process : 1
11/04/04 07:08:09 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-masahji/
11/04/04 07:08:09 INFO streaming.StreamJob: Running job: job_local_0001
11/04/04 07:08:09 INFO streaming.StreamJob: Job running in-process (local Hadoop)
...

$ hadoop fs -cat out/streaming/filter_ips/part-00000 ...!

189.186.9.181 - - [18/Jul/2010:16:21:35 -0700] "-" 400 0 "-" "-"!
201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/arrows.gif HTTP/1.1" 200 729 "http://www.fuel.tv/Gear/blogs/view/
1450" "Mozilla/4.0"!
201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/error.gif HTTP/1.1" 200 996 "http://www.fuel.tv/Gear/blogs/view/
1450" "Mozilla/4.0"!
201.201.16.82 - - [18/Jul/2010:16:21:35 -0700] "GET /images/success.gif HTTP/1.1" 200 1024 "http://www.fuel.tv/Gear" "Mozilla/
4.0"!
66.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /Sevenfold_Skate/content_flags/ HTTP/1.1" 200 82 "-" "Mozilla/5.0"!
67.195.114.59 - - [18/Jul/2010:16:21:35 -0700] "GET /friends HTTP/1.0" 200 9894 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/
3.0")!
90.221.175.16 - - [18/Jul/2010:16:21:35 -0700] "GET /music HTTP/1.1" 200 30151 "http://www.wearelistening.org" "Mozilla/5.0"!
90.221.75.196 - - [19/Jul/2010:16:21:35 -0700] "GET /javascripts/synctree_web.js HTTP/1.1" 200 43754 "http://www.fuel.tv/music"
"Mozilla/5.0"


SQL -> Hadoop


Simple Query
Query
SELECT first_name, last_name FROM people
WHERE first_name = ‘John’
OR favorite_movie_id = 2


Simple Query
Query

Input
id first_name last_name favorite_movie_id

1 John Mulligan 3

2 Samir Ahmed 5

3 Royce Rollins 2

4 John Smith 2


Simple Query
Query

Input Output
id first_name last_name favorite_movie_id first_name last_name

1 John Mulligan 3 John Mulligan

John Smith
2 Samir Ahmed 5

3 Royce Rollins 2

4 John Smith 2


Simple Query (Mapper)
public class SimpleQuery {
...
public static class SelectAndFilterMapper
extends Mapper<Object, Text, TextArrayWritable, Text> {
...

String [] row = value.toString().split(DELIMITER);

try {
if( row[FIRST_NAME_COLUMN].equals("John") ||
row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) {

columns.set( new String[] {
row[FIRST_NAME_COLUMN],
row[LAST_NAME_COLUMN]
});

context.write(columns, blank);

}
} catch(InterruptedException ex) { throw new IOException(ex); }
}
}
...
}

...
...


try {
if( row[FIRST_NAME_COLUMN].equals("John") ||
row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) {

});


}
}
}
...
}

...
...


try { WHERE
if( row[FIRST_NAME_COLUMN].equals("John") || WHERE first_name = ‘John’
row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) { OR favorite_movie_id = 2

});


}
}
}
...
}

...
...


try { WHERE

SELECT
row[LAST_NAME_COLUMN] SELECT first_name, last_name
});


}
}
}
...
}

...
...


try { WHERE

SELECT
row[LAST_NAME_COLUMN] SELECT first_name, last_name
});

context.write(columns, blank); Emit
}
}
}
...
}

Simple Query (Running)
$ hadoop jar target/hadoop-recipes-1.0.jar com.synctree.hadoop.recipes.SimpleQuery
data/people.tsv out/simple_query

...
11/04/04 09:19:15 INFO mapred.JobClient: Job complete: job_local_0001
11/04/04 09:19:15 INFO mapred.JobClient: Counters: 13
11/04/04 09:19:15 INFO mapred.JobClient: FileSystemCounters
11/04/04 09:19:15 INFO mapred.JobClient: FILE_BYTES_READ=306296
11/04/04 09:19:15 INFO mapred.JobClient: FILE_BYTES_WRITTEN=398676
11/04/04 09:19:15 INFO mapred.JobClient: Map-Reduce Framework
11/04/04 09:19:15 INFO mapred.JobClient: Reduce input groups=3
11/04/04 09:19:15 INFO mapred.JobClient: Combine output records=0
11/04/04 09:19:15 INFO mapred.JobClient: Map input records=4
11/04/04 09:19:15 INFO mapred.JobClient: Reduce shuffle bytes=0
11/04/04 09:19:15 INFO mapred.JobClient: Reduce output records=3
11/04/04 09:19:15 INFO mapred.JobClient: Spilled Records=6
11/04/04 09:19:15 INFO mapred.JobClient: Map output bytes=54
11/04/04 09:19:15 INFO mapred.JobClient: Combine input records=0
11/04/04 09:19:15 INFO mapred.JobClient: Map output records=3
11/04/04 09:19:15 INFO mapred.JobClient: SPLIT_RAW_BYTES=127
11/04/04 09:19:15 INFO mapred.JobClient: Reduce input records=3
...


Simple Query (Running)

$ hadoop fs -cat out/simple_query/part-r-00000

John! Mulligan!
John! Smith!
Royce! Rollins!


Join Query
Query
SELECT first_name, last_name, movies.name name,
movies.image
FROM people JOIN movies ON (
people.favorite_movie_id = movies.id
)


Join Query
Input
id first_name last_name favorite_... id name image

1 John Mulligan 3 2 The Matrix http://bit.ly/matrix.jpg

2 Samir Ahmed 5 3 Gatacca http://bit.ly/g.jpg

3 Royce Rollins 2 4 AI http://bit.ly/ai.jpg

4 John Smith 2 5 Avatar http://bit.ly/avatar.jpg


Join Query
Input people movies

id first_name last_name favorite_... id name image

1 John Mulligan 3 2 The Matrix http://bit.ly/matrix.jpg

2 Samir Ahmed 5 3 Gatacca http://bit.ly/g.jpg

3 Royce Rollins 2 4 AI http://bit.ly/ai.jpg

4 John Smith 2 5 Avatar http://bit.ly/avatar.jpg

Output
first_name last_name name image

John Mulligan The Matrix http://bit.ly/matrix.jpg

Samir Ahmed Gatacca http://bit.ly/g.jpg

Royce Rollins AI http://bit.ly/ai.jpg

John Smith Avatar http://bit.ly/avatar.jpg


Join Query (Mapper)
extends Mapper<Object, Text, Text, TextArrayWritable> {
...

String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();

try {
if(fileName.startsWith("people")) {
columns.set( new String [] {
"people",
row[PEOPLE_FIRST_NAME_COLUMN],
row[PEOPLE_LAST_NAME_COLUMN]
});
joinKey.set(row[PEOPLE_FAVORITE_MOVIE_ID_COLUMN]);
}
else if(fileName.startsWith("movies")) {
"movies",
row[MOVIES_NAME_COLUMN],
row[MOVIES_IMAGE_COLUMN]
});

joinKey.set(row[MOVIES_ID_COLUMN]);
}

context.write(joinKey, columns);

}
...

Join Query (Mapper)
...
public void map(Object key, Text value, Context context) Parse


try {
"people",
});
}
"movies",
});

}


}
...

Join Query (Mapper)
...


try {
"people",
Classify
});
}
"movies",
});

}


}
...

Join Query (Mapper)
...


try {
"people",
Classify
});
}
"movies",
});

}

context.write(joinKey, columns); Emit
}
...

Join Query (Reducer)
public static class CombineMapsReducer
extends Reducer<Text,TextArrayWritable,Text, TextArrayWritable> {
...
public void reduce(Text key, Iterable<TextArrayWritable> values,
Context context

LinkedList<String []> people = new LinkedList<String[]>();
LinkedList<String []> movies = new LinkedList<String[]>();

for (TextArrayWritable val : values) {
String dataset = val.getTextAt(0).toString();
if(dataset.equals("people")) {
people.add(new String[] {
val.getTextAt(1).toString(),
});
}
if(dataset.equals("movies")) {
movies.add(new String[] {
});
}
}

for(String[] person : people) {
for(String[] movie : movies) {
columns.set(new String[] {
person[0], person[1],
movie[0], movie[1]
});
context.write(BLANK, columns);
}
}
...

...
Context context


Extract
});
}
});
}
}

movie[0], movie[1]
});
}
}
...

...
Context context


Extract
});
}
people X movies
});
}
}

movie[0], movie[1]
});
}
}
...

...
Context context


Extract
});
}
people X movies
});

SELECT
}
}

for(String[] person : people) { SELECT first_name,
for(String[] movie : movies) { last_name,
movies.name name,
movie[0], movie[1] movies.image
});
}
}
...

...
Context context


Extract
});
}
people X movies
});

SELECT
}
}

for(String[] person : people) { SELECT first_name,
for(String[] movie : movies) { last_name,
movies.name name,
movie[0], movie[1] movies.image
});

}
}
Emit
...

Hive


What is Hive?
“Hive is a data warehouse infrastructure built on top of Hadoop. It provides tools to enable
easy data ETL, a mechanism to put structures on the data, and the capability to querying
and analysis of large data sets stored in Hadoop ﬁles. Hive deﬁnes a simple SQL-like query
language, called QL, that enables users familiar with SQL to query the data. At the same
time, this language also allows programmers who are familiar with the MapReduce
framework to be able to plug in their custom mappers and reducers to perform more
sophisticated analysis that may not be supported by the built-in capabilities of the
language.”


Hive Features

SerDe

MetaStore

Query Processor

Compiler

Processor

Functions / UDFs, UDAFs, UDTFs


Hive Demo


Links

http://hadoop.apache.org/

https://github.com/synctree/hadoop-recipes

http://hadoop.apache.org/common/docs/r0.20.2/streaming.html

http://developer.yahoo.com/blogs/hadoop/

http://wiki.apache.org/hadoop/Hive


Questions?


Thanks


Solving real world problems with Hadoop

Recommended

Recommended

More Related Content

Similar to Solving real world problems with Hadoop

Similar to Solving real world problems with Hadoop (20)

Recently uploaded

Recently uploaded (20)

Solving real world problems with Hadoop