3. Word Count
Input
MapReduce is a framework for processing huge datasets on
certain kinds of distributable problems using a large number
of computers (nodes), collectively referred to as a cluster
...
Tuesday, April 5, 2011
4. Word Count
Input
MapReduce is a framework for processing huge datasets on
certain kinds of distributable problems using a large number
of computers (nodes), collectively referred to as a cluster
...
Output
as! ! ! ! 1 MapReduce!! 1 (nodes),! ! 1
certain! ! 1 cluster! ! 1 a! ! ! ! 3
collectively! 1 computers!! 1 is! ! ! ! 1
datasets! ! 1 distributable!
1 large! ! ! 1
framework!! 1 for!! ! ! 1 processing! 1
huge! ! ! 1 kinds! ! ! 1 using! ! ! 1
number!! ! 1 of! ! ! ! 2
on! ! ! ! 1 problems! ! 1
referred! ! 1
to! ! ! ! 1
Tuesday, April 5, 2011
5. Word Count (Mapper)
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr =
new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
Tuesday, April 5, 2011
6. Word Count (Mapper)
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
Extract
StringTokenizer itr = word = “MapReduce”
new StringTokenizer(value.toString()); word = ”is”
while (itr.hasMoreTokens()) { word = “a”
word.set(itr.nextToken());
...
context.write(word, one);
}
}
}
Tuesday, April 5, 2011
7. Word Count (Mapper)
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr =
new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one); Emit
} “MapReduce”, 1
}
} “is”, 1
“a”, 1
...
Tuesday, April 5, 2011
8. Word Count (Reducer)
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
Tuesday, April 5, 2011
9. Word Count (Reducer)
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
Sum
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
key=“of”
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) { sum = 2
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
Tuesday, April 5, 2011
10. Word Count (Reducer)
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
Emit
}
result.set(sum);
context.write(key, result);
} “of”, 2
}
Tuesday, April 5, 2011
11. Word Count (Running)
$ hadoop jar ./.versions/0.20/hadoop-0.20-examples.jar wordcount
-D mapred.reduce.tasks=3
input_file out
11/04/03 21:21:27 INFO mapred.JobClient: Default number of map tasks: 2
11/04/03 21:21:27 INFO mapred.JobClient: Default number of reduce tasks: 3
11/04/03 21:21:28 INFO input.FileInputFormat: Total input paths to process : 1
11/04/03 21:21:29 INFO mapred.JobClient: Running job: job_201103252110_0659
11/04/03 21:21:30 INFO mapred.JobClient: map 0% reduce 0%
11/04/03 21:21:37 INFO mapred.JobClient: map 100% reduce 0%
11/04/03 21:21:49 INFO mapred.JobClient: map 100% reduce 33%
11/04/03 21:21:52 INFO mapred.JobClient: map 100% reduce 66%
11/04/03 21:22:05 INFO mapred.JobClient: map 100% reduce 100%
11/04/03 21:22:08 INFO mapred.JobClient: Job complete: job_201103252110_0659
11/04/03 21:22:08 INFO mapred.JobClient: Counters: 17
...
11/04/03 21:22:08 INFO mapred.JobClient: Map output bytes=286
11/04/03 21:22:08 INFO mapred.JobClient: Combine input records=27
11/04/03 21:22:08 INFO mapred.JobClient: Map output records=27
11/04/03 21:22:08 INFO mapred.JobClient: Reduce input records=24
Tuesday, April 5, 2011
14. Word Count
Input Split Map Shuffle/Sort Reduce Output
as 1
certain 1
collectively 1
MAP datasets 1
MapReduce is a framework 1
MapReduce is a framework for huge 1
processsing number 1
framework for on 1
REDUCE
processing referred 1
huge datasets on MAP to 1
huge datasets certain kinds of
on certain kinds distributable MapReduce 1
of distributable cluster 1
problems using a computers 1
problems using distributable 1
large number of MAP REDUCE for 1
a large number computers kinds 1
o f c o m p u te r s of 2
problems 1
( n o d e s ) , (nodes)
collectively
collectively referrered to as a MAP
referred to as a
REDUCE (nodes), 1
cluster a 3
cluster is 1
... large 1
MAP processing 1
using 1
Tuesday, April 5, 2011
17. Log Processing (Mapper)
public static final Pattern LOG_PATTERN = Pattern.compile("^
([d.]+) (S+) (S+) [(([w/]+):([d:]+)s[+-]d{4})
] "(.+?)" (d{3}) (d+) "([^"]+)" "([^"]+)"");
public static class ExtractDateAndIpMapper
extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text ip = new Text();
public void map(Object key, Text value, Context context)
throws IOException {
String text = value.toString();
Matcher matcher = LOG_PATTERN.matcher(text);
while (matcher.find()) {
try {
ip.set(matcher.group(5) + "t" + matcher.group(1));
context.write(ip, one);
} catch(InterruptedException ex) {
throw new IOException(ex);
}
}
}
}
Tuesday, April 5, 2011
18. Log Processing (Mapper)
public static final Pattern LOG_PATTERN = Pattern.compile("^
([d.]+) (S+) (S+) [(([w/]+):([d:]+)s[+-]d{4})
] "(.+?)" (d{3}) (d+) "([^"]+)" "([^"]+)"");
public static class ExtractDateAndIpMapper
extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text ip = new Text();
public void map(Object key, Text value, Context context) Extract
throws IOException { ip = “189.186.9.181”
ip = ”201.201.16.82”
String text = value.toString();
ip = “66.249.67.57”
Matcher matcher = LOG_PATTERN.matcher(text);
...
while (matcher.find()) {
try {
ip.set(matcher.group(5) + "t" + matcher.group(1));
context.write(ip, one);
} catch(InterruptedException ex) {
throw new IOException(ex);
}
}
}
}
Tuesday, April 5, 2011
19. Log Processing (Mapper)
public static final Pattern LOG_PATTERN = Pattern.compile("^
([d.]+) (S+) (S+) [(([w/]+):([d:]+)s[+-]d{4})
] "(.+?)" (d{3}) (d+) "([^"]+)" "([^"]+)"");
public static class ExtractDateAndIpMapper
extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text ip = new Text();
public void map(Object key, Text value, Context context)
throws IOException {
String text = value.toString();
Matcher matcher = LOG_PATTERN.matcher(text);
while (matcher.find()) {
try {
ip.set(matcher.group(5) + "t" + matcher.group(1));
Emit
context.write(ip, one);
} catch(InterruptedException ex) {
throw new IOException(ex); “18/Jul/2010t189.186.9.181”,
} ...
}
}
}
Tuesday, April 5, 2011
20. Log Processing (main)
public class LogAggregator {
...
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: LogAggregator <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "LogAggregator");
job.setJarByClass(LogAggregator.class);
job.setMapperClass(ExtractDateAndIpMapper.class);
job.setCombinerClass(WordCount.IntSumReducer.class);
job.setReducerClass(WordCount.IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Tuesday, April 5, 2011
21. Log Processing (main)
public class LogAggregator {
...
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: LogAggregator <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "LogAggregator");
job.setJarByClass(LogAggregator.class);
job.setMapperClass(ExtractDateAndIpMapper.class);
job.setCombinerClass(WordCount.IntSumReducer.class); Mapper
job.setReducerClass(WordCount.IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Tuesday, April 5, 2011
22. Log Processing (main)
public class LogAggregator {
...
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: LogAggregator <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "LogAggregator");
job.setJarByClass(LogAggregator.class);
job.setMapperClass(ExtractDateAndIpMapper.class);
job.setCombinerClass(WordCount.IntSumReducer.class);
job.setReducerClass(WordCount.IntSumReducer.class); Reducer
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Tuesday, April 5, 2011
23. Log Processing (main)
public class LogAggregator {
...
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: LogAggregator <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "LogAggregator");
job.setJarByClass(LogAggregator.class);
job.setMapperClass(ExtractDateAndIpMapper.class);
job.setCombinerClass(WordCount.IntSumReducer.class);
job.setReducerClass(WordCount.IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
}
System.exit(job.waitForCompletion(true) ? 0 : 1);
Input/
}
Output
Settings
Tuesday, April 5, 2011
24. Log Processing (main)
public class LogAggregator {
...
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: LogAggregator <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "LogAggregator");
job.setJarByClass(LogAggregator.class);
job.setMapperClass(ExtractDateAndIpMapper.class);
job.setCombinerClass(WordCount.IntSumReducer.class);
job.setReducerClass(WordCount.IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
} Run it!
Tuesday, April 5, 2011
25. Log Processing (Running)
$ hadoop jar target/hadoop-recipes-1.0.jar com.synctree.hadoop.recipes.LogAggregator
-libjars hadoop-examples.jar data/access.log log_results
11/04/04 00:51:30 INFO jvm.JvmMetrics: Initializing JVM Metrics with 11/04/04
00:51:30 INFO input.FileInputFormat: Total input paths to process : 1
11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Creating hadoop-
examples.jar in /tmp/hadoop-masahji/mapred/local/
archive/-8850340642758714312_382885124_516658918/file/Users/masahji/Development/
hadoop-recipes-work--8125788655475885988 with rwxr-xr-x
11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Cached file:///
Users/masahji/Development/hadoop-recipes/hadoop-examples.jar as /tmp/hadoop-masahji/
mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/
Development/hadoop-recipes/hadoop-examples.jar
11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Cached file:///
Users/masahji/Development/hadoop-recipes/hadoop-examples.jar as /tmp/hadoop-masahji/
mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/
Development/hadoop-recipes/hadoop-examples.jar
11/04/04 00:51:32 INFO mapred.JobClient: map 100% reduce 100%
Tuesday, April 5, 2011
26. Log Processing (Running)
$ hadoop jar target/hadoop-recipes-1.0.jar com.synctree.hadoop.recipes.LogAggregator
-libjars hadoop-examples.jar data/access.log log_results
11/04/04 00:51:30 INFO jvm.JvmMetrics: Initializing JVM Metrics with 11/04/04
00:51:30 INFO input.FileInputFormat: Total input paths to process : 1
11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Creating hadoop-
examples.jar in /tmp/hadoop-masahji/mapred/local/
archive/-8850340642758714312_382885124_516658918/file/Users/masahji/Development/
hadoop-recipes-work--8125788655475885988 with rwxr-xr-x
11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Cached file:///
Users/masahji/Development/hadoop-recipes/hadoop-examples.jar as /tmp/hadoop-masahji/
mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/
Development/hadoop-recipes/hadoop-examples.jar
11/04/04 00:51:31 INFO filecache.TrackerDistributedCacheManager: Cached file:///
Users/masahji/Development/hadoop-recipes/hadoop-examples.jar as /tmp/hadoop-masahji/
mapred/local/archive/-8850340642758714312_382885124_516658918/file/Users/masahji/
Development/hadoop-recipes/hadoop-examples.jar
11/04/04 00:51:32 INFO mapred.JobClient: map 100% reduce 100%
JAR placed into
Distributed Cache
Tuesday, April 5, 2011
35. Ruby Example (ignore ip list)
#!/usr/bin/env ruby
ignore = %w(127.0.0.1 192.168 10)
log_regex = /^([d.]+)s/
Read STDIN
while(line = STDIN.gets) Write STDOUT
next unless line =~ log_regex
ip = $1
print line if ignore.reject { |ignore_ip| ip !~ /^#{ignore_ip}(.|$)/ }.empty?
end
Tuesday, April 5, 2011
36. Ruby Example (ignore ip list)
#!/usr/bin/env ruby
ignore = %w(127.0.0.1 192.168 10)
log_regex = /^([d.]+)s/
while(line = STDIN.gets)
next unless line =~ log_regex
ip = $1
print line if ignore.reject { |ignore_ip| ip !~ /^#{ignore_ip}(.|$)/ }.empty?
end
Tuesday, April 5, 2011
37. Ruby Example (ignore ip list)
$ hadoop jar $HADOOP_HOME/hadoop-streaming.jar
-input data/access.log
-output out/streaming/filter_ips
-mapper './script/filter_ips'
-reducer org.apache.hadoop.mapred.lib.IdentityReducer
11/04/04 07:08:08 INFO jvm.JvmMetrics: Initializing JVM Metrics with 11/04/04
11/04/04 07:08:08 WARN mapred.JobClient: No job jar file set. User classes may not
11/04/04 07:08:08 INFO mapred.FileInputFormat: Total input paths to process : 1
11/04/04 07:08:09 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-masahji/
11/04/04 07:08:09 INFO streaming.StreamJob: Running job: job_local_0001
11/04/04 07:08:09 INFO streaming.StreamJob: Job running in-process (local Hadoop)
...
Tuesday, April 5, 2011
40. Simple Query
Query
SELECT first_name, last_name FROM people
WHERE first_name = ‘John’
OR favorite_movie_id = 2
Tuesday, April 5, 2011
41. Simple Query
Query
SELECT first_name, last_name FROM people
WHERE first_name = ‘John’
OR favorite_movie_id = 2
Input
id first_name last_name favorite_movie_id
1 John Mulligan 3
2 Samir Ahmed 5
3 Royce Rollins 2
4 John Smith 2
Tuesday, April 5, 2011
42. Simple Query
Query
SELECT first_name, last_name FROM people
WHERE first_name = ‘John’
OR favorite_movie_id = 2
Input Output
id first_name last_name favorite_movie_id first_name last_name
1 John Mulligan 3 John Mulligan
John Smith
2 Samir Ahmed 5
3 Royce Rollins 2
4 John Smith 2
Tuesday, April 5, 2011
43. Simple Query (Mapper)
public class SimpleQuery {
...
public static class SelectAndFilterMapper
extends Mapper<Object, Text, TextArrayWritable, Text> {
...
public void map(Object key, Text value, Context context)
throws IOException {
String [] row = value.toString().split(DELIMITER);
try {
if( row[FIRST_NAME_COLUMN].equals("John") ||
row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) {
columns.set( new String[] {
row[FIRST_NAME_COLUMN],
row[LAST_NAME_COLUMN]
});
context.write(columns, blank);
}
} catch(InterruptedException ex) { throw new IOException(ex); }
}
}
...
}
Tuesday, April 5, 2011
44. Simple Query (Mapper)
public class SimpleQuery {
...
public static class SelectAndFilterMapper
extends Mapper<Object, Text, TextArrayWritable, Text> {
...
public void map(Object key, Text value, Context context) Extract
throws IOException {
String [] row = value.toString().split(DELIMITER);
try {
if( row[FIRST_NAME_COLUMN].equals("John") ||
row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) {
columns.set( new String[] {
row[FIRST_NAME_COLUMN],
row[LAST_NAME_COLUMN]
});
context.write(columns, blank);
}
} catch(InterruptedException ex) { throw new IOException(ex); }
}
}
...
}
Tuesday, April 5, 2011
45. Simple Query (Mapper)
public class SimpleQuery {
...
public static class SelectAndFilterMapper
extends Mapper<Object, Text, TextArrayWritable, Text> {
...
public void map(Object key, Text value, Context context) Extract
throws IOException {
String [] row = value.toString().split(DELIMITER);
try { WHERE
if( row[FIRST_NAME_COLUMN].equals("John") || WHERE first_name = ‘John’
row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) { OR favorite_movie_id = 2
columns.set( new String[] {
row[FIRST_NAME_COLUMN],
row[LAST_NAME_COLUMN]
});
context.write(columns, blank);
}
} catch(InterruptedException ex) { throw new IOException(ex); }
}
}
...
}
Tuesday, April 5, 2011
46. Simple Query (Mapper)
public class SimpleQuery {
...
public static class SelectAndFilterMapper
extends Mapper<Object, Text, TextArrayWritable, Text> {
...
public void map(Object key, Text value, Context context) Extract
throws IOException {
String [] row = value.toString().split(DELIMITER);
try { WHERE
if( row[FIRST_NAME_COLUMN].equals("John") || WHERE first_name = ‘John’
row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) { OR favorite_movie_id = 2
SELECT
columns.set( new String[] {
row[FIRST_NAME_COLUMN],
row[LAST_NAME_COLUMN] SELECT first_name, last_name
});
context.write(columns, blank);
}
} catch(InterruptedException ex) { throw new IOException(ex); }
}
}
...
}
Tuesday, April 5, 2011
47. Simple Query (Mapper)
public class SimpleQuery {
...
public static class SelectAndFilterMapper
extends Mapper<Object, Text, TextArrayWritable, Text> {
...
public void map(Object key, Text value, Context context) Extract
throws IOException {
String [] row = value.toString().split(DELIMITER);
try { WHERE
if( row[FIRST_NAME_COLUMN].equals("John") || WHERE first_name = ‘John’
row[FAVORITE_MOVIE_ID_COLUMN].equals("2") ) { OR favorite_movie_id = 2
SELECT
columns.set( new String[] {
row[FIRST_NAME_COLUMN],
row[LAST_NAME_COLUMN] SELECT first_name, last_name
});
context.write(columns, blank); Emit
}
} catch(InterruptedException ex) { throw new IOException(ex); }
}
}
...
}
Tuesday, April 5, 2011
48. Simple Query (Running)
$ hadoop jar target/hadoop-recipes-1.0.jar com.synctree.hadoop.recipes.SimpleQuery
data/people.tsv out/simple_query
...
11/04/04 09:19:15 INFO mapred.JobClient: map 100% reduce 100%
11/04/04 09:19:15 INFO mapred.JobClient: Job complete: job_local_0001
11/04/04 09:19:15 INFO mapred.JobClient: Counters: 13
11/04/04 09:19:15 INFO mapred.JobClient: FileSystemCounters
11/04/04 09:19:15 INFO mapred.JobClient: FILE_BYTES_READ=306296
11/04/04 09:19:15 INFO mapred.JobClient: FILE_BYTES_WRITTEN=398676
11/04/04 09:19:15 INFO mapred.JobClient: Map-Reduce Framework
11/04/04 09:19:15 INFO mapred.JobClient: Reduce input groups=3
11/04/04 09:19:15 INFO mapred.JobClient: Combine output records=0
11/04/04 09:19:15 INFO mapred.JobClient: Map input records=4
11/04/04 09:19:15 INFO mapred.JobClient: Reduce shuffle bytes=0
11/04/04 09:19:15 INFO mapred.JobClient: Reduce output records=3
11/04/04 09:19:15 INFO mapred.JobClient: Spilled Records=6
11/04/04 09:19:15 INFO mapred.JobClient: Map output bytes=54
11/04/04 09:19:15 INFO mapred.JobClient: Combine input records=0
11/04/04 09:19:15 INFO mapred.JobClient: Map output records=3
11/04/04 09:19:15 INFO mapred.JobClient: SPLIT_RAW_BYTES=127
11/04/04 09:19:15 INFO mapred.JobClient: Reduce input records=3
...
Tuesday, April 5, 2011
50. Join Query
Query
SELECT first_name, last_name, movies.name name,
movies.image
FROM people JOIN movies ON (
people.favorite_movie_id = movies.id
)
Tuesday, April 5, 2011
51. Join Query
Input
id first_name last_name favorite_... id name image
1 John Mulligan 3 2 The Matrix http://bit.ly/matrix.jpg
2 Samir Ahmed 5 3 Gatacca http://bit.ly/g.jpg
3 Royce Rollins 2 4 AI http://bit.ly/ai.jpg
4 John Smith 2 5 Avatar http://bit.ly/avatar.jpg
Tuesday, April 5, 2011
52. Join Query
Input people movies
id first_name last_name favorite_... id name image
1 John Mulligan 3 2 The Matrix http://bit.ly/matrix.jpg
2 Samir Ahmed 5 3 Gatacca http://bit.ly/g.jpg
3 Royce Rollins 2 4 AI http://bit.ly/ai.jpg
4 John Smith 2 5 Avatar http://bit.ly/avatar.jpg
Output
first_name last_name name image
John Mulligan The Matrix http://bit.ly/matrix.jpg
Samir Ahmed Gatacca http://bit.ly/g.jpg
Royce Rollins AI http://bit.ly/ai.jpg
John Smith Avatar http://bit.ly/avatar.jpg
Tuesday, April 5, 2011
53. Join Query (Mapper)
public static class SelectAndFilterMapper
extends Mapper<Object, Text, Text, TextArrayWritable> {
...
public void map(Object key, Text value, Context context)
throws IOException {
String [] row = value.toString().split(DELIMITER);
String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();
try {
if(fileName.startsWith("people")) {
columns.set( new String [] {
"people",
row[PEOPLE_FIRST_NAME_COLUMN],
row[PEOPLE_LAST_NAME_COLUMN]
});
joinKey.set(row[PEOPLE_FAVORITE_MOVIE_ID_COLUMN]);
}
else if(fileName.startsWith("movies")) {
columns.set( new String [] {
"movies",
row[MOVIES_NAME_COLUMN],
row[MOVIES_IMAGE_COLUMN]
});
joinKey.set(row[MOVIES_ID_COLUMN]);
}
context.write(joinKey, columns);
} catch(InterruptedException ex) {
throw new IOException(ex);
}
...
Tuesday, April 5, 2011
63. What is Hive?
“Hive is a data warehouse infrastructure built on top of Hadoop. It provides tools to enable
easy data ETL, a mechanism to put structures on the data, and the capability to querying
and analysis of large data sets stored in Hadoop files. Hive defines a simple SQL-like query
language, called QL, that enables users familiar with SQL to query the data. At the same
time, this language also allows programmers who are familiar with the MapReduce
framework to be able to plug in their custom mappers and reducers to perform more
sophisticated analysis that may not be supported by the built-in capabilities of the
language.”
Tuesday, April 5, 2011
64. Hive Features
SerDe
MetaStore
Query Processor
Compiler
Processor
Functions / UDFs, UDAFs, UDTFs
Tuesday, April 5, 2011