More Related Content Similar to Intro to HDFS and MapReduce (20) Intro to HDFS and MapReduce1. Introduction to HDFS
and MapReduce
Copyright © 2012-2013, Think Big Analytics, All
Rights Reserved
Thursday, January 10, 13
2. Who Am I
- Ryan Tabora
- Data Developer at Think
Big Analytics
- Big Data Consulting
- Experience working with
Hadoop, HBase, Hive,
Solr, Cassandra, etc.
Copyright © 2012-2013, Think Big Analytics, All
2 Rights Reserved
Thursday, January 10, 13
3. Who Am I
- Ryan Tabora
- Data Developer at Think
Big Analytics
- Big Data Consulting
- Experience working with
Hadoop, HBase, Hive,
Solr, Cassandra, etc.
Copyright © 2012-2013, Think Big Analytics, All
2 Rights Reserved
Thursday, January 10, 13
4. Think Big is the leading professional services firm that’s purpose built
for Big Data.
• One of Silicon Valley’s Fastest Growing Big Data start ups
• 100% Focus on Big Data consulting & Data Science solution services
• Management Background:
Cambridge Technology, C-bridge, Oracle, Sun Microsystems, Quantcast,
Accenture
C-bridge Internet Solutions (CBIS) founder 1996 & executives, IPO 1999
• Clients: 40+
• North America Locations
• US East: Boston, New York, Washington D.C.
• US Central: Chicago, Austin
• US West: HQ Mountain View, San Diego, Salt Lake City
• EMEA & APAC
Confidential Think Big Analytics
3
Thursday, January 10, 13
5. Think Big Recognized as a Top Pure-Play Big Data Vendor
Source: Forbes February
2012
Confidential Think Big Analytics
01/04/13
4
Thursday, January 10, 13
6. Agenda
- Big Data
- Hadoop Ecosystem
- HDFS
- MapReduce in Hadoop
- The Hadoop Java API
- Conclusions
Copyright © 2012-2013, Think Big Analytics, All
5 Rights Reserved
Thursday, January 10, 13
7. Big Data
Copyright © 2012-2013, Think Big Analytics, All
6 Rights Reserved
Thursday, January 10, 13
8. A Data Shift...
Source: EMC Digital Universe Study*
Copyright © 2012-2013, Think Big Analytics, All
7 Rights Reserved
Thursday, January 10, 13
9. Motivation
“Simple algorithms and lots
of data trump complex
models. ”
Halevy, Norvig, and Pereira
(Google), IEEE Intelligent Systems
Copyright © 2012-2013, Think Big Analytics, All
8 Rights Reserved
Thursday, January 10, 13
10. Pioneers
• Google and Yahoo:
- Index 850+ million websites, over one
trillion URLs.
• Facebook ad targeting:
- 840+ million users, > 50% of whom are
active daily.
Copyright © 2012-2013, Think Big Analytics, All
9 Rights Reserved
Thursday, January 10, 13
11. Hadoop
Ecosystem
Copyright © 2012-2013, Think Big Analytics, All
10 Rights Reserved
Thursday, January 10, 13
12. Common Tool?
• Hadoop
- Cluster: distributed computing
platform.
- Commodity*, server-class hardware.
- Extensible Platform.
Copyright © 2012-2013, Think Big Analytics, All
11 Rights Reserved
Thursday, January 10, 13
13. Hadoop Origins
• MapReduce and Google File System (GFS)
pioneered at Google.
• Hadoop is the commercially-supported
open-source equivalent.
Copyright © 2012-2013, Think Big Analytics, All
12 Rights Reserved
Thursday, January 10, 13
14. What Is Hadoop?
• Hadoop is a platform.
• Distributes and replicates data.
• Manages parallel tasks created by users.
• Runs as several processes on a cluster.
• The term Hadoop generally refers to a toolset, not a
single tool.
Copyright © 2012-2013, Think Big Analytics, All
13 Rights Reserved
Thursday, January 10, 13
15. Why Hadoop?
• Handles unstructured to semi-structured to
structured data.
• Handles enormous data volumes.
• Flexible data analysis and machine learning
tools.
• Cost-effective scalability.
Copyright © 2012-2013, Think Big Analytics, All
14 Rights Reserved
Thursday, January 10, 13
16. The Hadoop Ecosystem
• HDFS - Hadoop Distributed File System.
• Map/Reduce - A distributed framework for
executing work in parallel.
• Hive - A SQL like syntax with a meta store to
allow SQL manipulation of data stored on HDFS.
• Pig - A top down scripting language to
manipulate.
• HBase - A NoSQL, non-sequential data store.
Copyright © 2012-2013, Think Big Analytics, All
15 Rights Reserved
Thursday, January 10, 13
17. The Hadoop Ecosystem
• HDFS - Hadoop Distributed File System.
• Map/Reduce - A distributed framework for
executing work in parallel.
• Hive - A SQL like syntax with a meta store to
allow SQL manipulation of data stored on HDFS.
• Pig - A top down scripting language to
manipulate.
• HBase - A NoSQL, non-sequential data store.
Copyright © 2012-2013, Think Big Analytics, All
15 Rights Reserved
Thursday, January 10, 13
18. HDFS
Copyright © 2012-2013, Think Big Analytics, All
16 Rights Reserved
Thursday, January 10, 13
19. What Is HDFS?
• Hadoop Distributed File System.
• Stores files in blocks across many nodes in a
cluster.
• Replicates the blocks across nodes for
durability.
• Master/Slave architecture.
Copyright © 2012-2013, Think Big Analytics, All
17 Rights Reserved
Thursday, January 10, 13
20. HDFS Traits
• Not fully POSIX compliant.
• No file updates.
• Write once, read many times.
• Large blocks, sequential read patterns.
• Designed for batch processing.
Copyright © 2012-2013, Think Big Analytics, All
18 Rights Reserved
Thursday, January 10, 13
21. HDFS Master
• NameNode
- Runs on a single node as a master process
‣ Holds file metadata (which blocks are where)
‣ Directs client access to files in HDFS
• SecondaryNameNode
- Not a hot failover
- Maintains a copy of the NameNode metadata
Copyright © 2012-2013, Think Big Analytics, All
19 Rights Reserved
Thursday, January 10, 13
22. HDFS Slaves
• DataNode
- Generally runs on all nodes in the cluster
‣ Block creation/replication/deletion/reads
‣ Takes orders from the NameNode
Copyright © 2012-2013, Think Big Analytics, All
20 Rights Reserved
Thursday, January 10, 13
23. HDFS Illustrated
NameNode
Put File
File
DataNode 1 DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
Copyright © 2012-2013, Think Big Analytics, All
21 Rights Reserved
Thursday, January 10, 13
24. HDFS Illustrated
NameNode
Put File
File
DataNode 1 DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
Copyright © 2012-2013, Think Big Analytics, All
21 Rights Reserved
Thursday, January 10, 13
25. HDFS Illustrated
NameNode
1
Put File 2
3
DataNode 1 DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
Copyright © 2012-2013, Think Big Analytics, All
21 Rights Reserved
Thursday, January 10, 13
26. HDFS Illustrated
NameNode
1,4,6
Put File 2
3
DataNode 1 DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
Copyright © 2012-2013, Think Big Analytics, All
21 Rights Reserved
Thursday, January 10, 13
27. HDFS Illustrated
NameNode
1,4,6
Put File 2 ,5,3
3
DataNode 1 DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
Copyright © 2012-2013, Think Big Analytics, All
21 Rights Reserved
Thursday, January 10, 13
28. HDFS Illustrated
NameNode
1,4,6
Put File 2 ,5,3
3,2,6
DataNode 1 DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
Copyright © 2012-2013, Think Big Analytics, All
21 Rights Reserved
Thursday, January 10, 13
29. HDFS Illustrated
NameNode
1,4,6
Put File 2 ,5,3
3,2,6
DataNode 1 DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
Copyright © 2012-2013, Think Big Analytics, All
21 Rights Reserved
Thursday, January 10, 13
30. Power of Hadoop
NameNode
1,4,6
Read File 2 ,5,3
3 ,2,6
DataNode 1 DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
Copyright © 2012-2013, Think Big Analytics, All
22 Rights Reserved
Thursday, January 10, 13
31. Power of Hadoop
NameNode
1,4,6
Read File 2 ,5,3
3 ,2,6
DataNode 1 DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
Copyright © 2012-2013, Think Big Analytics, All
22 Rights Reserved
Thursday, January 10, 13
32. Power of Hadoop
NameNode
1,4,6
Read File 2 ,5,3
3 ,2,6
DataNode 1 DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
Copyright © 2012-2013, Think Big Analytics, All
22 Rights Reserved
Thursday, January 10, 13
33. Power of Hadoop
NameNode
,4,6
Read File 2 ,5,3
3 ,2,6
DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
Copyright © 2012-2013, Think Big Analytics, All
22 Rights Reserved
Thursday, January 10, 13
34. Power of Hadoop
NameNode
5,4,6
Read File 2 ,5,3
3 ,2,6
DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
Copyright © 2012-2013, Think Big Analytics, All
22 Rights Reserved
Thursday, January 10, 13
35. Power of Hadoop
NameNode
5,4,6
Read File 2 ,5,3
3 ,2,6
DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
Copyright © 2012-2013, Think Big Analytics, All
22 Rights Reserved
Thursday, January 10, 13
36. Power of Hadoop
NameNode
5,4,6
Read File 2 ,5,3
3 ,2,6
Read time
=
Transfer DataNode 2 DataNode 3
Rate x
Number of
Machines*
DataNode 4 DataNode 5 DataNode 6
Copyright © 2012-2013, Think Big Analytics, All
22 Rights Reserved
Thursday, January 10, 13
37. Power of Hadoop
NameNode
5,4,6
Read File 2 ,5,3
3 ,2,6
Read time
100 MB/s
=
x
Transfer DataNode 2 DataNode 3
3
Rate x
=
Number of
300MB/s
Machines*
DataNode 4 DataNode 5 DataNode 6
Copyright © 2012-2013, Think Big Analytics, All
22 Rights Reserved
Thursday, January 10, 13
38. HDFS Shell
• Easy to use command line interface.
• Create, copy, move, and delete files.
• Administrative duties - chmod, chown, chgrp.
• Set replication factor for a file.
• Head, tail, cat to view files.
Copyright © 2012-2013, Think Big Analytics, All
23 Rights Reserved
Thursday, January 10, 13
39. The Hadoop Ecosystem
• HDFS - Hadoop Distributed File System.
• Map/Reduce - A distributed framework for
executing work in parallel.
• Hive - A SQL like syntax with a meta store to
allow SQL manipulation of data stored on HDFS.
• Pig - A top down scripting language to
manipulate.
• HBase - A NoSQL, non-sequential data store.
Copyright © 2012-2013, Think Big Analytics, All
24 Rights Reserved
Thursday, January 10, 13
40. The Hadoop Ecosystem
• HDFS - Hadoop Distributed File System.
• Map/Reduce - A distributed framework for
executing work in parallel.
• Hive - A SQL like syntax with a meta store to
allow SQL manipulation of data stored on HDFS.
• Pig - A top down scripting language to
manipulate.
• HBase - A NoSQL, non-sequential data store.
Copyright © 2012-2013, Think Big Analytics, All
24 Rights Reserved
Thursday, January 10, 13
41. MapReduce
in
Hadoop
Copyright © 2012-2013, Think Big Analytics, All
25 Rights Reserved
Thursday, January 10, 13
42. MapReduce Basics
• Logical functions: Mappers and Reducers.
• Developers write map and reduce functions,
then submit a jar to the Hadoop cluster.
• Hadoop handles distributing the Map and
Reduce tasks across the cluster.
• Typically batch oriented.
Copyright © 2012-2013, Think Big Analytics, All
26 Rights Reserved
Thursday, January 10, 13
43. MapReduce
Daemons
•JobTracker (Master)
- Manages MapReduce jobs, giving tasks to
different nodes, managing task failure
•TaskTracker (Slave)
- Creates individual map and reduce tasks
- Reports task status to JobTracker
Copyright © 2012-2013, Think Big Analytics, All
27 Rights Reserved
Thursday, January 10, 13
44. MapReduce in
Hadoop
Copyright © 2012-2013, Think Big Analytics, All
28 Rights Reserved
Thursday, January 10, 13
45. MapReduce in
Hadoop
Let’s look at how MapReduce
actually works in Hadoop,
using WordCount.
Copyright © 2012-2013, Think Big Analytics, All
28 Rights Reserved
Thursday, January 10, 13
46. Input Mappers Sort, Reducers Output
Shuffle
Hadoop uses (hadoop, 1)
MapReduce
a2
(mapreduce, 1) hadoop 1
is 2
(uses, 1)
(is, 1), (a, 1)
There is a
Map phase
(map, 1),(phase,1)
(there, 1) map 1
mapreduce 1
phase 2
(phase,1)
(is, 1), (a, 1) reduce 1
(there, 1), there 2
There is a
Reduce phase (reduce 1) uses 1
Copyright © 2012-2013, Think Big Analytics, All
29 Rights Reserved
Thursday, January 10, 13
47. Input Mappers Sort, Reducers Output
Shuffle
Hadoop uses (hadoop, 1)
MapReduce
a2
(mapreduce, 1) hadoop 1
is 2
(uses, 1)
We need to convert
(is, 1), (a, 1)
There is a
Map phase
(map, 1),(phase,1)
the Input
(there, 1) map 1
mapreduce 1
phase 2
into the Output.
(phase,1)
(is, 1), (a, 1) reduce 1
(there, 1), there 2
There is a
Reduce phase (reduce 1) uses 1
Copyright © 2012-2013, Think Big Analytics, All
29 Rights Reserved
Thursday, January 10, 13
48. Input Mappers Sort, Reducers Output
Shuffle
Hadoop uses
MapReduce
a2
hadoop 1
is 2
There is a
Map phase
map 1
mapreduce 1
phase 2
reduce 1
there 2
There is a
Reduce phase uses 1
Copyright © 2012-2013, Think Big Analytics, All
30 Rights Reserved
Thursday, January 10, 13
49. Input Mappers
Hadoop uses
MapReduce
(doc1, "…")
There is a
Map phase
(doc2, "…")
(doc3, "")
There is a
Reduce phase
(doc4, "…")
Copyright © 2012-2013, Think Big Analytics, All
31 Rights Reserved
Thursday, January 10, 13
50. Input Mappers
(hadoop, 1)
Hadoop uses
MapReduce
(doc1, "…") (uses, 1)
(mapreduce, 1)
(there, 1)
(is, 1)
There is a
Map phase
(doc2, "…") (a, 1)
(map, 1)
(phase, 1)
(doc3, "")
(there, 1)
(is, 1)
There is a
Reduce phase
(doc4, "…") (a, 1)
(reduce, 1)
(phase, 1)
Copyright © 2012-2013, Think Big Analytics, All
32 Rights Reserved
Thursday, January 10, 13
51. Input Mappers Sort, Reducers
Shuffle
0-9, a-l
Hadoop uses (hadoop, 1)
MapReduce
(doc1, "…")
(mapreduce, 1)
(uses, 1)
(is, 1), (a, 1)
There is a
Map phase
(doc2, "…") m-q
(map, 1),(phase,1)
(there, 1)
(doc3, "")
(phase,1) r-z
(is, 1), (a, 1)
(there, 1),
There is a
Reduce phase
(doc4, "…") (reduce 1)
Copyright © 2012-2013, Think Big Analytics, All
33 Rights Reserved
Thursday, January 10, 13
52. Input Mappers Sort, Reducers
Shuffle
0-9, a-l
Hadoop uses (hadoop, 1)
MapReduce
(doc1, "…") (a, [1,1]),
(mapreduce, 1) (hadoop, [1]),
(is, [1,1])
(uses, 1)
(is, 1), (a, 1)
There is a
Map phase
(doc2, "…") m-q
(map, 1),(phase,1)
(there, 1) (map, [1]),
(mapreduce, [1]),
(phase, [1,1])
(doc3, "")
(phase,1) r-z
(is, 1), (a, 1) (reduce, [1]),
(there, 1), (there, [1,1]),
There is a
Reduce phase
(doc4, "…") (reduce 1) (uses, 1)
Copyright © 2012-2013, Think Big Analytics, All
34 Rights Reserved
Thursday, January 10, 13
53. Input Mappers Sort, Reducers Output
Shuffle
0-9, a-l
Hadoop uses (hadoop, 1)
MapReduce
(doc1, "…") (a, [1,1]), a2
(mapreduce, 1) (hadoop, [1]), hadoop 1
(is, [1,1]) is 2
(uses, 1)
(is, 1), (a, 1)
There is a
Map phase
(doc2, "…") m-q
(map, 1),(phase,1)
(there, 1) (map, [1]), map 1
(mapreduce, [1]), mapreduce 1
(phase, [1,1]) phase 2
(doc3, "")
(phase,1) r-z
(is, 1), (a, 1) (reduce, [1]), reduce 1
(there, 1), (there, [1,1]), there 2
There is a
Reduce phase
(doc4, "…") (reduce 1) (uses, 1) uses 1
Copyright © 2012-2013, Think Big Analytics, All
35 Rights Reserved
Thursday, January 10, 13
54. Input Mappers Sort, Reducers Output
Shuffle
0-9, a-l
Hadoop uses (hadoop, 1)
MapReduce (doc1, "…") (a, [1,1]), a2
(mapreduce, 1) (hadoop, [1]), hadoop 1
(is, [1,1]) is 2
(uses, 1)
(is, 1), (a, 1)
There is a
Map phase
(doc2, "…") m-q
(map, 1),(phase,1)
(there, 1) (map, [1]), map 1
(mapreduce, [1]), mapreduce 1
(phase, [1,1]) phase 2
(doc3, "")
(phase,1) r-z
(is, 1), (a, 1) (reduce, [1]),
(there, 1), (there, [1,1]),
(doc4, "…") (reduce 1) (uses, 1)
Copyright © 2012-2013, Think Big Analytics, All
36 Rights Reserved
Thursday, January 10, 13
55. Input Mappers Sort, Reducers Output
Shuffle
0-9, a-l
Hadoop uses (hadoop, 1)
MapReduce (doc1, "…") (a, [1,1]), a2
(mapreduce, 1) (hadoop, [1]), hadoop 1
(is, [1,1]) is 2
(uses, 1)
(is, 1), (a, 1)
There is a
Map phase
(doc2, "…") m-q
(map, 1),(phase,1)
(there, 1) (map, [1]), map 1
(mapreduce, [1]), mapreduce 1
(phase, [1,1]) phase 2
Map: (doc3, "")
•
(phase,1) r-z
Transform one input 1), (a, 1)
(is,
to 0-N
(reduce, [1]),
outputs. (there, 1), (there, [1,1]),
(doc4, "…") (reduce 1) (uses, 1)
Copyright © 2012-2013, Think Big Analytics, All
36 Rights Reserved
Thursday, January 10, 13
56. Input Mappers Sort, Reducers Output
Shuffle
0-9, a-l
Hadoop uses (hadoop, 1)
MapReduce (doc1, "…") (a, [1,1]), a2
(mapreduce, 1) (hadoop, [1]), hadoop 1
(is, [1,1]) is 2
(uses, 1)
(is, 1), (a, 1)
There is a
Map phase
(doc2, "…") m-q
(map, 1),(phase,1)
(there, 1) (map, [1]), map 1
(mapreduce, [1]), mapreduce 1
(phase, [1,1]) phase 2
Map: (doc3, "") Reduce:
• •
(phase,1) r-z
Transform one input 1), (a, 1)
(is,
to 0-N Collect multiple inputs into
(reduce, [1]),
outputs. (there, 1), one output.
(there, [1,1]),
(doc4, "…") (reduce 1) (uses, 1)
Copyright © 2012-2013, Think Big Analytics, All
36 Rights Reserved
Thursday, January 10, 13
57. Cluster View of
MapReduce
NameNode
M R
JobTracker
jar
TaskTracker TaskTracker TaskTracker
DataNode DataNode DataNode
Copyright © 2012-2013, Think Big Analytics, All
37 Rights Reserved
Thursday, January 10, 13
58. Cluster View of
MapReduce
NameNode
M R
JobTracker
jar
TaskTracker TaskTracker TaskTracker
DataNode DataNode DataNode
Copyright © 2012-2013, Think Big Analytics, All
37 Rights Reserved
Thursday, January 10, 13
59. Cluster View of
MapReduce
NameNode
M R
JobTracker
jar
TaskTracker TaskTracker TaskTracker
M M M
DataNode DataNode DataNode
Copyright © 2012-2013, Think Big Analytics, All
37 Rights Reserved
Thursday, January 10, 13
60. Cluster View of
MapReduce
NameNode
M R
JobTracker
jar
TaskTracker TaskTracker TaskTracker
Map Phase M M M
DataNode DataNode DataNode
Copyright © 2012-2013, Think Big Analytics, All
37 Rights Reserved
Thursday, January 10, 13
61. Cluster View of
MapReduce
NameNode
M R
JobTracker
jar
TaskTracker TaskTracker TaskTracker
* Intermediate Data Is
Map Phase k,v M k,v k,v M k,v M k,v
Stored Locally
DataNode DataNode DataNode
Copyright © 2012-2013, Think Big Analytics, All
37 Rights Reserved
Thursday, January 10, 13
62. Cluster View of
MapReduce
NameNode
M R
JobTracker
jar
TaskTracker TaskTracker TaskTracker
Map Phase k,v k,v k,v k,v k,v
DataNode DataNode DataNode
Copyright © 2012-2013, Think Big Analytics, All
37 Rights Reserved
Thursday, January 10, 13
63. Cluster View of
MapReduce
NameNode
M R
JobTracker
jar
TaskTracker TaskTracker TaskTracker
k,v k,v k,v k,v k,v
Shuffle/Sort
DataNode DataNode DataNode
Copyright © 2012-2013, Think Big Analytics, All
37 Rights Reserved
Thursday, January 10, 13
64. Cluster View of
MapReduce
NameNode
M R
JobTracker
jar
TaskTracker TaskTracker TaskTracker
k,v k,v k,v k,v k,v
Shuffle/Sort
DataNode DataNode DataNode
Copyright © 2012-2013, Think Big Analytics, All
37 Rights Reserved
Thursday, January 10, 13
65. Cluster View of
MapReduce
NameNode
M R
JobTracker
jar
TaskTracker TaskTracker TaskTracker
k,v R k,v k,v R k,v R k,v
Reduce Phase
DataNode DataNode DataNode
Copyright © 2012-2013, Think Big Analytics, All
37 Rights Reserved
Thursday, January 10, 13
66. Cluster View of
MapReduce
NameNode
M R
JobTracker
jar
TaskTracker TaskTracker TaskTracker
R R R
Reduce Phase
DataNode DataNode DataNode
Copyright © 2012-2013, Think Big Analytics, All
37 Rights Reserved
Thursday, January 10, 13
67. Cluster View of
MapReduce
NameNode
M R
JobTracker
jar
TaskTracker TaskTracker TaskTracker
Job Complete! DataNode DataNode DataNode
Copyright © 2012-2013, Think Big Analytics, All
37 Rights Reserved
Thursday, January 10, 13
68. The
Hadoop
Java API
Copyright © 2012-2013, Think Big Analytics, All
38 Rights Reserved
Thursday, January 10, 13
69. MapReduce in Java
Copyright © 2012-2013, Think Big Analytics, All
39 Rights Reserved
Thursday, January 10, 13
70. MapReduce in Java
Let’s look at WordCount
written in the
MapReduce Java API.
Copyright © 2012-2013, Think Big Analytics, All
39 Rights Reserved
Thursday, January 10, 13
71. Map Code
public class SimpleWordCountMapper
extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
static final Text word = new Text();
static final IntWritable one = new IntWritable(1);
@Override
public void map(LongWritable key, Text documentContents,
OutputCollector<Text, IntWritable> collector, Reporter reporter)
throws IOException {
String[] tokens = documentContents.toString().split("s+");
for (String wordString : tokens) {
if (wordString.length() > 0) {
word.set(wordString.toLowerCase());
collector.collect(word, one);
}
}
}
}
Copyright © 2012-2013, Think Big Analytics, All
40 Rights Reserved
Thursday, January 10, 13
72. Map Code
public class SimpleWordCountMapper
extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
static final Text word = new Text();
static final IntWritable one = new IntWritable(1);
@Override
public void map(LongWritable key, Text documentContents,
OutputCollector<Text, IntWritable> collector, Reporter reporter)
throws IOException {
String[] tokens = documentContents.toString().split("s+");
for (String wordString : tokens) {
if (wordString.length() > 0) {
word.set(wordString.toLowerCase());
collector.collect(word, one);
} Let’s drill into this code...
}
}
}
Copyright © 2012-2013, Think Big Analytics, All
40 Rights Reserved
Thursday, January 10, 13
73. Map Code
public class SimpleWordCountMapper
extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
static final Text word = new Text();
static final IntWritable one = new IntWritable(1);
@Override
public void map(LongWritable key, Text documentContents,
OutputCollector<Text, IntWritable> collector, Reporter reporter)
throws IOException {
String[] tokens = documentContents.toString().split("s+");
for (String wordString : tokens) {
if (wordString.length() > 0) {
word.set(wordString.toLowerCase());
collector.collect(word, one);
}
}
}
}
Copyright © 2012-2013, Think Big Analytics, All
41 Rights Reserved
Thursday, January 10, 13
74. Map Code
public class SimpleWordCountMapper Mapper class with 4
extends MapReduceBase implements type parameters for the
Mapper<LongWritable, Text, Text, IntWritable> { input key-value types and
output types.
static final Text word = new Text();
static final IntWritable one = new IntWritable(1);
@Override
public void map(LongWritable key, Text documentContents,
OutputCollector<Text, IntWritable> collector, Reporter reporter)
throws IOException {
String[] tokens = documentContents.toString().split("s+");
for (String wordString : tokens) {
if (wordString.length() > 0) {
word.set(wordString.toLowerCase());
collector.collect(word, one);
}
}
}
}
Copyright © 2012-2013, Think Big Analytics, All
41 Rights Reserved
Thursday, January 10, 13
75. Map Code
public class SimpleWordCountMapper
extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
static final Text word = new Text(); Output key-value objects
static final IntWritable one = new IntWritable(1); we’ll reuse.
@Override
public void map(LongWritable key, Text documentContents,
OutputCollector<Text, IntWritable> collector, Reporter reporter)
throws IOException {
String[] tokens = documentContents.toString().split("s+");
for (String wordString : tokens) {
if (wordString.length() > 0) {
word.set(wordString.toLowerCase());
collector.collect(word, one);
}
}
}
}
Copyright © 2012-2013, Think Big Analytics, All
42 Rights Reserved
Thursday, January 10, 13
76. Map Code
public class SimpleWordCountMapper
extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
static final Text word = new Text(); Map method with input,
static final IntWritable one = new IntWritable(1); output “collector”, and
reporting object.
@Override
public void map(LongWritable key, Text documentContents,
OutputCollector<Text, IntWritable> collector, Reporter reporter)
throws IOException {
String[] tokens = documentContents.toString().split("s+");
for (String wordString : tokens) {
if (wordString.length() > 0) {
word.set(wordString.toLowerCase());
collector.collect(word, one);
}
}
}
}
Copyright © 2012-2013, Think Big Analytics, All
43 Rights Reserved
Thursday, January 10, 13
77. Map Code
public class SimpleWordCountMapper
extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
static final Text word = new Text();
static final IntWritable one = new IntWritable(1);
@Override
public void map(LongWritable key, Text documentContents,
OutputCollector<Text, IntWritable> collector, Reporter reporter)
throws IOException {
String[] tokens = documentContents.toString().split("s+");
for (String wordString : tokens) {
if (wordString.length() > 0) {
word.set(wordString.toLowerCase());
collector.collect(word, one); Tokenize the line,
} “collect” each
} (word, 1)
}
}
Copyright © 2012-2013, Think Big Analytics, All
44 Rights Reserved
Thursday, January 10, 13
78. Reduce Code
public class SimpleWordCountReducer
extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterator<IntWritable> counts,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int count = 0;
while (counts.hasNext()) {
count += counts.next().get();
}
output.collect(key, new IntWritable(count));
}
}
Copyright © 2012-2013, Think Big Analytics, All
45 Rights Reserved
Thursday, January 10, 13
79. Reduce Code
public class SimpleWordCountReducer
extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterator<IntWritable> counts,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int count = 0;
while (counts.hasNext()) {
count += counts.next().get();
}
output.collect(key, new IntWritable(count));
}
}
Let’s drill into this code...
Copyright © 2012-2013, Think Big Analytics, All
45 Rights Reserved
Thursday, January 10, 13
80. Reduce Code
public class SimpleWordCountReducer
extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterator<IntWritable> counts,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int count = 0;
while (counts.hasNext()) {
count += counts.next().get();
}
output.collect(key, new IntWritable(count));
}
}
Copyright © 2012-2013, Think Big Analytics, All
46 Rights Reserved
Thursday, January 10, 13
81. Reduce Code
public class SimpleWordCountReducer Reducer class with 4
extends MapReduceBase implements type parameters for the
Reducer<Text, IntWritable, Text, IntWritable> { input key-value types and
output types.
@Override
public void reduce(Text key, Iterator<IntWritable> counts,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int count = 0;
while (counts.hasNext()) {
count += counts.next().get();
}
output.collect(key, new IntWritable(count));
}
}
Copyright © 2012-2013, Think Big Analytics, All
46 Rights Reserved
Thursday, January 10, 13
82. Reduce Code
public class SimpleWordCountReducer
extends MapReduceBase implements Reduce method with
Reducer<Text, IntWritable, Text, IntWritable> { input, output “collector”,
and reporting object.
@Override
public void reduce(Text key, Iterator<IntWritable> counts,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int count = 0;
while (counts.hasNext()) {
count += counts.next().get();
}
output.collect(key, new IntWritable(count));
}
}
Copyright © 2012-2013, Think Big Analytics, All
47 Rights Reserved
Thursday, January 10, 13
83. Reduce Code
public class SimpleWordCountReducer
extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterator<IntWritable> counts,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int count = 0;
while (counts.hasNext()) {
Count the counts per
count += counts.next().get();
} word and emit
output.collect(key, new IntWritable(count)); (word, N)
}
}
Copyright © 2012-2013, Think Big Analytics, All
48 Rights Reserved
Thursday, January 10, 13
84. Other Options
• HDFS - Hadoop Distributed File System.
• Map/Reduce - A distributed framework for
executing work in parallel.
• Hive - A SQL like syntax with a meta store to
allow SQL manipulation of data stored on HDFS.
• Pig - A top down scripting language to
manipulate.
• HBase - A NoSQL, non-sequential data store.
Copyright © 2012-2013, Think Big Analytics, All
49 Rights Reserved
Thursday, January 10, 13
85. Other Options
• HDFS - Hadoop Distributed File System.
• Map/Reduce - A distributed framework for
executing work in parallel.
• Hive - A SQL like syntax with a meta store to
allow SQL manipulation of data stored on HDFS.
• Pig - A top down scripting language to
manipulate.
• HBase - A NoSQL, non-sequential data store.
Copyright © 2012-2013, Think Big Analytics, All
49 Rights Reserved
Thursday, January 10, 13
86. Other Options
• HDFS - Hadoop Distributed File System.
• Map/Reduce - A distributed framework for
executing work in parallel.
• Hive - A SQL like syntax with a meta store to
allow SQL manipulation of data stored on HDFS.
• Pig - A top down scripting language to
manipulate.
• HBase - A NoSQL, non-sequential data store.
Copyright © 2012-2013, Think Big Analytics, All
49 Rights Reserved
Thursday, January 10, 13
87. Conclusions
Copyright © 2012-2013, Think Big Analytics, All
50 Rights Reserved
Thursday, January 10, 13
88. Hadoop Benefits
• A cost-effective, scalable way to:
- Store massive data sets.
- Perform arbitrary analyses on
those data sets.
Copyright © 2012-2013, Think Big Analytics, All
51 Rights Reserved
Thursday, January 10, 13
89. Hadoop Tools
• Offers a variety of tools for:
- Application development.
- Integration with other platforms
(e.g., databases).
Copyright © 2012-2013, Think Big Analytics, All
52 Rights Reserved
Thursday, January 10, 13
90. Hadoop
Distributions
• A rich, open-source ecosystem.
- Free to use.
- Commercially-supported
distributions.
Copyright © 2012-2013, Think Big Analytics, All
53 Rights Reserved
Thursday, January 10, 13
91. Thank You!
- Feel free to contact me at
‣ ryan.tabora@thinkbiganalytics.com
- Or our solutions consultant
‣ matt.mcdevitt@thinkbiganalytics.com
- As always, THINK BIG!
Copyright © 2012-2013, Think Big Analytics, All
54 Rights Reserved
Thursday, January 10, 13
92. Bonus
Content
Copyright © 2012-2013, Think Big Analytics, All
55 Rights Reserved
Thursday, January 10, 13
93. The Hadoop Ecosystem
• HDFS - Hadoop Distributed File System.
• Map/Reduce - A distributed framework for
executing work in parallel.
• Hive - A SQL like syntax with a meta store to
allow SQL manipulation of data stored on HDFS.
• Pig - A top down scripting language to
manipulate.
• HBase - A NoSQL, non-sequential data store.
Copyright © 2012-2013, Think Big Analytics, All
56 Rights Reserved
Thursday, January 10, 13
94. The Hadoop Ecosystem
• HDFS - Hadoop Distributed File System.
• Map/Reduce - A distributed framework for
executing work in parallel.
• Hive - A SQL like syntax with a meta store to
allow SQL manipulation of data stored on HDFS.
• Pig - A top down scripting language to
manipulate.
• HBase - A NoSQL, non-sequential data store.
Copyright © 2012-2013, Think Big Analytics, All
56 Rights Reserved
Thursday, January 10, 13
95. Hive:
SQL for
Hadoop
Copyright © 2012-2013, Think Big Analytics, All
57 Rights Reserved
Thursday, January 10, 13
96. Hive
Copyright © 2012-2013, Think Big Analytics, All
58 Rights Reserved
Thursday, January 10, 13
97. Hive
Let’s look at WordCount
written in Hive,
the SQL for Hadoop.
Copyright © 2012-2013, Think Big Analytics, All
58 Rights Reserved
Thursday, January 10, 13
98. CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'docs'
OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, 's')) AS word
FROM docs) w
GROUP BY word ORDER BY word;
Copyright © 2012-2013, Think Big Analytics, All
59 Rights Reserved
Thursday, January 10, 13
99. CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'docs'
OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, 's')) AS word
FROM docs) w
GROUP BY word ORDER BY word; Let’s drill into this code...
Copyright © 2012-2013, Think Big Analytics, All
59 Rights Reserved
Thursday, January 10, 13
100. CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'docs'
OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, 's')) AS word
FROM docs) w
GROUP BY word ORDER BY word;
Copyright © 2012-2013, Think Big Analytics, All
60 Rights Reserved
Thursday, January 10, 13
101. Create a table to hold
CREATE TABLE docs (line STRING); the raw text we’re
counting. Each line is a
“column”.
LOAD DATA INPATH 'docs'
OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, 's')) AS word
FROM docs) w
GROUP BY word ORDER BY word;
Copyright © 2012-2013, Think Big Analytics, All
60 Rights Reserved
Thursday, January 10, 13
102. CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'docs' Load the text in the
“docs” directory into the
OVERWRITE INTO TABLE docs; table.
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, 's')) AS word
FROM docs) w
GROUP BY word ORDER BY word;
Copyright © 2012-2013, Think Big Analytics, All
61 Rights Reserved
Thursday, January 10, 13
103. CREATE TABLE docs (line STRING);
Create the final table
LOAD DATA INPATH 'docs' and fill it with the results
OVERWRITE INTO TABLE docs; from a nested query of
the docs table that
performs WordCount
CREATE TABLE word_counts AS on the fly.
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, 's')) AS word
FROM docs) w
GROUP BY word ORDER BY word;
Copyright © 2012-2013, Think Big Analytics, All
62 Rights Reserved
Thursday, January 10, 13
104. Hive
Copyright © 2012-2013, Think Big Analytics, All
63 Rights Reserved
Thursday, January 10, 13
105. Hive
Because so many Hadoop users
come from SQL backgrounds,
Hive is one of the most
essential tools in the ecosystem!!
Copyright © 2012-2013, Think Big Analytics, All
63 Rights Reserved
Thursday, January 10, 13
106. The Hadoop Ecosystem
• HDFS - Hadoop Distributed File System.
• Map/Reduce - A distributed framework for
executing work in parallel.
• Hive - A SQL like syntax with a meta store to
allow SQL manipulation of data stored on HDFS.
• Pig - A top down scripting language to
manipulate.
• HBase - A NoSQL, non-sequential data store.
Copyright © 2012-2013, Think Big Analytics, All
64 Rights Reserved
Thursday, January 10, 13
107. The Hadoop Ecosystem
• HDFS - Hadoop Distributed File System.
• Map/Reduce - A distributed framework for
executing work in parallel.
• Hive - A SQL like syntax with a meta store to
allow SQL manipulation of data stored on HDFS.
• Pig - A top down scripting language to
manipulate.
• HBase - A NoSQL, non-sequential data store.
Copyright © 2012-2013, Think Big Analytics, All
64 Rights Reserved
Thursday, January 10, 13
108. Pig:
Data Flow
for Hadoop
Copyright © 2012-2013, Think Big Analytics, All
65 Rights Reserved
Thursday, January 10, 13
109. Pig
Copyright © 2012-2013, Think Big Analytics, All
66 Rights Reserved
Thursday, January 10, 13
110. Pig
Let’s look at WordCount
written in Pig,
the Data Flow language
for Hadoop.
Copyright © 2012-2013, Think Big Analytics, All
66 Rights Reserved
Thursday, January 10, 13
111. inpt = LOAD 'docs' using TextLoader
AS (line:chararray);
words = FOREACH inpt
GENERATE flatten(TOKENIZE(line)) AS word;
grpd = GROUP words BY word;
cntd = FOREACH grpd
GENERATE group, COUNT(words);
STORE cntd INTO 'output';
Copyright © 2012-2013, Think Big Analytics, All
67 Rights Reserved
Thursday, January 10, 13
112. inpt = LOAD 'docs' using TextLoader
AS (line:chararray);
words = FOREACH inpt
GENERATE flatten(TOKENIZE(line)) AS word;
grpd = GROUP words BY word;
cntd = FOREACH grpd
GENERATE group, COUNT(words);
STORE cntd INTO 'output'; Let’s drill into this code...
Copyright © 2012-2013, Think Big Analytics, All
67 Rights Reserved
Thursday, January 10, 13
113. inpt = LOAD 'docs' using TextLoader
AS (line:chararray);
words = FOREACH inpt
GENERATE flatten(TOKENIZE(line)) AS word;
grpd = GROUP words BY word;
cntd = FOREACH grpd
GENERATE group, COUNT(words);
STORE cntd INTO 'output';
Copyright © 2012-2013, Think Big Analytics, All
68 Rights Reserved
Thursday, January 10, 13
114. inpt = LOAD 'docs' using TextLoader
AS (line:chararray); Like the Hive example,
load “docs” content,
each line is a “field”.
words = FOREACH inpt
GENERATE flatten(TOKENIZE(line)) AS word;
grpd = GROUP words BY word;
cntd = FOREACH grpd
GENERATE group, COUNT(words);
STORE cntd INTO 'output';
Copyright © 2012-2013, Think Big Analytics, All
68 Rights Reserved
Thursday, January 10, 13
115. inpt = LOAD 'docs' using TextLoader
AS (line:chararray); Tokenize into words (an
array) and “flatten” into
separate records.
words = FOREACH inpt
GENERATE flatten(TOKENIZE(line)) AS word;
grpd = GROUP words BY word;
cntd = FOREACH grpd
GENERATE group, COUNT(words);
STORE cntd INTO 'output';
Copyright © 2012-2013, Think Big Analytics, All
69 Rights Reserved
Thursday, January 10, 13
116. inpt = LOAD 'docs' using TextLoader
AS (line:chararray);
words = FOREACH inpt
GENERATE flatten(TOKENIZE(line)) AS word;
Collect the same words
grpd = GROUP words BY word; together.
cntd = FOREACH grpd
GENERATE group, COUNT(words);
STORE cntd INTO 'output';
Copyright © 2012-2013, Think Big Analytics, All
70 Rights Reserved
Thursday, January 10, 13
117. inpt = LOAD 'docs' using TextLoader
AS (line:chararray);
words = FOREACH inpt
GENERATE flatten(TOKENIZE(line)) AS word;
grpd = GROUP words BY word;
cntd = FOREACH grpd
Count each word.
GENERATE group, COUNT(words);
STORE cntd INTO 'output';
Copyright © 2012-2013, Think Big Analytics, All
71 Rights Reserved
Thursday, January 10, 13
118. inpt = LOAD 'docs' using TextLoader
AS (line:chararray);
words = FOREACH inpt
GENERATE flatten(TOKENIZE(line)) AS word;
grpd = GROUP words BY word;
cntd = FOREACH grpd
GENERATE group, COUNT(words);
Save the results.
STORE cntd INTO 'output'; Profit!
Copyright © 2012-2013, Think Big Analytics, All
72 Rights Reserved
Thursday, January 10, 13
119. Pig
Copyright © 2012-2013, Think Big Analytics, All
73 Rights Reserved
Thursday, January 10, 13
120. Pig
Pig and Hive overlap,
but Pig is popular for ETL,
e.g., data transformation,
cleansing, ingestion, etc.
Copyright © 2012-2013, Think Big Analytics, All
73 Rights Reserved
Thursday, January 10, 13
121. Questions?
Copyright © 2012-2013, Think Big Analytics, All
74 Rights Reserved
Thursday, January 10, 13