17. 17
Hadoop v1 : drawbacks
– One Namenode : SPOF
– One Jobtracker : SPOF and un-scalable (nodes
limitation)
– MapReduce only : open this platform to non MR
applications
18. 18
Hadoop v2
Improvements :
– HDFS v2 : Secondary namenode
– YARN (Yet Another Resource Negociator)
● JobTracker => Resource Manager + Applications
Master (more than one)
● Can be used by non MapReduce applications
– MapReduce v2 : uses Yarn
27. 27
Pig
● With Pig write MR Jobs becomes easy
● Dataflow model : data is the key !
● Langage : PigLatin
● No limit : Used Defined Functions
http://pig.apache.org/docs/r0.13.0/
28. 28
● Pig-Wordcount
Pig
lines = LOAD '/user/XXX/file.txt' AS (line:chararray);
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
grouped = GROUP words BY word;
wordcount = FOREACH grouped GENERATE group, COUNT(words);
DUMP wordcount;
29. 29
Import …
public class WordCount2 {
Pig
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
static enum CountersEnum { INPUT_WORDS }
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
private boolean caseSensitive;
private Set<String> patternsToSkip = new HashSet<String>();
private Configuration conf;
private BufferedReader fis;
...
=> 130 lines of code !
30. 30
● SQL like : HQL
● UDFs
● Hive-Wordcount
Hive
CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, 's')) AS word FROM docs) w
GROUP BY word
ORDER BY word;