1. Big DataProblem
Map-Reduce
Computer Lab
Big Data processing using MapReduce
N.Venkatesh 1
1Asst.Professor of CSE
JNTUK University College of Engineering,Vizianagaram
Email-id:nvenkatesh@jntukucev.ac.in
27 January 2015
N.Venkatesh Big Data processing using MapReduce
2. Big DataProblem
Map-Reduce
Computer Lab
Table of contents
1 Big DataProblem
2 Map-Reduce
Little-bit About Map-Reduce
Elements of MapReduce Programs
Hello World and More
3 Computer Lab
N.Venkatesh Big Data processing using MapReduce
3. Big DataProblem
Map-Reduce
Computer Lab
Data is Growing much faster than the Computation speeds.
Reasons: Data Sources like web,sensors,telescope,RFID, mobiles
cheaper storage.
N.Venkatesh Big Data processing using MapReduce
5. Big DataProblem
Map-Reduce
Computer Lab
Little-bit About Map-Reduce
Elements of MapReduce Programs
Hello World and More
Motivation
Distribute the Data into set of nodes which are connected in a
network.
N.Venkatesh Big Data processing using MapReduce
6. Big DataProblem
Map-Reduce
Computer Lab
Little-bit About Map-Reduce
Elements of MapReduce Programs
Hello World and More
Motivation
Distribute the Data into set of nodes which are connected in a
network.
Question is how do we Program? Issues while
Programming: How to divide the work across
nodes(scheduling),How to deal with node failures, stragglers
mera machine stuck huva yaar
N.Venkatesh Big Data processing using MapReduce
7. Big DataProblem
Map-Reduce
Computer Lab
Little-bit About Map-Reduce
Elements of MapReduce Programs
Hello World and More
Motivation
Distribute the Data into set of nodes which are connected in a
network.
Question is how do we Program? Issues while
Programming: How to divide the work across
nodes(scheduling),How to deal with node failures, stragglers
mera machine stuck huva yaar
Data Parallel model: Automatically takes care of Scheduling,
node failures Good example of parallel model is Map-Reduce
it Was Invented by Engineers at Google as a system for
building Search Index
N.Venkatesh Big Data processing using MapReduce
8. Big DataProblem
Map-Reduce
Computer Lab
Little-bit About Map-Reduce
Elements of MapReduce Programs
Hello World and More
Map-Reduce Programming Model
Data-type: Key-Value records
Map Function:
(Kin, Vin) ⇒ list(Kinter , Vinter )
Reduce Function:
(Kinter , list(Vinter )) ⇒ list(Kout, Vout)
Key and Value can be any type.
N.Venkatesh Big Data processing using MapReduce
9. Big DataProblem
Map-Reduce
Computer Lab
Little-bit About Map-Reduce
Elements of MapReduce Programs
Hello World and More
Writing MapReduce from Scratch
What ever the Data(Data set) and What ever the application
Everything Should be Converted into <Key K,Value V> Pairs
InputFormat <K,V>
Defines Input Splits,Record Reader,Input to the Mapper
Mapper <K,V,K,V>
Uses map function to Produce Intermediate <K,V> Pairs
Combiner<K,V,K,V> and Partitioner <K,V>
on Same Mapper Multiple Values associated with the same key
Partition the key space based on number of Reducers
Reducer<K,V,K,V>
Uses reduce function(executed one per Key)
OutputFormat<K,V> and Driver(containing main function
with job details).
Beauty is Everything can be Customized Including Key,Value.
N.Venkatesh Big Data processing using MapReduce
10. Big DataProblem
Map-Reduce
Computer Lab
Little-bit About Map-Reduce
Elements of MapReduce Programs
Hello World and More
Deep-Dive Into MapReduce Hello World(WordCount)
Definition: Find the number of occurrences of every word in a
document or set of documents
Solution(Think !!!): how to convert this problem into
Problem of starting from some <key,value> pairs and ending
at <key,value> pairs i.e word and its count.
N.Venkatesh Big Data processing using MapReduce
11. Big DataProblem
Map-Reduce
Computer Lab
Little-bit About Map-Reduce
Elements of MapReduce Programs
Hello World and More
Deep-Dive Into MapReduce Hello World(WordCount)
Definition: Find the number of occurrences of every word in a
document or set of documents
Solution(Think !!!): how to convert this problem into
Problem of starting from some <key,value> pairs and ending
at <key,value> pairs i.e word and its count.
Selecting Input Formats Fitting your needs,if not go for
Customized Input Format
N.Venkatesh Big Data processing using MapReduce
12. Big DataProblem
Map-Reduce
Computer Lab
Little-bit About Map-Reduce
Elements of MapReduce Programs
Hello World and More
Deep-Dive Into MapReduce Hello World(WordCount)
Definition: Find the number of occurrences of every word in a
document or set of documents
Solution(Think !!!): how to convert this problem into
Problem of starting from some <key,value> pairs and ending
at <key,value> pairs i.e word and its count.
Selecting Input Formats Fitting your needs,if not go for
Customized Input Format
Writing map function of Mapper based on the Record
returned Record Reader of Input Format
N.Venkatesh Big Data processing using MapReduce
13. Big DataProblem
Map-Reduce
Computer Lab
Little-bit About Map-Reduce
Elements of MapReduce Programs
Hello World and More
Deep-Dive Into MapReduce Hello World(WordCount)
Definition: Find the number of occurrences of every word in a
document or set of documents
Solution(Think !!!): how to convert this problem into
Problem of starting from some <key,value> pairs and ending
at <key,value> pairs i.e word and its count.
Selecting Input Formats Fitting your needs,if not go for
Customized Input Format
Writing map function of Mapper based on the Record
returned Record Reader of Input Format
An optional Combiner if there is any possibility of Local
Aggregation
N.Venkatesh Big Data processing using MapReduce
14. Big DataProblem
Map-Reduce
Computer Lab
Little-bit About Map-Reduce
Elements of MapReduce Programs
Hello World and More
Deep-Dive Into MapReduce Hello World(WordCount)
Definition: Find the number of occurrences of every word in a
document or set of documents
Solution(Think !!!): how to convert this problem into
Problem of starting from some <key,value> pairs and ending
at <key,value> pairs i.e word and its count.
Selecting Input Formats Fitting your needs,if not go for
Customized Input Format
Writing map function of Mapper based on the Record
returned Record Reader of Input Format
An optional Combiner if there is any possibility of Local
Aggregation
Writing reduce function of Reducer based on the Record
returned by Record Reader of Input Format
N.Venkatesh Big Data processing using MapReduce
15. Big DataProblem
Map-Reduce
Computer Lab
Little-bit About Map-Reduce
Elements of MapReduce Programs
Hello World and More
Mapper of WordCount
public class WordMapper extends Mapper<LongWritable,
Text, Text, LongWritable> {
LongWritable one = new LongWritable(1);
@Override
public void map(LongWritable key, Text value,
Context contex) throws IOException,
InterruptedException {
String line=value.toStrnig();
String [] wordsinline= line.split(" ");
for(i=0;i<wordsinline.length;i++)
contex.write(wordsinline[i], one);
}
}
N.Venkatesh Big Data processing using MapReduce
16. Big DataProblem
Map-Reduce
Computer Lab
Little-bit About Map-Reduce
Elements of MapReduce Programs
Hello World and More
Reducer of WordCount
public class WordReducer extends Reducer<Text,
LongWritable, Text, LongWritable> {
LongWritable totalWC = new LongWritable();
@Override
public void reduce(Text _key, Iterable<LongWritable>
values, Context context) throws IOException,
InterruptedException {
int wordCount = 0;
for(LongWritable val :values)
{ wordCount=wordCount+val;
}
totalWordCount.set(wordCount);
context.write(key, totalWC);
}}
N.Venkatesh Big Data processing using MapReduce
17. Big DataProblem
Map-Reduce
Computer Lab
Little-bit About Map-Reduce
Elements of MapReduce Programs
Hello World and More
Job Configuration and Job Submission
Jobs are controlled by using Configuration,Job Class Objects
Configurations are maps from attribute names to string value,
Specified by using either set or addResource
conf.set(propName, propValue); or
conf.addResource(PathContainsPropertiesfile)
conf.set(”mapreduce.job.jar”,”/home/hadoop/x.jar”);
Job Objects will take Configuartion Object and Parameters
like InputPath,OutputPath,Mapper,Reducer etc.
N.Venkatesh Big Data processing using MapReduce
18. Big DataProblem
Map-Reduce
Computer Lab
Little-bit About Map-Reduce
Elements of MapReduce Programs
Hello World and More
Job Driver Example
public class WordCount {
public static void main(String[] args) throws IOException{
Configuration conf = new Configuration();
conf.set("mapreduce.job.jar","x.jar");
conf.addResource(new Path("conf.xml"));
Job job=Job.getInstace(conf,"xyz");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setMapperClass(WordMapper.class);
job.setReducerClass(WordReducer.class);
job.setInputPath(new Path(args[0]);
job.setOutputPath(new Path(args[1]);
job.waitForCompletion(true);
}
}
N.Venkatesh Big Data processing using MapReduce
19. Big DataProblem
Map-Reduce
Computer Lab
Little-bit About Map-Reduce
Elements of MapReduce Programs
Hello World and More
Building Word Co-Occurrence Matrix From a Large Corpus
In general, A Co-occurence matrix could be described as the
tracking of an event, and given a window time or space, What
other events may occurs. In this context ”words” are events,
”window” relative position of targeted words.
Ex:The way to love anything is to realize that it may be lost
Co-occurence for the word love is [way,to,anything,is] for window size 2.
Solution(Think !!!): how to convert this problem into Problem of
starting from some <key,value> pairs and ending at <key,value>
pairs i.e pair of words and its count.
Similar to wordcount only difference is <word,neighbor> should be
mapper output key instead of <word>.
N.Venkatesh Big Data processing using MapReduce
20. Big DataProblem
Map-Reduce
Computer Lab
Little-bit About Map-Reduce
Elements of MapReduce Programs
Hello World and More
More Examples
Simple Search : Display all the lines which has given word
Map: filter all those lines which has word
Reduce: Identity Reducer
typical Map only job.
Sort the list of word according their count:
Nontrivial: two jobs or add a custom key (word, count)
Find the number of lines in a file:
Tricky one: ....
N.Venkatesh Big Data processing using MapReduce
21. Big DataProblem
Map-Reduce
Computer Lab
Little-bit About Map-Reduce
Elements of MapReduce Programs
Hello World and More
Intresting Examples
Social Networking Site Common Friends List: When you visit
someone’s profile, you see a list of friends that you have in
common. This list doesn’t change frequently so it’d be wasteful to
recalculate it every time you visit the profile.
venky: priya mouni santosh suresh kumar
suresh: santi kumar santosh srinivas divakar ....
venky visits suresh profile he should get two common friends
srinivas, kumar
Map:
key (venky,suresh ) : venky’s friends after processing venky friends
key (venky, suresh) : suresh’s friends after processing suresh friends
Reduce:
(venky,suresh): intersection of venky’s and suresh’s friends.
N.Venkatesh Big Data processing using MapReduce
22. Big DataProblem
Map-Reduce
Computer Lab
Little-bit About Map-Reduce
Elements of MapReduce Programs
Hello World and More
Intresting Examples
Social Networking Sites Friend Recommend er: People you
might know systems based on the Common friends.
venky: priya mouni santosh kumar srinivas
sahaja: priya mouni santi kumar santosh srinivas divakar ....
People you might know system should recommend venky you
might know sahaja and also to sahaja you might know venky. take
facebook as an example you are taking about 1 billion records.in
the context of india 100 million records.
Map: after processing every record recommend every friend with
other friend in the list. (priya;mouni,c=venky)
(mouni;priya,c=venky) ...
precaution they might be already friends (venky; priya,c=null)
Reduce: combine them same key (priya :mouni 2(venky,sahaja))
N.Venkatesh Big Data processing using MapReduce
23. Big DataProblem
Map-Reduce
Computer Lab
Little bit About Lab Environment
Two Clusters (23-node,11-node)
Ubuntu
Windows Compilation Environment.
Eclipse
N.Venkatesh Big Data processing using MapReduce