SlideShare une entreprise Scribd logo
1  sur  42
Télécharger pour lire hors ligne
Distributed Computing with
Apache Hadoop
Introduction to MapReduce
Konstantin V. Shvachko
Birmingham Big Data Science Group
October 19, 2011
Computing
• History of computing started long time ago
• Fascination with numbers
– Vast universe with simple strict rules
– Computing devices
– Crunch numbers
• The Internet
– Universe of words, fuzzy rules
– Different type of computing
– Understand meaning of things
– Human thinking
– Errors & deviations are a
part of study
2
Computer History Museum, San Jose
Words vs. Numbers
• In 1997 IBM built Deep Blue supercomputer
– Playing chess game with the champion G. Kasparov
– Human race was defeated
– Strict rules for Chess
– Fast deep analyses of current state
– Still numbers
3
• In 2011 IBM built Watson computer to
play Jeopardy
– Questions and hints in human terms
– Analysis of texts from library and the
Internet
– Human champions defeated
Big Data
• Computations that need the power of many computers
– Large datasets: hundreds of TBs, PBs
– Or use of thousands of CPUs in parallel
– Or both
• Cluster as a computer
4
What is a PB?
1 KB = 1000 Bytes
1 MB = 1000 KB
1 GB = 1000 MB
1 TB = 1000 GB
1 PB = 1000 TB
???? = 1000 PB
Examples – Science
• Fundamental physics: Large Hadron Collider (LHC)
– Smashing high-energy protons at the speed of light
– 1 PB of event data per sec, most filtered out
– 15 PB of data per year
– 150 computing centers around the World
– 160 PB of disk + 90 PB of tape storage
• Math: Big Numbers
– 2 quadrillionth (1015) digit of π is 0
– pure CPU workload
– 12 days of cluster time
– 208 years of CPU-time on a cluster with 7600 CPU cores
• Big Data – Big Science
5
Examples – Web
• Search engine Webmap
– Map of the Internet
– 2008 @ Yahoo, 1500 nodes, 5 PB raw storage
• Internet Search Index
– Traditional application
• Social Network Analysis
– Intelligence
– Trends
6
The Sorting Problem
• Classic in-memory sorting
– Complexity: number of comparisons
• External sorting
– Cannot load all data in memory
– 16 GB RAM vs. 200 GB file
– Complexity: + disk IOs (bytes read or written)
• Distributed sorting
– Cannot load data on a single server
– 12 drives * 2 TB = 24 TB disc space vs. 200 TB data set
– Complexity: + network transfers
7
Worst Average Space
Bubble Sort O(n2) O(n2) In-place
Quicksort O(n2) O(n log n) In-place
Merge Sort O(n log n) O(n log n) Double
What do we do?
• Need a lot of computers
• How to make them work together
8
Hadoop
• Apache Hadoop is an ecosystem of
tools for processing “Big Data”
• Started in 2005 by D. Cutting and M. Cafarella
• Consists of two main components: Providing unified cluster view
1. HDFS – a distributed file system
– File system API connecting thousands of drives
2. MapReduce – a framework for distributed computations
– Splitting jobs into parts executable on one node
– Scheduling and monitoring of job execution
• Today used everywhere: Becoming a standard of distributed computing
• Hadoop is an open source project
9
MapReduce
• MapReduce
– 2004 Jeffrey Dean, Sanjay Ghemawat. Google.
– “MapReduce: Simplified Data Processing on Large Clusters”
• Computational model
– What is a comp. model ???
• Turing machine, Java
– Split large input data into small enough pieces, process in parallel
• Execution framework
– Compilers, interpreters
– Scheduling, Processing, Coordination
– Failure recovery
10
Functional Programming
• Map a higher-order function
– applies a given function to each element of a list
– returns the list of results
• Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ]
• Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25]
11
Functional Programming: reduce
• Map a higher-order function
– applies a given function to each element of a list
– returns the list of results
• Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ]
• Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25]
• Reduce / fold a higher-order function
– Iterates given function over a list of elements
– Applies function to previous result and current element
– Return single result
• Example. Reduce( x + y, [0,1,2,3,4,5] ) = (((((0 + 1) + 2) + 3) + 4) + 5) = 15
12
Functional Programming
• Map a higher-order function
– applies a given function to each element of a list
– returns the list of results
• Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ]
• Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25]
• Reduce / fold a higher-order function
– Iterates given function over a list of elements
– Applies function to previous result and current element
– Return single result
• Example. Reduce( x + y, [0,1,2,3,4,5] ) = (((((0 + 1) + 2) + 3) + 4) + 5) = 15
• Reduce( x * y, [0,1,2,3,4,5] ) = ?
13
Functional Programming
• Map a higher-order function
– applies a given function to each element of a list
– returns the list of results
• Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ]
• Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25]
• Reduce / fold a higher-order function
– Iterates given function over a list of elements
– Applies function to previous result and current element
– Return single result
• Example. Reduce( x + y, [0,1,2,3,4,5] ) = (((((0 + 1) + 2) + 3) + 4) + 5) = 15
• Reduce( x * y, [0,1,2,3,4,5] ) = 0
14
Example: Sum of Squares
• Composition of
– a map followed by
– a reduce applied to the results of the map
• Example.
– Map( x2, [1,2,3,4,5] ) = [0,1,4,9,16,25]
– Reduce( x + y, [1,4,9,16,25] ) = ((((1 + 4) + 9) + 16) + 25) = 55
• Map easily parallelizable
– Compute x2 for 1,2,3 on one node and for 4,5 on another
• Reduce notoriously sequential
– Need all squares at one node to compute the total sum.
15
Square Pyramid Number
1 + 4 + … + n2 =
n(n+1)(2n+1) / 6
Computational Model
• MapReduce is a Parallel Computational Model
• Map-Reduce algorithm = job
• Operates with key-value pairs: (k, V)
– Primitive types, Strings or more complex Structures
• Map-Reduce job input and output is a list of pairs {(k, V)}
• MR Job as defined by 2 functions
• map: (k1; v1) → {(k2; v2)}
• reduce: (k2; {v2}) → {(k3; v3)}
16
Job Workflow
17
dogs C, 3
like
cats
V, 1
C, 2 V, 2
C, 3 V, 1
C, 8
V, 4
The Algorithm
18
Map ( null, word)
nC = Consonants(word)
nV = Vowels(word)
Emit(“Consonants”, nC)
Emit(“Vowels”, nV)
Reduce(key, {n1, n2, …})
nRes = n1 + n2 + …
Emit(key, nRes)
Computation Framework
• Two virtual clusters: HDFS and MapReduce
– Physically tightly coupled. Designed to work together
• Hadoop Distributed File System. View data as files and directories
• MapReduce is a Parallel Computation Framework
– Job scheduling and execution framework
19
HDFS Architecture Principles
• The name space is a hierarchy of files and directories
• Files are divided into blocks (typically 128 MB)
• Namespace (metadata) is decoupled from data
– Fast namespace operations, not slowed down by
– Data streaming
• Single NameNode keeps the entire name space in RAM
• DataNodes store data blocks on local drives
• Blocks are replicated on 3 DataNodes for redundancy and availability
20
MapReduce Framework
• Job Input is a file or a set of files in a distributed file system (HDFS)
– Input is split into blocks of roughly the same size
– Blocks are replicated to multiple nodes
– Block holds a list of key-value pairs
• Map task is scheduled to one of the nodes containing the block
– Map task input is node-local
– Map task result is node-local
• Map task results are grouped: one group per reducer
Each group is sorted
• Reduce task is scheduled to a node
– Reduce task transfers the targeted groups from all mapper nodes
– Computes and stores results in a separate HDFS file
• Job Output is a set of files in HDFS. With #files = #reducers
21
Map Reduce Example: Mean
• Mean
• Input: large text file
• Output: average length of words in the file µ
• Example: µ({dogs, like, cats}) = 4
22
n
ix
n 1
1
Mean Mapper
• Map input is the set of words {w} in the partition
– Key = null Value = w
• Map computes
– Number of words in the partition
– Total length of the words ∑length(w)
• Map output
– <“count”, #words>
– <“length”, #totalLength>
23
Map (null, w)
Emit(“count”, 1)
Emit(“length”, length(w))
Single Mean Reducer
• Reduce input
– {<key, {value}>}, where
– key = “count”, “length”
– value is an integer
• Reduce computes
– Total number of words: N = sum of all “count” values
– Total length of words: L = sum of all “length” values
• Reduce Output
– <“count”, N>
– <“length”, L>
• The result
– µ = L / N
24
Reduce(key, {n1, n2, …})
nRes = n1 + n2 + …
Emit(key, nRes)
Analyze ()
read(“part-r-00000”)
print(“mean = ” + L/N)
Mean: Mapper, Reducer
25
public class WordMean {
private final static Text COUNT_KEY = new Text(new String("count"));
private final static Text LENGTH_KEY = new Text(new String("length"));
private final static LongWritable ONE = new LongWritable(1);
public static class WordMeanMapper
extends Mapper<Object, Text, Text, LongWritable> {
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
String word = itr.nextToken();
context.write(LENGTH_KEY, new LongWritable(word.length()));
context.write(COUNT_KEY, ONE);
} } }
public static class WordMeanReducer
extends Reducer<Text,LongWritable,Text,LongWritable> {
public void reduce(Text key, Iterable<LongWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (LongWritable val : values)
sum += val.get();
context.write(key, new LongWritable(sum));
} }
. . . . . . . . . . . . . . . .
Mean: main()
26
. . . . . . . . . . . . . . . .
public static void main(String[] args) throws IOException {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(
conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordmean <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "word mean");
job.setJarByClass(WordMean.class);
job.setMapperClass(WordMeanMapper.class);
job.setReducerClass(WordMeanReducer.class);
job.setCombinerClass(WordMeanReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setNumReduceTasks(1);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
Path outputpath = new Path(otherArgs[1]);
FileOutputFormat.setOutputPath(job, outputpath);
boolean result = job.waitForCompletion(true);
analyzeResult(outputpath);
System.exit(result ? 0 : 1);
}
. . . . . . . . . . . . . . . .
Mean: analyzeResult()
27
. . . . . . . . . . . . . . . .
private static void analyzeResult(Path outDir) throws IOException {
FileSystem fs = FileSystem.get(new Configuration());
Path reduceFile = new Path(outDir, "part-r-00000");
if(!fs.exists(reduceFile)) return;
long count = 0, length = 0;
BufferedReader in =
new BufferedReader(new InputStreamReader(fs.open(reduceFile)));
while(in != null && in.ready()) {
StringTokenizer st = new StringTokenizer(in.readLine());
String key = st.nextToken();
String value = st.nextToken();
if(key.equals("count")) count = Long.parseLong(value);
else if(key.equals("length")) length = Long.parseLong(value);
}
double average = (double)length / count;
System.out.println("The mean is: " + average);
}
} // end WordMean
MapReduce Implementation
• Single master JobTracker shepherds the distributed heard of TaskTrackers
1. Job scheduling and resource allocation
2. Job monitoring and job lifecycle coordination
3. Cluster health and resource tracking
• Job is defined
– Program: myJob.jar file
– Configuration: conf.xml
– Input, output paths
• JobClient submits the job to the JobTracker
– Calculates and creates splits based on the input
– Write myJob.jar and conf.xml to HDFS
28
MapReduce Implementation
• JobTracker divides the job into tasks: one map task per split.
– Assigns a TaskTracker for each task, collocated with the split
• TaskTrackers execute tasks and report status to the JobTracker
– TaskTracker can run multiple map and reduce tasks
– Map and Reduce Slots
• Failed attempts reassigned to other TaskTrackers
• Job execution status and results reported back to the client
• Scheduler lets many jobs run in parallel
29
Example: Standard Deviation
• Standard deviation
• Input: large text file
• Output: standard deviation σ of word lengths
• Example: σ({dogs, like, cats}) = 0
• How many jobs
30
n
ix
n 1
2
)(
1
Standard Deviation: Hint
31
2
1
22
1
2
11
22
1
22
1
1
)
1
(2
1
)(
1
n
i
nn
i
n
i
n
i
x
n
n
x
n
x
n
x
n
Standard Deviation Mapper
• Map input is the set of words {w} in the partition
– Key = null Value = w
• Map computes
– Number of words in the partition
– Total length of the words ∑length(w)
– The sum of lengths squared ∑length(w)2
• Map output
– <“count”, #words>
– <“length”, #totalLength>
– <“squared”, #sumLengthSquared>
32
Map (null, w)
Emit(“count”, 1)
Emit(“length”, length(w))
Emit(“squared”, length(w)2)
Standard Deviation Reducer
• Reduce input
– {<key, {value}>}, where
– key = “count”, “length”, “squared”
– value is an integer
• Reduce computes
– Total number of words: N = sum of all “count” values
– Total length of words: L = sum of all “length” values
– Sum of length squares: S = sum of all “squared” values
• Reduce Output
– <“count”, N>
– <“length”, L>
– <“squared”, S>
• The result
– µ = L / N
– σ = sqrt(S / N - µ2)
33
Reduce(key, {n1, n2, …})
nRes = n1 + n2 + …
Emit(key, nRes)
Analyze ()
read(“part-r-00000”)
print(“mean = ” + L/N)
print(“std.dev = ” +
sqrt(S/N – L*L / N*N))
Combiner, Partitioner
• Combiners perform local aggregation before the shuffle & sort phase
– Optimization to reduce data transfers during shuffle
– In Mean example reduces transfer of many keys to only two
• Partitioners assign intermediate (map) key-value pairs to reducers
– Responsible for dividing up the intermediate key space
– Not used with single Reducer
34
Input
Data
Input
Data
Map Reduce
Input Map Shuffle
& sort
Reduce OutputCombiner
Partitioner
Distributed Sorting
• Sort a dataset, which cannot be entirely stored on one node.
• Input:
– Set of files. 100 byte records.
– The first 10 bytes of each record is the key and the rest is the value.
• Output:
– Ordered list of files: f1, … fN
– Each file fi is sorted, and
– If i < j then for any keys k Є fi and r Є fj (k ≤ r)
– Concatenation of files in the given order must form a completely sorted record set
35
Input
Data
Naïve MapReduce Sorting
• If the output could be stored on one node
• The input to any Reducer is always sorted by key
– Shuffle sorts Map outputs
• One identity Mapper and one identity Reducer would do the trick
– Identity: <k,v> → <k,v>
36
Input
Data
Map Reduce
dogs
like
cats
cats
dogs
like
Input Map Shuffle Reduce Output
cats dogs like
Naïve Sorting: Multiple Maps
• Multiple identity Mappers and one identity Reducer – same result
– Does not work for multiple Reducers
37
Input
Data
Output
Data
Map
Map
Map
Reduce
dogs
like
cats
cats
dogs
like
Input Map Shuffle Reduce Output
Sorting: Generalization
• Define a hash function, such that
– h: {k} → [1,N]
– Preserves the order: k ≤ s → h(k) ≤ h(s)
– h(k) is a fixed size prefix of string k (2 first bytes)
• Identity Mapper
• With a specialized Partitioner
– Compute hash of the key h(k) and assigns <k,v> to reducer Rh(k)
• Identity Reducer
– Number of reducers is N: R1, …, RN
– Inputs for Ri are all pairs that have key h(k) = i
– Ri is an identity reducer, which writes output to HDFS file fi
– Hash function choice guarantees that
keys from fi are less than keys from fj if i < j
• The algorithm was implemented to win Gray’s Terasort Benchmark in 2008
38
Undirected Graphs
• “A Discipline of Programming” E. W. Dijkstra. Ch. 23.
– Good old classics
• Graph is defined by V = {v}, E = {<v,w> | v,w Є V}
• Undirected graph. E is symmetrical, that is <v,w> Є E ≡ <w,v> Є E
• Different representations of E
1. Set of pairs
2. <v, {direct neighbors}>
3. Adjacency matrix
• From 1 to 2 in one MR job
– Identity Mapper
– Combiner = Reducer
– Reducer joins values for each vertex
39
Connected Components
• Partition set of nodes V into disjoint subsets V1, …, VN
– V = V1 U … U VN
– No paths using E from Vi to Vj if i ≠ j
– Gi = <Vi, Ei >
• Representation of connected component
– key = min{Vi}
– value = Vi
• Chain of MR jobs
• Initial data representation
– E is partitioned into sets of records (blocks)
– <v,w> Є E → <min(v,w), {v,w}> = <k, C>
40
MR Connected Components
• Mapper / Reducer Input
– {<k, C>}, where C is a subset of V, k = min(C)
• Mapper
• Reducer
• Iterate. Stop when stabilized
41
Map {<k, C>}
For all <ki, Ci> and <kj, Cj>
if Ci ∩ Cj ≠ Ǿ then
C = Ci U Cj
Emit(min(C), C)
Reduce(k, {C1, C2, …})
resC = C1 U C2 U …
Emit(k, resC)
The End
42

Contenu connexe

Tendances

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013WANdisco Plc
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReducefvanvollenhoven
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceDenis Shestakov
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataCyanny LIANG
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopRan Ziv
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Robert Grossman
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
 
Scalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopScalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopDenis Shestakov
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 

Tendances (20)

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Scalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopScalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with Hadoop
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 

Similaire à Distributed Computing with Apache Hadoop. Introduction to MapReduce.

L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptMaruthiPrasad96
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkThoughtWorks
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingCollin Bennett
 
Introduction to R.pptx
Introduction to R.pptxIntroduction to R.pptx
Introduction to R.pptxkarthikks82
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesKelly Technologies
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesKelly Technologies
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreKelly Technologies
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupCsaba Toth
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part IMarin Dimitrov
 

Similaire à Distributed Computing with Apache Hadoop. Introduction to MapReduce. (20)

Hadoop
HadoopHadoop
Hadoop
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Introduction to R.pptx
Introduction to R.pptxIntroduction to R.pptx
Introduction to R.pptx
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
ch02-mapreduce.pptx
ch02-mapreduce.pptxch02-mapreduce.pptx
ch02-mapreduce.pptx
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
mapreduce ppt.ppt
mapreduce ppt.pptmapreduce ppt.ppt
mapreduce ppt.ppt
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
 
R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
L3.fa14.ppt
L3.fa14.pptL3.fa14.ppt
L3.fa14.ppt
 
Hadoop
HadoopHadoop
Hadoop
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 

Dernier

Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 

Dernier (20)

Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 

Distributed Computing with Apache Hadoop. Introduction to MapReduce.

  • 1. Distributed Computing with Apache Hadoop Introduction to MapReduce Konstantin V. Shvachko Birmingham Big Data Science Group October 19, 2011
  • 2. Computing • History of computing started long time ago • Fascination with numbers – Vast universe with simple strict rules – Computing devices – Crunch numbers • The Internet – Universe of words, fuzzy rules – Different type of computing – Understand meaning of things – Human thinking – Errors & deviations are a part of study 2 Computer History Museum, San Jose
  • 3. Words vs. Numbers • In 1997 IBM built Deep Blue supercomputer – Playing chess game with the champion G. Kasparov – Human race was defeated – Strict rules for Chess – Fast deep analyses of current state – Still numbers 3 • In 2011 IBM built Watson computer to play Jeopardy – Questions and hints in human terms – Analysis of texts from library and the Internet – Human champions defeated
  • 4. Big Data • Computations that need the power of many computers – Large datasets: hundreds of TBs, PBs – Or use of thousands of CPUs in parallel – Or both • Cluster as a computer 4 What is a PB? 1 KB = 1000 Bytes 1 MB = 1000 KB 1 GB = 1000 MB 1 TB = 1000 GB 1 PB = 1000 TB ???? = 1000 PB
  • 5. Examples – Science • Fundamental physics: Large Hadron Collider (LHC) – Smashing high-energy protons at the speed of light – 1 PB of event data per sec, most filtered out – 15 PB of data per year – 150 computing centers around the World – 160 PB of disk + 90 PB of tape storage • Math: Big Numbers – 2 quadrillionth (1015) digit of π is 0 – pure CPU workload – 12 days of cluster time – 208 years of CPU-time on a cluster with 7600 CPU cores • Big Data – Big Science 5
  • 6. Examples – Web • Search engine Webmap – Map of the Internet – 2008 @ Yahoo, 1500 nodes, 5 PB raw storage • Internet Search Index – Traditional application • Social Network Analysis – Intelligence – Trends 6
  • 7. The Sorting Problem • Classic in-memory sorting – Complexity: number of comparisons • External sorting – Cannot load all data in memory – 16 GB RAM vs. 200 GB file – Complexity: + disk IOs (bytes read or written) • Distributed sorting – Cannot load data on a single server – 12 drives * 2 TB = 24 TB disc space vs. 200 TB data set – Complexity: + network transfers 7 Worst Average Space Bubble Sort O(n2) O(n2) In-place Quicksort O(n2) O(n log n) In-place Merge Sort O(n log n) O(n log n) Double
  • 8. What do we do? • Need a lot of computers • How to make them work together 8
  • 9. Hadoop • Apache Hadoop is an ecosystem of tools for processing “Big Data” • Started in 2005 by D. Cutting and M. Cafarella • Consists of two main components: Providing unified cluster view 1. HDFS – a distributed file system – File system API connecting thousands of drives 2. MapReduce – a framework for distributed computations – Splitting jobs into parts executable on one node – Scheduling and monitoring of job execution • Today used everywhere: Becoming a standard of distributed computing • Hadoop is an open source project 9
  • 10. MapReduce • MapReduce – 2004 Jeffrey Dean, Sanjay Ghemawat. Google. – “MapReduce: Simplified Data Processing on Large Clusters” • Computational model – What is a comp. model ??? • Turing machine, Java – Split large input data into small enough pieces, process in parallel • Execution framework – Compilers, interpreters – Scheduling, Processing, Coordination – Failure recovery 10
  • 11. Functional Programming • Map a higher-order function – applies a given function to each element of a list – returns the list of results • Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ] • Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25] 11
  • 12. Functional Programming: reduce • Map a higher-order function – applies a given function to each element of a list – returns the list of results • Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ] • Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25] • Reduce / fold a higher-order function – Iterates given function over a list of elements – Applies function to previous result and current element – Return single result • Example. Reduce( x + y, [0,1,2,3,4,5] ) = (((((0 + 1) + 2) + 3) + 4) + 5) = 15 12
  • 13. Functional Programming • Map a higher-order function – applies a given function to each element of a list – returns the list of results • Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ] • Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25] • Reduce / fold a higher-order function – Iterates given function over a list of elements – Applies function to previous result and current element – Return single result • Example. Reduce( x + y, [0,1,2,3,4,5] ) = (((((0 + 1) + 2) + 3) + 4) + 5) = 15 • Reduce( x * y, [0,1,2,3,4,5] ) = ? 13
  • 14. Functional Programming • Map a higher-order function – applies a given function to each element of a list – returns the list of results • Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ] • Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25] • Reduce / fold a higher-order function – Iterates given function over a list of elements – Applies function to previous result and current element – Return single result • Example. Reduce( x + y, [0,1,2,3,4,5] ) = (((((0 + 1) + 2) + 3) + 4) + 5) = 15 • Reduce( x * y, [0,1,2,3,4,5] ) = 0 14
  • 15. Example: Sum of Squares • Composition of – a map followed by – a reduce applied to the results of the map • Example. – Map( x2, [1,2,3,4,5] ) = [0,1,4,9,16,25] – Reduce( x + y, [1,4,9,16,25] ) = ((((1 + 4) + 9) + 16) + 25) = 55 • Map easily parallelizable – Compute x2 for 1,2,3 on one node and for 4,5 on another • Reduce notoriously sequential – Need all squares at one node to compute the total sum. 15 Square Pyramid Number 1 + 4 + … + n2 = n(n+1)(2n+1) / 6
  • 16. Computational Model • MapReduce is a Parallel Computational Model • Map-Reduce algorithm = job • Operates with key-value pairs: (k, V) – Primitive types, Strings or more complex Structures • Map-Reduce job input and output is a list of pairs {(k, V)} • MR Job as defined by 2 functions • map: (k1; v1) → {(k2; v2)} • reduce: (k2; {v2}) → {(k3; v3)} 16
  • 17. Job Workflow 17 dogs C, 3 like cats V, 1 C, 2 V, 2 C, 3 V, 1 C, 8 V, 4
  • 18. The Algorithm 18 Map ( null, word) nC = Consonants(word) nV = Vowels(word) Emit(“Consonants”, nC) Emit(“Vowels”, nV) Reduce(key, {n1, n2, …}) nRes = n1 + n2 + … Emit(key, nRes)
  • 19. Computation Framework • Two virtual clusters: HDFS and MapReduce – Physically tightly coupled. Designed to work together • Hadoop Distributed File System. View data as files and directories • MapReduce is a Parallel Computation Framework – Job scheduling and execution framework 19
  • 20. HDFS Architecture Principles • The name space is a hierarchy of files and directories • Files are divided into blocks (typically 128 MB) • Namespace (metadata) is decoupled from data – Fast namespace operations, not slowed down by – Data streaming • Single NameNode keeps the entire name space in RAM • DataNodes store data blocks on local drives • Blocks are replicated on 3 DataNodes for redundancy and availability 20
  • 21. MapReduce Framework • Job Input is a file or a set of files in a distributed file system (HDFS) – Input is split into blocks of roughly the same size – Blocks are replicated to multiple nodes – Block holds a list of key-value pairs • Map task is scheduled to one of the nodes containing the block – Map task input is node-local – Map task result is node-local • Map task results are grouped: one group per reducer Each group is sorted • Reduce task is scheduled to a node – Reduce task transfers the targeted groups from all mapper nodes – Computes and stores results in a separate HDFS file • Job Output is a set of files in HDFS. With #files = #reducers 21
  • 22. Map Reduce Example: Mean • Mean • Input: large text file • Output: average length of words in the file µ • Example: µ({dogs, like, cats}) = 4 22 n ix n 1 1
  • 23. Mean Mapper • Map input is the set of words {w} in the partition – Key = null Value = w • Map computes – Number of words in the partition – Total length of the words ∑length(w) • Map output – <“count”, #words> – <“length”, #totalLength> 23 Map (null, w) Emit(“count”, 1) Emit(“length”, length(w))
  • 24. Single Mean Reducer • Reduce input – {<key, {value}>}, where – key = “count”, “length” – value is an integer • Reduce computes – Total number of words: N = sum of all “count” values – Total length of words: L = sum of all “length” values • Reduce Output – <“count”, N> – <“length”, L> • The result – µ = L / N 24 Reduce(key, {n1, n2, …}) nRes = n1 + n2 + … Emit(key, nRes) Analyze () read(“part-r-00000”) print(“mean = ” + L/N)
  • 25. Mean: Mapper, Reducer 25 public class WordMean { private final static Text COUNT_KEY = new Text(new String("count")); private final static Text LENGTH_KEY = new Text(new String("length")); private final static LongWritable ONE = new LongWritable(1); public static class WordMeanMapper extends Mapper<Object, Text, Text, LongWritable> { public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { String word = itr.nextToken(); context.write(LENGTH_KEY, new LongWritable(word.length())); context.write(COUNT_KEY, ONE); } } } public static class WordMeanReducer extends Reducer<Text,LongWritable,Text,LongWritable> { public void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (LongWritable val : values) sum += val.get(); context.write(key, new LongWritable(sum)); } } . . . . . . . . . . . . . . . .
  • 26. Mean: main() 26 . . . . . . . . . . . . . . . . public static void main(String[] args) throws IOException { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser( conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordmean <in> <out>"); System.exit(2); } Job job = new Job(conf, "word mean"); job.setJarByClass(WordMean.class); job.setMapperClass(WordMeanMapper.class); job.setReducerClass(WordMeanReducer.class); job.setCombinerClass(WordMeanReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class); job.setNumReduceTasks(1); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); Path outputpath = new Path(otherArgs[1]); FileOutputFormat.setOutputPath(job, outputpath); boolean result = job.waitForCompletion(true); analyzeResult(outputpath); System.exit(result ? 0 : 1); } . . . . . . . . . . . . . . . .
  • 27. Mean: analyzeResult() 27 . . . . . . . . . . . . . . . . private static void analyzeResult(Path outDir) throws IOException { FileSystem fs = FileSystem.get(new Configuration()); Path reduceFile = new Path(outDir, "part-r-00000"); if(!fs.exists(reduceFile)) return; long count = 0, length = 0; BufferedReader in = new BufferedReader(new InputStreamReader(fs.open(reduceFile))); while(in != null && in.ready()) { StringTokenizer st = new StringTokenizer(in.readLine()); String key = st.nextToken(); String value = st.nextToken(); if(key.equals("count")) count = Long.parseLong(value); else if(key.equals("length")) length = Long.parseLong(value); } double average = (double)length / count; System.out.println("The mean is: " + average); } } // end WordMean
  • 28. MapReduce Implementation • Single master JobTracker shepherds the distributed heard of TaskTrackers 1. Job scheduling and resource allocation 2. Job monitoring and job lifecycle coordination 3. Cluster health and resource tracking • Job is defined – Program: myJob.jar file – Configuration: conf.xml – Input, output paths • JobClient submits the job to the JobTracker – Calculates and creates splits based on the input – Write myJob.jar and conf.xml to HDFS 28
  • 29. MapReduce Implementation • JobTracker divides the job into tasks: one map task per split. – Assigns a TaskTracker for each task, collocated with the split • TaskTrackers execute tasks and report status to the JobTracker – TaskTracker can run multiple map and reduce tasks – Map and Reduce Slots • Failed attempts reassigned to other TaskTrackers • Job execution status and results reported back to the client • Scheduler lets many jobs run in parallel 29
  • 30. Example: Standard Deviation • Standard deviation • Input: large text file • Output: standard deviation σ of word lengths • Example: σ({dogs, like, cats}) = 0 • How many jobs 30 n ix n 1 2 )( 1
  • 32. Standard Deviation Mapper • Map input is the set of words {w} in the partition – Key = null Value = w • Map computes – Number of words in the partition – Total length of the words ∑length(w) – The sum of lengths squared ∑length(w)2 • Map output – <“count”, #words> – <“length”, #totalLength> – <“squared”, #sumLengthSquared> 32 Map (null, w) Emit(“count”, 1) Emit(“length”, length(w)) Emit(“squared”, length(w)2)
  • 33. Standard Deviation Reducer • Reduce input – {<key, {value}>}, where – key = “count”, “length”, “squared” – value is an integer • Reduce computes – Total number of words: N = sum of all “count” values – Total length of words: L = sum of all “length” values – Sum of length squares: S = sum of all “squared” values • Reduce Output – <“count”, N> – <“length”, L> – <“squared”, S> • The result – µ = L / N – σ = sqrt(S / N - µ2) 33 Reduce(key, {n1, n2, …}) nRes = n1 + n2 + … Emit(key, nRes) Analyze () read(“part-r-00000”) print(“mean = ” + L/N) print(“std.dev = ” + sqrt(S/N – L*L / N*N))
  • 34. Combiner, Partitioner • Combiners perform local aggregation before the shuffle & sort phase – Optimization to reduce data transfers during shuffle – In Mean example reduces transfer of many keys to only two • Partitioners assign intermediate (map) key-value pairs to reducers – Responsible for dividing up the intermediate key space – Not used with single Reducer 34 Input Data Input Data Map Reduce Input Map Shuffle & sort Reduce OutputCombiner Partitioner
  • 35. Distributed Sorting • Sort a dataset, which cannot be entirely stored on one node. • Input: – Set of files. 100 byte records. – The first 10 bytes of each record is the key and the rest is the value. • Output: – Ordered list of files: f1, … fN – Each file fi is sorted, and – If i < j then for any keys k Є fi and r Є fj (k ≤ r) – Concatenation of files in the given order must form a completely sorted record set 35
  • 36. Input Data Naïve MapReduce Sorting • If the output could be stored on one node • The input to any Reducer is always sorted by key – Shuffle sorts Map outputs • One identity Mapper and one identity Reducer would do the trick – Identity: <k,v> → <k,v> 36 Input Data Map Reduce dogs like cats cats dogs like Input Map Shuffle Reduce Output cats dogs like
  • 37. Naïve Sorting: Multiple Maps • Multiple identity Mappers and one identity Reducer – same result – Does not work for multiple Reducers 37 Input Data Output Data Map Map Map Reduce dogs like cats cats dogs like Input Map Shuffle Reduce Output
  • 38. Sorting: Generalization • Define a hash function, such that – h: {k} → [1,N] – Preserves the order: k ≤ s → h(k) ≤ h(s) – h(k) is a fixed size prefix of string k (2 first bytes) • Identity Mapper • With a specialized Partitioner – Compute hash of the key h(k) and assigns <k,v> to reducer Rh(k) • Identity Reducer – Number of reducers is N: R1, …, RN – Inputs for Ri are all pairs that have key h(k) = i – Ri is an identity reducer, which writes output to HDFS file fi – Hash function choice guarantees that keys from fi are less than keys from fj if i < j • The algorithm was implemented to win Gray’s Terasort Benchmark in 2008 38
  • 39. Undirected Graphs • “A Discipline of Programming” E. W. Dijkstra. Ch. 23. – Good old classics • Graph is defined by V = {v}, E = {<v,w> | v,w Є V} • Undirected graph. E is symmetrical, that is <v,w> Є E ≡ <w,v> Є E • Different representations of E 1. Set of pairs 2. <v, {direct neighbors}> 3. Adjacency matrix • From 1 to 2 in one MR job – Identity Mapper – Combiner = Reducer – Reducer joins values for each vertex 39
  • 40. Connected Components • Partition set of nodes V into disjoint subsets V1, …, VN – V = V1 U … U VN – No paths using E from Vi to Vj if i ≠ j – Gi = <Vi, Ei > • Representation of connected component – key = min{Vi} – value = Vi • Chain of MR jobs • Initial data representation – E is partitioned into sets of records (blocks) – <v,w> Є E → <min(v,w), {v,w}> = <k, C> 40
  • 41. MR Connected Components • Mapper / Reducer Input – {<k, C>}, where C is a subset of V, k = min(C) • Mapper • Reducer • Iterate. Stop when stabilized 41 Map {<k, C>} For all <ki, Ci> and <kj, Cj> if Ci ∩ Cj ≠ Ǿ then C = Ci U Cj Emit(min(C), C) Reduce(k, {C1, C2, …}) resC = C1 U C2 U … Emit(k, resC)