5. Task is broken up to multiple parts with a software tool
and each part is distributed to a processor, then each
processor will perform the assigned part.
5
Finally, the parts are reassembled to deliver
the final solution or execute the task.
10. Motivation
10
• Motivations
● Large-scale data processing on clusters
● Massively parallel (hundreds or thousands of CPUs)
● Reliable execution with easy data access
• Functions
● Fault-tolerance
● Status and monitoring tools
● A clean abstraction for programmers
12. 12
Lisp map function
● Input parameters: a function and a set of values
● This function is applied to each of the values.
Lisp reduce function
● given a binary function and a set of values.
● It combines all the values together using the
binary function.
(map ‘length ‘(() (a) (ab) (abc)))
(length(()) length(a) length(ab)
length(abc))
(0 1 2 3)
use the + (add) function to reduce the
list
(reduce #'+ '(0 1 2 3))
6
Example
16. 16
+
++
Instead of having one person
cover the whole book
we hire several !
A first group is called mappers
the second is called reducers
Divide the book in several parts
and give one to each mapper .
21. 21
Input/output specification of the WC mapreduce job
Input : a set of (key values) stored in files
key: document ID
value: a list of words as content of each document
Output: a set of (key values) stored in files
key: wordID
value: word frequency appeared in all documents
MapReduce function specification:
map(String input_key, String input_value):
reduce(String output_key, Iterator intermediate_values):
25. 25
“MapReduce has been implemented in many
programming languages and frameworks, such
as Apache Hadoop, Pig, Hive, etc. “
26. 26
Divides the work on mappers
and reducers
runs on each node to execute
the real mapreduce tasks
Brief introduction for later use
mapReduce daemons
28. 28
Map: Split the array of 1000
elements into 10 small data
chunks (each chunk will have 100
elements)
Each chunk will be processed by a
separate thread concurrently.
We will have 10 threads and each
thread will iterate 100 elements to
produce the sum of those 100
elements.
Reducer: takes the output of
these 10 threads and will be
summed again to produce the
final output.
Sum array elements using mapReduce with java
30. 30
create thread pool of 10
save each task
of each chunk
in queue
split array of 1k into
chunks each of
100
save map result
of each chunk
into mapOutput
31. 31
getoutput of map and
aggregate results
For each element
in mapOut(
the result from
previous map)
source code link : https://github.com/HabibaAbderrahim/thread_mapReduce
33. 33
Environnement
Pseudo Distributed environment
PS : This is a pseudo environment that simulate a fully distributed environment since
we have one server / one pc
java should be installed
create hadoop
sudo user
install hadoop
for the official
website
check hadoop
is installed
version : 3.2.1
34. 34
Environnement
Pseudo Distributed environment
Files configuration
version : 3.2.1
java home and hadoop home
add java path
HDFS : hadoop file system
HDFS configuration : namenode/datanode/replication
mapReduce configuration
mapReduce runs on Yarn
Verify Hadoop daemons
35. 35
We decided to work with python
just to test hadoop
streaming Features
Environnement
version : 3.2.1
version : 3.5.1
word count using mapReduce in Hadoop with python
39. The ideas, concepts and diagrams are taken from the following websites:
● http://www.metz.supelec.fr/metz/personnel/vialle/course/BigData-2A-CS/poly-
pdf/Poly-chap6.pdf
● https://sites.cs.ucsb.edu/~tyang/class/240a17/slides/CS240TopicMapReduce.
pdf
● https://fr.slideshare.net/LiliaSfaxi/bigdatachp2-hadoop-mapreduce
● https://algodaily.com/lessons/what-is-mapreduce-and-how-does-it-work
[References]
39
should not be confused with Multiprocessing in where multiple processors or cores are working on solving different tasks, instead of parts of the same task as in parallel processing.
before driven into detail , take a moment and ask yourself
what does
» Functional programming meets distributed computing » A batch data processing system
Traditional approach
In this approach we will iterate each element in an array and will add it to produce final sum.