Embarrassingly/Delightfully Parallel Problems

Embarrassingly Parallel
Problems
CS5225 Parallel and Concurrent Programming
Dilum Bandara
Dilum.Bandara@uom.lk
Some slides adapted from Dr. Srinath Perera

Embarrassingly Parallel Problems
 A.k.a. Delightfully Parallel Problems
 Can be easily parallelizable
 Usually use simple communication patterns
 Usually work without much communication
among each other
 Map-Reduce programming model provides a
powerful abstraction to handle embarrassingly
parallel problems
2

Map-Reduce
 Common pattern to solve parallel problems
 Based on 2 constructs from functional programming,
map & reduce
 Introduced by Google
 Dean et. al., “MapReduce: Simplified Data Processing
on Large Clusters,” OSDI, 2004
 Extensible for different applications
 Scale to very large number of nodes
 Hide details like failures from users
3

High-Order Functions
 Programming languages (e.g., Java) pass data
as parameters & results of functions
 Higher-order functions pass both data as well as
functions as parameters or results of functions
 E.g., Python, Ruby, JavaScript
 For example
def f(x):
return x + 3
def g(function, x):
return function(x) * function(x)
print g(f, 7) 4

Map-Reduce
 Accepts 2 functions as inputs
1. Map function
 Y fn1(X)
 Accepts input X & outputs another Y
2. Reduce function
 Z fn2(List<Y>)
 Accepts array of Y’s & returns another output Z
5

Map-Reduce (Contd.)
 Map-reduce support is provided by a function
like following
 Y map-reduce(mapfn, reducefn, List<X>)
 Map reduce implementation takes list of inputs
(list) & does following
 Apply map function to each entry in the list, which
emit (key, value) pairs
 Collect results, group them by keys, & then pass them
to reduce function as array
6

Map-Reduce (Contd.)
7
Source: www.datasciencecentral.com/profiles/blogs/practical-
illustration-of-map-reduce-hadoop-style-on-real-data

Map-Reduce for Word Counting
8
Source: http://xiaochongzhang.me/blog/?p=338
How to do this for a large dataset using a distributed system?

In Class Activity
1. Card sorting
2. Card sorting with 2 rounds
3. Identify missing cards
9
Inspired by Marcio Silva's “The MapReduce Card Game” at
http://blog.marciosilva.com/2012/10/the-mapreduce-card-game.html

Why Map-Reduce?
 Implementing same pattern in a distributed
system isn’t that easy
 Need to worry about communication, failures,
initialization, etc.
 MapReduce frameworks worry about all those
 You write map & reduce functions & call
framework
 It forces you to think parallel in design time
 It gives you a higher-level of abstraction to think in
 It’s very generic, & covers lot of usecases
 See http://wiki.apache.org/hadoop/PoweredBy
10

Map-Reduce Implementations
 Can be implemented in many ways
 In-memory implementation
 Distributed implementation
 Communication by messages
 Communication by file system
 Communication by databases
 Communication Requirements
 Need broadcast & reduce operations only
11

Map-Reduce with Hadoop
 Apache Hadoop is an implementation of Map-
reduce
 Handles all details about distributed execution
 You just have to give Map & Reduce functions
12

Map-Reduce Data Model
13
Source: http://slides.com/bearrito/pittsburgh-nosql-_-mapreduce#/

Map-Reduce Data Model (Cont.)
 Hadoop breaks input data into multiple data items by
new lines & runs map function once for each data item
 When executed, map function outputs (key, value) pairs
 Hadoop collects all (key, value) pairs generated by map
function, sorts them by the key, & groups values with the
same key together into groups
 For each distinct key, Hadoop runs reduce function once
while passing key & list of values for that key as input
 Reduce function outputs (key, value) pairs, & Hadoop
writes them to a file as final result
14

Execution on a Cluster/Cloud
15
Source: www.cbsolution.net/techniques/ontarget/mapreduce_vs_data_warehouse

MapReduce Execution
16
Source: Dean et. al.,
“MapReduce, OSDI, 2004

Designing Map-Reduce Applications
 You control task granularity by changing no of
map & reduce tasks
 How many map tasks?
 How many reduce tasks?
 Fine Grain  more parallelism  more
communication overhead and vise versa
 Usually frameworks handle load balancing &
failures
 If large number of maps are there, you need a
Combine Function as well
17

Examples
 Sorting
 How to sort an array of 1 million integers using
MapReduce?
 Inverted Index
 Normal index is a mapping from document to terms
 Inverted index is mapping from terms to documents
 If we have a million documents, how do we build a
inverted index using MapReduce?
 Frequency Distribution of Word Occurrences
 Count number of occurrences & build a histogram
18

Examples (Cont.)
 Stitch Imagery
 For Google maps, Google need to combine many
map data into a single set of data
 Business Intelligence
 A business want to create a graph of income
generated by each region & marketing money spend
on each region
19

Examples (Cont.)
 K-Means
 Assume you are given a list of earth quakes
coordinates happened in the world in last 50 years.
 You are asked to use K-Means Clustering algorithm
to find 10 locations around which those earth quakes
were located.
 K-Means starts with 10 random cluster locations.
 It proceeds iteratively, & at each iteration, it assigns each
data point (earth quake) to the closest cluster location
 At end of each iteration, it recalculates each cluster location
using mean of all data point coordinates assigned to that
location
 It stops when cluster locations doesn’t change after
recalculation 20

K-Means Algorithm
List kmeans(datapointsList , initialClustersList){
oldlocations = null;
newLocations = initialClustersList ;
while(oldlocations != newLocations){
for(d in datapointsList){
oldlocations = newLocations ;
newLocations = //recalculate locations
}
//assign d to closest location in newLocations
}
}
return newLocations ;
21

Embarrassingly/Delightfully Parallel Problems

Recommandé

Recommandé

Contenu connexe

Similaire à Embarrassingly/Delightfully Parallel Problems

Similaire à Embarrassingly/Delightfully Parallel Problems (20)

Plus de Dilum Bandara

Plus de Dilum Bandara (20)

Dernier

Dernier (20)

Embarrassingly/Delightfully Parallel Problems

Notes de l'éditeur