SlideShare a Scribd company logo
1 of 23
Download to read offline
Relational Algebra and MapReduce
Towards High-level Programming Languages
Pietro Michiardi
Eurecom
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 1 / 23
Sources and Acks
Jimmy Lin and Chris Dyer, “Data-Intensive Text Processing with
MapReduce,” Morgan & Claypool Publishers, 2010.
http://lintool.github.io/MapReduceAlgorithms/
Tom White, “Hadoop, The Definitive Guide,” O’Reilly / Yahoo
Press, 2012
Anand Rajaraman, Jeffrey D. Ullman, Jure Leskovec, “Mining of
Massive Datasets”, Cambridge University Press, 2013
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 2 / 23
Relational Algebra
Relational Algebra and
MapReduce
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 3 / 23
Relational Algebra Introduction
Introduction
Disclaimer
This is not a full course on Relational Algebra
Neither this is a course on SQL
Introduction to Relational Algebra, RDBMS and SQL
Follow the video lectures of the Stanford class on RDBMS
https://www.coursera.org/course/db
→ Note that you have to sign up for an account
Overview of this part
Brief introduction to simplified relational algebra
Useful to understand Pig, Hive and HBase
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 4 / 23
Relational Algebra Introduction
Relational Algebra Operators
There are a number of operations on data that fit well the
relational algebra model
In traditional RDBMS, queries involve retrieval of small amounts of
data
In this course, and in particular in this class, we should keep in
mind the particular workload underlying MapReduce
→ Full scans of large amounts of data
→ Queries are not selective1
, they process all data
A review of some terminology
A relation is a table
Attributes are the column headers of the table
The set of attributes of a relation is called a schema
Example: R(A1, A2, ..., An) indicates a relation called R whose
attributes are A1, A2, ..., An
1
This is true in general. However, most ETL jobs involve selection and projection to
do data preparation.
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 5 / 23
Relational Algebra Operators
Operators
Let’s start with an example
Below, we have part of a relation called Links describing the
structure of the Web
There are two attributes: From and To
A row, or tuple, of the relation is a pair of URLs, indicating the
existence of a link between them
→ The number of tuples in a real dataset is in the order of billions (109
)
From To
url1 url2
url1 url3
url2 url3
url2 url4
· · · · · ·
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 6 / 23
Relational Algebra Operators
Operators
Relations (however big) can be stored in a distributed
filesystem
If they don’t fit in a single machine, they’re broken into pieces (think
HDFS)
Next, we review and describe a set of relational algebra
operators
Intuitive explanation of what they do
“Pseudo-code” of their implementation in/by MapReduce
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 7 / 23
Relational Algebra Operators
Operators
Selection: σC(R)
Apply condition C to each tuple of relation R
Produce in output a relation containing only tuples that satisfy C
Projection: πS(R)
Given a subset S of relation R attributes
Produce in output a relation containing only tuples for the attributes
in S
Union, Intersection and Difference
Well known operators on sets
Apply to the set of tuples in two relations that have the same
schema
Variations on the theme: work on bags
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 8 / 23
Relational Algebra Operators
Operators
Natural join R S
Given two relations, compare each pair of tuples, one from each
relation
If the tuples agree on all the attributes common to both schema →
produce an output tuple that has components on each attribute
Otherwise produce nothing
Join condition can be on a subset of attributes
Let’s work with an example
Recall the Links relation from previous slides
Query (or data processing job): find the paths of length
two in the Web
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 9 / 23
Relational Algebra Operators
Join Example
Informally, to satisfy the query we must:
find the triples of URLs in the form (u, v, w) such that there is a link
from u to v and a link from v to w
Using the join operator
Imagine we have two relations (with different schema), and let’s try
to apply the natural join operator
There are two copies of Links: L1(U1, U2) and L2(U2, U3)
Let’s compute L1 L2
For each tuple t1 of L1 and each tuple t2 of L2, see if their U2
component are the same
If yes, then produce a tuple in output, with the schema (U1, U2, U3)
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 10 / 23
Relational Algebra Operators
Join Example
What we have seen is called (to be precise) a self-join
Question: How would you implement a self join in your favorite
programming language?
Question: What is the time complexity of your algorithm?
Question: What is the space complexity of your algorithm?
To continue the example
Say you are not interested in the entire two-hop path but just the
start and end nodes
Then you do a projection and the notation would be: πU1,U3
(L1 L2)
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 11 / 23
Relational Algebra Operators
Operators
Grouping and Aggregation: γX (R)
Given a relation R, partition its tuples according to their values in
one set of attributes G
The set G is called the grouping attributes
Then, for each group, aggregate the values in certain other
attributes
Aggregation functions: SUM, COUNT, AVG, MIN, MAX, ...
In the notation, X is a list of elements that can be:
A grouping attribute
An expression θ(A), where θ is one of the (five) aggregation
functions and A is an attribute NOT among the grouping attributes
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 12 / 23
Relational Algebra Operators
Operators
Grouping and Aggregation: γX (R)
The result of this operation is a relation with one tuple for each
group
That tuple has a component for each of the grouping attributes, with
the value common to tuples of that group
That tuple has another component for each aggregation, with the
aggregate value for that group
Let’s work with an example
Imagine that a social-networking site has a relation
Friends(User, Friend)
The tuples are pairs (a, b) such that b is a friend of a
Query: compute the number of friends each member
has
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 13 / 23
Relational Algebra Operators
Grouping and Aggregation Example
How to satisfy the query
γUser,COUNT(Friend))(Friends)
This operation groups all the tuples by the value in their frist
component
→ There is one group for each user
Then, for each group, it counts the number of friends
Some details
The COUNT operation applied to an attribute does not consider the
values of that attribute
In fact, it counts the number of tuples in the group
In SQL, there is a “count distinct” operator that counts the number
of different values
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 14 / 23
Relational Algebra Operators and MapReduce
Computing Selection
In practice, selections do not need a full-blown MapReduce
implementation
They can be implemented in the map phase alone
Actually, they could also be implemented in the reduce portion
A MapReduce implementation of σC(R)
Map: For each tuple t in R, check if t satisfies C
If so, emit a key/value pair (t, t)
Reduce: Identity reducer
Question: single or multiple reducers?
NOTE: the output is not exactly a relation
WHY?
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 15 / 23
Relational Algebra Operators and MapReduce
Computing Projections
Similar process to selection
But, projection may cause same tuple to appear several times
A MapReduce implementation of πS(R)
Map: For each tuple t in R, construct a tuple t by eliminating those
components whose attributes are not in S
Emit a key/value pair (t , t )
Reduce: For each key t produced by any of the Map tasks, fetch t , [t , · · · , t ]
Emit a key/value pair (t , t )
NOTE: the reduce operation is duplicate elimination
This operation is associative and commutative, so it is possible to
optimize MapReduce by using a Combiner in each mapper
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 16 / 23
Relational Algebra Operators and MapReduce
Computing Unions
Suppose relations R and S have the same schema
Map tasks will be assigned chunks from either R or S
Mappers don’t do much, just pass by to reducers
Reducers do duplicate elimination
A MapReduce implementation of union
Map: 2
For each tuple t in R or S, emit a key/value pair (t, t)
Reduce: For each key t there will be either one or two values
Emit (t, t) in either case
2
Hadoop MapReduce supports reading multiple inputs.
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 17 / 23
Relational Algebra Operators and MapReduce
Computing Intersections
Very similar to computing unions
Suppose relations R and S have the same schema
The map function is the same (an identity mapper) as for union
The reduce function must produce a tuple only if both relations
have that tuple
A MapReduce implementation of intersection
Map: For each tuple t in R or S, emit a key/value pair (t, t)
Reduce: If key t has value list [t, t] then emit the key/value pair (t, t)
Otherwise, emit the key/value pair (t, NULL)
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 18 / 23
Relational Algebra Operators and MapReduce
Computing difference
Assume we have two relations R and S with the same
schema
The only way a tuple t can appear in the output is if it is in R but not
in S
The map function passes tuples from R and S to the reducer
NOTE: it must inform the reducer whether the tuple came from R or
S
A MapReduce implementation of difference
Map: For a tuple t in R emit a key/value pair (t, R ) and for a tuple t in S,
emit a key/value pair (t, S )
Reduce: For each key t, do the following:
If it is associated to R , then emit (t, t)
If it is associated to [ R , S ] or [ S , R ], or [ S ], emit the key/value
pair (t, NULL)
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 19 / 23
Relational Algebra Operators and MapReduce
Computing the natural Join
This topic is subject to continuous refinements
There are many JOIN operators and many different
implementations
We’ve seen some of them in the laboratory sessions
Let’s look at two relations R(A, B) and S(B, C)
We must find tuples that agree on their B components
We shall use the B-value of tuples from either relation as the key
The value will be the other component and the name of the relation
That way the reducer knows from which relation each tuple is
coming from
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 20 / 23
Relational Algebra Operators and MapReduce
Computing the natural Join
A MapReduce implementation of Natural Join
Map: For each tuple (a, b) of R emit the key/value pair (b, ( R , a))
For each tuple (b, c) of S emit the key/value pair (b, ( S , c))
Reduce: Each key b will be associated to a list of pairs that are either ( R , a)
or ( S , c)
Emit key/value pairs of the form
(b, [(a1, b, c1), (a2, b, c2), · · · , (an, b, cn)])
NOTES
Question: what if the MapReduce framework wouldn’t implement
the distributed (and sorted) group by?
In general, for n tuples in relation R and m tuples in relation S all
with a common B-value, then we end up with nm tuples in the result
If all tuples of both relations have the same B-value, then we’re
computing the Cartesian product
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 21 / 23
Relational Algebra Operators and MapReduce
Overview of SQL Joins
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 22 / 23
Relational Algebra Operators and MapReduce
Grouping and Aggregation in MapReduce
Let R(A, B, C) be a relation to which we apply γA,θ(B)(R)
The map operation prepares the grouping
The grouping is done by the framework
The reducer computes the aggregation
Simplifying assumptions: one grouping attribute and one
aggregation function
MapReduce implementation of γA,θ(B)(R)3
Map: For each tuple (a, b, c) emit the key/value pair (a, b)
Reduce: Each key a represents a group
Apply θ to the list [b1, b2, · · · , bn]
Emit the key/value pair (a, x) where x = θ([b1, b2, · · · , bn])
3
Note here that we are also projecting.
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 23 / 23

More Related Content

What's hot

Single Layer Rosenblatt Perceptron
Single Layer Rosenblatt PerceptronSingle Layer Rosenblatt Perceptron
Single Layer Rosenblatt PerceptronAndriyOleksiuk
 
Scheduling in Cloud Computing
Scheduling in Cloud ComputingScheduling in Cloud Computing
Scheduling in Cloud ComputingHitesh Mohapatra
 
Neural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronNeural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronMostafa G. M. Mostafa
 
Dichotomy of parallel computing platforms
Dichotomy of parallel computing platformsDichotomy of parallel computing platforms
Dichotomy of parallel computing platformsSyed Zaid Irshad
 
Introduction to prolog
Introduction to prologIntroduction to prolog
Introduction to prologHarry Potter
 
Back propagation
Back propagationBack propagation
Back propagationNagarajan
 
Multi-agent systems
Multi-agent systemsMulti-agent systems
Multi-agent systemsR A Akerkar
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & UnderfittingSOUMIT KAR
 
Design cycles of pattern recognition
Design cycles of pattern recognitionDesign cycles of pattern recognition
Design cycles of pattern recognitionAl Mamun
 
Machine Learning: Generative and Discriminative Models
Machine Learning: Generative and Discriminative ModelsMachine Learning: Generative and Discriminative Models
Machine Learning: Generative and Discriminative Modelsbutest
 
Introdution and designing a learning system
Introdution and designing a learning systemIntrodution and designing a learning system
Introdution and designing a learning systemswapnac12
 
Logical Clocks (Distributed computing)
Logical Clocks (Distributed computing)Logical Clocks (Distributed computing)
Logical Clocks (Distributed computing)Sri Prasanna
 
Data mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarityData mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarityRushali Deshmukh
 
Adversarial search
Adversarial searchAdversarial search
Adversarial searchNilu Desai
 

What's hot (20)

Single Layer Rosenblatt Perceptron
Single Layer Rosenblatt PerceptronSingle Layer Rosenblatt Perceptron
Single Layer Rosenblatt Perceptron
 
Scheduling in Cloud Computing
Scheduling in Cloud ComputingScheduling in Cloud Computing
Scheduling in Cloud Computing
 
Neural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronNeural Networks: Multilayer Perceptron
Neural Networks: Multilayer Perceptron
 
Dichotomy of parallel computing platforms
Dichotomy of parallel computing platformsDichotomy of parallel computing platforms
Dichotomy of parallel computing platforms
 
Voting protocol
Voting protocolVoting protocol
Voting protocol
 
Introduction to prolog
Introduction to prologIntroduction to prolog
Introduction to prolog
 
Greedy method by Dr. B. J. Mohite
Greedy method by Dr. B. J. MohiteGreedy method by Dr. B. J. Mohite
Greedy method by Dr. B. J. Mohite
 
Link state routing protocol
Link state routing protocolLink state routing protocol
Link state routing protocol
 
Back propagation
Back propagationBack propagation
Back propagation
 
Multi-agent systems
Multi-agent systemsMulti-agent systems
Multi-agent systems
 
Task programming
Task programmingTask programming
Task programming
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & Underfitting
 
Design cycles of pattern recognition
Design cycles of pattern recognitionDesign cycles of pattern recognition
Design cycles of pattern recognition
 
Language models
Language modelsLanguage models
Language models
 
Machine Learning: Generative and Discriminative Models
Machine Learning: Generative and Discriminative ModelsMachine Learning: Generative and Discriminative Models
Machine Learning: Generative and Discriminative Models
 
Introdution and designing a learning system
Introdution and designing a learning systemIntrodution and designing a learning system
Introdution and designing a learning system
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Logical Clocks (Distributed computing)
Logical Clocks (Distributed computing)Logical Clocks (Distributed computing)
Logical Clocks (Distributed computing)
 
Data mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarityData mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarity
 
Adversarial search
Adversarial searchAdversarial search
Adversarial search
 

Similar to Relational Algebra and MapReduce

Unit-II DBMS presentation for students.pdf
Unit-II DBMS presentation for students.pdfUnit-II DBMS presentation for students.pdf
Unit-II DBMS presentation for students.pdfajajkhan16
 
Lecture 07 leonidas guibas - networks of shapes and images
Lecture 07   leonidas guibas - networks of shapes and imagesLecture 07   leonidas guibas - networks of shapes and images
Lecture 07 leonidas guibas - networks of shapes and imagesmustafa sarac
 
RelationalAlgebra-RelationalCalculus-SQL.pdf
RelationalAlgebra-RelationalCalculus-SQL.pdfRelationalAlgebra-RelationalCalculus-SQL.pdf
RelationalAlgebra-RelationalCalculus-SQL.pdf10GUPTASOUMYARAMPRAK
 
Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)Yueshen Xu
 
Integrative Parallel Programming in HPC
Integrative Parallel Programming in HPCIntegrative Parallel Programming in HPC
Integrative Parallel Programming in HPCVictor Eijkhout
 
Ch4.mapreduce algorithm design
Ch4.mapreduce algorithm designCh4.mapreduce algorithm design
Ch4.mapreduce algorithm designAllenWu
 
Map reduce
Map reduceMap reduce
Map reducexydii
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingcoolmirza143
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)anh tuan
 
Automatic Synthesis of Combiners in the MapReduce Framework
Automatic Synthesis of Combiners in the MapReduce FrameworkAutomatic Synthesis of Combiners in the MapReduce Framework
Automatic Synthesis of Combiners in the MapReduce FrameworkKinoshita Minoru
 
Graph Matching
Graph MatchingGraph Matching
Graph Matchinggraphitech
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsDilum Bandara
 
Introduction to Relational Database Management Systems
Introduction to Relational Database Management SystemsIntroduction to Relational Database Management Systems
Introduction to Relational Database Management SystemsAdri Jovin
 

Similar to Relational Algebra and MapReduce (20)

Unit-II DBMS presentation for students.pdf
Unit-II DBMS presentation for students.pdfUnit-II DBMS presentation for students.pdf
Unit-II DBMS presentation for students.pdf
 
Lecture 07 leonidas guibas - networks of shapes and images
Lecture 07   leonidas guibas - networks of shapes and imagesLecture 07   leonidas guibas - networks of shapes and images
Lecture 07 leonidas guibas - networks of shapes and images
 
RelationalAlgebra-RelationalCalculus-SQL.pdf
RelationalAlgebra-RelationalCalculus-SQL.pdfRelationalAlgebra-RelationalCalculus-SQL.pdf
RelationalAlgebra-RelationalCalculus-SQL.pdf
 
Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)
 
Integrative Parallel Programming in HPC
Integrative Parallel Programming in HPCIntegrative Parallel Programming in HPC
Integrative Parallel Programming in HPC
 
Ch4.mapreduce algorithm design
Ch4.mapreduce algorithm designCh4.mapreduce algorithm design
Ch4.mapreduce algorithm design
 
Main map reduce
Main map reduceMain map reduce
Main map reduce
 
Map reduce
Map reduceMap reduce
Map reduce
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreading
 
Map reduce
Map reduceMap reduce
Map reduce
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
 
Relational Algebra-23-04-2023.pdf
Relational Algebra-23-04-2023.pdfRelational Algebra-23-04-2023.pdf
Relational Algebra-23-04-2023.pdf
 
Automatic Synthesis of Combiners in the MapReduce Framework
Automatic Synthesis of Combiners in the MapReduce FrameworkAutomatic Synthesis of Combiners in the MapReduce Framework
Automatic Synthesis of Combiners in the MapReduce Framework
 
Graph Matching
Graph MatchingGraph Matching
Graph Matching
 
MUMS: Agent-based Modeling Workshop - Practical Bayesian Optimization for Age...
MUMS: Agent-based Modeling Workshop - Practical Bayesian Optimization for Age...MUMS: Agent-based Modeling Workshop - Practical Bayesian Optimization for Age...
MUMS: Agent-based Modeling Workshop - Practical Bayesian Optimization for Age...
 
Lecture 1 mapreduce
Lecture 1  mapreduceLecture 1  mapreduce
Lecture 1 mapreduce
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel Problems
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
An Introduction to MATLAB with Worked Examples
An Introduction to MATLAB with Worked ExamplesAn Introduction to MATLAB with Worked Examples
An Introduction to MATLAB with Worked Examples
 
Introduction to Relational Database Management Systems
Introduction to Relational Database Management SystemsIntroduction to Relational Database Management Systems
Introduction to Relational Database Management Systems
 

Recently uploaded

ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 

Recently uploaded (20)

ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 

Relational Algebra and MapReduce

  • 1. Relational Algebra and MapReduce Towards High-level Programming Languages Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 1 / 23
  • 2. Sources and Acks Jimmy Lin and Chris Dyer, “Data-Intensive Text Processing with MapReduce,” Morgan & Claypool Publishers, 2010. http://lintool.github.io/MapReduceAlgorithms/ Tom White, “Hadoop, The Definitive Guide,” O’Reilly / Yahoo Press, 2012 Anand Rajaraman, Jeffrey D. Ullman, Jure Leskovec, “Mining of Massive Datasets”, Cambridge University Press, 2013 Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 2 / 23
  • 3. Relational Algebra Relational Algebra and MapReduce Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 3 / 23
  • 4. Relational Algebra Introduction Introduction Disclaimer This is not a full course on Relational Algebra Neither this is a course on SQL Introduction to Relational Algebra, RDBMS and SQL Follow the video lectures of the Stanford class on RDBMS https://www.coursera.org/course/db → Note that you have to sign up for an account Overview of this part Brief introduction to simplified relational algebra Useful to understand Pig, Hive and HBase Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 4 / 23
  • 5. Relational Algebra Introduction Relational Algebra Operators There are a number of operations on data that fit well the relational algebra model In traditional RDBMS, queries involve retrieval of small amounts of data In this course, and in particular in this class, we should keep in mind the particular workload underlying MapReduce → Full scans of large amounts of data → Queries are not selective1 , they process all data A review of some terminology A relation is a table Attributes are the column headers of the table The set of attributes of a relation is called a schema Example: R(A1, A2, ..., An) indicates a relation called R whose attributes are A1, A2, ..., An 1 This is true in general. However, most ETL jobs involve selection and projection to do data preparation. Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 5 / 23
  • 6. Relational Algebra Operators Operators Let’s start with an example Below, we have part of a relation called Links describing the structure of the Web There are two attributes: From and To A row, or tuple, of the relation is a pair of URLs, indicating the existence of a link between them → The number of tuples in a real dataset is in the order of billions (109 ) From To url1 url2 url1 url3 url2 url3 url2 url4 · · · · · · Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 6 / 23
  • 7. Relational Algebra Operators Operators Relations (however big) can be stored in a distributed filesystem If they don’t fit in a single machine, they’re broken into pieces (think HDFS) Next, we review and describe a set of relational algebra operators Intuitive explanation of what they do “Pseudo-code” of their implementation in/by MapReduce Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 7 / 23
  • 8. Relational Algebra Operators Operators Selection: σC(R) Apply condition C to each tuple of relation R Produce in output a relation containing only tuples that satisfy C Projection: πS(R) Given a subset S of relation R attributes Produce in output a relation containing only tuples for the attributes in S Union, Intersection and Difference Well known operators on sets Apply to the set of tuples in two relations that have the same schema Variations on the theme: work on bags Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 8 / 23
  • 9. Relational Algebra Operators Operators Natural join R S Given two relations, compare each pair of tuples, one from each relation If the tuples agree on all the attributes common to both schema → produce an output tuple that has components on each attribute Otherwise produce nothing Join condition can be on a subset of attributes Let’s work with an example Recall the Links relation from previous slides Query (or data processing job): find the paths of length two in the Web Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 9 / 23
  • 10. Relational Algebra Operators Join Example Informally, to satisfy the query we must: find the triples of URLs in the form (u, v, w) such that there is a link from u to v and a link from v to w Using the join operator Imagine we have two relations (with different schema), and let’s try to apply the natural join operator There are two copies of Links: L1(U1, U2) and L2(U2, U3) Let’s compute L1 L2 For each tuple t1 of L1 and each tuple t2 of L2, see if their U2 component are the same If yes, then produce a tuple in output, with the schema (U1, U2, U3) Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 10 / 23
  • 11. Relational Algebra Operators Join Example What we have seen is called (to be precise) a self-join Question: How would you implement a self join in your favorite programming language? Question: What is the time complexity of your algorithm? Question: What is the space complexity of your algorithm? To continue the example Say you are not interested in the entire two-hop path but just the start and end nodes Then you do a projection and the notation would be: πU1,U3 (L1 L2) Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 11 / 23
  • 12. Relational Algebra Operators Operators Grouping and Aggregation: γX (R) Given a relation R, partition its tuples according to their values in one set of attributes G The set G is called the grouping attributes Then, for each group, aggregate the values in certain other attributes Aggregation functions: SUM, COUNT, AVG, MIN, MAX, ... In the notation, X is a list of elements that can be: A grouping attribute An expression θ(A), where θ is one of the (five) aggregation functions and A is an attribute NOT among the grouping attributes Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 12 / 23
  • 13. Relational Algebra Operators Operators Grouping and Aggregation: γX (R) The result of this operation is a relation with one tuple for each group That tuple has a component for each of the grouping attributes, with the value common to tuples of that group That tuple has another component for each aggregation, with the aggregate value for that group Let’s work with an example Imagine that a social-networking site has a relation Friends(User, Friend) The tuples are pairs (a, b) such that b is a friend of a Query: compute the number of friends each member has Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 13 / 23
  • 14. Relational Algebra Operators Grouping and Aggregation Example How to satisfy the query γUser,COUNT(Friend))(Friends) This operation groups all the tuples by the value in their frist component → There is one group for each user Then, for each group, it counts the number of friends Some details The COUNT operation applied to an attribute does not consider the values of that attribute In fact, it counts the number of tuples in the group In SQL, there is a “count distinct” operator that counts the number of different values Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 14 / 23
  • 15. Relational Algebra Operators and MapReduce Computing Selection In practice, selections do not need a full-blown MapReduce implementation They can be implemented in the map phase alone Actually, they could also be implemented in the reduce portion A MapReduce implementation of σC(R) Map: For each tuple t in R, check if t satisfies C If so, emit a key/value pair (t, t) Reduce: Identity reducer Question: single or multiple reducers? NOTE: the output is not exactly a relation WHY? Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 15 / 23
  • 16. Relational Algebra Operators and MapReduce Computing Projections Similar process to selection But, projection may cause same tuple to appear several times A MapReduce implementation of πS(R) Map: For each tuple t in R, construct a tuple t by eliminating those components whose attributes are not in S Emit a key/value pair (t , t ) Reduce: For each key t produced by any of the Map tasks, fetch t , [t , · · · , t ] Emit a key/value pair (t , t ) NOTE: the reduce operation is duplicate elimination This operation is associative and commutative, so it is possible to optimize MapReduce by using a Combiner in each mapper Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 16 / 23
  • 17. Relational Algebra Operators and MapReduce Computing Unions Suppose relations R and S have the same schema Map tasks will be assigned chunks from either R or S Mappers don’t do much, just pass by to reducers Reducers do duplicate elimination A MapReduce implementation of union Map: 2 For each tuple t in R or S, emit a key/value pair (t, t) Reduce: For each key t there will be either one or two values Emit (t, t) in either case 2 Hadoop MapReduce supports reading multiple inputs. Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 17 / 23
  • 18. Relational Algebra Operators and MapReduce Computing Intersections Very similar to computing unions Suppose relations R and S have the same schema The map function is the same (an identity mapper) as for union The reduce function must produce a tuple only if both relations have that tuple A MapReduce implementation of intersection Map: For each tuple t in R or S, emit a key/value pair (t, t) Reduce: If key t has value list [t, t] then emit the key/value pair (t, t) Otherwise, emit the key/value pair (t, NULL) Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 18 / 23
  • 19. Relational Algebra Operators and MapReduce Computing difference Assume we have two relations R and S with the same schema The only way a tuple t can appear in the output is if it is in R but not in S The map function passes tuples from R and S to the reducer NOTE: it must inform the reducer whether the tuple came from R or S A MapReduce implementation of difference Map: For a tuple t in R emit a key/value pair (t, R ) and for a tuple t in S, emit a key/value pair (t, S ) Reduce: For each key t, do the following: If it is associated to R , then emit (t, t) If it is associated to [ R , S ] or [ S , R ], or [ S ], emit the key/value pair (t, NULL) Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 19 / 23
  • 20. Relational Algebra Operators and MapReduce Computing the natural Join This topic is subject to continuous refinements There are many JOIN operators and many different implementations We’ve seen some of them in the laboratory sessions Let’s look at two relations R(A, B) and S(B, C) We must find tuples that agree on their B components We shall use the B-value of tuples from either relation as the key The value will be the other component and the name of the relation That way the reducer knows from which relation each tuple is coming from Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 20 / 23
  • 21. Relational Algebra Operators and MapReduce Computing the natural Join A MapReduce implementation of Natural Join Map: For each tuple (a, b) of R emit the key/value pair (b, ( R , a)) For each tuple (b, c) of S emit the key/value pair (b, ( S , c)) Reduce: Each key b will be associated to a list of pairs that are either ( R , a) or ( S , c) Emit key/value pairs of the form (b, [(a1, b, c1), (a2, b, c2), · · · , (an, b, cn)]) NOTES Question: what if the MapReduce framework wouldn’t implement the distributed (and sorted) group by? In general, for n tuples in relation R and m tuples in relation S all with a common B-value, then we end up with nm tuples in the result If all tuples of both relations have the same B-value, then we’re computing the Cartesian product Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 21 / 23
  • 22. Relational Algebra Operators and MapReduce Overview of SQL Joins Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 22 / 23
  • 23. Relational Algebra Operators and MapReduce Grouping and Aggregation in MapReduce Let R(A, B, C) be a relation to which we apply γA,θ(B)(R) The map operation prepares the grouping The grouping is done by the framework The reducer computes the aggregation Simplifying assumptions: one grouping attribute and one aggregation function MapReduce implementation of γA,θ(B)(R)3 Map: For each tuple (a, b, c) emit the key/value pair (a, b) Reduce: Each key a represents a group Apply θ to the list [b1, b2, · · · , bn] Emit the key/value pair (a, x) where x = θ([b1, b2, · · · , bn]) 3 Note here that we are also projecting. Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 23 / 23