SlideShare une entreprise Scribd logo
1  sur  36
MRShare: Sharing Across Multiple Queries in MapReduce Tomasz Nykiel(University of Toronto) MichalisPotamias (Boston University) ChaitanyaMishra (University of Toronto, currently Facebook) George Kollios (Boston University) Nick Koudas (University of Toronto) 1
Data management landscape flexibility MRShare – sharing framework for MR ,[object Object]
Large scale setups
 Time performanceσπ efficiency 2
MRShare – a sharing framework for Map Reduce MRShare framework: Inspired by sharing primitives from relational domain Introduces a cost model for Map Reduce jobs Searches for the optimal sharing strategies Does not change the Map Reduce computational model hsdhquweiquwijksajdajsdjhwhjadjhashdj 3
Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing  MRShare Evaluation Summary 4
Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing  MRShare Evaluation Summary 5
network Map Reduce recap. Reduce Map I Output I I Output I HDFS HDFS 6
Outline Introduction Map Reduce recap. MRShare - Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing  MRShare Evaluation Summary 7
Sharing primitives – sharing scans SELECT COUNT(*) FROM user GROUP BY hometown SELECT AVG(age) FROM user GROUP BY hometown SQL Map Map id1 student Toronto id1 student Toronto Toronto 1 Toronto 17 Map Reduce Reduce Reduce Toronto 1 Toronto 17 Toronto 1 Toronto 3 Toronto 19 Toronto 18 Toronto 1 Montreal 20 Montreal 20 Ottawa 1 Ottawa 23 Ottawa 2 Ottawa 24 Ottawa 1 Ottawa 25 8
MRShare – sharing scans (map). Input Meta-map Map 1 Map 2 Map 3 Map 4 Map output 9
Meta-reduce MRShare – sharing scans (reduce) Reduce 1 Reduce 2 Reduce 3 Reduce 4 10
Outline Introduction Map Reduce recap. MRShare - Sharing primitives in Map-Reduce Sharing scans Sharing intermediate data MRShare – Cost based approach to sharing  MRShare Evaluation Summary 11
Sharing primitives - Sharing intermediate data. SELECT COUNT(*) FROM user  WHERE occupation=‘student’ GROUP BY hometown SELECT COUNT(*) FROM user WHERE age > 18 GROUP BY hometown SQL Map Map id1 student Toronto id1 student Toronto Age ?> 18 Occupation ?= ‘student’ Toronto 1 Toronto 1 Map Reduce Reduce Reduce Toronto 1 Toronto 1 Toronto 1 Toronto 3 Toronto 1 Toronto 2 Toronto 1 Ottawa 1 Ottawa 1 Ottawa 1 Ottawa 1 Ottawa 2 Montreal 2 Ottawa 1 Montreal 1 12
Meta-map MRShare – sharing intermediate data (map). Input Map 1 Map 2 Map 3 Map 4 Map output 13
Meta-reduce MRShare – sharing intermediate data (reduce). Reduce 1 Reduce 2 Reduce 3 Reduce 4 14
Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing Cost model for finding the optimal sharing strategy SplitJobs – cost based algorithm for sharing scans MultiSplitJobs – an improvement of SplitJobs γ-MultiSplitJobs– the algorithm for sharing intermediate data MRShare Evaluation Summary 15
Cost model for Map Reduce (single job) Reading input Sorting int. data Copying Writing output Reading– f(input size) Sorting– f(intermediate data size) Copying– f(intermediate data size) Writing – f(output size) 16
Cost of executing a group of jobs Read Sort Copy Write J1 Read Sort Copy Write J2 Read Sort Copy Write J3 J1+J2+J3 Read Sort Copy Write Potential costs Potential savings Savings 17
Finding the optimal sharing strategy “NoShare” J3 J3 J2 J2 18 J5 J4 J4 J1 J1 J5 J3 J2 J4 J1 ,[object Object],J5 “GreedyShare”
Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing Cost model for finding the optimal sharing strategy  SplitJobs – cost based algorithm for sharing scans MultiSplitJobs – an improvement of SplitJobs γ-MultiSplitJobs– the algorithm for sharing intermediate data MRShare Evaluation Summary 19
Sharing scans - cost based optimization  20 Read Sort J1 J1+J2+J3 Read Sort J2 Read Sort Read Sort J3 Potential costs Savings Savings come from reduced number of scans The sorting cost  might change The costs of copying  and writing the output do not change ,[object Object],[object Object]
SplitJobs – a DP solution for sharing scans. We reduce the problem of grouping to the problem of splitting a sorted list of jobs – by approximating the cost of sorting. J6 J5 J4 J3 J2 J1 ,[object Object],J6 J5 J4 J3 J2 J1 SplitJobs 22 G1 G2 G3
Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing Cost model for finding the optimal sharing strategy SplitJobs – cost based algorithm for sharing scans MultiSplitJobs – an improvement of SplitJobs γ-MultiSplitJobs– the algorithm for sharing intermediate data MRShare Evaluation Summary 23
MultiSplitJobs – an improvement of SplitJobs 24 J8 J7 J6 J5 J4 J3 J2 J1 G1 G2 SplitJobs SplitJobs G3 SplitJobs G4 MultiSplitJobs
Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing Cost model for finding the optimal sharing strategy SplitJobs – cost based algorithm for sharing scans MultiSplitJobs – an improvement of SplitJobs γ-MultiSplitJobs– the algorithm for sharing intermediate data MRShare Evaluation Summary 25
Sharing intermediate data - cost based optimization  26 Read Sort Copy J1 J1+J2+J3 Read Sort Copy Read Sort Copy J2 Savings Potential savings Read Sort Copy J3 Potential costs or savings The sorting and copying costs change – depending on the size of the intermediate data Prohibitive cost of maintaining statistics J3 We need to estimate the size of the intermediate data of all combinations of jobs. J1 J2
Approximate the size of the intermediate data J3 J1 γ-MultiSplitJobs – the solution for sharing intermediate data 27 J2 J3 J2 J1 = + γ * J1 J2 J3 ,[object Object]
γ set heuristically,[object Object]
Evaluation setup 40 EC2 small instance virtual machines Modified Hadoop engine 30 GB text dataset consisting of blogs Multiple grep-wordcount queries Counts words matching a regular expression Allows for variable intermediate data sizes Generic aggregation Map Reduce job 29
Evaluation goals Sharing is not always beneficial. ‘GreedyShare’ policy How much can we save on sharing scans? MRShare - MultiSplitJobs evaluation How much can we save on sharing intermediate data?  MRShare - γ-MultiSplitJobs evaluation 30
Is sharing always beneficial?- ‘GreedyShare’ policy 31
How much we save on sharing scans – MRShare MultiSplitJobs 32
How much we save on sharing intermediate data - MRShare - γ-MultiSplitJobs 33
Summary We introduced MRShare – a framework for automatic work sharing in Map Reduce. We identified sharing primitives and demonstrated the implementation thereof in a Map Reduce engine. We established a cost model and solved several work sharing optimization problems. We demonstrated vast savings when using MRShare. 34
Thank you!!! Questions? 35

Contenu connexe

Tendances

Towards and adaptable spatial processing architecture
Towards and adaptable spatial processing architectureTowards and adaptable spatial processing architecture
Towards and adaptable spatial processing architecture
Armando Guevara
 
Lecture+12+topology+2013 (3)
Lecture+12+topology+2013 (3)Lecture+12+topology+2013 (3)
Lecture+12+topology+2013 (3)
Mei Chi Lo
 
OKCon 2013 Moodboards
OKCon 2013 MoodboardsOKCon 2013 Moodboards
OKCon 2013 Moodboards
thuesing
 
Plan4business technical solution
Plan4business technical solutionPlan4business technical solution
Plan4business technical solution
Karel Charvat
 
Conceptual models of real world geographical phenomena (epm107_2007)
Conceptual models of real world geographical phenomena (epm107_2007)Conceptual models of real world geographical phenomena (epm107_2007)
Conceptual models of real world geographical phenomena (epm107_2007)
esambale
 

Tendances (20)

QGIS Module 2
QGIS Module 2QGIS Module 2
QGIS Module 2
 
ON TRAFFIC-AWARE PARTITION AND AGGREGATION IN MAPREDUCE FOR BIG DATA APPLICAT...
ON TRAFFIC-AWARE PARTITION AND AGGREGATION IN MAPREDUCE FOR BIG DATA APPLICAT...ON TRAFFIC-AWARE PARTITION AND AGGREGATION IN MAPREDUCE FOR BIG DATA APPLICAT...
ON TRAFFIC-AWARE PARTITION AND AGGREGATION IN MAPREDUCE FOR BIG DATA APPLICAT...
 
Compression-based Graph Mining Exploiting Structure Primites
Compression-based Graph Mining Exploiting Structure PrimitesCompression-based Graph Mining Exploiting Structure Primites
Compression-based Graph Mining Exploiting Structure Primites
 
Towards and adaptable spatial processing architecture
Towards and adaptable spatial processing architectureTowards and adaptable spatial processing architecture
Towards and adaptable spatial processing architecture
 
TYBSC IT PGIS Unit I Chapter I- Introduction to Geographic Information Systems
TYBSC IT PGIS Unit I  Chapter I- Introduction to Geographic Information SystemsTYBSC IT PGIS Unit I  Chapter I- Introduction to Geographic Information Systems
TYBSC IT PGIS Unit I Chapter I- Introduction to Geographic Information Systems
 
Lecture+12+topology+2013 (3)
Lecture+12+topology+2013 (3)Lecture+12+topology+2013 (3)
Lecture+12+topology+2013 (3)
 
Large graph analysis using g mine system
Large graph analysis using g mine systemLarge graph analysis using g mine system
Large graph analysis using g mine system
 
Fundamental operations
Fundamental operationsFundamental operations
Fundamental operations
 
GIS Data Types
GIS Data TypesGIS Data Types
GIS Data Types
 
OKCon 2013 Moodboards
OKCon 2013 MoodboardsOKCon 2013 Moodboards
OKCon 2013 Moodboards
 
Mrp Final
Mrp FinalMrp Final
Mrp Final
 
Digitization and 3d modelling of a mine plan
Digitization and 3d modelling of a mine planDigitization and 3d modelling of a mine plan
Digitization and 3d modelling of a mine plan
 
TYBSC IT PGIS Unit III Chapter II Data Entry and Preparation
TYBSC IT PGIS Unit III Chapter II Data Entry and PreparationTYBSC IT PGIS Unit III Chapter II Data Entry and Preparation
TYBSC IT PGIS Unit III Chapter II Data Entry and Preparation
 
Chap02 01
Chap02 01Chap02 01
Chap02 01
 
Plan4business technical solution
Plan4business technical solutionPlan4business technical solution
Plan4business technical solution
 
TerraWorld
TerraWorldTerraWorld
TerraWorld
 
TYBSC IT PGIS Unit IV Spacial Data Analysis
TYBSC IT PGIS Unit IV  Spacial Data AnalysisTYBSC IT PGIS Unit IV  Spacial Data Analysis
TYBSC IT PGIS Unit IV Spacial Data Analysis
 
TYBSC IT PGIS Unit II Chapter I Data Management and Processing Systems
TYBSC IT PGIS Unit II Chapter I Data Management and Processing SystemsTYBSC IT PGIS Unit II Chapter I Data Management and Processing Systems
TYBSC IT PGIS Unit II Chapter I Data Management and Processing Systems
 
Conceptual models of real world geographical phenomena (epm107_2007)
Conceptual models of real world geographical phenomena (epm107_2007)Conceptual models of real world geographical phenomena (epm107_2007)
Conceptual models of real world geographical phenomena (epm107_2007)
 
Domain research presentation Midterm
Domain research presentation MidtermDomain research presentation Midterm
Domain research presentation Midterm
 

En vedette (11)

Sql Server
Sql ServerSql Server
Sql Server
 
Sql joins inner join self join outer joins
Sql joins inner join self join outer joinsSql joins inner join self join outer joins
Sql joins inner join self join outer joins
 
Sql server JOIN
Sql server JOINSql server JOIN
Sql server JOIN
 
Types Of Join In Sql Server - Join With Example In Sql Server
Types Of Join In Sql Server - Join With Example In Sql ServerTypes Of Join In Sql Server - Join With Example In Sql Server
Types Of Join In Sql Server - Join With Example In Sql Server
 
MS Sql Server: Joining Databases
MS Sql Server: Joining DatabasesMS Sql Server: Joining Databases
MS Sql Server: Joining Databases
 
SQL Joins
SQL JoinsSQL Joins
SQL Joins
 
Sql joins
Sql joinsSql joins
Sql joins
 
Sql joins
Sql joinsSql joins
Sql joins
 
SQL Joins and Query Optimization
SQL Joins and Query OptimizationSQL Joins and Query Optimization
SQL Joins and Query Optimization
 
joins in database
 joins in database joins in database
joins in database
 
SQL JOIN
SQL JOINSQL JOIN
SQL JOIN
 

Similaire à Mr Share 11 Sep 2010

GIS 5103 – Fundamentals of GISLecture 83D GIS.docx
GIS 5103 – Fundamentals of GISLecture 83D GIS.docxGIS 5103 – Fundamentals of GISLecture 83D GIS.docx
GIS 5103 – Fundamentals of GISLecture 83D GIS.docx
shericehewat
 
Big Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy GaussianBig Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy Gaussian
IJCSIS Research Publications
 
Scalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data ShardingScalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data Sharding
inside-BigData.com
 
Challenges in the Design of a Graph Database Benchmark
Challenges in the Design of a Graph Database Benchmark Challenges in the Design of a Graph Database Benchmark
Challenges in the Design of a Graph Database Benchmark
graphdevroom
 

Similaire à Mr Share 11 Sep 2010 (20)

On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
 
Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?
 
Top 3 design patterns in Map Reduce
Top 3 design patterns in Map ReduceTop 3 design patterns in Map Reduce
Top 3 design patterns in Map Reduce
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel Problems
 
Main map reduce
Main map reduceMain map reduce
Main map reduce
 
COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...
COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...
COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG S...
 
Parallel algorithms for multi-source graph traversal and its applications
Parallel algorithms for multi-source graph traversal and its applicationsParallel algorithms for multi-source graph traversal and its applications
Parallel algorithms for multi-source graph traversal and its applications
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
 
2013-imMens-EuroVis
2013-imMens-EuroVis2013-imMens-EuroVis
2013-imMens-EuroVis
 
Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspective
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning Track
 
Graph analysis over relational database
Graph analysis over relational databaseGraph analysis over relational database
Graph analysis over relational database
 
GIS 5103 – Fundamentals of GISLecture 83D GIS.docx
GIS 5103 – Fundamentals of GISLecture 83D GIS.docxGIS 5103 – Fundamentals of GISLecture 83D GIS.docx
GIS 5103 – Fundamentals of GISLecture 83D GIS.docx
 
Big Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy GaussianBig Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy Gaussian
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 
Scalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data ShardingScalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data Sharding
 
Challenges in the Design of a Graph Database Benchmark
Challenges in the Design of a Graph Database Benchmark Challenges in the Design of a Graph Database Benchmark
Challenges in the Design of a Graph Database Benchmark
 
50120140505004
5012014050500450120140505004
50120140505004
 
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
 
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
 

Dernier

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Mr Share 11 Sep 2010

  • 1. MRShare: Sharing Across Multiple Queries in MapReduce Tomasz Nykiel(University of Toronto) MichalisPotamias (Boston University) ChaitanyaMishra (University of Toronto, currently Facebook) George Kollios (Boston University) Nick Koudas (University of Toronto) 1
  • 2.
  • 4. Time performanceσπ efficiency 2
  • 5. MRShare – a sharing framework for Map Reduce MRShare framework: Inspired by sharing primitives from relational domain Introduces a cost model for Map Reduce jobs Searches for the optimal sharing strategies Does not change the Map Reduce computational model hsdhquweiquwijksajdajsdjhwhjadjhashdj 3
  • 6. Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing MRShare Evaluation Summary 4
  • 7. Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing MRShare Evaluation Summary 5
  • 8. network Map Reduce recap. Reduce Map I Output I I Output I HDFS HDFS 6
  • 9. Outline Introduction Map Reduce recap. MRShare - Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing MRShare Evaluation Summary 7
  • 10. Sharing primitives – sharing scans SELECT COUNT(*) FROM user GROUP BY hometown SELECT AVG(age) FROM user GROUP BY hometown SQL Map Map id1 student Toronto id1 student Toronto Toronto 1 Toronto 17 Map Reduce Reduce Reduce Toronto 1 Toronto 17 Toronto 1 Toronto 3 Toronto 19 Toronto 18 Toronto 1 Montreal 20 Montreal 20 Ottawa 1 Ottawa 23 Ottawa 2 Ottawa 24 Ottawa 1 Ottawa 25 8
  • 11. MRShare – sharing scans (map). Input Meta-map Map 1 Map 2 Map 3 Map 4 Map output 9
  • 12. Meta-reduce MRShare – sharing scans (reduce) Reduce 1 Reduce 2 Reduce 3 Reduce 4 10
  • 13. Outline Introduction Map Reduce recap. MRShare - Sharing primitives in Map-Reduce Sharing scans Sharing intermediate data MRShare – Cost based approach to sharing MRShare Evaluation Summary 11
  • 14. Sharing primitives - Sharing intermediate data. SELECT COUNT(*) FROM user WHERE occupation=‘student’ GROUP BY hometown SELECT COUNT(*) FROM user WHERE age > 18 GROUP BY hometown SQL Map Map id1 student Toronto id1 student Toronto Age ?> 18 Occupation ?= ‘student’ Toronto 1 Toronto 1 Map Reduce Reduce Reduce Toronto 1 Toronto 1 Toronto 1 Toronto 3 Toronto 1 Toronto 2 Toronto 1 Ottawa 1 Ottawa 1 Ottawa 1 Ottawa 1 Ottawa 2 Montreal 2 Ottawa 1 Montreal 1 12
  • 15. Meta-map MRShare – sharing intermediate data (map). Input Map 1 Map 2 Map 3 Map 4 Map output 13
  • 16. Meta-reduce MRShare – sharing intermediate data (reduce). Reduce 1 Reduce 2 Reduce 3 Reduce 4 14
  • 17. Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing Cost model for finding the optimal sharing strategy SplitJobs – cost based algorithm for sharing scans MultiSplitJobs – an improvement of SplitJobs γ-MultiSplitJobs– the algorithm for sharing intermediate data MRShare Evaluation Summary 15
  • 18. Cost model for Map Reduce (single job) Reading input Sorting int. data Copying Writing output Reading– f(input size) Sorting– f(intermediate data size) Copying– f(intermediate data size) Writing – f(output size) 16
  • 19. Cost of executing a group of jobs Read Sort Copy Write J1 Read Sort Copy Write J2 Read Sort Copy Write J3 J1+J2+J3 Read Sort Copy Write Potential costs Potential savings Savings 17
  • 20.
  • 21. Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing Cost model for finding the optimal sharing strategy SplitJobs – cost based algorithm for sharing scans MultiSplitJobs – an improvement of SplitJobs γ-MultiSplitJobs– the algorithm for sharing intermediate data MRShare Evaluation Summary 19
  • 22.
  • 23.
  • 24. Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing Cost model for finding the optimal sharing strategy SplitJobs – cost based algorithm for sharing scans MultiSplitJobs – an improvement of SplitJobs γ-MultiSplitJobs– the algorithm for sharing intermediate data MRShare Evaluation Summary 23
  • 25. MultiSplitJobs – an improvement of SplitJobs 24 J8 J7 J6 J5 J4 J3 J2 J1 G1 G2 SplitJobs SplitJobs G3 SplitJobs G4 MultiSplitJobs
  • 26. Outline Introduction Map Reduce recap. MRShare – Sharing primitives in Map-Reduce MRShare – Cost based approach to sharing Cost model for finding the optimal sharing strategy SplitJobs – cost based algorithm for sharing scans MultiSplitJobs – an improvement of SplitJobs γ-MultiSplitJobs– the algorithm for sharing intermediate data MRShare Evaluation Summary 25
  • 27. Sharing intermediate data - cost based optimization 26 Read Sort Copy J1 J1+J2+J3 Read Sort Copy Read Sort Copy J2 Savings Potential savings Read Sort Copy J3 Potential costs or savings The sorting and copying costs change – depending on the size of the intermediate data Prohibitive cost of maintaining statistics J3 We need to estimate the size of the intermediate data of all combinations of jobs. J1 J2
  • 28.
  • 29.
  • 30. Evaluation setup 40 EC2 small instance virtual machines Modified Hadoop engine 30 GB text dataset consisting of blogs Multiple grep-wordcount queries Counts words matching a regular expression Allows for variable intermediate data sizes Generic aggregation Map Reduce job 29
  • 31. Evaluation goals Sharing is not always beneficial. ‘GreedyShare’ policy How much can we save on sharing scans? MRShare - MultiSplitJobs evaluation How much can we save on sharing intermediate data? MRShare - γ-MultiSplitJobs evaluation 30
  • 32. Is sharing always beneficial?- ‘GreedyShare’ policy 31
  • 33. How much we save on sharing scans – MRShare MultiSplitJobs 32
  • 34. How much we save on sharing intermediate data - MRShare - γ-MultiSplitJobs 33
  • 35. Summary We introduced MRShare – a framework for automatic work sharing in Map Reduce. We identified sharing primitives and demonstrated the implementation thereof in a Map Reduce engine. We established a cost model and solved several work sharing optimization problems. We demonstrated vast savings when using MRShare. 34
  • 37. Ongoing work – sharing expensive computation Sharing across multiple Map Reduce jobs with expensive predicates. 36 Input Meta-map Map 1 Map 2 Map 3 Map 4
  • 38. Ongoing work – dynamic sharing Dynamic sharing. 37 J1+j2 progress J1 J2 time J2 J1

Notes de l'éditeur

  1. Talk about different possibilities of arranging jobs, and the question which one is the optimal one.