SlideShare une entreprise Scribd logo
1  sur  15
Télécharger pour lire hors ligne
Optimising Joins in MR
via Lookup Service
!
Rohit Kochar
Inmobi
Problem Statement
• Table A a.k.a Fact Table => Huge set of
data(100+ GB)
• Table B a.k.a Dimension Table => Relatively
small set of data (1-2 GB)
• R = A X B => Required Result
Types of Joins
• Fragment Replicate Joins
• Reduce side joins
Broadly there are two approaches for performing joins in a
hadoop job:
Our Initial Approach
• Dimension data was small
• Map side joins by loading data in HashMaps
• Stream Fact table
• UDFs for pig scripts
• Good for fat maps
Contd..
Example!
R1 = JOIN A by A1, B by B1
R2 = JOIN R1 by A2,C by C1
R3 = JOIN R2 by A3, D by D1
• This will result in multiple MR jobs in PIG
Cons of this approach
• Increased memory foot print of jobs
• Increased map setup time
• Large number of mapper => Multiple reading of
same dimension data
Dimension Store
• In memory data backed by disk
• High read throughput
• Schema and data type aware lookup service
• Client library for lookups
• Inbuilt client side cache in the library
• ETL job to load dimensions in store
• Multi version data to support dimension analytics
• Single source of truth for all processing
Joins using Dimension store
• Instead of local cache use DimStore in mapper
for joins
• 99.5% lookups satisfied from local client cache
• Cache size is 1-30% of the corresponding
dimension table size
• 30-40% gain in time taken for jobs
• Joins in real time processing
Improvements on a real job
Parameter New Job Existing Job
Avg Map Time 731 sec(12.2 mins) 1312 sec (21.9 mins)
Total time by all mappers 41mins, 55sec 1hrs, 34mins, 10sec
Dimension
Lookup
Cardinality of
Dimension
Elements Loaded in
Cache
Cache
Hit
Cache
size/
totalDimension1 542K 11K 99.75% 2%
Dimension2 558K 9K 99.94% 1.6%
Dimension3 2590K 113K 97.51% 4.3%
Dimension4 514 432 99.98% 84.04%
Cache Stats
Technologies Evaluated for DimStore
Server
• HSQL DB =>In memory/process relational
database
• Redis => In memory key value store also
referred as data structure store
• AeroSpike =>In memory,disk backed Key value
store
HSQL DB
Throughput Latency
• Throughput 60 k/sec
• Latency ~8ms
• Inbuilt support for the joins
• Query on a non indexed column was
a problem
Redis
Throughput Latency
• Throughput of the 70k queries/sec
• Latency 1-2 ms
• No native support for sharding and HA
• No disk persistence
• No support for tuple
Aerospike(Community Edition)
Throughput Latency
• Throughput of the 120k queries/sec
• Latency ~1 ms
• Support for auto sharding and HA
• Disk persistence
• Secondary Indexes
• Support for tuple
Limitations
• Dimension Cardinality:Input per batch is high
• Staleness of data is not acceptable
• Dimension data size is very small
Q & A
Thanks

Contenu connexe

Tendances

Openstack and eBay
Openstack and eBay Openstack and eBay
Openstack and eBay
Open Stack
 
Ronalao termpresent
Ronalao termpresentRonalao termpresent
Ronalao termpresent
Elma Belitz
 

Tendances (20)

Vam: A Locality-Improving Dynamic Memory Allocator
Vam: A Locality-Improving Dynamic Memory AllocatorVam: A Locality-Improving Dynamic Memory Allocator
Vam: A Locality-Improving Dynamic Memory Allocator
 
Large-Scale Geographically Weighted Regression on Spark
Large-Scale Geographically Weighted Regression on SparkLarge-Scale Geographically Weighted Regression on Spark
Large-Scale Geographically Weighted Regression on Spark
 
Sparse PDF Volumes for Consistent Multi-resolution Volume Rendering
Sparse PDF Volumes for Consistent Multi-resolution Volume RenderingSparse PDF Volumes for Consistent Multi-resolution Volume Rendering
Sparse PDF Volumes for Consistent Multi-resolution Volume Rendering
 
Team3 presentation
Team3 presentationTeam3 presentation
Team3 presentation
 
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationQ4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
 
Working with Scientific Data in MATLAB
Working with Scientific Data in MATLABWorking with Scientific Data in MATLAB
Working with Scientific Data in MATLAB
 
Openstack and eBay
Openstack and eBay Openstack and eBay
Openstack and eBay
 
How does one go from binary data to HDF files efficiently?
How does one go from binary data to HDF files efficiently?How does one go from binary data to HDF files efficiently?
How does one go from binary data to HDF files efficiently?
 
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTechGeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
 
LocationTech Projects
LocationTech ProjectsLocationTech Projects
LocationTech Projects
 
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and SparkFOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
 
Guy Barrette: Afficher des données en temps réel dans PowerBI
Guy Barrette: Afficher des données en temps réel dans PowerBIGuy Barrette: Afficher des données en temps réel dans PowerBI
Guy Barrette: Afficher des données en temps réel dans PowerBI
 
Ronalao termpresent
Ronalao termpresentRonalao termpresent
Ronalao termpresent
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
 
Weather Data Analytics Using Hadoop
Weather Data Analytics Using HadoopWeather Data Analytics Using Hadoop
Weather Data Analytics Using Hadoop
 
R user group 2011 09
R user group 2011 09R user group 2011 09
R user group 2011 09
 
Vineetha.ppt
Vineetha.pptVineetha.ppt
Vineetha.ppt
 
Building maps for apps in the cloud - a Softlayer Use Case
Building maps for  apps in the cloud - a Softlayer Use CaseBuilding maps for  apps in the cloud - a Softlayer Use Case
Building maps for apps in the cloud - a Softlayer Use Case
 
Hadoop Map Reduce OS
Hadoop Map Reduce OSHadoop Map Reduce OS
Hadoop Map Reduce OS
 
Processing Geospatial Data At Scale @locationtech
Processing Geospatial Data At Scale @locationtechProcessing Geospatial Data At Scale @locationtech
Processing Geospatial Data At Scale @locationtech
 

Similaire à Optimizing joins in Map reduce jobs via Lookup Service

Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit
 

Similaire à Optimizing joins in Map reduce jobs via Lookup Service (20)

Building Scalable Aggregation Systems
Building Scalable Aggregation SystemsBuilding Scalable Aggregation Systems
Building Scalable Aggregation Systems
 
Average Active Sessions - OaktableWorld 2013
Average Active Sessions - OaktableWorld 2013Average Active Sessions - OaktableWorld 2013
Average Active Sessions - OaktableWorld 2013
 
Average Active Sessions RMOUG2007
Average Active Sessions RMOUG2007Average Active Sessions RMOUG2007
Average Active Sessions RMOUG2007
 
Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez
 
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big Data
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukCloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
 
Leveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data WarehouseLeveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data Warehouse
 
Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
 
MapReduce presentation
MapReduce presentationMapReduce presentation
MapReduce presentation
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
 
Data & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon RedshiftData & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon Redshift
 
Zipline - A Declarative Feature Engineering Framework
Zipline - A Declarative Feature Engineering FrameworkZipline - A Declarative Feature Engineering Framework
Zipline - A Declarative Feature Engineering Framework
 
InfluxDB 1.0 - Optimizing InfluxDB by Sam Dillard
InfluxDB 1.0 - Optimizing InfluxDB by Sam DillardInfluxDB 1.0 - Optimizing InfluxDB by Sam Dillard
InfluxDB 1.0 - Optimizing InfluxDB by Sam Dillard
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 

Dernier

DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
MayuraD1
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 

Dernier (20)

Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Bridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxBridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptx
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 

Optimizing joins in Map reduce jobs via Lookup Service

  • 1. Optimising Joins in MR via Lookup Service ! Rohit Kochar Inmobi
  • 2. Problem Statement • Table A a.k.a Fact Table => Huge set of data(100+ GB) • Table B a.k.a Dimension Table => Relatively small set of data (1-2 GB) • R = A X B => Required Result
  • 3. Types of Joins • Fragment Replicate Joins • Reduce side joins Broadly there are two approaches for performing joins in a hadoop job:
  • 4. Our Initial Approach • Dimension data was small • Map side joins by loading data in HashMaps • Stream Fact table • UDFs for pig scripts • Good for fat maps
  • 5. Contd.. Example! R1 = JOIN A by A1, B by B1 R2 = JOIN R1 by A2,C by C1 R3 = JOIN R2 by A3, D by D1 • This will result in multiple MR jobs in PIG
  • 6. Cons of this approach • Increased memory foot print of jobs • Increased map setup time • Large number of mapper => Multiple reading of same dimension data
  • 7. Dimension Store • In memory data backed by disk • High read throughput • Schema and data type aware lookup service • Client library for lookups • Inbuilt client side cache in the library • ETL job to load dimensions in store • Multi version data to support dimension analytics • Single source of truth for all processing
  • 8. Joins using Dimension store • Instead of local cache use DimStore in mapper for joins • 99.5% lookups satisfied from local client cache • Cache size is 1-30% of the corresponding dimension table size • 30-40% gain in time taken for jobs • Joins in real time processing
  • 9. Improvements on a real job Parameter New Job Existing Job Avg Map Time 731 sec(12.2 mins) 1312 sec (21.9 mins) Total time by all mappers 41mins, 55sec 1hrs, 34mins, 10sec Dimension Lookup Cardinality of Dimension Elements Loaded in Cache Cache Hit Cache size/ totalDimension1 542K 11K 99.75% 2% Dimension2 558K 9K 99.94% 1.6% Dimension3 2590K 113K 97.51% 4.3% Dimension4 514 432 99.98% 84.04% Cache Stats
  • 10. Technologies Evaluated for DimStore Server • HSQL DB =>In memory/process relational database • Redis => In memory key value store also referred as data structure store • AeroSpike =>In memory,disk backed Key value store
  • 11. HSQL DB Throughput Latency • Throughput 60 k/sec • Latency ~8ms • Inbuilt support for the joins • Query on a non indexed column was a problem
  • 12. Redis Throughput Latency • Throughput of the 70k queries/sec • Latency 1-2 ms • No native support for sharding and HA • No disk persistence • No support for tuple
  • 13. Aerospike(Community Edition) Throughput Latency • Throughput of the 120k queries/sec • Latency ~1 ms • Support for auto sharding and HA • Disk persistence • Secondary Indexes • Support for tuple
  • 14. Limitations • Dimension Cardinality:Input per batch is high • Staleness of data is not acceptable • Dimension data size is very small