Optimizing joins in Map reduce jobs via Lookup Service

Optimising Joins in MR
via Lookup Service
!
Rohit Kochar
Inmobi

Problem Statement
• Table A a.k.a Fact Table => Huge set of
data(100+ GB)
• Table B a.k.a Dimension Table => Relatively
small set of data (1-2 GB)
• R = A X B => Required Result

Types of Joins
• Fragment Replicate Joins
• Reduce side joins
Broadly there are two approaches for performing joins in a
hadoop job:

Our Initial Approach
• Dimension data was small
• Map side joins by loading data in HashMaps
• Stream Fact table
• UDFs for pig scripts
• Good for fat maps

Contd..
Example!
R1 = JOIN A by A1, B by B1
R2 = JOIN R1 by A2,C by C1
R3 = JOIN R2 by A3, D by D1
• This will result in multiple MR jobs in PIG

Cons of this approach
• Increased memory foot print of jobs
• Increased map setup time
• Large number of mapper => Multiple reading of
same dimension data

Dimension Store
• In memory data backed by disk
• High read throughput
• Schema and data type aware lookup service
• Client library for lookups
• Inbuilt client side cache in the library
• ETL job to load dimensions in store
• Multi version data to support dimension analytics
• Single source of truth for all processing

Joins using Dimension store
• Instead of local cache use DimStore in mapper
for joins
• 99.5% lookups satisﬁed from local client cache
• Cache size is 1-30% of the corresponding
dimension table size
• 30-40% gain in time taken for jobs
• Joins in real time processing

Improvements on a real job
Parameter New Job Existing Job
Avg Map Time 731 sec(12.2 mins) 1312 sec (21.9 mins)
Total time by all mappers 41mins, 55sec 1hrs, 34mins, 10sec
Dimension
Lookup
Cardinality of
Dimension
Elements Loaded in
Cache
Cache
Hit
Cache
size/
totalDimension1 542K 11K 99.75% 2%
Dimension2 558K 9K 99.94% 1.6%
Dimension3 2590K 113K 97.51% 4.3%
Dimension4 514 432 99.98% 84.04%
Cache Stats

Technologies Evaluated for DimStore
Server
• HSQL DB =>In memory/process relational
database
• Redis => In memory key value store also
referred as data structure store
• AeroSpike =>In memory,disk backed Key value
store

HSQL DB
Throughput Latency
• Throughput 60 k/sec
• Latency ~8ms
• Inbuilt support for the joins
• Query on a non indexed column was
a problem

Redis
Throughput Latency
• Throughput of the 70k queries/sec
• Latency 1-2 ms
• No native support for sharding and HA
• No disk persistence
• No support for tuple

Aerospike(Community Edition)
Throughput Latency
• Throughput of the 120k queries/sec
• Latency ~1 ms
• Support for auto sharding and HA
• Disk persistence
• Secondary Indexes
• Support for tuple

Limitations
• Dimension Cardinality:Input per batch is high
• Staleness of data is not acceptable
• Dimension data size is very small

Optimizing joins in Map reduce jobs via Lookup Service

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Optimizing joins in Map reduce jobs via Lookup Service

Similaire à Optimizing joins in Map reduce jobs via Lookup Service (20)

Dernier

Dernier (20)

Optimizing joins in Map reduce jobs via Lookup Service