This document describes Yahoo's attribution framework for efficiently performing sparse joins on massive data to attribute display ad impressions and clicks to the serving events. The key challenges are the size disparity between the impression/click events (hundreds of MBs) and serving events (multiple TBs), and the sparse nature of the joins. The framework uses aggressive data partitioning and pruning strategies, and partition-aware efficient join query plans, to enable attributing events over long lookback windows efficiently. Performance comparisons show it significantly outperforms alternative approaches like hash and replicated joins in Pig.
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Yahoo Display Advertising Attribution
1. Yahoo! Display Ads Attribution
Framework:
A Problem Of Efficient Sparse Joins On Massive Data
Supreeth, Sundeep, Chenjie, Chinmay
Data Team, Yahoo!
1
2. Agenda
§ Problem description
› Serves impressions clicks
› Attribution
§ Class of problems and application in other use cases
§ Attribution framework
§ Performance comparison
§ Conclusion
2
3. Serves Impressions Clicks
Web Ad
Servers Servers
Be the first place people go when they
want to find, explore, and participate with
Impressionsnews, from serious forfun. ad shown
all forms of – client side event to an Serves - Server logged event for
Clicks – client side event for a click on an ad an ad served. Serve has
Interactions – client side events for interactions complete context
within an ad
Serve events are heavy and is
Impressions clicks and conversions are a few a few 10s of KBs
bytes
Serve Guid + Serve timestamp + {other fields of Serve Guid + Serve timestamp + {other
Join
impressions/clicks/interactions} fields of serve}
* Guid is global unique identifier
3
4. Need For Attribution
Serves
5m
Several hours to days Older instances
Impressions/Clicks
Every 5 mins
Attribute an impression/click with the serve
4
5. Distribution Of % Impressions Arrived
From The Client Side wrt Serves
% of Impressions for a serve
90
80
70
60
50
%of Impressions for a serve
40
30 t1->201205301000
t2->201205300955
20 t3->201205300950
.
10 .
.
0
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t12 t13
Time period from when the serves happened
5
6. Distribution Of % Clicks Arrived From
The Client Side wrt Serves
%of Clicks for a serve
45
40
35
30
25
%of Clicks for a serve
20
15
t1->201205301000
t2->201205300955
10
t3->201205300950
.
5 .
.
0
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t12 t13
Time period from when the serves happened
6
7. Class Of Problems
§ Sparse joins spanning TBs of data on grid
§ Few MBs to a few TBs
§ Left outer join or any other outer join
Data Set Impressions Serves (5m*288)
Data Size 400MB 20GB *288 ~= 5.6 TB
(Compressed size)
7
8. Similar Use Cases
§ Associating video, click, social interactions back to the
activity data
§ Attribute back a small size client beacon to a large
dataset
§ Within Yahoo
› Audience view/click attribution
› Weblog based investigation
› Joining dimensional data with web traffic data
8
9. Pig Joins And Problem Fit
Join Strategy Comments Cost
Merge join The datasets are not sorted High
Hash join Shuffle and reduce time High
Replicated Join Does not meet performance High
needs; left outer join on the
replicated dataset
Skewed Join Data set is not skewed N/A
9
10. Problem Statement
To do a sparse outer join on a very large
dataset with high performance requirements
for display ad attribution on grid
10
11. Attribution Framework - Overview
Smart Instrumentation Strategies
Aggressive partitioning and selection
Partition Aware Efficient Join Query
Plan
11
12. Instrument For Attribution
Ø Smart Instrumentation
Strategies
§ Serve guid Aggressive partitioning and
selection
§ Clues which can help you partition better Partition Aware Efficient Join
Query Plan
› Timestamp of the serve
§ Partition keys used in event instrumentation
§ In the impression attribution example:
Impression Serves
Serve Guid + Serve timestamp + {other fields of Serve Guid + Serve timestamp + {other
impressions/clicks/interactions} fields of serve}
12
13. Partitioning approach
§ Join key based partitioning Smart Instrumentation
Strategies
§ Keys for leveraging physical partitioning Ø Aggressive partitioning
and selection
› timestamp Partition Aware Efficient Join
Query Plan
§ Use of hashes in partitioning
› HashFNV, Murmur
Key Partition Type
Join keys Hash
Timestamp Range
13
14. Pruning/Selection
§ Hashing of keys in the data sets Smart Instrumentation
Strategies
Ø Aggressive partitioning
§ Pruning of partitions and selection
Partition Aware Efficient Join
› Timestamp Query Plan
› Hash of the join key
§ IO costs and partitions
§ Configurable partitions
Key Partition Type Pruning
Join keys Hash Yes
Timestamp Range Yes
14
15. Partition Aware Efficient Join Query
Plans
Stream the selected
Impression event keys Smart Instrumentation
Size : MBs
Serve event partitions Strategies
Size : TBs
Aggressive partitioning and
selection
Ø Partition Aware
Inner Efficient Join Query Plan
Join
Stream full
Annotated impression
Impression event
Size : MBs
Size: Hundreds of MBs
Left outer
join
Complete
Annotated Impression
- in memory data with Serve data
- stream
15
16. Attribution Framework: Capabilities
Smart Instrumentation
Strategies
§ Left outer on impression/click/interaction Aggressive partitioning and
selection
› As long as the impression/click/interaction Partition Aware Efficient Join
Query Plan
exists, we will get a record in output
§ Complete annotation with the serve
§ Distinct join with serves
§ Sparse joins achieved by pruning the partitions
§ Map side joins
16
18. Attribution Framework: Tuning Parameters
§ Serve Partitions: trade off between IO & namespace used
(lookback = 24 hours)
4000 180000
Bytes read
Number of files
3500 160000
140000
3000
120000
2500
100000
2000 Bytes Read(GB)
80000 Namespace Used
1500
60000
1000
40000
500 20000
0 0
2 4 8 16 32 64 128 256 512 1024
Partitions
18
19. Attribution Framework: Tuning Parameters
§ Split Size: trade off between number of mappers and map
task run time
(partitions = 16, lookback = 24 hours)
35000 1200
Number of Mappers
Time taken
30000
1000
25000
800
20000
600 Number of Mappers
15000 Time Taken(s)
400
10000
200
5000
0 0
128MB 1 GB 2 GB 3 GB 4 GB
Split Size
19
20. Comparison With Other PIG Joins
Join Mappers Reducers Lookback Input Size Time to
complete
Left Outer 2800 45 40mins 180GB 42.5m*
Hash Join
Replicated 5680 0 5hours 1TB 7m**
Join
Attribution 5760 0 24hours Effective 5.6 TB;
6m***
Framework With Pruning 1.1 TB
* Best case for hash join 1.5m+15.5m+25.5m (Mapper + Shuffle + Reducer)
** Map time taken
*** 1 min + 2mins + 3mins (Selection/Pruning + Impression partitioning +Join)
20
21. Conclusion
§ For the sparse look up problem, the attribution framework
used works very well and within the performance needs
§ Effective partitioning aids longer lookbacks and reduced
IO
§ The levers in the framework allow for tuning based on the
computation/IO requirements
21
22. Future Steps
§ Use Hbase/Cassandra to store the event grain serve data
and do lookups
§ Use of bloom filter along with an index format
§ Compare the strategy with what Hive does and come up
with a framework using Hive
22