Yahoo Display Advertising Attribution

Yahoo! Display Ads Attribution
Framework:
A Problem Of Efficient Sparse Joins On Massive Data

Supreeth, Sundeep, Chenjie, Chinmay

Data Team, Yahoo!

1

Agenda

§  Problem description
›  Serves impressions clicks
›  Attribution
§  Class of problems and application in other use cases
§  Attribution framework
§  Performance comparison
§  Conclusion

2

Serves Impressions Clicks

Web Ad
Servers Servers

Be the first place people go when they
want to find, explore, and participate with
Impressionsnews, from serious forfun. ad shown
all forms of – client side event to an Serves - Server logged event for
Clicks – client side event for a click on an ad an ad served. Serve has
Interactions – client side events for interactions complete context
within an ad
Serve events are heavy and is
Impressions clicks and conversions are a few a few 10s of KBs
bytes

Serve Guid + Serve timestamp + {other fields of Serve Guid + Serve timestamp + {other
Join
impressions/clicks/interactions} fields of serve}

* Guid is global unique identifier
3

Need For Attribution

Serves

5m

Several hours to days Older instances

Impressions/Clicks
Every 5 mins
Attribute an impression/click with the serve

4

Distribution Of % Impressions Arrived
From The Client Side wrt Serves
% of Impressions for a serve
90

80

70

60

50

%of Impressions for a serve
40

30 t1->201205301000
t2->201205300955
20 t3->201205300950
.
10 .
.
0
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t12 t13

Time period from when the serves happened
5

Distribution Of % Clicks Arrived From
The Client Side wrt Serves
%of Clicks for a serve
45

40

35

30

25

%of Clicks for a serve
20

15
t1->201205301000
t2->201205300955
10
t3->201205300950
.
5 .
.
0
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t12 t13

Time period from when the serves happened
6

Class Of Problems

§  Sparse joins spanning TBs of data on grid
§  Few MBs to a few TBs
§  Left outer join or any other outer join

Data Set Impressions Serves (5m*288)

Data Size 400MB 20GB *288 ~= 5.6 TB
(Compressed size)

7

Similar Use Cases

§  Associating video, click, social interactions back to the
activity data
§  Attribute back a small size client beacon to a large
dataset
§  Within Yahoo
›  Audience view/click attribution
›  Weblog based investigation
›  Joining dimensional data with web traffic data

8

Pig Joins And Problem Fit

Join Strategy Comments Cost
Merge join The datasets are not sorted High
Hash join Shuffle and reduce time High
Replicated Join Does not meet performance High
needs; left outer join on the
replicated dataset
Skewed Join Data set is not skewed N/A

9

Problem Statement

To do a sparse outer join on a very large
dataset with high performance requirements
for display ad attribution on grid

10

Attribution Framework - Overview

Smart Instrumentation Strategies

Aggressive partitioning and selection

Partition Aware Efficient Join Query
Plan

11

Instrument For Attribution

Ø Smart Instrumentation
Strategies
§  Serve guid Aggressive partitioning and
selection
§  Clues which can help you partition better Partition Aware Efficient Join
Query Plan
›  Timestamp of the serve
§  Partition keys used in event instrumentation
§  In the impression attribution example:

Impression Serves

Serve Guid + Serve timestamp + {other fields of Serve Guid + Serve timestamp + {other
impressions/clicks/interactions} fields of serve}

12

Partitioning approach

§  Join key based partitioning Smart Instrumentation
Strategies

§  Keys for leveraging physical partitioning Ø Aggressive partitioning
and selection

›  timestamp Partition Aware Efficient Join
Query Plan

§  Use of hashes in partitioning
›  HashFNV, Murmur

Key Partition Type
Join keys Hash
Timestamp Range

13

Pruning/Selection

§  Hashing of keys in the data sets Smart Instrumentation
Strategies
Ø Aggressive partitioning
§  Pruning of partitions and selection
Partition Aware Efficient Join
›  Timestamp Query Plan

›  Hash of the join key
§  IO costs and partitions
§  Configurable partitions
Key Partition Type Pruning
Join keys Hash Yes
Timestamp Range Yes

14

Partition Aware Efficient Join Query
Plans
Stream the selected
Impression event keys Smart Instrumentation
Size : MBs
Serve event partitions Strategies
Size : TBs
Aggressive partitioning and
selection
Ø Partition Aware
Inner Efficient Join Query Plan
Join

Stream full
Annotated impression
Impression event
Size : MBs
Size: Hundreds of MBs

Left outer
join

Complete
Annotated Impression
- in memory data with Serve data
- stream
15

Attribution Framework: Capabilities

Smart Instrumentation
Strategies
§  Left outer on impression/click/interaction Aggressive partitioning and
selection

›  As long as the impression/click/interaction Partition Aware Efficient Join
Query Plan
exists, we will get a record in output
§  Complete annotation with the serve
§  Distinct join with serves
§  Sparse joins achieved by pruning the partitions
§  Map side joins

16

Attribution Framework: Implementation

Smart Instrumentation
Strategies
§  Python embedded PIG Aggressive partitioning and
selection

§  Dynamic partitioning/pruning (UDFs) Partition Aware Efficient Join
Query Plan

§  Configurable parameters
›  Lookbacks
›  Partitions
›  CombinedSplitSize

17

Attribution Framework: Tuning Parameters

§  Serve Partitions: trade off between IO & namespace used

(lookback = 24 hours)

4000 180000
Bytes read

Number of files
3500 160000

140000
3000
120000
2500
100000
2000 Bytes Read(GB)
80000 Namespace Used
1500
60000
1000
40000

500 20000

0 0
2 4 8 16 32 64 128 256 512 1024

Partitions
18

Attribution Framework: Tuning Parameters

§  Split Size: trade off between number of mappers and map
task run time
(partitions = 16, lookback = 24 hours)
35000 1200
Number of Mappers

Time taken
30000
1000

25000
800

20000
600 Number of Mappers
15000 Time Taken(s)

400
10000

200
5000

0 0
128MB 1 GB 2 GB 3 GB 4 GB

Split Size
19

Comparison With Other PIG Joins

Join Mappers Reducers Lookback Input Size Time to
complete
Left Outer 2800 45 40mins 180GB 42.5m*
Hash Join
Replicated 5680 0 5hours 1TB 7m**
Join
Attribution 5760 0 24hours Effective 5.6 TB;
6m***
Framework With Pruning 1.1 TB

* Best case for hash join 1.5m+15.5m+25.5m (Mapper + Shuffle + Reducer)
** Map time taken
*** 1 min + 2mins + 3mins (Selection/Pruning + Impression partitioning +Join)

20

Conclusion

§  For the sparse look up problem, the attribution framework
used works very well and within the performance needs
§  Effective partitioning aids longer lookbacks and reduced
IO
§  The levers in the framework allow for tuning based on the
computation/IO requirements

21

Future Steps

§  Use Hbase/Cassandra to store the event grain serve data
and do lookups
§  Use of bloom filter along with an index format
§  Compare the strategy with what Hive does and come up
with a framework using Hive

22

Yahoo Display Advertising Attribution

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (9)

Similaire à Yahoo Display Advertising Attribution

Similaire à Yahoo Display Advertising Attribution (20)

Plus de DataWorks Summit

Plus de DataWorks Summit (20)

Dernier

Dernier (20)

Yahoo Display Advertising Attribution