SlideShare une entreprise Scribd logo
1  sur  23
Yahoo! Display Ads Attribution
Framework:
A Problem Of Efficient Sparse Joins On Massive Data



            Supreeth, Sundeep, Chenjie, Chinmay

                    Data Team, Yahoo!




                              1
Agenda

§  Problem description
 ›    Serves impressions clicks
 ›    Attribution
§  Class of problems and application in other use cases
§  Attribution framework
§  Performance comparison
§  Conclusion




                              2
Serves Impressions Clicks


                                                               Web                        Ad
                                                              Servers                   Servers


  Be the first place people go when they
  want to find, explore, and participate with
Impressionsnews, from serious forfun. ad shown
  all forms of – client side event to an                      Serves - Server logged event for
Clicks – client side event for a click on an ad               an ad served. Serve has
Interactions – client side events for interactions            complete context
                  within an ad
                                                              Serve events are heavy and is
Impressions clicks and conversions are a few                  a few 10s of KBs
bytes


 Serve Guid + Serve timestamp + {other fields of               Serve Guid + Serve timestamp + {other
                                                       Join
        impressions/clicks/interactions}                                  fields of serve}


  * Guid is global unique identifier
                                                   3
Need For Attribution

                                   Serves




                     5m


                              Several hours to days        Older instances




Impressions/Clicks
   Every 5 mins
                      Attribute an impression/click with the serve



                          4
Distribution Of % Impressions Arrived
From The Client Side wrt Serves
                         % of Impressions for a serve
    90


    80


    70


    60


    50

                                                                              %of Impressions for a serve
    40


    30                                                                        t1->201205301000
                                                                              t2->201205300955
    20                                                                        t3->201205300950
                                                                              .
    10                                                                        .
                                                                              .
     0
         t1    t2   t3    t4   t5   t6   t7       t8   t9   t10   t12   t13



              Time period from when the serves happened
                                              5
Distribution Of % Clicks Arrived From
The Client Side wrt Serves
                        %of Clicks for a serve
   45

   40

   35

   30

   25

                                                                                  %of Clicks for a serve
   20

   15
                                                                              t1->201205301000
                                                                              t2->201205300955
   10
                                                                              t3->201205300950
                                                                              .
    5                                                                         .
                                                                              .
    0
        t1    t2   t3     t4   t5   t6   t7       t8   t9   t10   t12   t13




             Time period from when the serves happened
                                              6
Class Of Problems


§  Sparse joins spanning TBs of data on grid
§  Few MBs to a few TBs
§  Left outer join or any other outer join


      Data Set              Impressions   Serves (5m*288)

      Data Size             400MB         20GB *288 ~= 5.6 TB
      (Compressed size)




                                 7
Similar Use Cases

§  Associating video, click, social interactions back to the
    activity data
§  Attribute back a small size client beacon to a large
    dataset
§  Within Yahoo
 ›    Audience view/click attribution
 ›    Weblog based investigation
 ›    Joining dimensional data with web traffic data




                                 8
Pig Joins And Problem Fit


   Join Strategy     Comments                        Cost
   Merge join        The datasets are not sorted     High
   Hash join         Shuffle and reduce time         High
   Replicated Join   Does not meet performance       High
                     needs; left outer join on the
                     replicated dataset
   Skewed Join       Data set is not skewed          N/A




                                  9
Problem Statement




 To do a sparse outer join on a very large
dataset with high performance requirements
      for display ad attribution on grid




                    10
Attribution Framework - Overview


            Smart Instrumentation Strategies




           Aggressive partitioning and selection




           Partition Aware Efficient Join Query
                          Plan




                             11
Instrument For Attribution

                                                                    Ø Smart Instrumentation
                                                                           Strategies
 §  Serve guid                                                     Aggressive partitioning and
                                                                            selection
 §  Clues which can help you partition better                      Partition Aware Efficient Join
                                                                              Query Plan
     ›    Timestamp of the serve
 §  Partition keys used in event instrumentation
 §  In the impression attribution example:

            Impression                                              Serves


Serve Guid + Serve timestamp + {other fields of        Serve Guid + Serve timestamp + {other
       impressions/clicks/interactions}                           fields of serve}




                                                  12
Partitioning approach

§  Join key based partitioning                            Smart Instrumentation
                                                                Strategies

§  Keys for leveraging physical partitioning           Ø Aggressive partitioning
                                                              and selection

 ›    timestamp                                         Partition Aware Efficient Join
                                                                  Query Plan


§  Use of hashes in partitioning
 ›    HashFNV, Murmur


         Key                           Partition Type
         Join keys                     Hash
         Timestamp                     Range




                                  13
Pruning/Selection

§  Hashing of keys in the data sets                        Smart Instrumentation
                                                                 Strategies
                                                         Ø Aggressive partitioning
§  Pruning of partitions                                      and selection
                                                         Partition Aware Efficient Join
 ›    Timestamp                                                    Query Plan



 ›    Hash of the join key
§  IO costs and partitions
§  Configurable partitions
        Key                   Partition Type   Pruning
        Join keys             Hash             Yes
        Timestamp             Range            Yes




                                  14
Partition Aware Efficient Join Query
Plans
                                     Stream the selected
Impression event keys                                                     Smart Instrumentation
      Size : MBs
                                    Serve event partitions                     Strategies
                                               Size : TBs
                                                                        Aggressive partitioning and
                                                                                selection
                                                                            Ø Partition Aware
                           Inner                                        Efficient Join Query Plan
                            Join



                                                                  Stream full
                   Annotated impression
                                                               Impression event
                       Size : MBs
                                                            Size: Hundreds of MBs


                                               Left outer
                                                  join




                                         Complete
                                    Annotated Impression
- in memory                         data with Serve data
- stream
                                          15
Attribution Framework: Capabilities

                                                      Smart Instrumentation
                                                           Strategies
§  Left outer on impression/click/interaction     Aggressive partitioning and
                                                           selection

›    As long as the impression/click/interaction   Partition Aware Efficient Join
                                                             Query Plan
     exists, we will get a record in output
§  Complete annotation with the serve
§  Distinct join with serves
§  Sparse joins achieved by pruning the partitions
§  Map side joins




                                16
Attribution Framework: Implementation

                                             Smart Instrumentation
                                                  Strategies
§  Python embedded PIG                   Aggressive partitioning and
                                                  selection

§  Dynamic partitioning/pruning (UDFs)   Partition Aware Efficient Join
                                                    Query Plan

§  Configurable parameters
 ›    Lookbacks
 ›    Partitions
 ›    CombinedSplitSize




                              17
Attribution Framework: Tuning Parameters

§  Serve Partitions: trade off between IO & namespace used

                  (lookback = 24 hours)

               4000                                                        180000
  Bytes read




                                                                                    Number of files
               3500                                                        160000

                                                                           140000
               3000
                                                                           120000
               2500
                                                                           100000
               2000                                                                                   Bytes Read(GB)
                                                                           80000                      Namespace Used
               1500
                                                                           60000
               1000
                                                                           40000

               500                                                         20000

                  0                                                        0
                      2   4   8   16   32   64   128   256    512   1024

                                        Partitions
                                                         18
Attribution Framework: Tuning Parameters

§  Split Size: trade off between number of mappers and map
    task run time
(partitions = 16, lookback = 24 hours)
                        35000                                           1200
    Number of Mappers




                                                                               Time taken
                        30000
                                                                        1000

                        25000
                                                                        800

                        20000
                                                                        600                 Number of Mappers
                        15000                                                               Time Taken(s)

                                                                        400
                        10000

                                                                        200
                        5000


                           0                                            0
                                128MB   1 GB   2 GB   3 GB       4 GB


                                               Split Size
                                                            19
Comparison With Other PIG Joins

Join          Mappers       Reducers Lookback            Input Size              Time to
                                                                                 complete
Left Outer    2800          45           40mins         180GB                    42.5m*
Hash Join
Replicated    5680          0            5hours         1TB                      7m**
Join
Attribution   5760          0            24hours        Effective 5.6 TB;
                                                                            6m***
Framework                                               With Pruning 1.1 TB




 * Best case for hash join 1.5m+15.5m+25.5m (Mapper + Shuffle + Reducer)
 ** Map time taken
 *** 1 min + 2mins + 3mins (Selection/Pruning + Impression partitioning +Join)



                                             20
Conclusion


§  For the sparse look up problem, the attribution framework
    used works very well and within the performance needs
§  Effective partitioning aids longer lookbacks and reduced
    IO
§  The levers in the framework allow for tuning based on the
    computation/IO requirements




                              21
Future Steps


§  Use Hbase/Cassandra to store the event grain serve data
    and do lookups
§  Use of bloom filter along with an index format
§  Compare the strategy with what Hive does and come up
    with a framework using Hive




                               22
Questions?




             23

Contenu connexe

En vedette

HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 
Chango - DDM Alliance Summit Marketing on Facebook
Chango - DDM Alliance Summit Marketing on FacebookChango - DDM Alliance Summit Marketing on Facebook
Chango - DDM Alliance Summit Marketing on Facebook
DDM Alliance
 

En vedette (9)

State of digital ad fraud 2017 by augustine fou
State of digital ad fraud 2017 by augustine fouState of digital ad fraud 2017 by augustine fou
State of digital ad fraud 2017 by augustine fou
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 
Chango - DDM Alliance Summit Marketing on Facebook
Chango - DDM Alliance Summit Marketing on FacebookChango - DDM Alliance Summit Marketing on Facebook
Chango - DDM Alliance Summit Marketing on Facebook
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Similaire à Yahoo Display Advertising Attribution

5 Best Practices for Successful Cloud Deployments – and the Pitfalls to Avoid
5 Best Practices for Successful Cloud Deployments – and the Pitfalls to Avoid5 Best Practices for Successful Cloud Deployments – and the Pitfalls to Avoid
5 Best Practices for Successful Cloud Deployments – and the Pitfalls to Avoid
Compuware APM
 
Managed services
Managed servicesManaged services
Managed services
rakeysh001
 
Warranty Outsourcing For Strategic Gains
Warranty Outsourcing For Strategic GainsWarranty Outsourcing For Strategic Gains
Warranty Outsourcing For Strategic Gains
ImranMasood
 

Similaire à Yahoo Display Advertising Attribution (20)

MeasureWorks - Performance Labs - Why Observability Matters!
MeasureWorks - Performance Labs - Why Observability Matters!MeasureWorks - Performance Labs - Why Observability Matters!
MeasureWorks - Performance Labs - Why Observability Matters!
 
Softtek Break Through Savings No Need Offshore 2011 Asug Final
Softtek Break Through Savings No Need Offshore 2011 Asug FinalSofttek Break Through Savings No Need Offshore 2011 Asug Final
Softtek Break Through Savings No Need Offshore 2011 Asug Final
 
Dreamforce'12 - Automate Business Processes with Force.com
Dreamforce'12 - Automate Business Processes with Force.comDreamforce'12 - Automate Business Processes with Force.com
Dreamforce'12 - Automate Business Processes with Force.com
 
Samanage Benchmarking: Better Service Performance Starts Here
Samanage Benchmarking: Better Service Performance Starts HereSamanage Benchmarking: Better Service Performance Starts Here
Samanage Benchmarking: Better Service Performance Starts Here
 
5 Best Practices for Successful Cloud Deployments – and the Pitfalls to Avoid
5 Best Practices for Successful Cloud Deployments – and the Pitfalls to Avoid5 Best Practices for Successful Cloud Deployments – and the Pitfalls to Avoid
5 Best Practices for Successful Cloud Deployments – and the Pitfalls to Avoid
 
Managed services
Managed servicesManaged services
Managed services
 
Rundeck Overview
Rundeck OverviewRundeck Overview
Rundeck Overview
 
What does performance mean in the cloud
What does performance mean in the cloudWhat does performance mean in the cloud
What does performance mean in the cloud
 
Managed services
Managed servicesManaged services
Managed services
 
CCCC Neustar Lenny Rachitsky
CCCC Neustar Lenny RachitskyCCCC Neustar Lenny Rachitsky
CCCC Neustar Lenny Rachitsky
 
Prelim survey data 9 17-11
Prelim survey data 9 17-11Prelim survey data 9 17-11
Prelim survey data 9 17-11
 
Pinning Down Cloud Computing
Pinning Down Cloud ComputingPinning Down Cloud Computing
Pinning Down Cloud Computing
 
Warranty Outsourcing For Strategic Gains
Warranty Outsourcing For Strategic GainsWarranty Outsourcing For Strategic Gains
Warranty Outsourcing For Strategic Gains
 
JDX Suite - A Product by Ad2pro Group
JDX Suite - A Product by Ad2pro GroupJDX Suite - A Product by Ad2pro Group
JDX Suite - A Product by Ad2pro Group
 
Soa To The Rescue
Soa To The RescueSoa To The Rescue
Soa To The Rescue
 
IT Infrastructure Outsourcing Benefits Demystified
IT Infrastructure Outsourcing Benefits Demystified IT Infrastructure Outsourcing Benefits Demystified
IT Infrastructure Outsourcing Benefits Demystified
 
Daniel Jasník - ITSMF pro cloudové služby - AID2019
Daniel Jasník - ITSMF pro cloudové služby - AID2019Daniel Jasník - ITSMF pro cloudové služby - AID2019
Daniel Jasník - ITSMF pro cloudové služby - AID2019
 
IT Service Level Agreement
IT Service Level AgreementIT Service Level Agreement
IT Service Level Agreement
 
Brotight China - Professional Service
Brotight China - Professional ServiceBrotight China - Professional Service
Brotight China - Professional Service
 
Sciencelogic - A Leader in IT Transformation
Sciencelogic - A Leader in IT Transformation Sciencelogic - A Leader in IT Transformation
Sciencelogic - A Leader in IT Transformation
 

Plus de DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Yahoo Display Advertising Attribution

  • 1. Yahoo! Display Ads Attribution Framework: A Problem Of Efficient Sparse Joins On Massive Data Supreeth, Sundeep, Chenjie, Chinmay Data Team, Yahoo! 1
  • 2. Agenda §  Problem description ›  Serves impressions clicks ›  Attribution §  Class of problems and application in other use cases §  Attribution framework §  Performance comparison §  Conclusion 2
  • 3. Serves Impressions Clicks Web Ad Servers Servers Be the first place people go when they want to find, explore, and participate with Impressionsnews, from serious forfun. ad shown all forms of – client side event to an Serves - Server logged event for Clicks – client side event for a click on an ad an ad served. Serve has Interactions – client side events for interactions complete context within an ad Serve events are heavy and is Impressions clicks and conversions are a few a few 10s of KBs bytes Serve Guid + Serve timestamp + {other fields of Serve Guid + Serve timestamp + {other Join impressions/clicks/interactions} fields of serve} * Guid is global unique identifier 3
  • 4. Need For Attribution Serves 5m Several hours to days Older instances Impressions/Clicks Every 5 mins Attribute an impression/click with the serve 4
  • 5. Distribution Of % Impressions Arrived From The Client Side wrt Serves % of Impressions for a serve 90 80 70 60 50 %of Impressions for a serve 40 30 t1->201205301000 t2->201205300955 20 t3->201205300950 . 10 . . 0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t12 t13 Time period from when the serves happened 5
  • 6. Distribution Of % Clicks Arrived From The Client Side wrt Serves %of Clicks for a serve 45 40 35 30 25 %of Clicks for a serve 20 15 t1->201205301000 t2->201205300955 10 t3->201205300950 . 5 . . 0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t12 t13 Time period from when the serves happened 6
  • 7. Class Of Problems §  Sparse joins spanning TBs of data on grid §  Few MBs to a few TBs §  Left outer join or any other outer join Data Set Impressions Serves (5m*288) Data Size 400MB 20GB *288 ~= 5.6 TB (Compressed size) 7
  • 8. Similar Use Cases §  Associating video, click, social interactions back to the activity data §  Attribute back a small size client beacon to a large dataset §  Within Yahoo ›  Audience view/click attribution ›  Weblog based investigation ›  Joining dimensional data with web traffic data 8
  • 9. Pig Joins And Problem Fit Join Strategy Comments Cost Merge join The datasets are not sorted High Hash join Shuffle and reduce time High Replicated Join Does not meet performance High needs; left outer join on the replicated dataset Skewed Join Data set is not skewed N/A 9
  • 10. Problem Statement To do a sparse outer join on a very large dataset with high performance requirements for display ad attribution on grid 10
  • 11. Attribution Framework - Overview Smart Instrumentation Strategies Aggressive partitioning and selection Partition Aware Efficient Join Query Plan 11
  • 12. Instrument For Attribution Ø Smart Instrumentation Strategies §  Serve guid Aggressive partitioning and selection §  Clues which can help you partition better Partition Aware Efficient Join Query Plan ›  Timestamp of the serve §  Partition keys used in event instrumentation §  In the impression attribution example: Impression Serves Serve Guid + Serve timestamp + {other fields of Serve Guid + Serve timestamp + {other impressions/clicks/interactions} fields of serve} 12
  • 13. Partitioning approach §  Join key based partitioning Smart Instrumentation Strategies §  Keys for leveraging physical partitioning Ø Aggressive partitioning and selection ›  timestamp Partition Aware Efficient Join Query Plan §  Use of hashes in partitioning ›  HashFNV, Murmur Key Partition Type Join keys Hash Timestamp Range 13
  • 14. Pruning/Selection §  Hashing of keys in the data sets Smart Instrumentation Strategies Ø Aggressive partitioning §  Pruning of partitions and selection Partition Aware Efficient Join ›  Timestamp Query Plan ›  Hash of the join key §  IO costs and partitions §  Configurable partitions Key Partition Type Pruning Join keys Hash Yes Timestamp Range Yes 14
  • 15. Partition Aware Efficient Join Query Plans Stream the selected Impression event keys Smart Instrumentation Size : MBs Serve event partitions Strategies Size : TBs Aggressive partitioning and selection Ø Partition Aware Inner Efficient Join Query Plan Join Stream full Annotated impression Impression event Size : MBs Size: Hundreds of MBs Left outer join Complete Annotated Impression - in memory data with Serve data - stream 15
  • 16. Attribution Framework: Capabilities Smart Instrumentation Strategies §  Left outer on impression/click/interaction Aggressive partitioning and selection ›  As long as the impression/click/interaction Partition Aware Efficient Join Query Plan exists, we will get a record in output §  Complete annotation with the serve §  Distinct join with serves §  Sparse joins achieved by pruning the partitions §  Map side joins 16
  • 17. Attribution Framework: Implementation Smart Instrumentation Strategies §  Python embedded PIG Aggressive partitioning and selection §  Dynamic partitioning/pruning (UDFs) Partition Aware Efficient Join Query Plan §  Configurable parameters ›  Lookbacks ›  Partitions ›  CombinedSplitSize 17
  • 18. Attribution Framework: Tuning Parameters §  Serve Partitions: trade off between IO & namespace used (lookback = 24 hours) 4000 180000 Bytes read Number of files 3500 160000 140000 3000 120000 2500 100000 2000 Bytes Read(GB) 80000 Namespace Used 1500 60000 1000 40000 500 20000 0 0 2 4 8 16 32 64 128 256 512 1024 Partitions 18
  • 19. Attribution Framework: Tuning Parameters §  Split Size: trade off between number of mappers and map task run time (partitions = 16, lookback = 24 hours) 35000 1200 Number of Mappers Time taken 30000 1000 25000 800 20000 600 Number of Mappers 15000 Time Taken(s) 400 10000 200 5000 0 0 128MB 1 GB 2 GB 3 GB 4 GB Split Size 19
  • 20. Comparison With Other PIG Joins Join Mappers Reducers Lookback Input Size Time to complete Left Outer 2800 45 40mins 180GB 42.5m* Hash Join Replicated 5680 0 5hours 1TB 7m** Join Attribution 5760 0 24hours Effective 5.6 TB; 6m*** Framework With Pruning 1.1 TB * Best case for hash join 1.5m+15.5m+25.5m (Mapper + Shuffle + Reducer) ** Map time taken *** 1 min + 2mins + 3mins (Selection/Pruning + Impression partitioning +Join) 20
  • 21. Conclusion §  For the sparse look up problem, the attribution framework used works very well and within the performance needs §  Effective partitioning aids longer lookbacks and reduced IO §  The levers in the framework allow for tuning based on the computation/IO requirements 21
  • 22. Future Steps §  Use Hbase/Cassandra to store the event grain serve data and do lookups §  Use of bloom filter along with an index format §  Compare the strategy with what Hive does and come up with a framework using Hive 22