SlideShare une entreprise Scribd logo
1  sur  26
Statistical Models for Massive Web Data
Deepak Agarwal, LinkedIn, USA
Director, Applied Relevance Science (ARS)


CATS Big Data Panel, October 11, 2012
Hosted by National Academy of Sciences
Washington D.C., USA
Disclaimer

 The opinions expressed here are mine and in
  no way represent the official position of LinkedIn

 Case studies presented today was work done
  while I was at Yahoo!




                                     NRC BIG DATA PANEL, AGARWAL, 2012
Big Data Applications in Business

 Big Data : Competitive advantage, Innovation, reduces
  uncertainty in decision making

 High Frequency Data
   – Large number of heterogeneous transactions per unit time
        Web visits, financial trading, credit card transactions, telephone
         calls, packet flows in an IP network,…


 I will focus on statistical modeling for one such data
  source
   – User visits to websites




                                                       NRC BIG DATA PANEL, AGARWAL, 2012
Example 1: Yahoo! front page



          Today module


                                Recommend content links
          F1     F2   F3   F4   (out of 30-40, editorially
                                programmed)

                                4 slots exposed, F1 has
                                maximum exposure
                                Routes traffic to other Y!
                                properties


                                     NRC BIG DATA PANEL, AGARWAL, 2012
LinkedIn News




                NRC BIG DATA PANEL, AGARWAL, 2012
LinkedIn Ads




               NRC BIG DATA PANEL, AGARWAL, 2012
Data Generation

                             User information

 http request   NEWS

                       ADS




                                                Ranking Service



            Model Updates                        NRC BIG DATA PANEL, AGARWAL, 2012
DATA



CONTEXT                    Select Item j with item covariates xj
                                                 (keywords, content categories, ...)




User i    visits                     (i, j) : response yij
(User, context)                            (click/no-click)
covariates xit

(profile information, device id,
first degree connections,
browse information,…)

                                                        NRC BIG DATA PANEL, AGARWAL, 2012
Statistical Problem

 Rank items (from an admissible pool) for user visits in
  some context to maximize a utility of interest
 Examples of utility functions
   – Click-rates (CTR)
   – Share-rates (CTR* [Share|Click] )
   – Revenue per page-view = CTR*bid (more complex due to
     second price auction)

 CTR is a fundamental measure that opens the door to a
  more principled approach to rank items
 Converge rapidly to maximum utility items
   – Sequential decision making process
   – Models: help cope data sparseness (curse of dimensionality


                                               NRC BIG DATA PANEL, AGARWAL, 2012
Illustrate with Y! front Page Application

 Simplify: Maximize CTR on first slot (F1)

 Article Pool
   – Editorially selected for high quality and brand image
   – Few articles in the pool but article pool dynamic




 We want to provide personalized recommendations
   – Users with many prior visits see recommendations “tailored” to
     their taste, others see the best for the “group” they belong to



                                                 NRC BIG DATA PANEL, AGARWAL, 2012
Types of user covariates
 Demographics, geo:
   – Not useful in front-page application

 Browse behavior: activity on Y! network ( xit )
   – Previous visits to property, search, ad views, clicks,..
   – This is useful for the front-page application

 Latent user factors based on previous clicks on the
  module ( uit )
   – Useful for active module users, obtained via factor models



                                                NRC BIG DATA PANEL, AGARWAL, 2012
Approach: Online + Offline

 Offline computation
  – Intensive computations done infrequently (once
    a day/week) to update parameters that are less
    time-sensitive

 Online computation
  – Lightweight computations frequent (once every 5-10
    minutes) to update parameters that are time-
    sensitive
  – Adaptive experiments(explore-exploit) also done
    online


                                      NRC BIG DATA PANEL, AGARWAL, 2012
Online computation: per-item online logistic regression

 For item j, the state-space model is

         yijt ~ Ber( pijt )
         lg t( pijt ) = u v jt + x b jt
                             '
                             i
                                        '
                                        it

         (v j,t+1, b j,t+1 ) = (v j,t , b j,t ) + d j,t+1 ~ (0, t )        2


         (v j 0 , b j 0 ) = (Dx j , 0) + e j 0 ~ (0, s )         2




 Item coefficients are update online via Kalman-filter (discounting
  approach of West and Harrison)
    – Item covariates are used to initialize coefficients at epoch zero


                                                        NRC BIG DATA PANEL, AGARWAL, 2012
Closer Look at online model

     Different components of    lgt( pijt ) = u v jt + x b jt
                                                    '
                                                    i
                                                                    '
                                                                    it


    r x1
u   i      :User latent factors, useful for heavy users


xit b jt : Residual item affinity to user covariate (old items)
 '




                                          NRC BIG DATA PANEL, AGARWAL, 2012
Online Adaptive Experimentation
(Explore/Exploit)
 Three schemes (all work reasonably well for the
  front page application)
   – epsilon-greedy: Show article with maximum posterior
     mean except with a small probability epsilon, choose an
     article at random.


   – Upper confidence bound (UCB): Show article with
     maximum score, where score = post-mean + k. post-std

   – Thompson sampling: Draw a sample (δ,β) from posterior
     to compute article CTR and show article with maximum
     CTR


                                           NRC BIG DATA PANEL, AGARWAL, 2012
Offline computation

 Computing user latent factors and item
  coefficient prior

  – This is computed offline once a day using
    retrospective (user,item) interaction data for last
    X days (X = 30 in our case)
  – Computations are done on Hadoop




                                       NRC BIG DATA PANEL, AGARWAL, 2012
Offline: Regression based Latent Factor
 Model
yij ~ Ber(pij ) (# obs. per user has wide variation)
lgt(pij ) = å uik v jk = u¢v j (need shrinkage on factors)
                          i
                     k


            ui = Gxi + e , e ~ N(0, diag(s , s ,.., s ))
                              u
                              i
                                      u
                                      i
                                                                   2
                                                                   1
                                                                           2
                                                                           2
                                                                                        2
                                                                                        r
regression weight matrix   user/item-specific correction terms (learnt from data)

            vi = Dx j + e , e ~ N(0, I)
                              v
                              j
                                       v
                                       j

            vik ³ 0
                                                        NRC BIG DATA PANEL, AGARWAL, 2012
Role of shrinkage (consider Guassian for
simplicity)
 For new user/article, factor estimates based on
  covariates
               user                item
     u new    G x new , v new      D x new
For old user, factor estimates
                                   user
E(ui | Rest) = (l I + å v j v ) (lGxi + å yij v j )
                                ' -1
                                 j
                     jÎNi                             jÎNi

 Linear combination of prior regression function
  and user feedback on items

                                       NRC BIG DATA PANEL, AGARWAL, 2012
Estimating the Regression function via EM

             Maximize
 (        f ( u i , v j , Data )       g (u i , G )       g ( v j , D ))       du i          dv j
     ij                            i                  j                    i             j




Integral cannot be computed in closed form,
approximated by Monte Carlo using Gibbs Sampling

For logistic, we use ARS (Gilks and Wild) to sample the
latent factors within the Gibbs sampler



                                                                  NRC BIG DATA PANEL, AGARWAL, 2012
Scaling to large data on Hadoop
 Randomly partition by users in the Map
 Run separate model on each partition
   – Care taken to initialize each partition model with
     same values, constraints on factors ensure
     identifiability within each partition

 Create ensembles by using different user partitions,
  average across ensembles to obtain estimates of
  user factors and regression functions

   – Estimates of user factors in ensembles uncorrelated,
     averaging reduces variance


                                           NRC BIG DATA PANEL, AGARWAL, 2012
Data Example
 1B events, 8M users, 6K articles
 Offline training produced user factor ui
 Baseline: Online logistic without ui
   – Covariate Only online Logistic model
            lgt(pijt ) = x b jt
                          '
                          it

 Overall click lift: 9.7%,
 Heavy users (> 10 clicks last month): 26%
 Cold users (not seen in the past): 3%


                                     NRC BIG DATA PANEL, AGARWAL, 2012
Click-lift for heavy users



        CTR LIFT Relative to COVARIATE ONLY
        Logistic Model




                                   NRC BIG DATA PANEL, AGARWAL, 2012
Computational Advertising: Matching ads to opportunities




                                                                         Advertisers
                               Pick
                 Ads          best ads


                Page                        Ad
 User                                     Network
                                           Examples:
                                         Yahoo, Google,
Opportunity                                  MSN,
                Publisher
                                    Ad exchanges(network
                                       of “networks”) …



                                            NRC BIG DATA PANEL, AGARWAL, 2012
Ad- exchange (RightMedia) [Agarwal et al. KDD 10]

   Advertisers participate in different ways
    –   CPM (pay by ad-view)
    –   CPC (pay per click)
    –   CPA (pay per conversion)



   To conduct an auction, normalize across pricing types
    –   Compute eCPM (expected CPM)
           Click-based ---- eCPM = click-rate*CPC
           Conversion-based ---- eCPM = conv-rate*CPA

 Similar strategy of computing offline and online components
    – Process 90B records for each model fit
    – Model has hundreds of millions of parameters
    – Model fully deployed on RightMedia today


                                                         NRC BIG DATA PANEL, AGARWAL, 2012
Summary

 Estimating interactions in high-dimensional
  sparse data important in web applications



 Scaling such models to Big Data is a
  challenging statistical problem

 Combining offline + online modeling with
  explore/exploit a good practical strategy


                                    NRC BIG DATA PANEL, AGARWAL, 2012
Some Challenges

 Very high-dimensional modeling with very large and noisy data
    – Few categorical variables with large number of levels interacting with
      each other to produce response
    – Scalability


 Designing sequential experiments
    – Multi-armed bandits are back in a big way


 Data fusion
    – From multiple and disparate sources


 Availability of data and ability to run experiments to researchers




                                                       NRC BIG DATA PANEL, AGARWAL, 2012

Contenu connexe

Tendances

Collaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsCollaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsNavisro Analytics
 
How to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on SparkHow to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on SparkCaserta
 
Recommendation engines
Recommendation enginesRecommendation engines
Recommendation enginesGeorgian Micsa
 
How Artificial Intelligence & Machine Learning Are Transforming Modern Marketing
How Artificial Intelligence & Machine Learning Are Transforming Modern MarketingHow Artificial Intelligence & Machine Learning Are Transforming Modern Marketing
How Artificial Intelligence & Machine Learning Are Transforming Modern MarketingCleverTap
 
The Universal Recommender
The Universal RecommenderThe Universal Recommender
The Universal RecommenderPat Ferrel
 
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Xavier Amatriain
 
Distributed Representation-based Recommender Systems in E-commerce
Distributed Representation-based Recommender Systems in E-commerceDistributed Representation-based Recommender Systems in E-commerce
Distributed Representation-based Recommender Systems in E-commerceRakuten Group, Inc.
 
Artificial Intelligence at LinkedIn
Artificial Intelligence at LinkedInArtificial Intelligence at LinkedIn
Artificial Intelligence at LinkedInBill Liu
 
Azure Machine Learning Dotnet Campus 2015
Azure Machine Learning Dotnet Campus 2015 Azure Machine Learning Dotnet Campus 2015
Azure Machine Learning Dotnet Campus 2015 antimo musone
 
Graph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media AnalyticsGraph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media AnalyticsNYC Predictive Analytics
 
Collaborative Filtering
Collaborative FilteringCollaborative Filtering
Collaborative FilteringTayfun Sen
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introductionLiang Xiang
 
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...Lucidworks
 
Survey of Recommendation Systems
Survey of Recommendation SystemsSurvey of Recommendation Systems
Survey of Recommendation Systemsyoualab
 
Recsys 2014 Tutorial - The Recommender Problem Revisited
Recsys 2014 Tutorial - The Recommender Problem RevisitedRecsys 2014 Tutorial - The Recommender Problem Revisited
Recsys 2014 Tutorial - The Recommender Problem RevisitedXavier Amatriain
 
Item Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation AlgorithmsItem Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation Algorithmsnextlib
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.
 
Scalable advertising recommender systems
Scalable advertising recommender systemsScalable advertising recommender systems
Scalable advertising recommender systemsJoaquin Delgado PhD.
 

Tendances (20)

Collaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsCollaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro Analytics
 
How to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on SparkHow to Build a Recommendation Engine on Spark
How to Build a Recommendation Engine on Spark
 
Recommendation engines
Recommendation enginesRecommendation engines
Recommendation engines
 
How Artificial Intelligence & Machine Learning Are Transforming Modern Marketing
How Artificial Intelligence & Machine Learning Are Transforming Modern MarketingHow Artificial Intelligence & Machine Learning Are Transforming Modern Marketing
How Artificial Intelligence & Machine Learning Are Transforming Modern Marketing
 
The Universal Recommender
The Universal RecommenderThe Universal Recommender
The Universal Recommender
 
How Artificial Intelligence & Machine Learning Are Transforming Modern Market...
How Artificial Intelligence & Machine Learning Are Transforming Modern Market...How Artificial Intelligence & Machine Learning Are Transforming Modern Market...
How Artificial Intelligence & Machine Learning Are Transforming Modern Market...
 
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
 
Distributed Representation-based Recommender Systems in E-commerce
Distributed Representation-based Recommender Systems in E-commerceDistributed Representation-based Recommender Systems in E-commerce
Distributed Representation-based Recommender Systems in E-commerce
 
Artificial Intelligence at LinkedIn
Artificial Intelligence at LinkedInArtificial Intelligence at LinkedIn
Artificial Intelligence at LinkedIn
 
Azure Machine Learning Dotnet Campus 2015
Azure Machine Learning Dotnet Campus 2015 Azure Machine Learning Dotnet Campus 2015
Azure Machine Learning Dotnet Campus 2015
 
Graph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media AnalyticsGraph Based Machine Learning with Applications to Media Analytics
Graph Based Machine Learning with Applications to Media Analytics
 
Collaborative Filtering
Collaborative FilteringCollaborative Filtering
Collaborative Filtering
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introduction
 
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
Search Accuracy Metrics and Predictive Analytics - A Big Data Use Case: Prese...
 
Survey of Recommendation Systems
Survey of Recommendation SystemsSurvey of Recommendation Systems
Survey of Recommendation Systems
 
Recsys 2014 Tutorial - The Recommender Problem Revisited
Recsys 2014 Tutorial - The Recommender Problem RevisitedRecsys 2014 Tutorial - The Recommender Problem Revisited
Recsys 2014 Tutorial - The Recommender Problem Revisited
 
Item Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation AlgorithmsItem Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation Algorithms
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
Scalable advertising recommender systems
Scalable advertising recommender systemsScalable advertising recommender systems
Scalable advertising recommender systems
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 

En vedette

Statistical Modeling - Cereal Data Project
Statistical Modeling - Cereal Data Project  Statistical Modeling - Cereal Data Project
Statistical Modeling - Cereal Data Project Hubert Lo
 
The active inclusion of young people
The active inclusion of young peopleThe active inclusion of young people
The active inclusion of young peoplepesec
 
STEP Inspire: Non-Profit Assignment
STEP Inspire: Non-Profit AssignmentSTEP Inspire: Non-Profit Assignment
STEP Inspire: Non-Profit Assignmentmdc5070
 
חוברת קורס
חוברת קורסחוברת קורס
חוברת קורסStas Segin
 
הראיות לפסיכולוגיה מונחית ראיות
הראיות לפסיכולוגיה מונחית ראיותהראיות לפסיכולוגיה מונחית ראיות
הראיות לפסיכולוגיה מונחית ראיותTsviGil
 
Instafxng weekly analysis 13th - 17th August
Instafxng weekly analysis 13th - 17th AugustInstafxng weekly analysis 13th - 17th August
Instafxng weekly analysis 13th - 17th AugustInstaforex Nigeria
 
Mike Todd Noris - Resume - 110116
Mike Todd Noris - Resume - 110116Mike Todd Noris - Resume - 110116
Mike Todd Noris - Resume - 110116mtnorris814
 
中央财政入户问卷最终版
中央财政入户问卷最终版中央财政入户问卷最终版
中央财政入户问卷最终版yongnianlou
 
Science m6
Science m6Science m6
Science m6Biobiome
 
Starter courses knife skills aug2012
Starter courses  knife skills aug2012Starter courses  knife skills aug2012
Starter courses knife skills aug2012Rachael Mann
 

En vedette (17)

Statistical Modeling - Cereal Data Project
Statistical Modeling - Cereal Data Project  Statistical Modeling - Cereal Data Project
Statistical Modeling - Cereal Data Project
 
The active inclusion of young people
The active inclusion of young peopleThe active inclusion of young people
The active inclusion of young people
 
STEP Inspire: Non-Profit Assignment
STEP Inspire: Non-Profit AssignmentSTEP Inspire: Non-Profit Assignment
STEP Inspire: Non-Profit Assignment
 
72 pat2
72 pat272 pat2
72 pat2
 
חוברת קורס
חוברת קורסחוברת קורס
חוברת קורס
 
הראיות לפסיכולוגיה מונחית ראיות
הראיות לפסיכולוגיה מונחית ראיותהראיות לפסיכולוגיה מונחית ראיות
הראיות לפסיכולוגיה מונחית ראיות
 
space
space space
space
 
Instafxng weekly analysis 13th - 17th August
Instafxng weekly analysis 13th - 17th AugustInstafxng weekly analysis 13th - 17th August
Instafxng weekly analysis 13th - 17th August
 
Mike Todd Noris - Resume - 110116
Mike Todd Noris - Resume - 110116Mike Todd Noris - Resume - 110116
Mike Todd Noris - Resume - 110116
 
中央财政入户问卷最终版
中央财政入户问卷最终版中央财政入户问卷最终版
中央财政入户问卷最终版
 
Science m6
Science m6Science m6
Science m6
 
Latiff
LatiffLatiff
Latiff
 
Quijote
QuijoteQuijote
Quijote
 
Human capital 2013
Human capital 2013Human capital 2013
Human capital 2013
 
Starter courses knife skills aug2012
Starter courses  knife skills aug2012Starter courses  knife skills aug2012
Starter courses knife skills aug2012
 
Proyecto Labmovel
Proyecto LabmovelProyecto Labmovel
Proyecto Labmovel
 
Dobronovskyi
DobronovskyiDobronovskyi
Dobronovskyi
 

Similaire à Statistical Models for Massive Web Data

Partial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather ConditionsPartial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather ConditionsIRJET Journal
 
IRJET- Real Time Implementation of Bi-Histogram Equalization Method on Androi...
IRJET- Real Time Implementation of Bi-Histogram Equalization Method on Androi...IRJET- Real Time Implementation of Bi-Histogram Equalization Method on Androi...
IRJET- Real Time Implementation of Bi-Histogram Equalization Method on Androi...IRJET Journal
 
IRJET- Design and Implementation of ATM Security System using Vibration Senso...
IRJET- Design and Implementation of ATM Security System using Vibration Senso...IRJET- Design and Implementation of ATM Security System using Vibration Senso...
IRJET- Design and Implementation of ATM Security System using Vibration Senso...IRJET Journal
 
Webpage Personalization and User Profiling
Webpage Personalization and User ProfilingWebpage Personalization and User Profiling
Webpage Personalization and User Profilingyingfeng
 
Recommender Systems Tutorial (Part 3) -- Online Components
Recommender Systems Tutorial (Part 3) -- Online ComponentsRecommender Systems Tutorial (Part 3) -- Online Components
Recommender Systems Tutorial (Part 3) -- Online ComponentsBee-Chung Chen
 
GAN in medical imaging
GAN in medical imagingGAN in medical imaging
GAN in medical imagingCheng-Bin Jin
 
Recommender Systems from A to Z – The Right Dataset
Recommender Systems from A to Z – The Right DatasetRecommender Systems from A to Z – The Right Dataset
Recommender Systems from A to Z – The Right DatasetCrossing Minds
 
Flow Trajectory Approach for Human Action Recognition
Flow Trajectory Approach for Human Action RecognitionFlow Trajectory Approach for Human Action Recognition
Flow Trajectory Approach for Human Action RecognitionIRJET Journal
 
Knowledge Graph Embeddings for Recommender Systems
Knowledge Graph Embeddings for Recommender SystemsKnowledge Graph Embeddings for Recommender Systems
Knowledge Graph Embeddings for Recommender SystemsEnrico Palumbo
 
Unpaired Image Translations Using GANs: A Review
Unpaired Image Translations Using GANs: A ReviewUnpaired Image Translations Using GANs: A Review
Unpaired Image Translations Using GANs: A ReviewIRJET Journal
 
The Role of Selfies in Creating the Next Generation Computer Vision Infused O...
The Role of Selfies in Creating the Next Generation Computer Vision Infused O...The Role of Selfies in Creating the Next Generation Computer Vision Infused O...
The Role of Selfies in Creating the Next Generation Computer Vision Infused O...hanumayamma
 
IRJET- Generating 3D Models Using 3D Generative Adversarial Network
IRJET- Generating 3D Models Using 3D Generative Adversarial NetworkIRJET- Generating 3D Models Using 3D Generative Adversarial Network
IRJET- Generating 3D Models Using 3D Generative Adversarial NetworkIRJET Journal
 
SVD and the Netflix Dataset
SVD and the Netflix DatasetSVD and the Netflix Dataset
SVD and the Netflix DatasetBen Mabey
 
PredictionIO - Building Applications That Predict User Behavior Through Big D...
PredictionIO - Building Applications That Predict User Behavior Through Big D...PredictionIO - Building Applications That Predict User Behavior Through Big D...
PredictionIO - Building Applications That Predict User Behavior Through Big D...predictionio
 
REAL-TIME OBJECT DETECTION USING OPEN COMPUTER VISION
REAL-TIME OBJECT DETECTION USING OPEN COMPUTER VISIONREAL-TIME OBJECT DETECTION USING OPEN COMPUTER VISION
REAL-TIME OBJECT DETECTION USING OPEN COMPUTER VISIONIRJET Journal
 
Creating Objects for Metaverse using GANs and Autoencoders
Creating Objects for Metaverse using GANs and AutoencodersCreating Objects for Metaverse using GANs and Autoencoders
Creating Objects for Metaverse using GANs and AutoencodersIRJET Journal
 
Generation of Deepfake images using GAN and Least squares GAN.ppt
Generation of Deepfake images using GAN and Least squares GAN.pptGeneration of Deepfake images using GAN and Least squares GAN.ppt
Generation of Deepfake images using GAN and Least squares GAN.pptDivyaGugulothu
 
Feature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax auditFeature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax auditMichael BENESTY
 
Software tookits for machine learning and graphical models
Software tookits for machine learning and graphical modelsSoftware tookits for machine learning and graphical models
Software tookits for machine learning and graphical modelsbutest
 
Big data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial UsecasesBig data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial UsecasesArvind Rapaka
 

Similaire à Statistical Models for Massive Web Data (20)

Partial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather ConditionsPartial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather Conditions
 
IRJET- Real Time Implementation of Bi-Histogram Equalization Method on Androi...
IRJET- Real Time Implementation of Bi-Histogram Equalization Method on Androi...IRJET- Real Time Implementation of Bi-Histogram Equalization Method on Androi...
IRJET- Real Time Implementation of Bi-Histogram Equalization Method on Androi...
 
IRJET- Design and Implementation of ATM Security System using Vibration Senso...
IRJET- Design and Implementation of ATM Security System using Vibration Senso...IRJET- Design and Implementation of ATM Security System using Vibration Senso...
IRJET- Design and Implementation of ATM Security System using Vibration Senso...
 
Webpage Personalization and User Profiling
Webpage Personalization and User ProfilingWebpage Personalization and User Profiling
Webpage Personalization and User Profiling
 
Recommender Systems Tutorial (Part 3) -- Online Components
Recommender Systems Tutorial (Part 3) -- Online ComponentsRecommender Systems Tutorial (Part 3) -- Online Components
Recommender Systems Tutorial (Part 3) -- Online Components
 
GAN in medical imaging
GAN in medical imagingGAN in medical imaging
GAN in medical imaging
 
Recommender Systems from A to Z – The Right Dataset
Recommender Systems from A to Z – The Right DatasetRecommender Systems from A to Z – The Right Dataset
Recommender Systems from A to Z – The Right Dataset
 
Flow Trajectory Approach for Human Action Recognition
Flow Trajectory Approach for Human Action RecognitionFlow Trajectory Approach for Human Action Recognition
Flow Trajectory Approach for Human Action Recognition
 
Knowledge Graph Embeddings for Recommender Systems
Knowledge Graph Embeddings for Recommender SystemsKnowledge Graph Embeddings for Recommender Systems
Knowledge Graph Embeddings for Recommender Systems
 
Unpaired Image Translations Using GANs: A Review
Unpaired Image Translations Using GANs: A ReviewUnpaired Image Translations Using GANs: A Review
Unpaired Image Translations Using GANs: A Review
 
The Role of Selfies in Creating the Next Generation Computer Vision Infused O...
The Role of Selfies in Creating the Next Generation Computer Vision Infused O...The Role of Selfies in Creating the Next Generation Computer Vision Infused O...
The Role of Selfies in Creating the Next Generation Computer Vision Infused O...
 
IRJET- Generating 3D Models Using 3D Generative Adversarial Network
IRJET- Generating 3D Models Using 3D Generative Adversarial NetworkIRJET- Generating 3D Models Using 3D Generative Adversarial Network
IRJET- Generating 3D Models Using 3D Generative Adversarial Network
 
SVD and the Netflix Dataset
SVD and the Netflix DatasetSVD and the Netflix Dataset
SVD and the Netflix Dataset
 
PredictionIO - Building Applications That Predict User Behavior Through Big D...
PredictionIO - Building Applications That Predict User Behavior Through Big D...PredictionIO - Building Applications That Predict User Behavior Through Big D...
PredictionIO - Building Applications That Predict User Behavior Through Big D...
 
REAL-TIME OBJECT DETECTION USING OPEN COMPUTER VISION
REAL-TIME OBJECT DETECTION USING OPEN COMPUTER VISIONREAL-TIME OBJECT DETECTION USING OPEN COMPUTER VISION
REAL-TIME OBJECT DETECTION USING OPEN COMPUTER VISION
 
Creating Objects for Metaverse using GANs and Autoencoders
Creating Objects for Metaverse using GANs and AutoencodersCreating Objects for Metaverse using GANs and Autoencoders
Creating Objects for Metaverse using GANs and Autoencoders
 
Generation of Deepfake images using GAN and Least squares GAN.ppt
Generation of Deepfake images using GAN and Least squares GAN.pptGeneration of Deepfake images using GAN and Least squares GAN.ppt
Generation of Deepfake images using GAN and Least squares GAN.ppt
 
Feature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax auditFeature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax audit
 
Software tookits for machine learning and graphical models
Software tookits for machine learning and graphical modelsSoftware tookits for machine learning and graphical models
Software tookits for machine learning and graphical models
 
Big data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial UsecasesBig data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial Usecases
 

Dernier

Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...Nguyen Thanh Tu Collection
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxUmeshTimilsina1
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxCeline George
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxPooja Bhuva
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - Englishneillewis46
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jisc
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxJisc
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfDr Vijay Vishwakarma
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxPooja Bhuva
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxmarlenawright1
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 

Dernier (20)

Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptx
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 

Statistical Models for Massive Web Data

  • 1. Statistical Models for Massive Web Data Deepak Agarwal, LinkedIn, USA Director, Applied Relevance Science (ARS) CATS Big Data Panel, October 11, 2012 Hosted by National Academy of Sciences Washington D.C., USA
  • 2. Disclaimer  The opinions expressed here are mine and in no way represent the official position of LinkedIn  Case studies presented today was work done while I was at Yahoo! NRC BIG DATA PANEL, AGARWAL, 2012
  • 3. Big Data Applications in Business  Big Data : Competitive advantage, Innovation, reduces uncertainty in decision making  High Frequency Data – Large number of heterogeneous transactions per unit time  Web visits, financial trading, credit card transactions, telephone calls, packet flows in an IP network,…  I will focus on statistical modeling for one such data source – User visits to websites NRC BIG DATA PANEL, AGARWAL, 2012
  • 4. Example 1: Yahoo! front page Today module Recommend content links F1 F2 F3 F4 (out of 30-40, editorially programmed) 4 slots exposed, F1 has maximum exposure Routes traffic to other Y! properties NRC BIG DATA PANEL, AGARWAL, 2012
  • 5. LinkedIn News NRC BIG DATA PANEL, AGARWAL, 2012
  • 6. LinkedIn Ads NRC BIG DATA PANEL, AGARWAL, 2012
  • 7. Data Generation User information http request NEWS ADS Ranking Service Model Updates NRC BIG DATA PANEL, AGARWAL, 2012
  • 8. DATA CONTEXT Select Item j with item covariates xj (keywords, content categories, ...) User i visits (i, j) : response yij (User, context) (click/no-click) covariates xit (profile information, device id, first degree connections, browse information,…) NRC BIG DATA PANEL, AGARWAL, 2012
  • 9. Statistical Problem  Rank items (from an admissible pool) for user visits in some context to maximize a utility of interest  Examples of utility functions – Click-rates (CTR) – Share-rates (CTR* [Share|Click] ) – Revenue per page-view = CTR*bid (more complex due to second price auction)  CTR is a fundamental measure that opens the door to a more principled approach to rank items  Converge rapidly to maximum utility items – Sequential decision making process – Models: help cope data sparseness (curse of dimensionality NRC BIG DATA PANEL, AGARWAL, 2012
  • 10. Illustrate with Y! front Page Application  Simplify: Maximize CTR on first slot (F1)  Article Pool – Editorially selected for high quality and brand image – Few articles in the pool but article pool dynamic  We want to provide personalized recommendations – Users with many prior visits see recommendations “tailored” to their taste, others see the best for the “group” they belong to NRC BIG DATA PANEL, AGARWAL, 2012
  • 11. Types of user covariates  Demographics, geo: – Not useful in front-page application  Browse behavior: activity on Y! network ( xit ) – Previous visits to property, search, ad views, clicks,.. – This is useful for the front-page application  Latent user factors based on previous clicks on the module ( uit ) – Useful for active module users, obtained via factor models NRC BIG DATA PANEL, AGARWAL, 2012
  • 12. Approach: Online + Offline  Offline computation – Intensive computations done infrequently (once a day/week) to update parameters that are less time-sensitive  Online computation – Lightweight computations frequent (once every 5-10 minutes) to update parameters that are time- sensitive – Adaptive experiments(explore-exploit) also done online NRC BIG DATA PANEL, AGARWAL, 2012
  • 13. Online computation: per-item online logistic regression  For item j, the state-space model is yijt ~ Ber( pijt ) lg t( pijt ) = u v jt + x b jt ' i ' it (v j,t+1, b j,t+1 ) = (v j,t , b j,t ) + d j,t+1 ~ (0, t ) 2 (v j 0 , b j 0 ) = (Dx j , 0) + e j 0 ~ (0, s ) 2  Item coefficients are update online via Kalman-filter (discounting approach of West and Harrison) – Item covariates are used to initialize coefficients at epoch zero NRC BIG DATA PANEL, AGARWAL, 2012
  • 14. Closer Look at online model  Different components of lgt( pijt ) = u v jt + x b jt ' i ' it r x1 u i :User latent factors, useful for heavy users xit b jt : Residual item affinity to user covariate (old items) ' NRC BIG DATA PANEL, AGARWAL, 2012
  • 15. Online Adaptive Experimentation (Explore/Exploit)  Three schemes (all work reasonably well for the front page application) – epsilon-greedy: Show article with maximum posterior mean except with a small probability epsilon, choose an article at random. – Upper confidence bound (UCB): Show article with maximum score, where score = post-mean + k. post-std – Thompson sampling: Draw a sample (δ,β) from posterior to compute article CTR and show article with maximum CTR NRC BIG DATA PANEL, AGARWAL, 2012
  • 16. Offline computation  Computing user latent factors and item coefficient prior – This is computed offline once a day using retrospective (user,item) interaction data for last X days (X = 30 in our case) – Computations are done on Hadoop NRC BIG DATA PANEL, AGARWAL, 2012
  • 17. Offline: Regression based Latent Factor Model yij ~ Ber(pij ) (# obs. per user has wide variation) lgt(pij ) = å uik v jk = u¢v j (need shrinkage on factors) i k ui = Gxi + e , e ~ N(0, diag(s , s ,.., s )) u i u i 2 1 2 2 2 r regression weight matrix user/item-specific correction terms (learnt from data) vi = Dx j + e , e ~ N(0, I) v j v j vik ³ 0 NRC BIG DATA PANEL, AGARWAL, 2012
  • 18. Role of shrinkage (consider Guassian for simplicity)  For new user/article, factor estimates based on covariates  user  item u new G x new , v new D x new For old user, factor estimates  user E(ui | Rest) = (l I + å v j v ) (lGxi + å yij v j ) ' -1 j jÎNi jÎNi  Linear combination of prior regression function and user feedback on items NRC BIG DATA PANEL, AGARWAL, 2012
  • 19. Estimating the Regression function via EM Maximize ( f ( u i , v j , Data ) g (u i , G ) g ( v j , D )) du i dv j ij i j i j Integral cannot be computed in closed form, approximated by Monte Carlo using Gibbs Sampling For logistic, we use ARS (Gilks and Wild) to sample the latent factors within the Gibbs sampler NRC BIG DATA PANEL, AGARWAL, 2012
  • 20. Scaling to large data on Hadoop  Randomly partition by users in the Map  Run separate model on each partition – Care taken to initialize each partition model with same values, constraints on factors ensure identifiability within each partition  Create ensembles by using different user partitions, average across ensembles to obtain estimates of user factors and regression functions – Estimates of user factors in ensembles uncorrelated, averaging reduces variance NRC BIG DATA PANEL, AGARWAL, 2012
  • 21. Data Example  1B events, 8M users, 6K articles  Offline training produced user factor ui  Baseline: Online logistic without ui – Covariate Only online Logistic model lgt(pijt ) = x b jt ' it  Overall click lift: 9.7%,  Heavy users (> 10 clicks last month): 26%  Cold users (not seen in the past): 3% NRC BIG DATA PANEL, AGARWAL, 2012
  • 22. Click-lift for heavy users CTR LIFT Relative to COVARIATE ONLY Logistic Model NRC BIG DATA PANEL, AGARWAL, 2012
  • 23. Computational Advertising: Matching ads to opportunities Advertisers Pick Ads best ads Page Ad User Network Examples: Yahoo, Google, Opportunity MSN, Publisher Ad exchanges(network of “networks”) … NRC BIG DATA PANEL, AGARWAL, 2012
  • 24. Ad- exchange (RightMedia) [Agarwal et al. KDD 10]  Advertisers participate in different ways – CPM (pay by ad-view) – CPC (pay per click) – CPA (pay per conversion)  To conduct an auction, normalize across pricing types – Compute eCPM (expected CPM)  Click-based ---- eCPM = click-rate*CPC  Conversion-based ---- eCPM = conv-rate*CPA  Similar strategy of computing offline and online components – Process 90B records for each model fit – Model has hundreds of millions of parameters – Model fully deployed on RightMedia today NRC BIG DATA PANEL, AGARWAL, 2012
  • 25. Summary  Estimating interactions in high-dimensional sparse data important in web applications  Scaling such models to Big Data is a challenging statistical problem  Combining offline + online modeling with explore/exploit a good practical strategy NRC BIG DATA PANEL, AGARWAL, 2012
  • 26. Some Challenges  Very high-dimensional modeling with very large and noisy data – Few categorical variables with large number of levels interacting with each other to produce response – Scalability  Designing sequential experiments – Multi-armed bandits are back in a big way  Data fusion – From multiple and disparate sources  Availability of data and ability to run experiments to researchers NRC BIG DATA PANEL, AGARWAL, 2012

Notes de l'éditeur

  1. Important module, 100s of millions of user visits.