ICT Role in 21st Century Education & its Challenges.pptx
Statistical Models for Massive Web Data
1. Statistical Models for Massive Web Data
Deepak Agarwal, LinkedIn, USA
Director, Applied Relevance Science (ARS)
CATS Big Data Panel, October 11, 2012
Hosted by National Academy of Sciences
Washington D.C., USA
2. Disclaimer
The opinions expressed here are mine and in
no way represent the official position of LinkedIn
Case studies presented today was work done
while I was at Yahoo!
NRC BIG DATA PANEL, AGARWAL, 2012
3. Big Data Applications in Business
Big Data : Competitive advantage, Innovation, reduces
uncertainty in decision making
High Frequency Data
– Large number of heterogeneous transactions per unit time
Web visits, financial trading, credit card transactions, telephone
calls, packet flows in an IP network,…
I will focus on statistical modeling for one such data
source
– User visits to websites
NRC BIG DATA PANEL, AGARWAL, 2012
4. Example 1: Yahoo! front page
Today module
Recommend content links
F1 F2 F3 F4 (out of 30-40, editorially
programmed)
4 slots exposed, F1 has
maximum exposure
Routes traffic to other Y!
properties
NRC BIG DATA PANEL, AGARWAL, 2012
7. Data Generation
User information
http request NEWS
ADS
Ranking Service
Model Updates NRC BIG DATA PANEL, AGARWAL, 2012
8. DATA
CONTEXT Select Item j with item covariates xj
(keywords, content categories, ...)
User i visits (i, j) : response yij
(User, context) (click/no-click)
covariates xit
(profile information, device id,
first degree connections,
browse information,…)
NRC BIG DATA PANEL, AGARWAL, 2012
9. Statistical Problem
Rank items (from an admissible pool) for user visits in
some context to maximize a utility of interest
Examples of utility functions
– Click-rates (CTR)
– Share-rates (CTR* [Share|Click] )
– Revenue per page-view = CTR*bid (more complex due to
second price auction)
CTR is a fundamental measure that opens the door to a
more principled approach to rank items
Converge rapidly to maximum utility items
– Sequential decision making process
– Models: help cope data sparseness (curse of dimensionality
NRC BIG DATA PANEL, AGARWAL, 2012
10. Illustrate with Y! front Page Application
Simplify: Maximize CTR on first slot (F1)
Article Pool
– Editorially selected for high quality and brand image
– Few articles in the pool but article pool dynamic
We want to provide personalized recommendations
– Users with many prior visits see recommendations “tailored” to
their taste, others see the best for the “group” they belong to
NRC BIG DATA PANEL, AGARWAL, 2012
11. Types of user covariates
Demographics, geo:
– Not useful in front-page application
Browse behavior: activity on Y! network ( xit )
– Previous visits to property, search, ad views, clicks,..
– This is useful for the front-page application
Latent user factors based on previous clicks on the
module ( uit )
– Useful for active module users, obtained via factor models
NRC BIG DATA PANEL, AGARWAL, 2012
12. Approach: Online + Offline
Offline computation
– Intensive computations done infrequently (once
a day/week) to update parameters that are less
time-sensitive
Online computation
– Lightweight computations frequent (once every 5-10
minutes) to update parameters that are time-
sensitive
– Adaptive experiments(explore-exploit) also done
online
NRC BIG DATA PANEL, AGARWAL, 2012
13. Online computation: per-item online logistic regression
For item j, the state-space model is
yijt ~ Ber( pijt )
lg t( pijt ) = u v jt + x b jt
'
i
'
it
(v j,t+1, b j,t+1 ) = (v j,t , b j,t ) + d j,t+1 ~ (0, t ) 2
(v j 0 , b j 0 ) = (Dx j , 0) + e j 0 ~ (0, s ) 2
Item coefficients are update online via Kalman-filter (discounting
approach of West and Harrison)
– Item covariates are used to initialize coefficients at epoch zero
NRC BIG DATA PANEL, AGARWAL, 2012
14. Closer Look at online model
Different components of lgt( pijt ) = u v jt + x b jt
'
i
'
it
r x1
u i :User latent factors, useful for heavy users
xit b jt : Residual item affinity to user covariate (old items)
'
NRC BIG DATA PANEL, AGARWAL, 2012
15. Online Adaptive Experimentation
(Explore/Exploit)
Three schemes (all work reasonably well for the
front page application)
– epsilon-greedy: Show article with maximum posterior
mean except with a small probability epsilon, choose an
article at random.
– Upper confidence bound (UCB): Show article with
maximum score, where score = post-mean + k. post-std
– Thompson sampling: Draw a sample (δ,β) from posterior
to compute article CTR and show article with maximum
CTR
NRC BIG DATA PANEL, AGARWAL, 2012
16. Offline computation
Computing user latent factors and item
coefficient prior
– This is computed offline once a day using
retrospective (user,item) interaction data for last
X days (X = 30 in our case)
– Computations are done on Hadoop
NRC BIG DATA PANEL, AGARWAL, 2012
17. Offline: Regression based Latent Factor
Model
yij ~ Ber(pij ) (# obs. per user has wide variation)
lgt(pij ) = å uik v jk = u¢v j (need shrinkage on factors)
i
k
ui = Gxi + e , e ~ N(0, diag(s , s ,.., s ))
u
i
u
i
2
1
2
2
2
r
regression weight matrix user/item-specific correction terms (learnt from data)
vi = Dx j + e , e ~ N(0, I)
v
j
v
j
vik ³ 0
NRC BIG DATA PANEL, AGARWAL, 2012
18. Role of shrinkage (consider Guassian for
simplicity)
For new user/article, factor estimates based on
covariates
user item
u new G x new , v new D x new
For old user, factor estimates
user
E(ui | Rest) = (l I + å v j v ) (lGxi + å yij v j )
' -1
j
jÎNi jÎNi
Linear combination of prior regression function
and user feedback on items
NRC BIG DATA PANEL, AGARWAL, 2012
19. Estimating the Regression function via EM
Maximize
( f ( u i , v j , Data ) g (u i , G ) g ( v j , D )) du i dv j
ij i j i j
Integral cannot be computed in closed form,
approximated by Monte Carlo using Gibbs Sampling
For logistic, we use ARS (Gilks and Wild) to sample the
latent factors within the Gibbs sampler
NRC BIG DATA PANEL, AGARWAL, 2012
20. Scaling to large data on Hadoop
Randomly partition by users in the Map
Run separate model on each partition
– Care taken to initialize each partition model with
same values, constraints on factors ensure
identifiability within each partition
Create ensembles by using different user partitions,
average across ensembles to obtain estimates of
user factors and regression functions
– Estimates of user factors in ensembles uncorrelated,
averaging reduces variance
NRC BIG DATA PANEL, AGARWAL, 2012
21. Data Example
1B events, 8M users, 6K articles
Offline training produced user factor ui
Baseline: Online logistic without ui
– Covariate Only online Logistic model
lgt(pijt ) = x b jt
'
it
Overall click lift: 9.7%,
Heavy users (> 10 clicks last month): 26%
Cold users (not seen in the past): 3%
NRC BIG DATA PANEL, AGARWAL, 2012
22. Click-lift for heavy users
CTR LIFT Relative to COVARIATE ONLY
Logistic Model
NRC BIG DATA PANEL, AGARWAL, 2012
23. Computational Advertising: Matching ads to opportunities
Advertisers
Pick
Ads best ads
Page Ad
User Network
Examples:
Yahoo, Google,
Opportunity MSN,
Publisher
Ad exchanges(network
of “networks”) …
NRC BIG DATA PANEL, AGARWAL, 2012
24. Ad- exchange (RightMedia) [Agarwal et al. KDD 10]
Advertisers participate in different ways
– CPM (pay by ad-view)
– CPC (pay per click)
– CPA (pay per conversion)
To conduct an auction, normalize across pricing types
– Compute eCPM (expected CPM)
Click-based ---- eCPM = click-rate*CPC
Conversion-based ---- eCPM = conv-rate*CPA
Similar strategy of computing offline and online components
– Process 90B records for each model fit
– Model has hundreds of millions of parameters
– Model fully deployed on RightMedia today
NRC BIG DATA PANEL, AGARWAL, 2012
25. Summary
Estimating interactions in high-dimensional
sparse data important in web applications
Scaling such models to Big Data is a
challenging statistical problem
Combining offline + online modeling with
explore/exploit a good practical strategy
NRC BIG DATA PANEL, AGARWAL, 2012
26. Some Challenges
Very high-dimensional modeling with very large and noisy data
– Few categorical variables with large number of levels interacting with
each other to produce response
– Scalability
Designing sequential experiments
– Multi-armed bandits are back in a big way
Data fusion
– From multiple and disparate sources
Availability of data and ability to run experiments to researchers
NRC BIG DATA PANEL, AGARWAL, 2012
Notes de l'éditeur
Important module, 100s of millions of user visits.