Presiding Officer Training module 2024 lok sabha elections
Sequential Selection of Correlated Ads by POMDPs
1. Sequential Selection of Correlated Ads by
POMDPs
Shuai Yuan, Jun Wang
University College London
October 29, 2012
2. Motivations and contributions
Motivations,
• help publishers gain more profit by displaying ads;
• go further than offline, content-based matching of
webpages and ads;
Contributions,
• a framework of ad selection for revenue optimisation;
• formulating the sequential selection problem by Partially
observable Markov decision process and providing exact
and approximate solutions;
• a public keyword-bid-ad-webpage dataset for reproducible
research1 .
1
http://www.computational-advertising.org
3. Related works
Contextual advertising,
• A semantic approach to contextual advertising [Broder 2007]
• Impedance coupling in content-targeted advertising [Ribeiro 2005]
• Contextual advertising by combining relevance with click feedback [Chakrabarti
2008]
Inventory management (contracts),
• Targeted advertising on the Web with inventory management [Chickering 2003]
• Revenue management for online advertising: Impatient advertisers
[Fridgeirsdottir 2007]
• Dynamic revenue management for online display advertising [Roels 2009]
Optimal pricing model,
• Pricing of Online Advertising: Cost-Per-Click-Through Vs. Cost-Per-Action [Hu
2010]
• Online advertising: Pay-per-view versus pay-per-click [Mangani 2004]
• Online advertising: Pay-per-view versus pay-per-click A comment [Fjell 2009]
• Single period balancing of pay-per-click and pay-per-view online display
advertisements [Kwon 2011]
4. Related works (cont.)
Ad scheduling,
• Scheduling advertisements on a web page to maximize revenue [Kumar 2006]
• Scheduling of dynamic in-game advertising [Turner 2011]
Multi-armed bandits,
• Using confidence bounds for exploitation-exploration trade-offs [Auer 2003]
• Multi-armed bandit problems with dependent arms [Pandey 2007]
POMDPs,
• A survey of POMDP applications [Cassandra 1998]
• Monte Carlo POMDPs [Thrun 2000]
• Perseus: Randomized point-based value iteration for POMDPs [Spaan 2005]
5. Problem statement - setup
500
400
300
200
100
0 200 400 600 800 1000
$ 500
400
300
200
100
0 200 400 600 800 1000
500
400
300
200
100
0 200 400 600 800 1000
Figure : 1 webpage, 1 ad slot, M impressions at each time step.
2
Payoff of ads follows X ∼ N (µ, I · σ0 ). µ is generated by µ ∼ N (θ, Σ).
6. Problem statement - graphical model
θ(1), Σ(1), T-1 θ(2), Σ(2), T-2 θ(T), Σ(T), 0
s(1) s(2) θ, Σ s(T)
μ(1) μ(2) μ(T)
2
σ 0
x(1) x(2) x(T)
Figure : The payoff model illustrated by an influence diagram
representation with generative processes of a finite horizon POMDP.
s(t) is the selection action. θ(t), Σ(t) is the belief at some stage.
7. Problem statement - object function
To maximise the expected cumulative payoff over time,
T T
∗
π = arg max E [Rπ (T )] = arg max E Xs(t) (t) = arg max E Xs(t) (t)
π π π
t=1 t=1
T T
=arg max xs(t) (t)p(xs(t) (t)|Ψ(t))dx = arg max θs(t) (t) (1)
π x π
t=1 t=1
where,
• s(t) is the selection decision;
• Ψ(t) is the available information;
• π is a selection policy and π ∗ is the optimal one;
• “M impressions” is dropped from object function.
8. Belief update
$
t=1 t=2 ...
Figure : Updating belief on ads’ performance over time.
9. Belief update - the selected ad
We update the belief using Bayes’ theorem.
p (x1 |x1 (t), Ψ(t))
= p (x1 |x1 (t), Ψ(t), µ1 ) p (µ1 |x1 (t), Ψ(t))dµ (2)
by “completing squares”,
p µ1 |x1 (t), Ψ(t) ∝ p(x1 (t)|µ1 , Ψ(t))p(µ1 |Ψ(t))
2 2
∝ exp − x1 (t) − µ1 − µ1 − θ1 (t) (3)
we obtain the new belief,
2
µ1 |x1 (t) ∼ N θ1 (t + 1), σ1 (t + 1) (4)
2 2
σ1 (t)x1 (t) + σ0 θ1 (t) 2
σ1 (t)σ02
2
θ1 (t + 1) = 2 2
σ1 (t + 1) = 2 (t) + σ 2
(5)
σ1 (t) + σ0 σ1 0
we write θi (t) and σi2 (t) as the shorthand for θi |Ψ(t) and σi2 |Ψ(t).
10. Belief update - the correlated ad
We also update the belief of non-selected ads,
p (x2 |x1 (t), Ψ(t)) = p (x2 |µ2 , x1 (t), Ψ(t)) p(µ2 |x1 (t), Ψ(t))dµ2 (6)
with linear Gaussian property,
2
µ1 |µ2 ∼ N (θ1 |µ2 , σ1 |µ2 ) (7)
2
σ1,2
σ1,2 2 2
θ1 |µ2 = θ1 + 2
(µ2 − θ2 ) σ1 |µ2 = σ1 − 2
(8)
σ2 σ2
we obtain the new belief on a correlated ad,
2
µ2 |x1 (t) ∼ N (θ2 (t + 1), σ2 (t + 1)) (9)
2
σ1,2
x1 (t) − θ1 (t) 2 2
θ2 (t + 1) = θ2 (t) + σ1,2 2 2
σ2 (t + 1) = σ2 (t) − 2 (t) + 2
(10)
σ1 (t) + σ0 σ1 σ0
11. Belief update - expected payoff
We also obtain the expected payoff of the selected ad,
2 2
X1 |x1 (t), Ψ(t) ∼ N θ1 (t + 1), σ0 + σ1 (t + 1) (11)
and the expected payoff of the correlated ad,
2 2
X2 |x1 (t), Ψ(t) ∼ N θ2 (t + 1), σ0 + σ2 (t + 1) (12)
The final objective function is,
T
π ∗ = arg max θs(t) (t) subject to (13)
π
t=1
xs(t) (t) − θs(t) (t)
θs(t+1) (t + 1) = θs(t+1) (t) + σs(t),s(t+1) 2 2
(14)
σs(t) (t) + σ0
2
σs(t),s(t+1)
2 2
σs(t+1) (t + 1) = σs(t+1) (t) − 2 2
(15)
σs(t) (t) + σ0
12. POMDP formulation and solution
(belief state)
500
400
300
(observation 200
& reward) (action) 100
0 200 400 600 800 1000
$ 500
400
300
200
100
(hidden state) 0 200 400 600 800 1000
500
400
300
200
100
0 200 400 600 800 1000
Figure : The POMDP model for the revenue optimisation problem.
(θ(t), Σ(t)) is belief at some stage; x(t) is observation and reward;
s(t) is action; (θ, Σ) is the hidden state. There is no state transition.
13. Value iteration and MAB approximation
The value function could be expressed as,
s(t)= arg max Vs(t) (Ψ(t)) = arg max
¯
(xi ) + ξ(Ψ(t), i)
s(t)∈N i∈N
the expected immediate reward the expected future reward
(16)
The exact solution using Value iteration2 :
V ∗ (θ, Σ, T ) = max E Xs(t) (1) + V ∗ θ|Xs(t) (1), Σ|Xs(t) (1), T − 1 (17)
s(1)∈N
The approximation based on multi-armed bandit3 :
qi − ti θi2 (t) t −1
ξUCB 1- NORMAL = 16 · · (18)
ti − 1 ti
2
R. E. Bellman. (1957) “Dynamic Programming”
3
Auer, P. et al. (2002) “Finite-time analysis of the multi-armed bandit
problem”
14. Value iteration with Monte Carlo sampling4
We use sampling to reduce the computational complexity,
1: function VALUE F UNC(θ, Σ, t)
2: array V ← 0 Expected reward vector.
3: loop i ← 1 to N
4: V [i] ← θi (t) Expected immediate reward.
5: if t < T then
6: for all s in S AMPLE(θ, Σ) do
7: [θ , Σ ] ← U PDATE B ELIEF(θ, Σ, s, i)
New belief after selecting i and observing s.
Equations 13.
1
8: V [i] ← V [i] + M VALUE F UNC(θ , Σ , t + 1)
0
9: end for
10: end if
11: end loop
12: return [M AX(V ), M AX I NDEX(V )]
13: end function
4
Thrun, S. (2000) “Monte Carlo POMDPs”
15. Multi-armed bandit based approximation
(cont.)
The UCB 1- NORMAL - COR algorithm:
1: function P LAN(θ, Σ, Ψ(t))
2: array V ← 0
3: loop i ← 1 to N
4: if ti < 8 log t then ti is the number of times ad i gets selected.
5: return i
6: end if
7: end loop
8: [θ , Σ ] ← U PDATE B ELIEF(θ, Σ, Ψ(t))
New belief of all ads with all available information.
Equations 13.
9: loop i ← 1 to N
q −t θ 2
10: V [i] ← θi + 16 · i t −1i · t−1
i
ti
Expected reward.
i
11: end loop
12: return [MAX(V ), M AX I NDEX(V )]
13: end function
16. Experiment datasets
ad network/exchange
Google AdWords INTRANET
Traffic Estimator
service $
$$$ $$
advertisers publishers
• publishers gain 68% of advertisers’ spending (2003);
• data was collected from 12/2011 to 05/2012;
• 512 different keywords, 310 with non-zero mean payoff, 8
categories;
• 20% for training and 80% for testing;
• we consider each keyword to be an ad.
17. Competing algorithms
We compare the following algorithms,
• RANDOM policy, which selects candidates randomly
(uniform);
• MYOPIC policy, based on the expected immediate reward;
• UCB 1 policy, which assumes independent between arms
and is model-free of reward distribution;
• UCB 1- NORMAL policy, which assumes independent
between arms and the reward following Gaussian
distribution;
• VI - COR policy, which solves Value iteration using Monte
Carlo sampling; and
• UCB 1- NORMAL - COR policy, which consider the
dependencies between candidates.
18. Results
Datasets MYOPIC RANDOM UCB 1 UCB 1- N VI - COR UCB 1- N - COR
Education 21.9 23.0 30.9 30.9 41.2* 27.6
Finance-1 38.5 27.8 40.9 26.4 44.5 27.4
Finance-2 22.1 16.5 30.6 22.8 38.0* 22.9
Information 14.1 12.9 27.8 15.9 29.4 15.9
P&O 41.6 30.4 50.5 31.4 72.9* 63.3
Shopping-1 17.4 10.6 42.3 16.1 40.2 16.4
Shopping-2 29.9 14.5 34.3 75.3 52.9 79.2*
Shopping-3 9.7 4.3 21.9 18.3 27.3 19.4
P&S 24.7 26.0 47.2 57.1 67.9* 59.9
Medical 30.5 19.6 52.7 32.2 58.0* 33.5
Table : The cumulative payoffs are averaged on 8 chunks then normalized w.r.t the
GOLDEN policy for a better representation. The one with highest cumulative payoff is
in bold and with ∗ if the difference with the second best is significant by Wilcoxon
signed-rank test. P&O is “People & organisations” and P&S is “‘Products & services”.
19. Results (cont.)
VI COR
UCB1 Normal COR
4000
UCB1 Normal
UCB1
Golden
Myopic
3000
Random
2000
1000
20 40 60 80 100
Figure : Cumulative payoff on “People & organization” category, 5
candidates.
20. Results (cont.)
1
Myopic
0.9 VI-Cor
UCB1-Normal
0.8
Normalized cumulative payoff
UCB1-Normal-Cor
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Edu F-1 F-2 Info P&O S-1 S-2 S-3 P&S Med
Figure : Comparison of accumulated payoffs on the 10 datasets.
VI-COR always performed better than MYOPIC and UCB1-NORMAL-COR
always performed better than UCB1-NORMAL across all datasets.
21. Results (cont.)
5000
best phones
4500 term insurance
4000
3500
Daily payoff
3000
2500
2000
1500
1000
500
0
0 50 100 150
Day
Figure : Special case: the daily payoff of two candidates with a
sudden change.
22. Results (cont.)
4
x 10
10
Golden
Myopic
9 VI−COR
UCB1−Normal−COR
8
Cumulative payoff
Figure : The
7
impact of the noise
2
6
factor σ0 for the
situation in the
5 previous figure.
4
3 −2 0 2 4
10 10 10 10
Noise factor σ2
0
xs(t) (t) − θs(t) (t)
θs(t+1) (t + 1) = θs(t+1) (t) + σs(t),s(t+1)
2 2
σs(t) (t) + σ0
23. Future works
• correlated update: if ad a1 on webpage w1 was shown to
user u1 and we observed its performance, what’s the belief
on performance of ad a2 on webpage w2 when showing to
user u2 with correlations known?
• multiple ads with diversification (another exploration and
exploitation dilemma);
• better solution for our continuous POMDP problem.