Sequential Selection of Correlated Ads by POMDPs

Sequential Selection of Correlated Ads by
POMDPs

Shuai Yuan, Jun Wang

University College London

October 29, 2012

Motivations and contributions
Motivations,
• help publishers gain more proﬁt by displaying ads;
• go further than ofﬂine, content-based matching of
webpages and ads;
Contributions,
• a framework of ad selection for revenue optimisation;
• formulating the sequential selection problem by Partially
observable Markov decision process and providing exact
and approximate solutions;
• a public keyword-bid-ad-webpage dataset for reproducible
research1 .

1
http://www.computational-advertising.org

Related works
Contextual advertising,
• A semantic approach to contextual advertising [Broder 2007]
• Impedance coupling in content-targeted advertising [Ribeiro 2005]
• Contextual advertising by combining relevance with click feedback [Chakrabarti
2008]
Inventory management (contracts),
• Targeted advertising on the Web with inventory management [Chickering 2003]
• Revenue management for online advertising: Impatient advertisers
[Fridgeirsdottir 2007]
• Dynamic revenue management for online display advertising [Roels 2009]
Optimal pricing model,
• Pricing of Online Advertising: Cost-Per-Click-Through Vs. Cost-Per-Action [Hu
2010]
• Online advertising: Pay-per-view versus pay-per-click [Mangani 2004]
• Online advertising: Pay-per-view versus pay-per-click A comment [Fjell 2009]
• Single period balancing of pay-per-click and pay-per-view online display
advertisements [Kwon 2011]

Related works (cont.)
Ad scheduling,
• Scheduling advertisements on a web page to maximize revenue [Kumar 2006]
• Scheduling of dynamic in-game advertising [Turner 2011]
Multi-armed bandits,
• Using conﬁdence bounds for exploitation-exploration trade-offs [Auer 2003]
• Multi-armed bandit problems with dependent arms [Pandey 2007]
POMDPs,
• A survey of POMDP applications [Cassandra 1998]
• Monte Carlo POMDPs [Thrun 2000]
• Perseus: Randomized point-based value iteration for POMDPs [Spaan 2005]

Problem statement - setup
500

400

300

200

100

0 200 400 600 800 1000

$ 500

400

300

200

100

0 200 400 600 800 1000

500

400

300

200

100

0 200 400 600 800 1000

Figure : 1 webpage, 1 ad slot, M impressions at each time step.
2
Payoff of ads follows X ∼ N (µ, I · σ0 ). µ is generated by µ ∼ N (θ, Σ).

Problem statement - graphical model

θ(1), Σ(1), T-1 θ(2), Σ(2), T-2 θ(T), Σ(T), 0

s(1) s(2) θ, Σ s(T)

μ(1) μ(2) μ(T)
2
σ 0

x(1) x(2) x(T)

Figure : The payoff model illustrated by an inﬂuence diagram
representation with generative processes of a ﬁnite horizon POMDP.
s(t) is the selection action. θ(t), Σ(t) is the belief at some stage.

Problem statement - object function
To maximise the expected cumulative payoff over time,
 
T T
∗
π = arg max E [Rπ (T )] = arg max E  Xs(t) (t) = arg max E Xs(t) (t)
π π π
t=1 t=1
T T
=arg max xs(t) (t)p(xs(t) (t)|Ψ(t))dx = arg max θs(t) (t) (1)
π x π
t=1 t=1

where,
• s(t) is the selection decision;
• Ψ(t) is the available information;
• π is a selection policy and π ∗ is the optimal one;
• “M impressions” is dropped from object function.

Belief update

$

t=1 t=2 ...

Figure : Updating belief on ads’ performance over time.

Belief update - the selected ad
We update the belief using Bayes’ theorem.
p (x1 |x1 (t), Ψ(t))

= p (x1 |x1 (t), Ψ(t), µ1 ) p (µ1 |x1 (t), Ψ(t))dµ (2)

by “completing squares”,
p µ1 |x1 (t), Ψ(t) ∝ p(x1 (t)|µ1 , Ψ(t))p(µ1 |Ψ(t))
2 2
∝ exp − x1 (t) − µ1 − µ1 − θ1 (t) (3)

we obtain the new belief,
2
µ1 |x1 (t) ∼ N θ1 (t + 1), σ1 (t + 1) (4)

2 2
σ1 (t)x1 (t) + σ0 θ1 (t) 2
σ1 (t)σ02
2
θ1 (t + 1) = 2 2
σ1 (t + 1) = 2 (t) + σ 2
(5)
σ1 (t) + σ0 σ1 0

we write θi (t) and σi2 (t) as the shorthand for θi |Ψ(t) and σi2 |Ψ(t).

Belief update - the correlated ad
We also update the belief of non-selected ads,

p (x2 |x1 (t), Ψ(t)) = p (x2 |µ2 , x1 (t), Ψ(t)) p(µ2 |x1 (t), Ψ(t))dµ2 (6)

with linear Gaussian property,
2
µ1 |µ2 ∼ N (θ1 |µ2 , σ1 |µ2 ) (7)

2
σ1,2
σ1,2 2 2
θ1 |µ2 = θ1 + 2
(µ2 − θ2 ) σ1 |µ2 = σ1 − 2
(8)
σ2 σ2

we obtain the new belief on a correlated ad,
2
µ2 |x1 (t) ∼ N (θ2 (t + 1), σ2 (t + 1)) (9)

2
σ1,2
x1 (t) − θ1 (t) 2 2
θ2 (t + 1) = θ2 (t) + σ1,2 2 2
σ2 (t + 1) = σ2 (t) − 2 (t) + 2
(10)
σ1 (t) + σ0 σ1 σ0

Belief update - expected payoff
We also obtain the expected payoff of the selected ad,
2 2
X1 |x1 (t), Ψ(t) ∼ N θ1 (t + 1), σ0 + σ1 (t + 1) (11)

and the expected payoff of the correlated ad,
2 2
X2 |x1 (t), Ψ(t) ∼ N θ2 (t + 1), σ0 + σ2 (t + 1) (12)

The ﬁnal objective function is,
T
π ∗ = arg max θs(t) (t) subject to (13)
π
t=1
xs(t) (t) − θs(t) (t)
θs(t+1) (t + 1) = θs(t+1) (t) + σs(t),s(t+1) 2 2
(14)
σs(t) (t) + σ0
2
σs(t),s(t+1)
2 2
σs(t+1) (t + 1) = σs(t+1) (t) − 2 2
(15)
σs(t) (t) + σ0

POMDP formulation and solution
(belief state)
500

400

300

(observation 200

& reward) (action) 100

0 200 400 600 800 1000

$ 500

400

300

200

100

(hidden state) 0 200 400 600 800 1000

500

400

300

200

100

0 200 400 600 800 1000

Figure : The POMDP model for the revenue optimisation problem.
(θ(t), Σ(t)) is belief at some stage; x(t) is observation and reward;
s(t) is action; (θ, Σ) is the hidden state. There is no state transition.

Value iteration and MAB approximation
The value function could be expressed as,
 
 
s(t)= arg max Vs(t) (Ψ(t)) = arg max 
 ¯
(xi ) + ξ(Ψ(t), i) 

s(t)∈N i∈N
the expected immediate reward the expected future reward
(16)

The exact solution using Value iteration2 :
V ∗ (θ, Σ, T ) = max E Xs(t) (1) + V ∗ θ|Xs(t) (1), Σ|Xs(t) (1), T − 1 (17)
s(1)∈N

The approximation based on multi-armed bandit3 :
qi − ti θi2 (t) t −1
ξUCB 1- NORMAL = 16 · · (18)
ti − 1 ti

2
R. E. Bellman. (1957) “Dynamic Programming”
3
Auer, P. et al. (2002) “Finite-time analysis of the multi-armed bandit
problem”

Value iteration with Monte Carlo sampling4
We use sampling to reduce the computational complexity,
1: function VALUE F UNC(θ, Σ, t)
2: array V ← 0 Expected reward vector.
3: loop i ← 1 to N
4: V [i] ← θi (t) Expected immediate reward.
5: if t < T then
6: for all s in S AMPLE(θ, Σ) do
7: [θ , Σ ] ← U PDATE B ELIEF(θ, Σ, s, i)
New belief after selecting i and observing s.
Equations 13.
1
8: V [i] ← V [i] + M VALUE F UNC(θ , Σ , t + 1)
0
9: end for
10: end if
11: end loop
12: return [M AX(V ), M AX I NDEX(V )]
13: end function

4
Thrun, S. (2000) “Monte Carlo POMDPs”

Multi-armed bandit based approximation
(cont.)
The UCB 1- NORMAL - COR algorithm:
1: function P LAN(θ, Σ, Ψ(t))
2: array V ← 0
4: if ti < 8 log t then ti is the number of times ad i gets selected.
5: return i
6: end if
7: end loop
8: [θ , Σ ] ← U PDATE B ELIEF(θ, Σ, Ψ(t))
New belief of all ads with all available information.
Equations 13.
q −t θ 2
10: V [i] ← θi + 16 · i t −1i · t−1
i
ti
Expected reward.
i
11: end loop
12: return [MAX(V ), M AX I NDEX(V )]
13: end function

Experiment datasets
ad network/exchange

Google AdWords INTRANET

Traffic Estimator
service $
$$$ $$

advertisers publishers

• publishers gain 68% of advertisers’ spending (2003);
• data was collected from 12/2011 to 05/2012;
• 512 different keywords, 310 with non-zero mean payoff, 8
categories;
• 20% for training and 80% for testing;
• we consider each keyword to be an ad.

Competing algorithms
We compare the following algorithms,
• RANDOM policy, which selects candidates randomly
(uniform);
• MYOPIC policy, based on the expected immediate reward;
• UCB 1 policy, which assumes independent between arms
and is model-free of reward distribution;
• UCB 1- NORMAL policy, which assumes independent
between arms and the reward following Gaussian
distribution;
• VI - COR policy, which solves Value iteration using Monte
Carlo sampling; and
• UCB 1- NORMAL - COR policy, which consider the
dependencies between candidates.

Results
Datasets MYOPIC RANDOM UCB 1 UCB 1- N VI - COR UCB 1- N - COR
Education 21.9 23.0 30.9 30.9 41.2* 27.6
Finance-1 38.5 27.8 40.9 26.4 44.5 27.4
Finance-2 22.1 16.5 30.6 22.8 38.0* 22.9
Information 14.1 12.9 27.8 15.9 29.4 15.9
P&O 41.6 30.4 50.5 31.4 72.9* 63.3
Shopping-1 17.4 10.6 42.3 16.1 40.2 16.4
Shopping-2 29.9 14.5 34.3 75.3 52.9 79.2*
Shopping-3 9.7 4.3 21.9 18.3 27.3 19.4
P&S 24.7 26.0 47.2 57.1 67.9* 59.9
Medical 30.5 19.6 52.7 32.2 58.0* 33.5

Table : The cumulative payoffs are averaged on 8 chunks then normalized w.r.t the
GOLDEN policy for a better representation. The one with highest cumulative payoff is
in bold and with ∗ if the difference with the second best is signiﬁcant by Wilcoxon
signed-rank test. P&O is “People & organisations” and P&S is “‘Products & services”.

Results (cont.)

VI COR

UCB1 Normal COR
4000
UCB1 Normal

UCB1

Golden

Myopic
3000
Random

2000

1000

20 40 60 80 100

Figure : Cumulative payoff on “People & organization” category, 5
candidates.

Results (cont.)
1
Myopic
0.9 VI-Cor
UCB1-Normal
0.8
Normalized cumulative payoff

UCB1-Normal-Cor
0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
Edu F-1 F-2 Info P&O S-1 S-2 S-3 P&S Med

Figure : Comparison of accumulated payoffs on the 10 datasets.
VI-COR always performed better than MYOPIC and UCB1-NORMAL-COR
always performed better than UCB1-NORMAL across all datasets.

Results (cont.)
5000
best phones
4500 term insurance

4000

3500
Daily payoff

3000

2500

2000

1500

1000

500

0
0 50 100 150
Day

Figure : Special case: the daily payoff of two candidates with a
sudden change.

Results (cont.)
4
x 10
10
Golden
Myopic
9 VI−COR
UCB1−Normal−COR

8
Cumulative payoff

Figure : The
7
impact of the noise
2
6
factor σ0 for the
situation in the
5 previous ﬁgure.

4

3 −2 0 2 4
10 10 10 10
Noise factor σ2
0
xs(t) (t) − θs(t) (t)
θs(t+1) (t + 1) = θs(t+1) (t) + σs(t),s(t+1)
2 2
σs(t) (t) + σ0

Future works
• correlated update: if ad a1 on webpage w1 was shown to
user u1 and we observed its performance, what’s the belief
on performance of ad a2 on webpage w2 when showing to
user u2 with correlations known?
• multiple ads with diversiﬁcation (another exploration and
exploitation dilemma);
• better solution for our continuous POMDP problem.

Sequential Selection of Correlated Ads by POMDPs

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

En vedette

En vedette (10)

Similaire à Sequential Selection of Correlated Ads by POMDPs

Similaire à Sequential Selection of Correlated Ads by POMDPs (20)

Dernier

Dernier (20)

Sequential Selection of Correlated Ads by POMDPs