Sponsored content in contextual bandits. Deconfounding targeting not at random

GRAPE
GRAPEGRAPE
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Sponsored content in contextual bandits.
Deconfounding Targeting Not At Random
MIUE 2023
Hubert Drążkowski
GRAPE|FAME, Warsaw University of Technology
September 22, 2023
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 1 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Motivational examples
Recommender systems
• Suggest best ads/movies a ∈ {a1, a2, ...aK }
• Users X1, X2, ...., XT
• Design of the study {na1
, na2
, ..., naK
},
P
i nai
= T
• Measured satisfaction {Rt(a1), ...Rt(aK )}T
t=1
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 2 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
The framework
Exploration vs exploitation
• Allocating limited resources under uncertainty
• Sequential manner
• Partial feedback (bandit feedback)
• Adaptive (non-iid) data
• Maximizing cumulative gain
• Current actions do not change future environment
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 3 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
The framework
Exploration vs exploitation
• Allocating limited resources under uncertainty
• Sequential manner
• Partial feedback (bandit feedback)
• Adaptive (non-iid) data
• Maximizing cumulative gain
• Current actions do not change future environment
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 3 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
The framework
Exploration vs exploitation
• Allocating limited resources under uncertainty
• Sequential manner
• Partial feedback (bandit feedback)
• Adaptive (non-iid) data
• Maximizing cumulative gain
• Current actions do not change future environment
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 3 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
The framework
Exploration vs exploitation
• Allocating limited resources under uncertainty
• Sequential manner
• Partial feedback (bandit feedback)
• Adaptive (non-iid) data
• Maximizing cumulative gain
• Current actions do not change future environment
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 3 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
The framework
Exploration vs exploitation
• Allocating limited resources under uncertainty
• Sequential manner
• Partial feedback (bandit feedback)
• Adaptive (non-iid) data
• Maximizing cumulative gain
• Current actions do not change future environment
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 3 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
The framework
Exploration vs exploitation
• Allocating limited resources under uncertainty
• Sequential manner
• Partial feedback (bandit feedback)
• Adaptive (non-iid) data
• Maximizing cumulative gain
• Current actions do not change future environment
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 3 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
The elements of the bandit model
(see Lattimore and Szepesvári (2020))
• Context Xt ∈ X
• Xt ∼ DX
• Actions At ∈ A = {a1, ..., K}
• At ∼ πt (a|x)
• Policy π ∈ Π
• π = {πt }T
t=1
• πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K
:
P
a∈A qa = 1}
• Rewards Rt ∈ R+
• (R(a1), R(a2), ..., R(ak )) and Rt =
PK
k=1 1(At = ak )R(ak )
• Rt ∼ DR|A,X
• History Ht ∈ Ht
• Ht = σ {(Xs , As , Rs )}t
s=1

Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
The elements of the bandit model
(see Lattimore and Szepesvári (2020))
• Context Xt ∈ X
• Xt ∼ DX
• Actions At ∈ A = {a1, ..., K}
• At ∼ πt (a|x)
• Policy π ∈ Π
• π = {πt }T
t=1
• πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K
:
P
a∈A qa = 1}
• Rewards Rt ∈ R+
• (R(a1), R(a2), ..., R(ak )) and Rt =
PK
k=1 1(At = ak )R(ak )
• Rt ∼ DR|A,X
• History Ht ∈ Ht
• Ht = σ {(Xs , As , Rs )}t
s=1

Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
The elements of the bandit model
(see Lattimore and Szepesvári (2020))
• Context Xt ∈ X
• Xt ∼ DX
• Actions At ∈ A = {a1, ..., K}
• At ∼ πt (a|x)
• Policy π ∈ Π
• π = {πt }T
t=1
• πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K
:
P
a∈A qa = 1}
• Rewards Rt ∈ R+
• (R(a1), R(a2), ..., R(ak )) and Rt =
PK
k=1 1(At = ak )R(ak )
• Rt ∼ DR|A,X
• History Ht ∈ Ht
• Ht = σ {(Xs , As , Rs )}t
s=1

Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
The elements of the bandit model
(see Lattimore and Szepesvári (2020))
• Context Xt ∈ X
• Xt ∼ DX
• Actions At ∈ A = {a1, ..., K}
• At ∼ πt (a|x)
• Policy π ∈ Π
• π = {πt }T
t=1
• πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K
:
P
a∈A qa = 1}
• Rewards Rt ∈ R+
• (R(a1), R(a2), ..., R(ak )) and Rt =
PK
k=1 1(At = ak )R(ak )
• Rt ∼ DR|A,X
• History Ht ∈ Ht
• Ht = σ {(Xs , As , Rs )}t
s=1

Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
The elements of the bandit model
(see Lattimore and Szepesvári (2020))
• Context Xt ∈ X
• Xt ∼ DX
• Actions At ∈ A = {a1, ..., K}
• At ∼ πt (a|x)
• Policy π ∈ Π
• π = {πt }T
t=1
• πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K
:
P
a∈A qa = 1}
• Rewards Rt ∈ R+
• (R(a1), R(a2), ..., R(ak )) and Rt =
PK
k=1 1(At = ak )R(ak )
• Rt ∼ DR|A,X
• History Ht ∈ Ht
• Ht = σ {(Xs , As , Rs )}t
s=1

Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
The elements of the bandit model
(see Lattimore and Szepesvári (2020))
• Context Xt ∈ X
• Xt ∼ DX
• Actions At ∈ A = {a1, ..., K}
• At ∼ πt (a|x)
• Policy π ∈ Π
• π = {πt }T
t=1
• πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K
:
P
a∈A qa = 1}
• Rewards Rt ∈ R+
• (R(a1), R(a2), ..., R(ak )) and Rt =
PK
k=1 1(At = ak )R(ak )
• Rt ∼ DR|A,X
• History Ht ∈ Ht
• Ht = σ {(Xs , As , Rs )}t
s=1

Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
The elements of the bandit model
(see Lattimore and Szepesvári (2020))
• Context Xt ∈ X
• Xt ∼ DX
• Actions At ∈ A = {a1, ..., K}
• At ∼ πt (a|x)
• Policy π ∈ Π
• π = {πt }T
t=1
• πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K
:
P
a∈A qa = 1}
• Rewards Rt ∈ R+
• (R(a1), R(a2), ..., R(ak )) and Rt =
PK
k=1 1(At = ak )R(ak )
• Rt ∼ DR|A,X
• History Ht ∈ Ht
• Ht = σ {(Xs , As , Rs )}t
s=1

Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
The elements of the bandit model
(see Lattimore and Szepesvári (2020))
• Context Xt ∈ X
• Xt ∼ DX
• Actions At ∈ A = {a1, ..., K}
• At ∼ πt (a|x)
• Policy π ∈ Π
• π = {πt }T
t=1
• πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K
:
P
a∈A qa = 1}
• Rewards Rt ∈ R+
• (R(a1), R(a2), ..., R(ak )) and Rt =
PK
k=1 1(At = ak )R(ak )
• Rt ∼ DR|A,X
• History Ht ∈ Ht
• Ht = σ {(Xs , As , Rs )}t
s=1

Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
The elements of the bandit model
(see Lattimore and Szepesvári (2020))
• Context Xt ∈ X
• Xt ∼ DX
• Actions At ∈ A = {a1, ..., K}
• At ∼ πt (a|x)
• Policy π ∈ Π
• π = {πt }T
t=1
• πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K
:
P
a∈A qa = 1}
• Rewards Rt ∈ R+
• (R(a1), R(a2), ..., R(ak )) and Rt =
PK
k=1 1(At = ak )R(ak )
• Rt ∼ DR|A,X
• History Ht ∈ Ht
• Ht = σ {(Xs , As , Rs )}t
s=1

Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
The elements of the bandit model
(see Lattimore and Szepesvári (2020))
• Context Xt ∈ X
• Xt ∼ DX
• Actions At ∈ A = {a1, ..., K}
• At ∼ πt (a|x)
• Policy π ∈ Π
• π = {πt }T
t=1
• πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K
:
P
a∈A qa = 1}
• Rewards Rt ∈ R+
• (R(a1), R(a2), ..., R(ak )) and Rt =
PK
k=1 1(At = ak )R(ak )
• Rt ∼ DR|A,X
• History Ht ∈ Ht
• Ht = σ {(Xs , As , Rs )}t
s=1

Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
The elements of the bandit model
(see Lattimore and Szepesvári (2020))
• Context Xt ∈ X
• Xt ∼ DX
• Actions At ∈ A = {a1, ..., K}
• At ∼ πt (a|x)
• Policy π ∈ Π
• π = {πt }T
t=1
• πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K
:
P
a∈A qa = 1}
• Rewards Rt ∈ R+
• (R(a1), R(a2), ..., R(ak )) and Rt =
PK
k=1 1(At = ak )R(ak )
• Rt ∼ DR|A,X
• History Ht ∈ Ht
• Ht = σ {(Xs , As , Rs )}t
s=1

Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
The elements of the bandit model
(see Lattimore and Szepesvári (2020))
• Context Xt ∈ X
• Xt ∼ DX
• Actions At ∈ A = {a1, ..., K}
• At ∼ πt (a|x)
• Policy π ∈ Π
• π = {πt }T
t=1
• πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K
:
P
a∈A qa = 1}
• Rewards Rt ∈ R+
• (R(a1), R(a2), ..., R(ak )) and Rt =
PK
k=1 1(At = ak )R(ak )
• Rt ∼ DR|A,X
• History Ht ∈ Ht
• Ht = σ {(Xs , As , Rs )}t
s=1

Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Details
• In short (Xt, At, Rt) ∼ D(πt)
• We know πt(a|x) (propensity score)
• We don’t know DX,⃗
R
• We have 1(At = a) ⊥
⊥ R(a)|Xt
• We want to maximize with π
ED(π)
 T
X
t=1
Rt(At)
#
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 5 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Details
• In short (Xt, At, Rt) ∼ D(πt)
• We know πt(a|x) (propensity score)
• We don’t know DX,⃗
R
• We have 1(At = a) ⊥
⊥ R(a)|Xt
• We want to maximize with π
ED(π)
 T
X
t=1
Rt(At)
#
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 5 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Details
• In short (Xt, At, Rt) ∼ D(πt)
• We know πt(a|x) (propensity score)
• We don’t know DX,⃗
R
• We have 1(At = a) ⊥
⊥ R(a)|Xt
• We want to maximize with π
ED(π)
 T
X
t=1
Rt(At)
#
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 5 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Details
• In short (Xt, At, Rt) ∼ D(πt)
• We know πt(a|x) (propensity score)
• We don’t know DX,⃗
R
• We have 1(At = a) ⊥
⊥ R(a)|Xt
• We want to maximize with π
ED(π)
 T
X
t=1
Rt(At)
#
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 5 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Details
• In short (Xt, At, Rt) ∼ D(πt)
• We know πt(a|x) (propensity score)
• We don’t know DX,⃗
R
• We have 1(At = a) ⊥
⊥ R(a)|Xt
• We want to maximize with π
ED(π)
 T
X
t=1
Rt(At)
#
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 5 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
The flow of information
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 6 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Inverse Gap Weighting
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 7 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
1 Authoritarian Sponsor
2 Deconfounding
3 Experiment
4 Conclusions
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 8 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Authoritarian Sponsor model
• The act of sponsoring
• Recommender system - marketing campaigns, testing products
• Healthcare - funding experiments, lobbying doctors
• The sponsor (€, 
H) intervenes in an authoritarian manner
At = StÃt + (1 − St)Āt,
St ∈ {0, 1}, St ∼ €(·|X)
Āt ∼ πt(·|X), Ãt ∼ 
H
t
(·|X)
H
 t
(a|x) = €
t
(1|x) 
H
t
(a|x) + €
t
(0|x)πt(a|x).
• The lack of knowledge about sponsor’s policy (€, 
H)
• Not sharing technology or strategy
• Lost in human to algorithm translation
• Hard to model process like auctions
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 9 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Authoritarian Sponsor model
• The act of sponsoring
• Recommender system - marketing campaigns, testing products
• Healthcare - funding experiments, lobbying doctors
• The sponsor (€, 
H) intervenes in an authoritarian manner
At = StÃt + (1 − St)Āt,
St ∈ {0, 1}, St ∼ €(·|X)
Āt ∼ πt(·|X), Ãt ∼ 
H
t
(·|X)
H
 t
(a|x) = €
t
(1|x) 
H
t
(a|x) + €
t
(0|x)πt(a|x).
• The lack of knowledge about sponsor’s policy (€, 
H)
• Not sharing technology or strategy
• Lost in human to algorithm translation
• Hard to model process like auctions
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 9 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Authoritarian Sponsor model
• The act of sponsoring
• Recommender system - marketing campaigns, testing products
• Healthcare - funding experiments, lobbying doctors
• The sponsor (€, 
H) intervenes in an authoritarian manner
At = StÃt + (1 − St)Āt,
St ∈ {0, 1}, St ∼ €(·|X)
Āt ∼ πt(·|X), Ãt ∼ 
H
t
(·|X)
H
 t
(a|x) = €
t
(1|x) 
H
t
(a|x) + €
t
(0|x)πt(a|x).
• The lack of knowledge about sponsor’s policy (€, 
H)
• Not sharing technology or strategy
• Lost in human to algorithm translation
• Hard to model process like auctions
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 9 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Targeting mechanisms
Introducing an unobserved confounder Z
1 Targeting Completely At Random (TCAR):
• S(X) = S, 
H(a|X, R, Z) = 
H(a)
• kind of like MCAR
2 Targeting At Random (TAR)
• S(X) = S(X), 
H(a|X, R, Z) = 
H(a|X)
• kind of like MAR
3 Targeting Not At Random (TNAR)
• 
H(a|X, R, Z) ⇒ R(a) ̸⊥
⊥ A|X, S = 1.
• kind of like MNAR
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 10 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Targeting mechanisms
Introducing an unobserved confounder Z
1 Targeting Completely At Random (TCAR):
• S(X) = S, 
H(a|X, R, Z) = 
H(a)
• kind of like MCAR
2 Targeting At Random (TAR)
• S(X) = S(X), 
H(a|X, R, Z) = 
H(a|X)
• kind of like MAR
3 Targeting Not At Random (TNAR)
• 
H(a|X, R, Z) ⇒ R(a) ̸⊥
⊥ A|X, S = 1.
• kind of like MNAR
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 10 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Targeting mechanisms
Introducing an unobserved confounder Z
1 Targeting Completely At Random (TCAR):
• S(X) = S, 
H(a|X, R, Z) = 
H(a)
• kind of like MCAR
2 Targeting At Random (TAR)
• S(X) = S(X), 
H(a|X, R, Z) = 
H(a|X)
• kind of like MAR
3 Targeting Not At Random (TNAR)
• 
H(a|X, R, Z) ⇒ R(a) ̸⊥
⊥ A|X, S = 1.
• kind of like MNAR
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 10 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Causal interpretation
Figure 1: TCAR
Figure 2: TAR
Figure 3: TNAR
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 11 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
1 Authoritarian Sponsor
2 Deconfounding
3 Experiment
4 Conclusions
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 12 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Data fusion
(see Colnet et al. (2020))
RCT OS
Internal validity
External validity
Propensity score ?
Table 1: Differences and similarities between data sources
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 13 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Data fusion
(see Colnet et al. (2020))
RCT OS Learner Sponsor
Internal validity
External validity ∼ ∼
Propensity score ? ?
Table 2: Differences and similarities between data sources
• Unsolved challenge: sampling in interaction!
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 14 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Data fusion
(see Colnet et al. (2020))
RCT OS Learner Sponsor
Internal validity
External validity ∼ ∼
Propensity score ? ?
Table 2: Differences and similarities between data sources
• Unsolved challenge: sampling in interaction!
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 14 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
CATE
• CATE
τa1,a2
(x) = EDR|A,X=x
[R(a1) − R(a2)] and b
τa1,a2
(x) = b
µa1
(x) − b
µa2
(x)
• Assumptions
• SUTVA: Rt =
P
a∈A 1(At = a)Rt (a),
• Ignorability: 1(At = a) ⊥
⊥ R(a)|Xt , St = 0
• Ignorability of the study participation: Rt (a) ⊥
⊥ St |Xt
• TNAR: R(a) ̸⊥
⊥ A|X, S = 1.
• Biased CATE on sponsor sample
ρa1,a2
(x) = E[R|A = a1, X = x, S = 1] − E[R|A = a2, X = x, S = 1].
• Bias measurement
ηa1,a2
(x) = τa1,a2
(x) − ρa1,a2
(x)
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 15 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
CATE
• CATE
τa1,a2
(x) = EDR|A,X=x
[R(a1) − R(a2)] and b
τa1,a2
(x) = b
µa1
(x) − b
µa2
(x)
• Assumptions
• SUTVA: Rt =
P
a∈A 1(At = a)Rt (a),
• Ignorability: 1(At = a) ⊥
⊥ R(a)|Xt , St = 0
• Ignorability of the study participation: Rt (a) ⊥
⊥ St |Xt
• TNAR: R(a) ̸⊥
⊥ A|X, S = 1.
• Biased CATE on sponsor sample
ρa1,a2
(x) = E[R|A = a1, X = x, S = 1] − E[R|A = a2, X = x, S = 1].
• Bias measurement
ηa1,a2
(x) = τa1,a2
(x) − ρa1,a2
(x)
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 15 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
CATE
• CATE
τa1,a2
(x) = EDR|A,X=x
[R(a1) − R(a2)] and b
τa1,a2
(x) = b
µa1
(x) − b
µa2
(x)
• Assumptions
• SUTVA: Rt =
P
a∈A 1(At = a)Rt (a),
• Ignorability: 1(At = a) ⊥
⊥ R(a)|Xt , St = 0
• Ignorability of the study participation: Rt (a) ⊥
⊥ St |Xt
• TNAR: R(a) ̸⊥
⊥ A|X, S = 1.
• Biased CATE on sponsor sample
ρa1,a2
(x) = E[R|A = a1, X = x, S = 1] − E[R|A = a2, X = x, S = 1].
• Bias measurement
ηa1,a2
(x) = τa1,a2
(x) − ρa1,a2
(x)
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 15 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Two step deconfounding
(see Kallus et al. (2018)), A = {a0, a1}
1 On the observational sample data use a metalearner to obtain b
ρa1,a0
(X).
2 Postulate a function q(X, a0) such that E[qt(X, a0)R|X = x, S = 0] = τa1,a0
(x). Where
qt(X, a0) =
1(A = a1)
πt(a1|X)
−
1(A = a0)
πt(a0|X)
.
3 Using qt(X, a0) apply the definition of ηa1,a0
(x) to adjust the b
ρ term by solving an optimization
problem on the unconfounded sample:
b
ηa1,a0
(X) = arg min
η
X
t:St =0
(qt(xt, a0)rt − b
ρa1,a0
(xt) − η(xt))
2
.
4 Finally b
τa1,a0 (x) = b
ρa1,a0 (x) + b
ηa1,a0 (x).
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 16 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Two step deconfounding
(see Kallus et al. (2018)), A = {a0, a1}
1 On the observational sample data use a metalearner to obtain b
ρa1,a0
(X).
2 Postulate a function q(X, a0) such that E[qt(X, a0)R|X = x, S = 0] = τa1,a0
(x). Where
qt(X, a0) =
1(A = a1)
πt(a1|X)
−
1(A = a0)
πt(a0|X)
.
3 Using qt(X, a0) apply the definition of ηa1,a0
(x) to adjust the b
ρ term by solving an optimization
problem on the unconfounded sample:
b
ηa1,a0 (X) = arg min
η
X
t:St =0
(qt(xt, a0)rt − b
ρa1,a0 (xt) − η(xt))
2
.
4 Finally b
τa1,a0 (x) = b
ρa1,a0 (x) + b
ηa1,a0 (x).
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 16 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Two step deconfounding
(see Kallus et al. (2018)), A = {a0, a1}
1 On the observational sample data use a metalearner to obtain b
ρa1,a0
(X).
2 Postulate a function q(X, a0) such that E[qt(X, a0)R|X = x, S = 0] = τa1,a0
(x). Where
qt(X, a0) =
1(A = a1)
πt(a1|X)
−
1(A = a0)
πt(a0|X)
.
3 Using qt(X, a0) apply the definition of ηa1,a0
(x) to adjust the b
ρ term by solving an optimization
problem on the unconfounded sample:
b
ηa1,a0 (X) = arg min
η
X
t:St =0
(qt(xt, a0)rt − b
ρa1,a0 (xt) − η(xt))
2
.
4 Finally b
τa1,a0 (x) = b
ρa1,a0 (x) + b
ηa1,a0 (x).
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 16 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Two step deconfounding
(see Kallus et al. (2018)), A = {a0, a1}
1 On the observational sample data use a metalearner to obtain b
ρa1,a0
(X).
2 Postulate a function q(X, a0) such that E[qt(X, a0)R|X = x, S = 0] = τa1,a0
(x). Where
qt(X, a0) =
1(A = a1)
πt(a1|X)
−
1(A = a0)
πt(a0|X)
.
3 Using qt(X, a0) apply the definition of ηa1,a0
(x) to adjust the b
ρ term by solving an optimization
problem on the unconfounded sample:
b
ηa1,a0 (X) = arg min
η
X
t:St =0
(qt(xt, a0)rt − b
ρa1,a0 (xt) − η(xt))
2
.
4 Finally b
τa1,a0 (x) = b
ρa1,a0 (x) + b
ηa1,a0 (x).
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 16 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Deconfounded CATE IGW (D-CATE-IGW)
• Let b = arg maxa b
µa(xt).
π(a|x) =
( 1
K+γm(b
µm
b (x)−b
µm
a (x))
for a ̸= b
1 −
P
c̸=b π(c|x) for a = b
=
(
1
K+γm b
τb,a(x) for a ̸= b
1 −
P
c̸=b π(c|x) for a = b
,
• Each round/epoch deconfound the CATE
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 17 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
1 Authoritarian Sponsor
2 Deconfounding
3 Experiment
4 Conclusions
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 18 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Setup I
• St ∼ Bern(ρ)
• No overlap scenario
Xt|St = 0 ∼Unif([−1, 1]),
Ut|St = 0 ∼N(0, 1).
• Full overlap scenario
Xt|St = 0 ∼ N(0, 1),
Ut|St = 0 ∼ N(0, 1).

Xt
Ut

| {At, St = 1} ∼ N

0
0

,

1 (2At − 1)σA
(2At − 1)σA 1

,
• σA ∈ {0.6, 0.9}
• ρ ∈ {0.3, 0.6}
Rt(At) = 1 + At + Xt + 2AtXt + 1/2X2
t + 3/4AtX2
t + 2Ut + 1/2ϵt,
where ϵ ∼ N(0, 1), τ(Xt) = 3/4X2
t + 2Xt + 1.
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 19 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Setup I
• St ∼ Bern(ρ)
• No overlap scenario
Xt|St = 0 ∼Unif([−1, 1]),
Ut|St = 0 ∼N(0, 1).
• Full overlap scenario
Xt|St = 0 ∼ N(0, 1),
Ut|St = 0 ∼ N(0, 1).

Xt
Ut

| {At, St = 1} ∼ N

0
0

,

1 (2At − 1)σA
(2At − 1)σA 1

,
• σA ∈ {0.6, 0.9}
• ρ ∈ {0.3, 0.6}
Rt(At) = 1 + At + Xt + 2AtXt + 1/2X2
t + 3/4AtX2
t + 2Ut + 1/2ϵt,
where ϵ ∼ N(0, 1), τ(Xt) = 3/4X2
t + 2Xt + 1.
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 19 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Setup I
• St ∼ Bern(ρ)
• No overlap scenario
Xt|St = 0 ∼Unif([−1, 1]),
Ut|St = 0 ∼N(0, 1).
• Full overlap scenario
Xt|St = 0 ∼ N(0, 1),
Ut|St = 0 ∼ N(0, 1).

Xt
Ut

| {At, St = 1} ∼ N

0
0

,

1 (2At − 1)σA
(2At − 1)σA 1

,
• σA ∈ {0.6, 0.9}
• ρ ∈ {0.3, 0.6}
Rt(At) = 1 + At + Xt + 2AtXt + 1/2X2
t + 3/4AtX2
t + 2Ut + 1/2ϵt,
where ϵ ∼ N(0, 1), τ(Xt) = 3/4X2
t + 2Xt + 1.
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 19 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Setup I
• St ∼ Bern(ρ)
• No overlap scenario
Xt|St = 0 ∼Unif([−1, 1]),
Ut|St = 0 ∼N(0, 1).
• Full overlap scenario
Xt|St = 0 ∼ N(0, 1),
Ut|St = 0 ∼ N(0, 1).

Xt
Ut

| {At, St = 1} ∼ N

0
0

,

1 (2At − 1)σA
(2At − 1)σA 1

,
• σA ∈ {0.6, 0.9}
• ρ ∈ {0.3, 0.6}
Rt(At) = 1 + At + Xt + 2AtXt + 1/2X2
t + 3/4AtX2
t + 2Ut + 1/2ϵt,
where ϵ ∼ N(0, 1), τ(Xt) = 3/4X2
t + 2Xt + 1.
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 19 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Result I
Figure 4: Normed cumulative regret for different scenarios
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 20 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Result II
Figure 5: True and estimated CATE values for different scenarios
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 21 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
1 Authoritarian Sponsor
2 Deconfounding
3 Experiment
4 Conclusions
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 22 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Contribution
1 Pioneering model for sponsored content in contextual bandits framework
2 Bandits not as experimental studies, but as observational studies
3 Confounding scenario and deconfounding application
4 D-CATE-IGW works
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 23 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Future research
• Theoretical
• Mathematically model the complicated sampling. Especially the flow of information
• Consistency proof of CATE estimator in this scenario
• High probability regret bounds on D-CATE-IGW P(REWARD(π)  BOUND(δ))  1 − δ
• Empirical
• More metalearners (X-learner, R-learner) (see Künzel et al. (2019))
• Other deconfounding methods (see Wu and Yang (2022))
• A more comprehensive empirical study
Expansion
• Policy evaluation
V (π) = EX EA∼π(·|X)ER|A,X [R]
b
Vt(π) on {(Xs, As, Rs)}t
s=1 ∼ D(H
 )
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 24 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Future research
• Theoretical
• Mathematically model the complicated sampling. Especially the flow of information
• Consistency proof of CATE estimator in this scenario
• High probability regret bounds on D-CATE-IGW P(REWARD(π)  BOUND(δ))  1 − δ
• Empirical
• More metalearners (X-learner, R-learner) (see Künzel et al. (2019))
• Other deconfounding methods (see Wu and Yang (2022))
• A more comprehensive empirical study
Expansion
• Policy evaluation
V (π) = EX EA∼π(·|X)ER|A,X [R]
b
Vt(π) on {(Xs, As, Rs)}t
s=1 ∼ D(H
 )
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 24 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Future research
• Theoretical
• Mathematically model the complicated sampling. Especially the flow of information
• Consistency proof of CATE estimator in this scenario
• High probability regret bounds on D-CATE-IGW P(REWARD(π)  BOUND(δ))  1 − δ
• Empirical
• More metalearners (X-learner, R-learner) (see Künzel et al. (2019))
• Other deconfounding methods (see Wu and Yang (2022))
• A more comprehensive empirical study
Expansion
• Policy evaluation
V (π) = EX EA∼π(·|X)ER|A,X [R]
b
Vt(π) on {(Xs, As, Rs)}t
s=1 ∼ D(H
 )
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 24 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Future research
• Theoretical
• Mathematically model the complicated sampling. Especially the flow of information
• Consistency proof of CATE estimator in this scenario
• High probability regret bounds on D-CATE-IGW P(REWARD(π)  BOUND(δ))  1 − δ
• Empirical
• More metalearners (X-learner, R-learner) (see Künzel et al. (2019))
• Other deconfounding methods (see Wu and Yang (2022))
• A more comprehensive empirical study
Expansion
• Policy evaluation
V (π) = EX EA∼π(·|X)ER|A,X [R]
b
Vt(π) on {(Xs, As, Rs)}t
s=1 ∼ D(H
 )
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 24 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Future research
• Theoretical
• Mathematically model the complicated sampling. Especially the flow of information
• Consistency proof of CATE estimator in this scenario
• High probability regret bounds on D-CATE-IGW P(REWARD(π)  BOUND(δ))  1 − δ
• Empirical
• More metalearners (X-learner, R-learner) (see Künzel et al. (2019))
• Other deconfounding methods (see Wu and Yang (2022))
• A more comprehensive empirical study
Expansion
• Policy evaluation
V (π) = EX EA∼π(·|X)ER|A,X [R]
b
Vt(π) on {(Xs, As, Rs)}t
s=1 ∼ D(H
 )
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 24 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Future research
• Theoretical
• Mathematically model the complicated sampling. Especially the flow of information
• Consistency proof of CATE estimator in this scenario
• High probability regret bounds on D-CATE-IGW P(REWARD(π)  BOUND(δ))  1 − δ
• Empirical
• More metalearners (X-learner, R-learner) (see Künzel et al. (2019))
• Other deconfounding methods (see Wu and Yang (2022))
• A more comprehensive empirical study
Expansion
• Policy evaluation
V (π) = EX EA∼π(·|X)ER|A,X [R]
b
Vt(π) on {(Xs, As, Rs)}t
s=1 ∼ D(H
 )
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 24 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Future research
• Theoretical
• Mathematically model the complicated sampling. Especially the flow of information
• Consistency proof of CATE estimator in this scenario
• High probability regret bounds on D-CATE-IGW P(REWARD(π)  BOUND(δ))  1 − δ
• Empirical
• More metalearners (X-learner, R-learner) (see Künzel et al. (2019))
• Other deconfounding methods (see Wu and Yang (2022))
• A more comprehensive empirical study
Expansion
• Policy evaluation
V (π) = EX EA∼π(·|X)ER|A,X [R]
b
Vt(π) on {(Xs, As, Rs)}t
s=1 ∼ D(H
 )
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 24 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Future research
• Theoretical
• Mathematically model the complicated sampling. Especially the flow of information
• Consistency proof of CATE estimator in this scenario
• High probability regret bounds on D-CATE-IGW P(REWARD(π)  BOUND(δ))  1 − δ
• Empirical
• More metalearners (X-learner, R-learner) (see Künzel et al. (2019))
• Other deconfounding methods (see Wu and Yang (2022))
• A more comprehensive empirical study
Expansion
• Policy evaluation
V (π) = EX EA∼π(·|X)ER|A,X [R]
b
Vt(π) on {(Xs, As, Rs)}t
s=1 ∼ D(H
 )
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 24 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Future research
• Theoretical
• Mathematically model the complicated sampling. Especially the flow of information
• Consistency proof of CATE estimator in this scenario
• High probability regret bounds on D-CATE-IGW P(REWARD(π)  BOUND(δ))  1 − δ
• Empirical
• More metalearners (X-learner, R-learner) (see Künzel et al. (2019))
• Other deconfounding methods (see Wu and Yang (2022))
• A more comprehensive empirical study
Expansion
• Policy evaluation
V (π) = EX EA∼π(·|X)ER|A,X [R]
b
Vt(π) on {(Xs, As, Rs)}t
s=1 ∼ D(H
 )
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 24 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Future research
• Theoretical
• Mathematically model the complicated sampling. Especially the flow of information
• Consistency proof of CATE estimator in this scenario
• High probability regret bounds on D-CATE-IGW P(REWARD(π)  BOUND(δ))  1 − δ
• Empirical
• More metalearners (X-learner, R-learner) (see Künzel et al. (2019))
• Other deconfounding methods (see Wu and Yang (2022))
• A more comprehensive empirical study
Expansion
• Policy evaluation
V (π) = EX EA∼π(·|X)ER|A,X [R]
b
Vt(π) on {(Xs, As, Rs)}t
s=1 ∼ D(H
 )
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 24 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
The beginning ...
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 25 / 26
Authoritarian Sponsor Deconfounding Experiment Conclusions References
Colnet, B., I. Mayer, G. Chen, A. Dieng, R. Li, G. Varoquaux, J.-P. Vert, J. Josse, and S. Yang
(2020). Causal inference methods for combining randomized trials and observational studies: a
review. arXiv preprint arXiv:2011.08047.
Kallus, N., A. M. Puli, and U. Shalit (2018). Removing hidden confounding by experimental
grounding. Advances in neural information processing systems 31.
Künzel, S. R., J. S. Sekhon, P. J. Bickel, and B. Yu (2019). Metalearners for estimating
heterogeneous treatment effects using machine learning. Proceedings of the national academy of
sciences 116(10), 4156–4165.
Lattimore, T. and C. Szepesvári (2020). Bandit algorithms. Cambridge University Press.
Wu, L. and S. Yang (2022). Integrative r-learner of heterogeneous treatment effects combining
experimental and observational studies. In Conference on Causal Learning and Reasoning, pp.
904–926. PMLR.
Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology
Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 26 / 26
1 sur 68

Recommandé

Sequential Monte Carlo algorithms for agent-based models of disease transmission par
Sequential Monte Carlo algorithms for agent-based models of disease transmissionSequential Monte Carlo algorithms for agent-based models of disease transmission
Sequential Monte Carlo algorithms for agent-based models of disease transmissionJeremyHeng10
62 vues57 diapositives
block-mdp-masters-defense.pdf par
block-mdp-masters-defense.pdfblock-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfJunghyun Lee
63 vues75 diapositives
Classification par
ClassificationClassification
ClassificationArthur Charpentier
16.7K vues199 diapositives
Micro to macro passage in traffic models including multi-anticipation effect par
Micro to macro passage in traffic models including multi-anticipation effectMicro to macro passage in traffic models including multi-anticipation effect
Micro to macro passage in traffic models including multi-anticipation effectGuillaume Costeseque
108 vues28 diapositives
Locality-sensitive hashing for search in metric space par
Locality-sensitive hashing for search in metric space Locality-sensitive hashing for search in metric space
Locality-sensitive hashing for search in metric space Eliezer Silva
104 vues47 diapositives
Low Complexity Regularization of Inverse Problems par
Low Complexity Regularization of Inverse ProblemsLow Complexity Regularization of Inverse Problems
Low Complexity Regularization of Inverse ProblemsGabriel Peyré
1.3K vues56 diapositives

Contenu connexe

Similaire à Sponsored content in contextual bandits. Deconfounding targeting not at random

2017-03, ICASSP, Projection-based Dual Averaging for Stochastic Sparse Optimi... par
2017-03, ICASSP, Projection-based Dual Averaging for Stochastic Sparse Optimi...2017-03, ICASSP, Projection-based Dual Averaging for Stochastic Sparse Optimi...
2017-03, ICASSP, Projection-based Dual Averaging for Stochastic Sparse Optimi...asahiushio1
119 vues22 diapositives
ijcai09submodularity.ppt par
ijcai09submodularity.pptijcai09submodularity.ppt
ijcai09submodularity.ppt42HSQuangMinh
7 vues154 diapositives
Physics-driven Spatiotemporal Regularization for High-dimensional Predictive... par
 Physics-driven Spatiotemporal Regularization for High-dimensional Predictive... Physics-driven Spatiotemporal Regularization for High-dimensional Predictive...
Physics-driven Spatiotemporal Regularization for High-dimensional Predictive...Hui Yang
113 vues42 diapositives
Sequential Monte Carlo algorithms for agent-based models of disease transmission par
Sequential Monte Carlo algorithms for agent-based models of disease transmissionSequential Monte Carlo algorithms for agent-based models of disease transmission
Sequential Monte Carlo algorithms for agent-based models of disease transmissionJeremyHeng10
38 vues52 diapositives
ppt0320defenseday par
ppt0320defensedayppt0320defenseday
ppt0320defensedayXi (Shay) Zhang, PhD
542 vues48 diapositives
main par
mainmain
mainDavid Mateos
194 vues75 diapositives

Similaire à Sponsored content in contextual bandits. Deconfounding targeting not at random(20)

2017-03, ICASSP, Projection-based Dual Averaging for Stochastic Sparse Optimi... par asahiushio1
2017-03, ICASSP, Projection-based Dual Averaging for Stochastic Sparse Optimi...2017-03, ICASSP, Projection-based Dual Averaging for Stochastic Sparse Optimi...
2017-03, ICASSP, Projection-based Dual Averaging for Stochastic Sparse Optimi...
asahiushio1119 vues
Physics-driven Spatiotemporal Regularization for High-dimensional Predictive... par Hui Yang
 Physics-driven Spatiotemporal Regularization for High-dimensional Predictive... Physics-driven Spatiotemporal Regularization for High-dimensional Predictive...
Physics-driven Spatiotemporal Regularization for High-dimensional Predictive...
Hui Yang113 vues
Sequential Monte Carlo algorithms for agent-based models of disease transmission par JeremyHeng10
Sequential Monte Carlo algorithms for agent-based models of disease transmissionSequential Monte Carlo algorithms for agent-based models of disease transmission
Sequential Monte Carlo algorithms for agent-based models of disease transmission
JeremyHeng1038 vues
ESRA2015 course: Latent Class Analysis for Survey Research par Daniel Oberski
ESRA2015 course: Latent Class Analysis for Survey ResearchESRA2015 course: Latent Class Analysis for Survey Research
ESRA2015 course: Latent Class Analysis for Survey Research
Daniel Oberski4.7K vues
Query Answering in Probabilistic Datalog+/{ Ontologies under Group Preferences par Oana Tifrea-Marciuska
Query Answering in Probabilistic Datalog+/{ Ontologies under Group PreferencesQuery Answering in Probabilistic Datalog+/{ Ontologies under Group Preferences
Query Answering in Probabilistic Datalog+/{ Ontologies under Group Preferences
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI par Jack Clark
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Jack Clark2.9K vues
[DL輪読会]Hindsight Experience Replayを応用した再ラベリングによる効率的な強化学習 par Deep Learning JP
[DL輪読会]Hindsight Experience Replayを応用した再ラベリングによる効率的な強化学習[DL輪読会]Hindsight Experience Replayを応用した再ラベリングによる効率的な強化学習
[DL輪読会]Hindsight Experience Replayを応用した再ラベリングによる効率的な強化学習
Deep Learning JP661 vues
Hierarchical Reinforcement Learning with Option-Critic Architecture par Necip Oguz Serbetci
Hierarchical Reinforcement Learning with Option-Critic ArchitectureHierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic Architecture
Linear Discriminant Analysis and Its Generalization par 일상 온
Linear Discriminant Analysis and Its GeneralizationLinear Discriminant Analysis and Its Generalization
Linear Discriminant Analysis and Its Generalization
일상 온3.6K vues
Applied machine learning for search engine relevance 3 par Charles Martin
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3
Charles Martin1.7K vues

Plus de GRAPE

ENTIME_GEM___GAP.pdf par
ENTIME_GEM___GAP.pdfENTIME_GEM___GAP.pdf
ENTIME_GEM___GAP.pdfGRAPE
5 vues15 diapositives
Boston_College Slides.pdf par
Boston_College Slides.pdfBoston_College Slides.pdf
Boston_College Slides.pdfGRAPE
4 vues208 diapositives
Presentation_Yale.pdf par
Presentation_Yale.pdfPresentation_Yale.pdf
Presentation_Yale.pdfGRAPE
9 vues207 diapositives
Presentation_Columbia.pdf par
Presentation_Columbia.pdfPresentation_Columbia.pdf
Presentation_Columbia.pdfGRAPE
4 vues187 diapositives
Presentation.pdf par
Presentation.pdfPresentation.pdf
Presentation.pdfGRAPE
4 vues175 diapositives
Presentation.pdf par
Presentation.pdfPresentation.pdf
Presentation.pdfGRAPE
18 vues113 diapositives

Plus de GRAPE(20)

ENTIME_GEM___GAP.pdf par GRAPE
ENTIME_GEM___GAP.pdfENTIME_GEM___GAP.pdf
ENTIME_GEM___GAP.pdf
GRAPE5 vues
Boston_College Slides.pdf par GRAPE
Boston_College Slides.pdfBoston_College Slides.pdf
Boston_College Slides.pdf
GRAPE4 vues
Presentation_Yale.pdf par GRAPE
Presentation_Yale.pdfPresentation_Yale.pdf
Presentation_Yale.pdf
GRAPE9 vues
Presentation_Columbia.pdf par GRAPE
Presentation_Columbia.pdfPresentation_Columbia.pdf
Presentation_Columbia.pdf
GRAPE4 vues
Presentation.pdf par GRAPE
Presentation.pdfPresentation.pdf
Presentation.pdf
GRAPE4 vues
Presentation.pdf par GRAPE
Presentation.pdfPresentation.pdf
Presentation.pdf
GRAPE18 vues
Presentation.pdf par GRAPE
Presentation.pdfPresentation.pdf
Presentation.pdf
GRAPE16 vues
Slides.pdf par GRAPE
Slides.pdfSlides.pdf
Slides.pdf
GRAPE14 vues
Slides.pdf par GRAPE
Slides.pdfSlides.pdf
Slides.pdf
GRAPE16 vues
DDKT-Munich.pdf par GRAPE
DDKT-Munich.pdfDDKT-Munich.pdf
DDKT-Munich.pdf
GRAPE7 vues
DDKT-Praga.pdf par GRAPE
DDKT-Praga.pdfDDKT-Praga.pdf
DDKT-Praga.pdf
GRAPE11 vues
DDKT-Southern.pdf par GRAPE
DDKT-Southern.pdfDDKT-Southern.pdf
DDKT-Southern.pdf
GRAPE25 vues
DDKT-SummerWorkshop.pdf par GRAPE
DDKT-SummerWorkshop.pdfDDKT-SummerWorkshop.pdf
DDKT-SummerWorkshop.pdf
GRAPE15 vues
DDKT-SAET.pdf par GRAPE
DDKT-SAET.pdfDDKT-SAET.pdf
DDKT-SAET.pdf
GRAPE29 vues
The European Unemployment Puzzle: implications from population aging par GRAPE
The European Unemployment Puzzle: implications from population agingThe European Unemployment Puzzle: implications from population aging
The European Unemployment Puzzle: implications from population aging
GRAPE53 vues
Matching it up: non-standard work and job satisfaction.pdf par GRAPE
Matching it up: non-standard work and job satisfaction.pdfMatching it up: non-standard work and job satisfaction.pdf
Matching it up: non-standard work and job satisfaction.pdf
GRAPE20 vues
Investment in human capital: an optimal taxation approach par GRAPE
Investment in human capital: an optimal taxation approachInvestment in human capital: an optimal taxation approach
Investment in human capital: an optimal taxation approach
GRAPE22 vues
slides_cef.pdf par GRAPE
slides_cef.pdfslides_cef.pdf
slides_cef.pdf
GRAPE22 vues
Fertility, contraceptives and gender inequality par GRAPE
Fertility, contraceptives and gender inequalityFertility, contraceptives and gender inequality
Fertility, contraceptives and gender inequality
GRAPE24 vues
The European Unemployment Puzzle: implications from population aging par GRAPE
The European Unemployment Puzzle: implications from population agingThe European Unemployment Puzzle: implications from population aging
The European Unemployment Puzzle: implications from population aging
GRAPE50 vues

Dernier

Stabilizing Algorithmic Stablecoins: the TerraLuna case study par
Stabilizing Algorithmic Stablecoins: the TerraLuna case studyStabilizing Algorithmic Stablecoins: the TerraLuna case study
Stabilizing Algorithmic Stablecoins: the TerraLuna case studyFedericoCalandra1
6 vues118 diapositives
Debt Watch | ICICI Prudential Mutual Fund par
Debt Watch | ICICI Prudential Mutual FundDebt Watch | ICICI Prudential Mutual Fund
Debt Watch | ICICI Prudential Mutual Fundiciciprumf
20 vues2 diapositives
Embracing the eFarming Challenge.pdf par
Embracing the eFarming Challenge.pdfEmbracing the eFarming Challenge.pdf
Embracing the eFarming Challenge.pdframadhan04116
9 vues1 diapositive
QNBFS Daily Market Report November 29, 2023 par
QNBFS Daily Market Report November 29, 2023QNBFS Daily Market Report November 29, 2023
QNBFS Daily Market Report November 29, 2023QNB Group
10 vues9 diapositives
Motilal Oswal Small Cap Fund One Pager.pdf par
Motilal Oswal Small Cap Fund One Pager.pdfMotilal Oswal Small Cap Fund One Pager.pdf
Motilal Oswal Small Cap Fund One Pager.pdfmultigainfinancial
290 vues2 diapositives
Debt Watch | ICICI Prudential Mutual Fund par
Debt Watch | ICICI Prudential Mutual FundDebt Watch | ICICI Prudential Mutual Fund
Debt Watch | ICICI Prudential Mutual Fundiciciprumf
8 vues2 diapositives

Dernier(20)

Stabilizing Algorithmic Stablecoins: the TerraLuna case study par FedericoCalandra1
Stabilizing Algorithmic Stablecoins: the TerraLuna case studyStabilizing Algorithmic Stablecoins: the TerraLuna case study
Stabilizing Algorithmic Stablecoins: the TerraLuna case study
Debt Watch | ICICI Prudential Mutual Fund par iciciprumf
Debt Watch | ICICI Prudential Mutual FundDebt Watch | ICICI Prudential Mutual Fund
Debt Watch | ICICI Prudential Mutual Fund
iciciprumf20 vues
Embracing the eFarming Challenge.pdf par ramadhan04116
Embracing the eFarming Challenge.pdfEmbracing the eFarming Challenge.pdf
Embracing the eFarming Challenge.pdf
ramadhan041169 vues
QNBFS Daily Market Report November 29, 2023 par QNB Group
QNBFS Daily Market Report November 29, 2023QNBFS Daily Market Report November 29, 2023
QNBFS Daily Market Report November 29, 2023
QNB Group10 vues
Debt Watch | ICICI Prudential Mutual Fund par iciciprumf
Debt Watch | ICICI Prudential Mutual FundDebt Watch | ICICI Prudential Mutual Fund
Debt Watch | ICICI Prudential Mutual Fund
iciciprumf8 vues
The implementation of government subsidies and tax incentives to enhance the ... par Fardeen Ahmed
The implementation of government subsidies and tax incentives to enhance the ...The implementation of government subsidies and tax incentives to enhance the ...
The implementation of government subsidies and tax incentives to enhance the ...
Fardeen Ahmed6 vues
Stock Market Brief Deck 1129.pdf par Michael Silva
Stock Market Brief Deck 1129.pdfStock Market Brief Deck 1129.pdf
Stock Market Brief Deck 1129.pdf
Michael Silva56 vues
Supplier Sourcing presentation.pdf par AllenSingson
Supplier Sourcing presentation.pdfSupplier Sourcing presentation.pdf
Supplier Sourcing presentation.pdf
AllenSingson20 vues
Housing Discrimination in America.pptx par ecobbins1
Housing Discrimination in America.pptxHousing Discrimination in America.pptx
Housing Discrimination in America.pptx
ecobbins125 vues
Topic 37 copy.pptx par saleh176
Topic 37 copy.pptxTopic 37 copy.pptx
Topic 37 copy.pptx
saleh1765 vues
Indias Sparkling Future : Lab-Grown Diamonds in Focus par anujadeodhar4
Indias Sparkling Future : Lab-Grown Diamonds in FocusIndias Sparkling Future : Lab-Grown Diamonds in Focus
Indias Sparkling Future : Lab-Grown Diamonds in Focus
anujadeodhar49 vues
Digital4Climate-Leveraging Digital innovations & data for climate action par Soren Gigler
Digital4Climate-Leveraging Digital innovations & data for climate action Digital4Climate-Leveraging Digital innovations & data for climate action
Digital4Climate-Leveraging Digital innovations & data for climate action
Soren Gigler67 vues
Blockchain, AI & Metaverse for Football Clubs - 2023.pdf par kelroyjames1
Blockchain, AI & Metaverse for Football Clubs - 2023.pdfBlockchain, AI & Metaverse for Football Clubs - 2023.pdf
Blockchain, AI & Metaverse for Football Clubs - 2023.pdf
kelroyjames112 vues
List of Qataris Sanctioned by the U.S. Treasury Department for Links to Al-Qa... par aljazeeramasoom
List of Qataris Sanctioned by the U.S. Treasury Department for Links to Al-Qa...List of Qataris Sanctioned by the U.S. Treasury Department for Links to Al-Qa...
List of Qataris Sanctioned by the U.S. Treasury Department for Links to Al-Qa...
The breath of the investment grade and the unpredictability of inflation - Eu... par Antonis Zairis
The breath of the investment grade and the unpredictability of inflation - Eu...The breath of the investment grade and the unpredictability of inflation - Eu...
The breath of the investment grade and the unpredictability of inflation - Eu...
Antonis Zairis12 vues

Sponsored content in contextual bandits. Deconfounding targeting not at random

  • 1. Authoritarian Sponsor Deconfounding Experiment Conclusions References Sponsored content in contextual bandits. Deconfounding Targeting Not At Random MIUE 2023 Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology September 22, 2023 Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 1 / 26
  • 2. Authoritarian Sponsor Deconfounding Experiment Conclusions References Motivational examples Recommender systems • Suggest best ads/movies a ∈ {a1, a2, ...aK } • Users X1, X2, ...., XT • Design of the study {na1 , na2 , ..., naK }, P i nai = T • Measured satisfaction {Rt(a1), ...Rt(aK )}T t=1 Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 2 / 26
  • 3. Authoritarian Sponsor Deconfounding Experiment Conclusions References The framework Exploration vs exploitation • Allocating limited resources under uncertainty • Sequential manner • Partial feedback (bandit feedback) • Adaptive (non-iid) data • Maximizing cumulative gain • Current actions do not change future environment Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 3 / 26
  • 4. Authoritarian Sponsor Deconfounding Experiment Conclusions References The framework Exploration vs exploitation • Allocating limited resources under uncertainty • Sequential manner • Partial feedback (bandit feedback) • Adaptive (non-iid) data • Maximizing cumulative gain • Current actions do not change future environment Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 3 / 26
  • 5. Authoritarian Sponsor Deconfounding Experiment Conclusions References The framework Exploration vs exploitation • Allocating limited resources under uncertainty • Sequential manner • Partial feedback (bandit feedback) • Adaptive (non-iid) data • Maximizing cumulative gain • Current actions do not change future environment Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 3 / 26
  • 6. Authoritarian Sponsor Deconfounding Experiment Conclusions References The framework Exploration vs exploitation • Allocating limited resources under uncertainty • Sequential manner • Partial feedback (bandit feedback) • Adaptive (non-iid) data • Maximizing cumulative gain • Current actions do not change future environment Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 3 / 26
  • 7. Authoritarian Sponsor Deconfounding Experiment Conclusions References The framework Exploration vs exploitation • Allocating limited resources under uncertainty • Sequential manner • Partial feedback (bandit feedback) • Adaptive (non-iid) data • Maximizing cumulative gain • Current actions do not change future environment Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 3 / 26
  • 8. Authoritarian Sponsor Deconfounding Experiment Conclusions References The framework Exploration vs exploitation • Allocating limited resources under uncertainty • Sequential manner • Partial feedback (bandit feedback) • Adaptive (non-iid) data • Maximizing cumulative gain • Current actions do not change future environment Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 3 / 26
  • 9. Authoritarian Sponsor Deconfounding Experiment Conclusions References The elements of the bandit model (see Lattimore and Szepesvári (2020)) • Context Xt ∈ X • Xt ∼ DX • Actions At ∈ A = {a1, ..., K} • At ∼ πt (a|x) • Policy π ∈ Π • π = {πt }T t=1 • πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K : P a∈A qa = 1} • Rewards Rt ∈ R+ • (R(a1), R(a2), ..., R(ak )) and Rt = PK k=1 1(At = ak )R(ak ) • Rt ∼ DR|A,X • History Ht ∈ Ht • Ht = σ {(Xs , As , Rs )}t s=1 Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
  • 10. Authoritarian Sponsor Deconfounding Experiment Conclusions References The elements of the bandit model (see Lattimore and Szepesvári (2020)) • Context Xt ∈ X • Xt ∼ DX • Actions At ∈ A = {a1, ..., K} • At ∼ πt (a|x) • Policy π ∈ Π • π = {πt }T t=1 • πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K : P a∈A qa = 1} • Rewards Rt ∈ R+ • (R(a1), R(a2), ..., R(ak )) and Rt = PK k=1 1(At = ak )R(ak ) • Rt ∼ DR|A,X • History Ht ∈ Ht • Ht = σ {(Xs , As , Rs )}t s=1 Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
  • 11. Authoritarian Sponsor Deconfounding Experiment Conclusions References The elements of the bandit model (see Lattimore and Szepesvári (2020)) • Context Xt ∈ X • Xt ∼ DX • Actions At ∈ A = {a1, ..., K} • At ∼ πt (a|x) • Policy π ∈ Π • π = {πt }T t=1 • πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K : P a∈A qa = 1} • Rewards Rt ∈ R+ • (R(a1), R(a2), ..., R(ak )) and Rt = PK k=1 1(At = ak )R(ak ) • Rt ∼ DR|A,X • History Ht ∈ Ht • Ht = σ {(Xs , As , Rs )}t s=1 Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
  • 12. Authoritarian Sponsor Deconfounding Experiment Conclusions References The elements of the bandit model (see Lattimore and Szepesvári (2020)) • Context Xt ∈ X • Xt ∼ DX • Actions At ∈ A = {a1, ..., K} • At ∼ πt (a|x) • Policy π ∈ Π • π = {πt }T t=1 • πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K : P a∈A qa = 1} • Rewards Rt ∈ R+ • (R(a1), R(a2), ..., R(ak )) and Rt = PK k=1 1(At = ak )R(ak ) • Rt ∼ DR|A,X • History Ht ∈ Ht • Ht = σ {(Xs , As , Rs )}t s=1 Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
  • 13. Authoritarian Sponsor Deconfounding Experiment Conclusions References The elements of the bandit model (see Lattimore and Szepesvári (2020)) • Context Xt ∈ X • Xt ∼ DX • Actions At ∈ A = {a1, ..., K} • At ∼ πt (a|x) • Policy π ∈ Π • π = {πt }T t=1 • πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K : P a∈A qa = 1} • Rewards Rt ∈ R+ • (R(a1), R(a2), ..., R(ak )) and Rt = PK k=1 1(At = ak )R(ak ) • Rt ∼ DR|A,X • History Ht ∈ Ht • Ht = σ {(Xs , As , Rs )}t s=1 Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
  • 14. Authoritarian Sponsor Deconfounding Experiment Conclusions References The elements of the bandit model (see Lattimore and Szepesvári (2020)) • Context Xt ∈ X • Xt ∼ DX • Actions At ∈ A = {a1, ..., K} • At ∼ πt (a|x) • Policy π ∈ Π • π = {πt }T t=1 • πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K : P a∈A qa = 1} • Rewards Rt ∈ R+ • (R(a1), R(a2), ..., R(ak )) and Rt = PK k=1 1(At = ak )R(ak ) • Rt ∼ DR|A,X • History Ht ∈ Ht • Ht = σ {(Xs , As , Rs )}t s=1 Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
  • 15. Authoritarian Sponsor Deconfounding Experiment Conclusions References The elements of the bandit model (see Lattimore and Szepesvári (2020)) • Context Xt ∈ X • Xt ∼ DX • Actions At ∈ A = {a1, ..., K} • At ∼ πt (a|x) • Policy π ∈ Π • π = {πt }T t=1 • πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K : P a∈A qa = 1} • Rewards Rt ∈ R+ • (R(a1), R(a2), ..., R(ak )) and Rt = PK k=1 1(At = ak )R(ak ) • Rt ∼ DR|A,X • History Ht ∈ Ht • Ht = σ {(Xs , As , Rs )}t s=1 Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
  • 16. Authoritarian Sponsor Deconfounding Experiment Conclusions References The elements of the bandit model (see Lattimore and Szepesvári (2020)) • Context Xt ∈ X • Xt ∼ DX • Actions At ∈ A = {a1, ..., K} • At ∼ πt (a|x) • Policy π ∈ Π • π = {πt }T t=1 • πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K : P a∈A qa = 1} • Rewards Rt ∈ R+ • (R(a1), R(a2), ..., R(ak )) and Rt = PK k=1 1(At = ak )R(ak ) • Rt ∼ DR|A,X • History Ht ∈ Ht • Ht = σ {(Xs , As , Rs )}t s=1 Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
  • 17. Authoritarian Sponsor Deconfounding Experiment Conclusions References The elements of the bandit model (see Lattimore and Szepesvári (2020)) • Context Xt ∈ X • Xt ∼ DX • Actions At ∈ A = {a1, ..., K} • At ∼ πt (a|x) • Policy π ∈ Π • π = {πt }T t=1 • πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K : P a∈A qa = 1} • Rewards Rt ∈ R+ • (R(a1), R(a2), ..., R(ak )) and Rt = PK k=1 1(At = ak )R(ak ) • Rt ∼ DR|A,X • History Ht ∈ Ht • Ht = σ {(Xs , As , Rs )}t s=1 Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
  • 18. Authoritarian Sponsor Deconfounding Experiment Conclusions References The elements of the bandit model (see Lattimore and Szepesvári (2020)) • Context Xt ∈ X • Xt ∼ DX • Actions At ∈ A = {a1, ..., K} • At ∼ πt (a|x) • Policy π ∈ Π • π = {πt }T t=1 • πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K : P a∈A qa = 1} • Rewards Rt ∈ R+ • (R(a1), R(a2), ..., R(ak )) and Rt = PK k=1 1(At = ak )R(ak ) • Rt ∼ DR|A,X • History Ht ∈ Ht • Ht = σ {(Xs , As , Rs )}t s=1 Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
  • 19. Authoritarian Sponsor Deconfounding Experiment Conclusions References The elements of the bandit model (see Lattimore and Szepesvári (2020)) • Context Xt ∈ X • Xt ∼ DX • Actions At ∈ A = {a1, ..., K} • At ∼ πt (a|x) • Policy π ∈ Π • π = {πt }T t=1 • πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K : P a∈A qa = 1} • Rewards Rt ∈ R+ • (R(a1), R(a2), ..., R(ak )) and Rt = PK k=1 1(At = ak )R(ak ) • Rt ∼ DR|A,X • History Ht ∈ Ht • Ht = σ {(Xs , As , Rs )}t s=1 Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
  • 20. Authoritarian Sponsor Deconfounding Experiment Conclusions References The elements of the bandit model (see Lattimore and Szepesvári (2020)) • Context Xt ∈ X • Xt ∼ DX • Actions At ∈ A = {a1, ..., K} • At ∼ πt (a|x) • Policy π ∈ Π • π = {πt }T t=1 • πt : X 7→ P(A), where P(A) := {q ∈ [0, 1]K : P a∈A qa = 1} • Rewards Rt ∈ R+ • (R(a1), R(a2), ..., R(ak )) and Rt = PK k=1 1(At = ak )R(ak ) • Rt ∼ DR|A,X • History Ht ∈ Ht • Ht = σ {(Xs , As , Rs )}t s=1 Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 4 / 26
  • 21. Authoritarian Sponsor Deconfounding Experiment Conclusions References Details • In short (Xt, At, Rt) ∼ D(πt) • We know πt(a|x) (propensity score) • We don’t know DX,⃗ R • We have 1(At = a) ⊥ ⊥ R(a)|Xt • We want to maximize with π ED(π) T X t=1 Rt(At) # Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 5 / 26
  • 22. Authoritarian Sponsor Deconfounding Experiment Conclusions References Details • In short (Xt, At, Rt) ∼ D(πt) • We know πt(a|x) (propensity score) • We don’t know DX,⃗ R • We have 1(At = a) ⊥ ⊥ R(a)|Xt • We want to maximize with π ED(π) T X t=1 Rt(At) # Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 5 / 26
  • 23. Authoritarian Sponsor Deconfounding Experiment Conclusions References Details • In short (Xt, At, Rt) ∼ D(πt) • We know πt(a|x) (propensity score) • We don’t know DX,⃗ R • We have 1(At = a) ⊥ ⊥ R(a)|Xt • We want to maximize with π ED(π) T X t=1 Rt(At) # Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 5 / 26
  • 24. Authoritarian Sponsor Deconfounding Experiment Conclusions References Details • In short (Xt, At, Rt) ∼ D(πt) • We know πt(a|x) (propensity score) • We don’t know DX,⃗ R • We have 1(At = a) ⊥ ⊥ R(a)|Xt • We want to maximize with π ED(π) T X t=1 Rt(At) # Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 5 / 26
  • 25. Authoritarian Sponsor Deconfounding Experiment Conclusions References Details • In short (Xt, At, Rt) ∼ D(πt) • We know πt(a|x) (propensity score) • We don’t know DX,⃗ R • We have 1(At = a) ⊥ ⊥ R(a)|Xt • We want to maximize with π ED(π) T X t=1 Rt(At) # Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 5 / 26
  • 26. Authoritarian Sponsor Deconfounding Experiment Conclusions References The flow of information Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 6 / 26
  • 27. Authoritarian Sponsor Deconfounding Experiment Conclusions References Inverse Gap Weighting Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 7 / 26
  • 28. Authoritarian Sponsor Deconfounding Experiment Conclusions References 1 Authoritarian Sponsor 2 Deconfounding 3 Experiment 4 Conclusions Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 8 / 26
  • 29. Authoritarian Sponsor Deconfounding Experiment Conclusions References Authoritarian Sponsor model • The act of sponsoring • Recommender system - marketing campaigns, testing products • Healthcare - funding experiments, lobbying doctors • The sponsor (€, H) intervenes in an authoritarian manner At = StÃt + (1 − St)Āt, St ∈ {0, 1}, St ∼ €(·|X) Āt ∼ πt(·|X), Ãt ∼ H t (·|X) H t (a|x) = € t (1|x) H t (a|x) + € t (0|x)πt(a|x). • The lack of knowledge about sponsor’s policy (€, H) • Not sharing technology or strategy • Lost in human to algorithm translation • Hard to model process like auctions Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 9 / 26
  • 30. Authoritarian Sponsor Deconfounding Experiment Conclusions References Authoritarian Sponsor model • The act of sponsoring • Recommender system - marketing campaigns, testing products • Healthcare - funding experiments, lobbying doctors • The sponsor (€, H) intervenes in an authoritarian manner At = StÃt + (1 − St)Āt, St ∈ {0, 1}, St ∼ €(·|X) Āt ∼ πt(·|X), Ãt ∼ H t (·|X) H t (a|x) = € t (1|x) H t (a|x) + € t (0|x)πt(a|x). • The lack of knowledge about sponsor’s policy (€, H) • Not sharing technology or strategy • Lost in human to algorithm translation • Hard to model process like auctions Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 9 / 26
  • 31. Authoritarian Sponsor Deconfounding Experiment Conclusions References Authoritarian Sponsor model • The act of sponsoring • Recommender system - marketing campaigns, testing products • Healthcare - funding experiments, lobbying doctors • The sponsor (€, H) intervenes in an authoritarian manner At = StÃt + (1 − St)Āt, St ∈ {0, 1}, St ∼ €(·|X) Āt ∼ πt(·|X), Ãt ∼ H t (·|X) H t (a|x) = € t (1|x) H t (a|x) + € t (0|x)πt(a|x). • The lack of knowledge about sponsor’s policy (€, H) • Not sharing technology or strategy • Lost in human to algorithm translation • Hard to model process like auctions Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 9 / 26
  • 32. Authoritarian Sponsor Deconfounding Experiment Conclusions References Targeting mechanisms Introducing an unobserved confounder Z 1 Targeting Completely At Random (TCAR): • S(X) = S, H(a|X, R, Z) = H(a) • kind of like MCAR 2 Targeting At Random (TAR) • S(X) = S(X), H(a|X, R, Z) = H(a|X) • kind of like MAR 3 Targeting Not At Random (TNAR) • H(a|X, R, Z) ⇒ R(a) ̸⊥ ⊥ A|X, S = 1. • kind of like MNAR Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 10 / 26
  • 33. Authoritarian Sponsor Deconfounding Experiment Conclusions References Targeting mechanisms Introducing an unobserved confounder Z 1 Targeting Completely At Random (TCAR): • S(X) = S, H(a|X, R, Z) = H(a) • kind of like MCAR 2 Targeting At Random (TAR) • S(X) = S(X), H(a|X, R, Z) = H(a|X) • kind of like MAR 3 Targeting Not At Random (TNAR) • H(a|X, R, Z) ⇒ R(a) ̸⊥ ⊥ A|X, S = 1. • kind of like MNAR Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 10 / 26
  • 34. Authoritarian Sponsor Deconfounding Experiment Conclusions References Targeting mechanisms Introducing an unobserved confounder Z 1 Targeting Completely At Random (TCAR): • S(X) = S, H(a|X, R, Z) = H(a) • kind of like MCAR 2 Targeting At Random (TAR) • S(X) = S(X), H(a|X, R, Z) = H(a|X) • kind of like MAR 3 Targeting Not At Random (TNAR) • H(a|X, R, Z) ⇒ R(a) ̸⊥ ⊥ A|X, S = 1. • kind of like MNAR Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 10 / 26
  • 35. Authoritarian Sponsor Deconfounding Experiment Conclusions References Causal interpretation Figure 1: TCAR Figure 2: TAR Figure 3: TNAR Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 11 / 26
  • 36. Authoritarian Sponsor Deconfounding Experiment Conclusions References 1 Authoritarian Sponsor 2 Deconfounding 3 Experiment 4 Conclusions Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 12 / 26
  • 37. Authoritarian Sponsor Deconfounding Experiment Conclusions References Data fusion (see Colnet et al. (2020)) RCT OS Internal validity External validity Propensity score ? Table 1: Differences and similarities between data sources Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 13 / 26
  • 38. Authoritarian Sponsor Deconfounding Experiment Conclusions References Data fusion (see Colnet et al. (2020)) RCT OS Learner Sponsor Internal validity External validity ∼ ∼ Propensity score ? ? Table 2: Differences and similarities between data sources • Unsolved challenge: sampling in interaction! Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 14 / 26
  • 39. Authoritarian Sponsor Deconfounding Experiment Conclusions References Data fusion (see Colnet et al. (2020)) RCT OS Learner Sponsor Internal validity External validity ∼ ∼ Propensity score ? ? Table 2: Differences and similarities between data sources • Unsolved challenge: sampling in interaction! Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 14 / 26
  • 40. Authoritarian Sponsor Deconfounding Experiment Conclusions References CATE • CATE τa1,a2 (x) = EDR|A,X=x [R(a1) − R(a2)] and b τa1,a2 (x) = b µa1 (x) − b µa2 (x) • Assumptions • SUTVA: Rt = P a∈A 1(At = a)Rt (a), • Ignorability: 1(At = a) ⊥ ⊥ R(a)|Xt , St = 0 • Ignorability of the study participation: Rt (a) ⊥ ⊥ St |Xt • TNAR: R(a) ̸⊥ ⊥ A|X, S = 1. • Biased CATE on sponsor sample ρa1,a2 (x) = E[R|A = a1, X = x, S = 1] − E[R|A = a2, X = x, S = 1]. • Bias measurement ηa1,a2 (x) = τa1,a2 (x) − ρa1,a2 (x) Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 15 / 26
  • 41. Authoritarian Sponsor Deconfounding Experiment Conclusions References CATE • CATE τa1,a2 (x) = EDR|A,X=x [R(a1) − R(a2)] and b τa1,a2 (x) = b µa1 (x) − b µa2 (x) • Assumptions • SUTVA: Rt = P a∈A 1(At = a)Rt (a), • Ignorability: 1(At = a) ⊥ ⊥ R(a)|Xt , St = 0 • Ignorability of the study participation: Rt (a) ⊥ ⊥ St |Xt • TNAR: R(a) ̸⊥ ⊥ A|X, S = 1. • Biased CATE on sponsor sample ρa1,a2 (x) = E[R|A = a1, X = x, S = 1] − E[R|A = a2, X = x, S = 1]. • Bias measurement ηa1,a2 (x) = τa1,a2 (x) − ρa1,a2 (x) Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 15 / 26
  • 42. Authoritarian Sponsor Deconfounding Experiment Conclusions References CATE • CATE τa1,a2 (x) = EDR|A,X=x [R(a1) − R(a2)] and b τa1,a2 (x) = b µa1 (x) − b µa2 (x) • Assumptions • SUTVA: Rt = P a∈A 1(At = a)Rt (a), • Ignorability: 1(At = a) ⊥ ⊥ R(a)|Xt , St = 0 • Ignorability of the study participation: Rt (a) ⊥ ⊥ St |Xt • TNAR: R(a) ̸⊥ ⊥ A|X, S = 1. • Biased CATE on sponsor sample ρa1,a2 (x) = E[R|A = a1, X = x, S = 1] − E[R|A = a2, X = x, S = 1]. • Bias measurement ηa1,a2 (x) = τa1,a2 (x) − ρa1,a2 (x) Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 15 / 26
  • 43. Authoritarian Sponsor Deconfounding Experiment Conclusions References Two step deconfounding (see Kallus et al. (2018)), A = {a0, a1} 1 On the observational sample data use a metalearner to obtain b ρa1,a0 (X). 2 Postulate a function q(X, a0) such that E[qt(X, a0)R|X = x, S = 0] = τa1,a0 (x). Where qt(X, a0) = 1(A = a1) πt(a1|X) − 1(A = a0) πt(a0|X) . 3 Using qt(X, a0) apply the definition of ηa1,a0 (x) to adjust the b ρ term by solving an optimization problem on the unconfounded sample: b ηa1,a0 (X) = arg min η X t:St =0 (qt(xt, a0)rt − b ρa1,a0 (xt) − η(xt)) 2 . 4 Finally b τa1,a0 (x) = b ρa1,a0 (x) + b ηa1,a0 (x). Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 16 / 26
  • 44. Authoritarian Sponsor Deconfounding Experiment Conclusions References Two step deconfounding (see Kallus et al. (2018)), A = {a0, a1} 1 On the observational sample data use a metalearner to obtain b ρa1,a0 (X). 2 Postulate a function q(X, a0) such that E[qt(X, a0)R|X = x, S = 0] = τa1,a0 (x). Where qt(X, a0) = 1(A = a1) πt(a1|X) − 1(A = a0) πt(a0|X) . 3 Using qt(X, a0) apply the definition of ηa1,a0 (x) to adjust the b ρ term by solving an optimization problem on the unconfounded sample: b ηa1,a0 (X) = arg min η X t:St =0 (qt(xt, a0)rt − b ρa1,a0 (xt) − η(xt)) 2 . 4 Finally b τa1,a0 (x) = b ρa1,a0 (x) + b ηa1,a0 (x). Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 16 / 26
  • 45. Authoritarian Sponsor Deconfounding Experiment Conclusions References Two step deconfounding (see Kallus et al. (2018)), A = {a0, a1} 1 On the observational sample data use a metalearner to obtain b ρa1,a0 (X). 2 Postulate a function q(X, a0) such that E[qt(X, a0)R|X = x, S = 0] = τa1,a0 (x). Where qt(X, a0) = 1(A = a1) πt(a1|X) − 1(A = a0) πt(a0|X) . 3 Using qt(X, a0) apply the definition of ηa1,a0 (x) to adjust the b ρ term by solving an optimization problem on the unconfounded sample: b ηa1,a0 (X) = arg min η X t:St =0 (qt(xt, a0)rt − b ρa1,a0 (xt) − η(xt)) 2 . 4 Finally b τa1,a0 (x) = b ρa1,a0 (x) + b ηa1,a0 (x). Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 16 / 26
  • 46. Authoritarian Sponsor Deconfounding Experiment Conclusions References Two step deconfounding (see Kallus et al. (2018)), A = {a0, a1} 1 On the observational sample data use a metalearner to obtain b ρa1,a0 (X). 2 Postulate a function q(X, a0) such that E[qt(X, a0)R|X = x, S = 0] = τa1,a0 (x). Where qt(X, a0) = 1(A = a1) πt(a1|X) − 1(A = a0) πt(a0|X) . 3 Using qt(X, a0) apply the definition of ηa1,a0 (x) to adjust the b ρ term by solving an optimization problem on the unconfounded sample: b ηa1,a0 (X) = arg min η X t:St =0 (qt(xt, a0)rt − b ρa1,a0 (xt) − η(xt)) 2 . 4 Finally b τa1,a0 (x) = b ρa1,a0 (x) + b ηa1,a0 (x). Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 16 / 26
  • 47. Authoritarian Sponsor Deconfounding Experiment Conclusions References Deconfounded CATE IGW (D-CATE-IGW) • Let b = arg maxa b µa(xt). π(a|x) = ( 1 K+γm(b µm b (x)−b µm a (x)) for a ̸= b 1 − P c̸=b π(c|x) for a = b = ( 1 K+γm b τb,a(x) for a ̸= b 1 − P c̸=b π(c|x) for a = b , • Each round/epoch deconfound the CATE Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 17 / 26
  • 48. Authoritarian Sponsor Deconfounding Experiment Conclusions References 1 Authoritarian Sponsor 2 Deconfounding 3 Experiment 4 Conclusions Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 18 / 26
  • 49. Authoritarian Sponsor Deconfounding Experiment Conclusions References Setup I • St ∼ Bern(ρ) • No overlap scenario Xt|St = 0 ∼Unif([−1, 1]), Ut|St = 0 ∼N(0, 1). • Full overlap scenario Xt|St = 0 ∼ N(0, 1), Ut|St = 0 ∼ N(0, 1). Xt Ut | {At, St = 1} ∼ N 0 0 , 1 (2At − 1)σA (2At − 1)σA 1 , • σA ∈ {0.6, 0.9} • ρ ∈ {0.3, 0.6} Rt(At) = 1 + At + Xt + 2AtXt + 1/2X2 t + 3/4AtX2 t + 2Ut + 1/2ϵt, where ϵ ∼ N(0, 1), τ(Xt) = 3/4X2 t + 2Xt + 1. Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 19 / 26
  • 50. Authoritarian Sponsor Deconfounding Experiment Conclusions References Setup I • St ∼ Bern(ρ) • No overlap scenario Xt|St = 0 ∼Unif([−1, 1]), Ut|St = 0 ∼N(0, 1). • Full overlap scenario Xt|St = 0 ∼ N(0, 1), Ut|St = 0 ∼ N(0, 1). Xt Ut | {At, St = 1} ∼ N 0 0 , 1 (2At − 1)σA (2At − 1)σA 1 , • σA ∈ {0.6, 0.9} • ρ ∈ {0.3, 0.6} Rt(At) = 1 + At + Xt + 2AtXt + 1/2X2 t + 3/4AtX2 t + 2Ut + 1/2ϵt, where ϵ ∼ N(0, 1), τ(Xt) = 3/4X2 t + 2Xt + 1. Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 19 / 26
  • 51. Authoritarian Sponsor Deconfounding Experiment Conclusions References Setup I • St ∼ Bern(ρ) • No overlap scenario Xt|St = 0 ∼Unif([−1, 1]), Ut|St = 0 ∼N(0, 1). • Full overlap scenario Xt|St = 0 ∼ N(0, 1), Ut|St = 0 ∼ N(0, 1). Xt Ut | {At, St = 1} ∼ N 0 0 , 1 (2At − 1)σA (2At − 1)σA 1 , • σA ∈ {0.6, 0.9} • ρ ∈ {0.3, 0.6} Rt(At) = 1 + At + Xt + 2AtXt + 1/2X2 t + 3/4AtX2 t + 2Ut + 1/2ϵt, where ϵ ∼ N(0, 1), τ(Xt) = 3/4X2 t + 2Xt + 1. Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 19 / 26
  • 52. Authoritarian Sponsor Deconfounding Experiment Conclusions References Setup I • St ∼ Bern(ρ) • No overlap scenario Xt|St = 0 ∼Unif([−1, 1]), Ut|St = 0 ∼N(0, 1). • Full overlap scenario Xt|St = 0 ∼ N(0, 1), Ut|St = 0 ∼ N(0, 1). Xt Ut | {At, St = 1} ∼ N 0 0 , 1 (2At − 1)σA (2At − 1)σA 1 , • σA ∈ {0.6, 0.9} • ρ ∈ {0.3, 0.6} Rt(At) = 1 + At + Xt + 2AtXt + 1/2X2 t + 3/4AtX2 t + 2Ut + 1/2ϵt, where ϵ ∼ N(0, 1), τ(Xt) = 3/4X2 t + 2Xt + 1. Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 19 / 26
  • 53. Authoritarian Sponsor Deconfounding Experiment Conclusions References Result I Figure 4: Normed cumulative regret for different scenarios Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 20 / 26
  • 54. Authoritarian Sponsor Deconfounding Experiment Conclusions References Result II Figure 5: True and estimated CATE values for different scenarios Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 21 / 26
  • 55. Authoritarian Sponsor Deconfounding Experiment Conclusions References 1 Authoritarian Sponsor 2 Deconfounding 3 Experiment 4 Conclusions Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 22 / 26
  • 56. Authoritarian Sponsor Deconfounding Experiment Conclusions References Contribution 1 Pioneering model for sponsored content in contextual bandits framework 2 Bandits not as experimental studies, but as observational studies 3 Confounding scenario and deconfounding application 4 D-CATE-IGW works Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 23 / 26
  • 57. Authoritarian Sponsor Deconfounding Experiment Conclusions References Future research • Theoretical • Mathematically model the complicated sampling. Especially the flow of information • Consistency proof of CATE estimator in this scenario • High probability regret bounds on D-CATE-IGW P(REWARD(π) BOUND(δ)) 1 − δ • Empirical • More metalearners (X-learner, R-learner) (see Künzel et al. (2019)) • Other deconfounding methods (see Wu and Yang (2022)) • A more comprehensive empirical study Expansion • Policy evaluation V (π) = EX EA∼π(·|X)ER|A,X [R] b Vt(π) on {(Xs, As, Rs)}t s=1 ∼ D(H ) Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 24 / 26
  • 58. Authoritarian Sponsor Deconfounding Experiment Conclusions References Future research • Theoretical • Mathematically model the complicated sampling. Especially the flow of information • Consistency proof of CATE estimator in this scenario • High probability regret bounds on D-CATE-IGW P(REWARD(π) BOUND(δ)) 1 − δ • Empirical • More metalearners (X-learner, R-learner) (see Künzel et al. (2019)) • Other deconfounding methods (see Wu and Yang (2022)) • A more comprehensive empirical study Expansion • Policy evaluation V (π) = EX EA∼π(·|X)ER|A,X [R] b Vt(π) on {(Xs, As, Rs)}t s=1 ∼ D(H ) Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 24 / 26
  • 59. Authoritarian Sponsor Deconfounding Experiment Conclusions References Future research • Theoretical • Mathematically model the complicated sampling. Especially the flow of information • Consistency proof of CATE estimator in this scenario • High probability regret bounds on D-CATE-IGW P(REWARD(π) BOUND(δ)) 1 − δ • Empirical • More metalearners (X-learner, R-learner) (see Künzel et al. (2019)) • Other deconfounding methods (see Wu and Yang (2022)) • A more comprehensive empirical study Expansion • Policy evaluation V (π) = EX EA∼π(·|X)ER|A,X [R] b Vt(π) on {(Xs, As, Rs)}t s=1 ∼ D(H ) Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 24 / 26
  • 60. Authoritarian Sponsor Deconfounding Experiment Conclusions References Future research • Theoretical • Mathematically model the complicated sampling. Especially the flow of information • Consistency proof of CATE estimator in this scenario • High probability regret bounds on D-CATE-IGW P(REWARD(π) BOUND(δ)) 1 − δ • Empirical • More metalearners (X-learner, R-learner) (see Künzel et al. (2019)) • Other deconfounding methods (see Wu and Yang (2022)) • A more comprehensive empirical study Expansion • Policy evaluation V (π) = EX EA∼π(·|X)ER|A,X [R] b Vt(π) on {(Xs, As, Rs)}t s=1 ∼ D(H ) Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 24 / 26
  • 61. Authoritarian Sponsor Deconfounding Experiment Conclusions References Future research • Theoretical • Mathematically model the complicated sampling. Especially the flow of information • Consistency proof of CATE estimator in this scenario • High probability regret bounds on D-CATE-IGW P(REWARD(π) BOUND(δ)) 1 − δ • Empirical • More metalearners (X-learner, R-learner) (see Künzel et al. (2019)) • Other deconfounding methods (see Wu and Yang (2022)) • A more comprehensive empirical study Expansion • Policy evaluation V (π) = EX EA∼π(·|X)ER|A,X [R] b Vt(π) on {(Xs, As, Rs)}t s=1 ∼ D(H ) Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 24 / 26
  • 62. Authoritarian Sponsor Deconfounding Experiment Conclusions References Future research • Theoretical • Mathematically model the complicated sampling. Especially the flow of information • Consistency proof of CATE estimator in this scenario • High probability regret bounds on D-CATE-IGW P(REWARD(π) BOUND(δ)) 1 − δ • Empirical • More metalearners (X-learner, R-learner) (see Künzel et al. (2019)) • Other deconfounding methods (see Wu and Yang (2022)) • A more comprehensive empirical study Expansion • Policy evaluation V (π) = EX EA∼π(·|X)ER|A,X [R] b Vt(π) on {(Xs, As, Rs)}t s=1 ∼ D(H ) Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 24 / 26
  • 63. Authoritarian Sponsor Deconfounding Experiment Conclusions References Future research • Theoretical • Mathematically model the complicated sampling. Especially the flow of information • Consistency proof of CATE estimator in this scenario • High probability regret bounds on D-CATE-IGW P(REWARD(π) BOUND(δ)) 1 − δ • Empirical • More metalearners (X-learner, R-learner) (see Künzel et al. (2019)) • Other deconfounding methods (see Wu and Yang (2022)) • A more comprehensive empirical study Expansion • Policy evaluation V (π) = EX EA∼π(·|X)ER|A,X [R] b Vt(π) on {(Xs, As, Rs)}t s=1 ∼ D(H ) Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 24 / 26
  • 64. Authoritarian Sponsor Deconfounding Experiment Conclusions References Future research • Theoretical • Mathematically model the complicated sampling. Especially the flow of information • Consistency proof of CATE estimator in this scenario • High probability regret bounds on D-CATE-IGW P(REWARD(π) BOUND(δ)) 1 − δ • Empirical • More metalearners (X-learner, R-learner) (see Künzel et al. (2019)) • Other deconfounding methods (see Wu and Yang (2022)) • A more comprehensive empirical study Expansion • Policy evaluation V (π) = EX EA∼π(·|X)ER|A,X [R] b Vt(π) on {(Xs, As, Rs)}t s=1 ∼ D(H ) Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 24 / 26
  • 65. Authoritarian Sponsor Deconfounding Experiment Conclusions References Future research • Theoretical • Mathematically model the complicated sampling. Especially the flow of information • Consistency proof of CATE estimator in this scenario • High probability regret bounds on D-CATE-IGW P(REWARD(π) BOUND(δ)) 1 − δ • Empirical • More metalearners (X-learner, R-learner) (see Künzel et al. (2019)) • Other deconfounding methods (see Wu and Yang (2022)) • A more comprehensive empirical study Expansion • Policy evaluation V (π) = EX EA∼π(·|X)ER|A,X [R] b Vt(π) on {(Xs, As, Rs)}t s=1 ∼ D(H ) Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 24 / 26
  • 66. Authoritarian Sponsor Deconfounding Experiment Conclusions References Future research • Theoretical • Mathematically model the complicated sampling. Especially the flow of information • Consistency proof of CATE estimator in this scenario • High probability regret bounds on D-CATE-IGW P(REWARD(π) BOUND(δ)) 1 − δ • Empirical • More metalearners (X-learner, R-learner) (see Künzel et al. (2019)) • Other deconfounding methods (see Wu and Yang (2022)) • A more comprehensive empirical study Expansion • Policy evaluation V (π) = EX EA∼π(·|X)ER|A,X [R] b Vt(π) on {(Xs, As, Rs)}t s=1 ∼ D(H ) Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 24 / 26
  • 67. Authoritarian Sponsor Deconfounding Experiment Conclusions References The beginning ... Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 25 / 26
  • 68. Authoritarian Sponsor Deconfounding Experiment Conclusions References Colnet, B., I. Mayer, G. Chen, A. Dieng, R. Li, G. Varoquaux, J.-P. Vert, J. Josse, and S. Yang (2020). Causal inference methods for combining randomized trials and observational studies: a review. arXiv preprint arXiv:2011.08047. Kallus, N., A. M. Puli, and U. Shalit (2018). Removing hidden confounding by experimental grounding. Advances in neural information processing systems 31. Künzel, S. R., J. S. Sekhon, P. J. Bickel, and B. Yu (2019). Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the national academy of sciences 116(10), 4156–4165. Lattimore, T. and C. Szepesvári (2020). Bandit algorithms. Cambridge University Press. Wu, L. and S. Yang (2022). Integrative r-learner of heterogeneous treatment effects combining experimental and observational studies. In Conference on Causal Learning and Reasoning, pp. 904–926. PMLR. Hubert Drążkowski GRAPE|FAME, Warsaw University of Technology Sponsored content in contextual bandits. Deconfounding Targeting Not At Random 26 / 26