Python and the Holy Grail of Causal Inference - Dennis Ramondt, Huib Keemink

Glass of
wine a day
Health
Income

Experiments you thought were good can still be invalid
Experiments you thought were bad can still be valid

Randomized testing: the set-up
Sample is randomly
split into two groups
Random subsample of
population is chosen
POPULATION INTERVENTION
CONTROL
= no change = improved outcome
Outcome in both
groups is measured
The same for all
participants
AVERAGE
TREATMENT
EFFECT

USE CASE: heat pump savings @ Eneco
?

Measurement data: daily gas usage ~ outside temperature
Average outside temperature (°C)
Gasusage(m3)

The experiment in the randomized test framework
• Sample is based on
“friendly users”: Eneco
employees, early
adopters and energy
enthusiasts
• Rental homes are excluded
from the study
• Participation is initiated by
customer
• Outcome: average yearly gas
savings
• Placements over many months
• Changes made to intervention
halfway through study
AVERAGE
TREATMENT
EFFECT
INTERVENTION
CONTROL

Fixing group imbalance: match test and control
Available covariates:
• House size (m2)
• Building type (terraced, apartment, detached, semi-detached)
• Construction period (<1946, 1946-1965, …, > 2010)
• Number of inhabitants (1, 2, 3, 4, 5+)
Number of possibilities: 10 x 4 x 6 x 5 = 1200
Our sample population is only 2500, exact matches infeasible  partial matching
Propensity Score Matching

Propensity score matching – concept
38%
Calculate chance of receiving treatment
given X (house type, etc)
test A
83%39%
41%
Match test subject to k control subjects
on this probability
12% 22%
Calculate effect for test and (matched) control
-
500m3
-20m3average
-
480m3
Repeat for all participants
 average effect over test group
RUN
AWAY!

Recap heat pump use case
• Experiment fails (almost) all standard assumptions
• Each of the “faults” can be corrected
• Measure months, need year  extrapolate with model
• Bias in test group  match with equally biased control using propensity
• Outcome: average effect over test group, not whole population
• We can not say anything about rental households without making additional
assumptions

USE CASE: effect of cooler placement @ HEINEKEN
?
€ €

POPULATION
• 13K off-trade* outlets
• Selling HEINEKEN beer brands
• May receive cooler
* Small to medium shops, e.g. mom and pop shops, groceries and kiosks; not retail
• Pool for ’experiment’ is
all outlets, sample is the
population
• Observational approach:
coolers are already placed
• Gold outlets higher
probability of getting cooler
than others
• Need effect on individual
outlets, to prioritize
future placements
AVERAGE
TREATMENT
EFFECT
INTERVENTION
CONTROL
The same for all
participants
• Outcome: yearly profit** uplift
• Placements over many years,
movements not tracked 
sales before/after unknown
** Profit is measured as FGP/hl, a company-wide calculation of profit per hl sales

Fig. Histograms showing the distribution of total profit per
outlet, when broken down by ranking and cooler setup
Problem 1: test and control group are statistically
different
Distribution of relevant characteristics* is different between test and control
profit
* A relevant characteristic is one that influences the probability of being selected for treatment

Problem 1: test and control group are statistically
different
Distribution of relevant characteristics* is different between test and control
* A relevant characteristic is one that influences the probability of being selected for treatment
• Outlet ranking (gold, silver, bronze)
• Outlet sub-channel (kiosk, grocery, convenience, etc)
• Outlet area type (city, urban, village)
• Area (name of neighborhood)
• Seasonality (is outlet only open in summer)
• Sales rep visits per month
• Volume of competitor vs HEINEKEN sales
• Number of assortment deals with HEINEKEN
• Amount of investment by HEINEKEN
• Number of HEINEKEN branding materials
• Census demographics in km2 (population, age, gender)
• Google Maps metrics in 500m2 (average venue rating, # venues
with photo, # of unique venue types, average venue opening times)

data_nongold = pd.DataFrame({
'y_profit': 20 + 5*np.random.randn(n),
'X_gold': 0,
'w_cooler': np.random.choice([0, 1], size=(n,), p=[2./3, 1./3])
}).assign(y_profit=lambda df: np.where(df.w_cooler, df.y_profit + 3, df.y_profit))
data_gold = pd.DataFrame({
'y_profit': 25 + 5*np.random.randn(n),
'X_gold': 1,
'w_cooler': np.random.choice([0, 1], size=(n,), p=[1./3, 2./3])
}).assign(y_profit=lambda df: np.where(df.w_cooler, df.y_profit + 5, df.y_profit))
data = data_nongold.append(data_gold

The need for effect correction – staging an experiment
Definition: conditional mean
Mean of y for given values of X, i.e. average of one variable as
a function of some other variables
𝐸 𝑌 𝑋 = 𝑋𝛽
Effect = mean treated – mean untreated
𝐸 𝑌 𝑤 = 1 − 𝐸 𝑌 𝑤 = 0 = 27.70 − 21.66 = 6.04 ??

𝐴𝑇𝐸𝑖𝑛𝑠 = 𝐸 𝑌 𝑋 = 1, 𝑤 = 1 − 𝐸 𝑌 𝑋 = 1, 𝑤 = 0
= 30.07 − 24.90 = 5.17
𝐴𝑇𝐸 𝑛𝑜𝑛𝑖𝑛𝑠 = 𝐸 𝑌 𝑋 = 0, 𝑤 = 1 − 𝐸 𝑌 𝑋 = 0, 𝑤 = 0
= 20.00 − 22.96 = 2.96
Only gold
Only non-gold

What would be the effect if all the imbalance in treatment
caused by gold ranking is removed?
50% of outlets are gold, if the probability of placement
were equal for all of them, the effect would be ...
𝐴𝑇𝐸 = 𝐸 𝑌 𝑋, 𝑤 = 1 − 𝐸 𝑌 𝑋, 𝑤 = 0
= 4.06

Procedure
With the sample mean of the covariates, fit the
regression
And the coefficient on w will be the average treatment
effect
𝑌 𝑜𝑛 1, 𝑤, 𝑿, 𝑤(𝑿 − 𝑿)
𝑿

data_reg = data.assign(
demeaned_interaction=lambda df:
df.w_cooler * (df.X_gold - df.X_gold.mean())
)
lm_all = LinearRegression()
lm_all.fit(
data_reg[['X_gold', 'demeaned_interaction', 'w_cooler']],
data.y_profit
)
lm_all.coef_[2]
4.0637

Estimating the ATE with regression – assumptions
Conditional mean independence
Mean dependence between treatment assignment w and
treatment-specific outcomes Yi can be removed by conditioning
on some variables X, provided that they are observable (AKA
weak ignorability)
𝐸 𝑌𝑖 𝑋, 𝑤 = 𝐸 𝑌𝑖 𝑋 𝑓𝑜𝑟 𝑖 ∈ {0,1}

Individual treatment effect estimation – assumptions
Many approaches exist, but most of your bias will be due to not observing enough confounders
X!
Conditional independence
Any dependence between treatment assignment w and
treatment-specific outcomes Yi can be removed by conditioning
on some variables X, provided that they are observable (AKA
strong ignorability)
𝑌0, 𝑌1 ⫫ 𝑤|𝑿

Estimating ITE with Virtual Twins*
Sales
Rating
=Bronze/Silver
Rating
=Gold
Cooler
=0
Cooler
=1
€2000 €3000
Procedure
Fit a tree ensemble with target Y and features X, w,
and interactions** between X and w
Predict all units with w=1 , predict all units with w=0
Subtract to get
Early stopping and OOB predictions reduce
overfitting, quantile objective can help to trim outliers
𝜏𝑖𝑡𝑒, 𝑖 = 𝑚1 𝑿𝑖 − 𝑚0 𝑿𝑖
* Foster, J. C., Taylor, J. M., and Ruberg, S. J. (2011). Subgroup identification from randomized clinical trial
data. Statistics in Medicine, 30(24):2867–2880.
** Scaling like we did with the linear ATE estimator is generally not needed with tree-based estimators

Fig. Model predicted profit versus actual profit, by
cooler type (all outlets)
Overview

Coolers to consider
cooler type (outlets within 90% confidence interval)

Coolers to upgrade
cooler type (outlets to upgrade / install)

• Your perfect experiment is likely ruined by harsh
reality
• But you may be able to fix it:
• Propensity score matching
• Average and individual treatment effect estimation
• Make sure you collect enough data:
• When is the treatment done?
• Measure Y before and after experiment
• What covariates X influence both treatment w and outcome Y?

Looking for:
• Senior Data Scientist
• Senior Data Engineer
Contact: ciaran.jetten@heineken.com

Estimating ITE with Honest RF*
* Athey, S., & Imbens, G. (2016). Recursive partitioning for heterogeneous causal effects. Proceedings of the
National Academy of Sciences, 113(27), 7353-7360.
Cooler 1/0
Rating
=Bronze/Silver
Rating
=Gold
𝐸 𝑌 𝑤 = 1 − 𝐸 𝑌 𝑤 = 0
€2000 − €3000 = €1000
Procedure
Fit a tree ensemble with target w and features X, with
constraint of minimum k units per class in each DT
leaf
Per leaf K in each DT, calculate mean difference in Y
between treatment and control units to get
𝜏𝑖𝑡𝑒, 𝑖 = 𝑁−1
𝑗=1
𝑁
[𝑌𝑗1 − 𝑌𝑗0]
𝑓𝑜𝑟 𝑖 ∈ 𝐾 𝑎𝑛𝑑 𝑗 ∈ 𝐾

Estimating ITE using Counterfactual Regression*
* Shalit, U., Johansson, F., & Sontag, D. (2016). Estimating individual treatment effect: generalization bounds
and algorithms. arXiv preprint arXiv:1606.03976.
Procedure
Learn a representation Φ of X  split samples
according to w  regress Y0 and Y1 on the
representation separately
Regularize Φ using IPM, which is the distance
between the distribution of X in w=1 and of X in w=0
Thus having joint objective of minimizing predictive
error and guaranteeing a balanced representation of
X

Python and the Holy Grail of Causal Inference - Dennis Ramondt, Huib Keemink

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Python and the Holy Grail of Causal Inference - Dennis Ramondt, Huib Keemink

Similaire à Python and the Holy Grail of Causal Inference - Dennis Ramondt, Huib Keemink (20)

Plus de PyData

Plus de PyData (20)

Dernier

Dernier (20)

Python and the Holy Grail of Causal Inference - Dennis Ramondt, Huib Keemink

Notes de l'éditeur