SlideShare a Scribd company logo
1 of 143
Download to read offline
CAUSALLY REGULARIZED
MACHINE LEARNING
Peng Cui, Tsinghua University
Kun Kuang, Zhejiang University
Bo Li, Tsinghua University
ML techniques are impacting our life
2
• A day in our life with ML techniques
8:30 am
8:00 am 10:00 am
4:00 pm
6:00 pm
8:00 pm
Now we are stepping into risk-sensitive areas
3
Shifting from Performance Driven to Risk Sensitive
Problems of today’s ML - Explainability
4
Most machine learning models are black-box models
5
Most ML methods are developed under I.I.D hypothesis
Problems of today’s ML - Stability
6
Yes
Maybe
No
Problems of today’s ML - Stability
7
• Cancer survival rate prediction
Training Data
Predictive Model
Testing Data
City Hospital
University HospitalHigher income, higher survival rate.
City Hospital
Survival rate is not so correlated with income.
Problems of today’s ML - Stability
8
A plausible reason: Correlation
Correlation is the very basics of machine learning.
9
Correlation is not explainable
10
Correlation is ‘unstable’
11
It’s not the fault of correlation, but the way we use it
• Three sources of correlation:
• Causation
• Causal mechanism
• Stable and explainable
• Confounding
• Ignoring X
• Spurious Correlation
• Sample Selection Bias
• Conditional on S
• Spurious Correlation
T Y
T Y
X
T Y
S
Accepted
Income
Financial
product offer
DogGrass
Sample
Selection
Ice Cream
Sales
Summer
A Practical Definition of Causality
Definition: T causes Y if and only if
changing T leads to a change in Y,
while keeping everything else constant.
Causal effect is defined as the magnitude by which Y is
changed by a unit change in T.
Called the “interventionist” interpretation of causality.
12
http://plato.stanford.edu/entries/causation-mani/
X
T Y
13
The benefits of bringing causality into learning
Causal Framework
T grass
X dog nose
Y label
Grass—Label: Strong correlation
Weak causation
Dog nose—Label: Strong correlation
Strong causation
X
T Y
More Explainable and More Stable
14
The gap between causality and learning
pHow to evaluate the outcome?
pWild environments
p High-dimensional
p Highly noisy
p Little prior knowledge (model specification, confounding structures)
p Targeting problems
p Understanding v.s. Prediction
p Depth v.s. Scale and Performance
How to bridge the gap between causality and (stable) learning?
Outline
ØCorrelation v.s. Causality
ØCausal Inference
ØStable Learning
ØNICO: An Image Dataset for Stable Learning
ØConclusions
15
16
T Y
U Z W• Causal Identification with back
door criterion
• Causal Estimation with do
calculus
Paradigms - Structural Causal Model
A graphical model to describe the causal mechanisms of a system
How to discover the causal structure?
17
• Causal Discovery
• Constraint-based: conditional independence
• Functional causal model based
Paradigms – Structural Causal Model
A generative model with strong expressive power.
But it induces high complexity.
Paradigms - Potential Outcome Framework
• A simpler setting
• Suppose the confounders of T are known a priori
• The computational complexity is affordable
• Under stronger assumptions
• E.g. all confounders need to be observed
18
More like a discriminative way to estimate treatment’s
partial effect on outcome.
Causal Effect Estimation
• Treatment Variable: ! = 1 or ! = 0
• Treated Group (! = 1) and Control Group (! = 0)
• Potential Outcome: &(! = 1) and &(! = 0)
• Average Causal Effect of Treatment (ATE):
19
(!) = )[& ! = 1 − & ! = 0 ]
Counterfactual Problem
• Two key points for causal effect
estimation
• Changing T
• Keeping everything else constant
• For each person, observe only one:
either !"#$or !"#%
• For different group (T=1 and T=0),
something else are not constant
20
Person T &'#( &'#)
P1 1 0.4 ?
P2 0 ? 0.6
P3 1 0.3 ?
P4 0 ? 0.1
P5 1 0.5 ?
P6 0 ? 0.5
P7 0 ? 0.1
Ideal Solution: Counterfactual World
• Reason about a world that does not exist
• Everything in the counterfactual world is the same as the
real world, except the treatment
21
! " = 1 ! " = 0
Randomized Experiments are the “Gold Standard”
• Drawbacks of randomized experiments:
• Cost
• Unethical
• Unrealistic
22
What can we do when an experiment is
not possible?
Observational Studies!
Recap: Causal Effect and Potential Outcome
• Two key points for causal effect estimation
• Changing T
• Keeping everything else (X) constant
• Counterfactual Problem
• Ideal Solution: Counterfactual World
• “Gold Standard”: Randomized Experiments
• We will discuss other solutions in next Section.
23
! " = 1 or ! " = 0
Outline
ØCorrelation v.s. Causality
ØCausal Inference
ØMethods for Causal Inference
ØStable Learning
ØNICO: An Image Dataset for Stable Learning
ØConclusions
24
Causal Inference with Observational Data
• Average Treatment Effect (ATE) represents the mean
(average) difference between the potential outcome of
units under treated (T=1) and control (T=0) status.
• Treated (T=1): taking a particular medication
• Control (T=0): not taking any medications
• ATE: the causal effect of the particular medication
25
!"# = #[& " = 1 − & " = 0 ]
Causal Inference with Observational Data
• Counterfactual Problem:
• Can we estimate ATE by directly comparing the average
outcome between treated and control groups?
• Yes with randomized experiments (X are the same)
• No with observational data (X might be different)
• Two key points:
• Changing T (T=1 and T=0)
• Keeping everything else (Confounder X) constant
26
! " = 1 or ! " = 0
Balancing Confounders’ Distribution
Methods for Causal Inference
• Matching
• Propensity Score Based Methods
• Propensity Score Matching
• Inverse of Propensity Weighting (IPW)
• Doubly Robust
• Data-Driven Variable Decomposition (D2VD)
• Directly Confounder Balancing
• Entropy Balancing
• Approximate Residual Balancing
• Differentiated Confounder Balancing (DCB)
27
Assumptions of Causal Inference
• A1: Stable Unit Treatment Value (SUTV): The effect of treatment on
a unit is independent of the treatment assignment of other units
! "# $#, $&, '# = ! "# $#, '#
• A2: Unconfounderness: The distribution of treatment is independent
of potential outcome when given the observed variables
$ ⊥ " 0 , " 1 | '
No unmeasured confounders
• A3: Overlap: Each unit has nonzero probability to receive either
treatment status when given the observed variables
0 < ! $ = 1 ' = . < 1
28
Methods for Causal Inference
• Matching
• Propensity Score Based Methods
• Propensity Score Matching
• Inverse of Propensity Weighting (IPW)
• Doubly Robust
• Data-Driven Variable Decomposition (D2VD)
• Directly Confounder Balancing
• Entropy Balancing
• Approximate Residual Balancing
• Differentiated Confounder Balancing
29
Matching
30
! = 0 ! = 1
Matching
31
Matching
• Identify pairs of treated (T=1) and control (T=0)
units whose confounders X are similar or even
identical to each other
• Paired units provide the everything else
(Confounders) approximate constant
• Estimating average causal effect by comparing
average outcome in the paired dataset
• Smaller !: less bias, but higher variance
32
" #$%&'()*+ ,-, ,/ ≤ !
Matching
• Exactly Matching:
• Easy to implement, but limited to low-
dimensional settings
• Since in high-dimensional settings, there
will be few exact matches
33
!"#$%&'( )*, ), ≤ .
!"#$%&'( )*, ), = 0
0, )* = ),
∞, )* ≠ ), 4 5
Methods for Causal Inference
• Matching
• Propensity Score Based Methods
• Propensity Score Matching
• Inverse of Propensity Weighting (IPW)
• Doubly Robust
• Data-Driven Variable Decomposition (D2VD)
• Directly Confounder Balancing
• Entropy Balancing
• Approximate Residual Balancing
• Differentiated Confounder Balancing
34
Propensity Score Based Methods
• Propensity score !(#) is the probability of a unit to be treated
• Then, Rubin shows that the propensity score is sufficient to
control or summarize the information of confounders
• Propensity score are rarely observed, need to be estimated
35
! # = &(' = 1|#)
' ⫫ # | !(#) ' ⫫ (+ 1 , +(0)) | !(#)
Propensity Score Matching
• Estimating propensity score:
• Supervised learning: predicting a known
label T based on observed covariates X.
• Conventionally, use logistic regression
• Matching pairs by distance between
propensity score:
• High dimensional challenge:
36
!"#$%&'( )*, ), ≤ .
̂( ) = 1(3 = 1|))
!"#$%&'( )*, ), = | ̂( )* − ̂( ), |
transferred from matching to PS estimation
Methods for Causal Inference
• Matching
• Propensity Score Based Methods
• Propensity Score Matching
• Inverse of Propensity Weighting (IPW)
• Doubly Robust
• Data-Driven Variable Decomposition (D2VD)
• Directly Confounder Balancing
• Entropy Balancing
• Approximate Residual Balancing
• Differentiated Confounder Balancing
37
Inverse of Propensity Weighting (IPW)
• Why weighting with inverse of propensity score is helpful?
• Propensity score induces the distribution bias on confounders X
38
Unit !(#) % − !(#) #units #units
(T=1)
#units
(T=0)
A 0.7 0.3 10 7 3
B 0.6 0.4 50 30 20
C 0.2 0.8 40 8 32
' ( = *(+ = 1|()
Reweighting by inverse of propensity score:
Unit #units
(T=1)
#units
(T=0)
A
B
C
./ =
+/
'/
+
1 − +/
1 − '/
Confounders
are the same!
10 10
50 50
40 40
Distribution Bias
Inverse of Propensity Weighting (IPW)
• Estimating ATE by IPW [1]:
• Interpretation: IPW creates a pseudo-population where the
confounders are the same between treated and control groups.
• Why does this work? Consider
39
!" =
$"
%"
+
1 − $"
1 − %"
Inverse of Propensity Weighting (IPW)
• If: , the true propensity score
• Similarly:
40
! = # ∗ !% + % − # ∗ !(
# ⊥ !%, !( | ,
- , = .(#|,)
123 = 3[5 1 − 5 0 ]
Inverse of Propensity Weighting (IPW)
• If: , the true propensity score, the IPW
estimator is unbiased
• Wildly used in many applications
• But requires the propensity score model is correct
• High variance when ! is close to 0 or 1
41
Methods for Causal Inference
• Matching
• Propensity Score Based Methods
• Propensity Score Matching
• Inverse of Propensity Weighting (IPW)
• Doubly Robust
• Data-Driven Variable Decomposition (D2VD)
• Directly Confounder Balancing
• Entropy Balancing
• Approximate Residual Balancing
• Differentiated Confounder Balancing
42
Doubly Robust
• Recap:
• Simple outcome regression:
• Unbiased if the regression models are correct
• IPW estimator:
• Unbiased if the propensity score model is correct
• Doubly Robust [2]: combine both approaches
43
!"# = #[& " = 1 − & " = 0 ]
and
Doubly Robust
• Estimating ATE with Doubly Robust estimator:
• Unbiased if either propensity score or regression model is correct
• This property is referred to as double robustness
44
Doubly Robust
• Theoretical Proof:
45
Doubly Robust
• Estimating ATE with Doubly Robust estimator:
• Unbiased if propensity score or regression model is correct
• This property is referred to as double robustness
• But may be very biased if both models are incorrect
46
Propensity Score based Methods
•Recap:
• Propensity Score Matching
• Inverse of Propensity Weighting
• Doubly Robust
•Need to estimate propensity score
• Treat all observed variables as confounders
• In Big Data Era, High dimensional data
• But, not all variables are confounders
47
Propensity Score based Methods
•Recap:
• Propensity Score Matching
• Inverse of Propensity Weighting
• Doubly Robust
•Need to estimate propensity score
• Treat all observed variables as confounders
• In Big Data Era, High dimensional data
• But, not all variables are confounders
48
How to automatically separate
the confounders?
Methods for Causal Inference
• Matching
• Propensity Score Based Methods
• Propensity Score Matching
• Inverse of Propensity Weighting (IPW)
• Doubly Robust
• Data-Driven Variable Decomposition (D2VD)
• Directly Confounder Balancing
• Entropy Balancing
• Approximate Residual Balancing
• Differentiated Confounder Balancing (DCB)
49
50
Inverse of Propensity Weighting (IPW)
• Treat all observed variables U as
confounders X
• Propensity Score Estimation:
• Adjusted Outcome:
• IPW ATE Estimator:
Data-Driven Variable Decomposition (D2VD)
51
• Separateness Assumption:
• All observed variables U can be decomposed into
three sets: Confounders X, Adjustment Variables Z,
and Irrelevant variables I (Omitted).
• Propensity Score Estimation:
• Adjusted Outcome:
• Our D2VD ATE Estimator:
Data-Driven Variable Decomposition (D2VD)
• Confounders Separation & ATE Estimation.
• With our D2VD estimator:
• By minimizing following objective function:
• We can estimate the ATE as:
52
Data-Driven Variable Decomposition (D2VD)
53
• Adjustment variables:
• Confounders:
• Treatment Effect:
!, #, $
Where
Where
Replace X, Z with U
Data-Driven Variable Decomposition (D2VD)
54
Bias Analysis:
Our D2VD algorithm is unbiased to estimate causal effect
Variance Analysis:
The asymptotic variance of Our D2VD algorithm is smaller
Data-Driven Variable Decomposition (D2VD)
•OUR: Data-Driven Variable Decomposition (D2VD)
•Baselines
•Directly Estimator (dir): ignores confounding bias
•IPW Estimator (IPW): treats all variables as confounders
•Doubly Robust Estimator (DR): IPW+regression
•Non-Separation Estimator (D2VD-): no variables separation
55
Data-Driven Variable Decomposition (D2VD)
• Dataset generation:
• Sample size m={1000,5000}
• Dimension of observed variables n={50,100,200}
• Observed variables:
• Treatment: logistic and misspecified
• Outcome:
56
Data-Driven Variable Decomposition (D2VD)
• Dataset generation:
• Sample size m={1000,5000}
• Dimension of observed variables n={50,100,200}
• Observed variables:
• Treatment: logistic and misspecified
• Outcome:
57
The true treatment effect in synthetic data is 1.
Data-Driven Variable Decomposition (D2VD)
• Experimental Results on Synthetic Data:
58
Data-Driven Variable Decomposition (D2VD)
• Experimental Results on Synthetic Data:
59
1. The direct estimator is failed under all settings.
2. IPW and DR estimators are good when T=Tlogit, but poor when T=Tmissp.
3. D2VD(-) has no variables separation, get similar results with DR estimator.
4. D2VD can improve accuracy and reduce variance for ATE estimation.
Data-Driven Variable Decomposition (D2VD)
• Experimental Results on Synthetic Data:
60
TPR: true positive rate
TNR: true negative rate
Our D2VD algorithm
can precisely separate
the confounders and
adjustment variables.
Experiments on Real World Data
• Dataset Description:
• Online advertising campaign (LONGCHAMP)
• Users Feedback: 14,891 LIKE; 93,108 DISLIKE
• 56 Features for each user
• Age, gender, #friends, device, user setting on WeChat
• Experimental Setting:
• Outcome Y: users feedback
• Treatment T: one feature
• Observed Variables U: other features
61
2015
Y = 1, if LIKE
Y = 0, if DISLIKE
Experiments Results
• ATE Estimation.
62
1. Our D2VD estimator evaluate the ATE more accuracy.
2. Our D2VD estimator can reduce the variance of estimated ATE.
3. Younger Ladies are with higher probability to like the LONGCHAMP ads.
Experiments Results
• Variables Decomposition.
63
1. The confounders are many other ways for adding friends on WeChat.
2. The adjustment variables have significant effect on outcome.
3. Our D2VD algorithm can precisely separate the confounders and
adjustment variables.
Summary: Propensity Score based Methods
• Propensity Score Matching (PSM):
• Units matching by their propensity score
• Inverse of Propensity Weighting (IPW):
• Units reweighted by inverse of propensity score
• Doubly Robust (DR):
• Combing IPW and regression
• Data-Driven Variable Decomposition (D2VD):
• Automatically separate the confounders and adjustment variables
• Confounder: estimate propensity score for IPW
• Adjustment variables: regression on outcome for reducing variance
• Improving accuracy and reducing variance on treatment effect estimation
• But, these methods need propensity score model is correct
64
Treat all observed
variables as confounder,
ignoring non-confounders
! " = $(& = 1|")
Methods for Causal Inference
• Matching
• Propensity Score Based Methods
• Propensity Score Matching
• Inverse of Propensity Weighting (IPW)
• Doubly Robust
• Data-Driven Variable Decomposition (D2VD)
• Directly Confounder Balancing
• Entropy Balancing
• Approximate Residual Balancing
• Differentiated Confounder Balancing (DCB)
65
Causal Inference with Observational Data
• Average Treatment Effect (ATE):
• Average Treatment effect on the Treated (ATT):
• Two key points:
• Changing T (T=1 and T=0)
• Keeping everything else (Confounder X) constant
66
!"# = #[& " = 1 − & " = 0 ]
!"" = #[& 1 |" = 1] − #[& 0 |" = 1]
Causal Inference with Observational Data
• Average Treatment Effect (ATE):
• Average Treatment effect on the Treated (ATT):
• Two key points:
• Changing T (T=1 and T=0)
• Keeping everything else (Confounder X) constant
67
!"# = #[& " = 1 − & " = 0 ]
!"" = #[& 1 |" = 1] − #[& 0 |" = 1]
Balancing Confounders’ Distribution
Directly Confounder Balancing
• Recap: Propensity score based methods
• Sample reweighting for confounder balancing
• But, need propensity score model is correct
• Weights would be very large if propensity score is close to 0 or 1
• Can we directly learn sample weight that can balance
confounders’ distribution between treated and control?
68
!" =
$"
%"
+
1 − $"
1 − %"
Yes!
Directly Confounder Balancing
• Motivation: The collection of all the moments of variables
uniquely determine their distributions.
• Methods: Learning sample weights by directly balancing
confounders’ moments as follows
69
The first moments of X
on the Control Group
The first moments of X
on the Treated Group
With moments, the sample weights can be learned
without any model specification.
Directly Confounder Balancing
• Motivation: The collection of all the moments of variables
uniquely determine their distributions.
• Methods: Learning sample weights by directly balancing
confounders’ moments as follows
• Estimating ATT by:
70
The first moments of X
on the Control Group
The first moments of X
on the Treated Group
Methods for Causal Inference
• Matching
• Propensity Score Based Methods
• Propensity Score Matching
• Inverse of Propensity Weighting (IPW)
• Doubly Robust
• Data-Driven Variable Decomposition (D2VD)
• Directly Confounder Balancing
• Entropy Balancing
• Approximate Residual Balancing
• Differentiated Confounder Balancing (DCB)
71
Entropy Balancing
• Directly confounder balancing by sample weights W
• Maximize the entropy of sample weights W
• But, treat all variables as confounders and balance
them equally
72
Approximate Residual Balancing
• 1. compute approximate balancing weights W as
• 2. Fit !" in the linear model using a lasso or elastic net,
• 3. Estimate the ATT as
• Double Robustness: Exact confounder balancing or regression is correct.
• But, treats all variables as confounders and balance them equally
73
Directly Confounder Balancing
• Recap:
• Entropy Balancing, Approximate Residual Balancing etc.
• Moments uniquely determine variables’ distribution
• Learning sample weights by balancing confounders’ moments
• But, treat all variables as confounders, and balance them equally
• Different confounders make different confounding bias
74
The first moments of X
on the Control Group
The first moments of X
on the Treated Group
Directly Confounder Balancing
• Recap:
• Entropy Balancing, Approximate Residual Balancing etc.
• Moments uniquely determine variables’ distribution
• Learning sample weights by balancing confounders’ moments
• But, treat all variables as confounders, and balance them equally
• Different confounders make different confounding bias
75
The first moments of X
on the Control Group
The first moments of X
on the Treated Group
How to differentiated confounders and
their bias?
Methods for Causal Inference
• Matching
• Propensity Score Based Methods
• Propensity Score Matching
• Inverse of Propensity Weighting (IPW)
• Doubly Robust
• Data-Driven Variable Decomposition (D2VD)
• Directly Confounder Balancing
• Entropy Balancing
• Approximate Residual Balancing
• Differentiated Confounder Balancing (DCB)
76
Differentiated Confounder Balancing
•Ideas: simultaneously learn confounder weights ! and
sample weighs ".
•Confounder weights determine which variable is
confounder and its contribution on confounding bias.
•Sample weights are designed for confounder balancing.
77
How to learn the confounder weights?
Confounder Weights Learning
• General relationship among !, ", and #:
78
Confounding biasConfounder weights
If $% = 0, then (% is not confounder, no need to balance.
Different confounders have different confounding weights.
Confounder Weights Learning
Propositions:
• In observational studies, not all observed variables are confounders, and
different confounders make unequal confounding bias on ATT with their
own weights.
• The confounder weights can be learned by regressing potential outcome
! 0 on augmented variables #.
79
Differentiated Confounder Balancing
• Objective Function
81
The ENT[3] and ARB[4] algorithms are special case of our DCB
algorithm by setting the confounder weights as unit vector.
Our DCB algorithm is more generalize for
treatment effect estimation.
Differentiated Confounder Balancing
• Algorithm
• Training Complexity: ! "#
• ": sample size, #: dimensions of variables
82
In each iteration, we first
update $ by fixing %, and
then update % by fixing $
Experiments
•Experimental Tasks:
ØRobustness Test (high-dimensional and noisy)
ØAccuracy Test (real world dataset)
ØPredictive Power Test (real ad application)
83
Experiments
• Baselines:
• Directly Estimator: comparing average outcome between treated and control units.
• IPW Estimator [1]: reweighting via inverse of propensity score
• Doubly Robust Estimator [2]: IPW + regression method
• Entropy Balancing Estimator [3]: directly confounder balancing with entropy loss
• Approximate Residual Balancing [4]: confounder balancing + regression
• Evaluation Metric:
84
Experiments - Robustness Test
• Dataset
ØSample size: ! = {2000, 5000}
ØVariables’ dimensions: ) = {50,100}
ØObserved Variables:
ØTreatment: from logistic function +,-./0 and misspecified function +1/223
• Confounding rate 45: the ratio of confounders to all observed variables.
• Confounding strength 65: the bias strength of confounders
ØOutcome: from linear function 7,/89:; and nonlinear function 78-8,/8
85
Experiments - Robustness Test
• Directly estimator fails in all settings, since it ignores confounding bias.
• IPW and DR estimators make huge error when facing high dimensional
variables or the model specifications are incorrect.
• ENT and ARB estimators have poor performance since they balance all
variables equally.
86
More results see our paper!
Experiments - Robustness Test
87
Our DCB estimator achieves significant improvements over
the baselines in different settings.
More results see our paper!
Our DCB estimator is very robust!
Experiments - Robustness Test
88
The MAE of our DCB estimator is consistent
stable and small.
• Sample Size
• Dimension of variables
• Confounding rate
• Confounding strength
Experiments - Robustness Test
89
Our DCB algorithm is very robust for
treatment effect estimation.
Experiments - Accuracy Test
• LaLonde Dataset [5]: Would the job training program increase people’s earnings
in the year of 1978?
• Randomized experiments: provide ground truth of treatment effect
• Observational studies: check the performance of all estimators
• Experimental Setting:
• V-RAW: variables set of 10 raw observed variables, including employment,
education, age ethnicity and married status.
• V-INTERACTION: variables set of raw variables, their pairwise one way
interaction and their squared terms.
90
Experiments - Accuracy Test
91
Results of ATT estimation
Our DCB estimator is more accurate than the baselines.
Our DCB estimator achieve a better confounder
balancing under V-INTERACTION setting.
Experiments - Predictive Power
• Dataset Description:
• Online advertising campaign (LONGCHAMP)
• Users Feedback: 14,891 LIKE; 93,108 DISLIKE
• 56 Features for each user
• Age, gender, #friends, device, user settings on WeChat
• Experimental Setting:
• Outcome Y: users feedback
• Treatment T: one feature
• Observed Variables X: other features
93
2015
Y = 1, if LIKE
Y = 0, if DISLIKE
Select the top k features with high causal effect for prediction
Experiments - Predictive Power
• Two correlation-based feature
selection baselines:
• MRel [6]: maximum relevance
• mRMR [7]: Maximum relevance
and minimum redundancy.
94
ØOur DCB estimator achieves the best prediction accuracy.
ØCorrelation based methods perform worse than causal methods.
Summary: Directly Confounder Balancing
• Motivation: Moments can uniquely determine distribution
• Entropy Balancing
• Confounder balancing with maximizing entropy of sample weights
• Approximate Residual Balancing
• Combine confounder balancing and regression for doubly robust
• Treat all variables as confounders, and balance them equally
• But different confounders make different bias
• Differentiated Confounder Balancing (DCB)
• Theoretical proof on the necessary of differentiation on confounders
• Improving the accuracy and robust on treatment effect estimation
95
Sectional Summary: Methods for Causal Inference
• Matching
• Propensity Score Based Methods
• Propensity Score Matching
• Inverse of Propensity Weighting (IPW)
• Doubly Robust
• Data-Driven Variable Decomposition (D2VD)
• Directly Confounder Balancing
• Entropy Balancing
• Approximate Residual Balancing
• Differentiated Confounder Balancing (DCB)
96
Not all observed variables
are confounders
Different confounders
make different bias
Treat all observed
variables as confounder
Balance all confounder
equally
Limited to low-dimensional settings
Sectional Summary: Methods for Causal Inference
97
p Progress has been made to draw causality from
big data.
p From single to group
p From binary to continuous
p Weak assumptions
Ready for Learning?
Outline
ØCorrelation v.s. Causality
ØCausal Inference
ØStable Learning
ØNICO: An Image Dataset for Stable Learning
ØFuture Directions and Conclusions
98
Stability and Prediction
99
True Model
Learning Process
Prediction
Performance
TraditionalLearning
StableLearning
Bin Yu (2016), Three Principles of Data Science: predictability, computability, stability
Stable Learning
100
ModelDistribution 1
Distribution 1
Distribution 2
Distribution 3
Distribution n
…
Accuracy 1
Accuracy 2
Accuracy 3
Accuracy n
…
I.I.D. Learning
Transfer Learning
VAR (Acc)
Stable
Learning
Training
Testing
Stability and Robustness
• Robustness
• More on prediction performance over data perturbations
• Prediction performance-driven
• Stability
• More on the true model
• Lay more emphasis on Bias
• Sufficient for robustness
101
Stable learning is a (intrinsic?) way to realize robust prediction
Domain Generalization / Invariant Learning
105
• Given data from different
observed environments :
• The task is to predict Y given X
such that the prediction works
well (is “robust”) for “all possible”
(including unseen) environments
Domain Generalization
• Assumption: the conditional probability P(Y|X) is stable or
invariant across different environments.
• Idea: taking knowledge acquired from a number of related domains
and applying it to previously unseen domains
• Theorem: Under reasonable technical assumptions. Then with
probability at least
106
Muandet K, Balduzzi D, Schölkopf B. Domain generalization via invariant feature. ICML 2013.
Invariant Prediction
• Invariant Assumption: There exists a subset ! ∈ # is causal for the prediction
of $, and the conditional distribution P(Y|S) is stable across all environments.
• Idea: Linking to causality
• Structural Causal Model (Pearl 2009):
• The parent variables of Y in SCM satisfies Invariant Assumption
• The causal variables lead to invariance w.r.t. “all” possible environments
107
Peters, J., Bühlmann, P., & Meinshausen, N. (2016). Causal inference by using invariant prediction: identification and
confidence intervals. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2016
From Variable Selection to Sample Reweighting
108
X
T Y
Typical Causal Framework
Sample reweighting can make a variable independent of other
variables.
Given a feature T
Assign different weights to samples so that
the samples with T and the samples without
T have similar distributions in X
Calculate the difference of Y distribution in
treated and controlled groups. (correlation
between T and Y)
Global Balancing: Decorrelating Variables
109
X
T Y
Typical Causal Framework
Partial effect can be regarded as causal effect. Predicting with causal
variables is stable across different environments.
Given ANY feature T
Assign different weights to samples so that the
samples with T and the samples without T have
similar distributions in X
Calculate the difference of Y distribution in
treated and controlled groups. (correlation
between T and Y)
Kun Kuang, Peng Cui, Susan Athey, Ruoxuan Li, Bo Li. Stable Prediction across Unknown Environments. KDD, 2018.
Theoretical Guarantee
110
Kun Kuang, Peng Cui, Susan Athey, Ruoxuan Li, Bo Li. Stable Prediction across Unknown Environments. KDD, 2018.
à
0
Causal Regularizer
111
Set feature j as treatment variable
Zheyan Shen, Peng Cui, Kun Kuang, Bo Li. Causally Regularized Learning on Data with Agnostic Bias. ACM MM, 2018.
Causally Regularized Logistic Regression
112
Zheyan Shen, Peng Cui, Kun Kuang, Bo Li. Causally Regularized Learning on Data with Agnostic Bias. ACM MM, 2018.
From Shallow to Deep - DGBR
113
Kun Kuang, Peng Cui, Susan Athey, Ruoxuan Li, Bo Li. Stable Prediction across Unknown Environments. KDD, 2018.
Experiment 1 – non-i.i.d. image classification
• )
• - 0 1 0 -
• ) 0 ) - ) : 0 1 -
• 0 ) : ) , 5 0 ) )) :
) - 0 ( 0
114
Experimental Result - insights
Experimental Result - insights
116
From Causal problem to Learning problem
118
• Previous logic:
• More direct logic:
Sample
Reweighting
Independent
Variables
Causal
Variable
Stable
Prediction
Sample
Reweighting
Independent
Variables
Stable
Prediction
Thinking from the Learning end
119
!"#$%&(() !"*+"(()
,-.// 01121
/.130 01121
Zheyan Shen, Peng Cui, Tong Zhang. Stable Learning of Linear Models via Sample Reweighting. (under review)
Stable Learning of Linear Models
• Consider the linear regression with misspecification bias
• By accurately estimating with the property that ! " is uniformly
small for all ", we can achieve stable learning.
• However, the estimation error caused by misspecification term can
be as bad as , where #$
is the smallest
eigenvalue of centered covariance matrix.
120
Bias term with bound ! " ≤ &Goes to infinity when perfect collinearity exists!
Zheyan Shen, Peng Cui, Tong Zhang. Stable Learning of Linear Models via Sample Reweighting. (under review)
Toy Example
• Assume the design matrix ! consists of two variables !", !$,
generated from a multivariate normal distribution:
• By changing %, we can simulate different extent of collinearity.
• To induce bias related to collinearity, we generate bias term & !
with & ! = !(, where ( is the eigenvector of centered covariance
matrix corresponding to its smallest eigenvalue )$
.
• The bias term is sensitive to collinearity.
121
Zheyan Shen, Peng Cui, Tong Zhang. Stable Learning of Linear Models via Sample Reweighting. (under review)
Simulation Results
122
!"#$% %##&# (%()*+")*&, -*"()
!"#$% /"#*",0% *, 1*22%#%,) 1*()#*-3)*&,(
*,0#%"(% 0&!!*,%"#*)4
Zheyan Shen, Peng Cui, Tong Zhang. Stable Learning of Linear Models via Sample Reweighting. (under review)
Reducing collinearity by sample reweighting
123
Idea: Learn a new set of sample weights !(#) to decorrelate the
input variables and increase the smallest eigenvalue
• Weighted Least Square Estimation
which is equivalent to
So, how to find an “oracle” distribution which holds the desired
property?
Zheyan Shen, Peng Cui, Tong Zhang. Stable Learning of Linear Models via Sample Reweighting. (under review)
Sample Reweighted Decorrelation Operator (cont.)
124
Decorrelation
where !, #, $, %, &, ' are drawn from 1 … * at random
• By treating the different columns independently while performing
random resampling, we can obtain a column-decorrelated design
matrix with the same marginal as before.
• Then we can use density ratio estimation to get +(-).
Zheyan Shen, Peng Cui, Tong Zhang. Stable Learning of Linear Models via Sample Reweighting. (under review)
Experimental Results
• Simulation Study
125
Zheyan Shen, Peng Cui, Tong Zhang. Stable Learning of Linear Models via Sample Reweighting. (under review)
Experimental Results
• Regression
• Classification
126
• Regression
• Classification
Zheyan Shen, Peng Cui, Tong Zhang. Stable Learning of Linear Models via Sample Reweighting. (under review)
Disentanglement Representation Learning
• Learning Multiple Levels of Abstraction
• The big payoff of deep learning is to allow learning higher levels of
abstraction
• Higher-level abstractions disentangle the factor of variation,
which allows much easier generalization and transfer
127
Yoshua Bengio, From Deep Learning of Disentangled Representations to Higher-level Cognition. (2019). YouTube. Retrieved 22 February 2019.
From decorrelating input variables to learning
disentangled representation
Disentanglement for Causality
• Causal / mechanism independence
• Independently Controllable Factors (Thomas, Bengio et al., 2017)
• Optimize both !" and #" to minimize
128
A policy !" A representation #"
selectively change correspond to value
Require subtle design on the
policy set to guarantee causality.
Sectional Summary
129
p Causal inference provide valuable insights for stable learning
p Complete causal structure means data generation process,
necessarily leading to stable prediction
p Stable learning can also help to advance causal inference
p Performance driven and practical applications
Benchmark is important!
Outline
ØCorrelation v.s. Causality
ØCausal Inference
ØStable Learning
ØNICO: An Image Dataset for Stable Learning
ØFuture Directions and Conclusions
130
Non-I.I.D. Image Classification
• Non I.I.D. Image Classification
• Two tasks
• Targeted Non-I.I.D. Image Classification
• Have prior knowledge on testing data
• e.g. transfer learning, domain adaptation
• General Non-I.I.D. Image Classification
• Testing is unknown, no prior
• more practical & realistic
131
!(#$%&'( = *$%&'(, ,$%&'( ) ≠ !(#$/0$ = *$/0$, ,$/0$ )
unknown
known
#$%&'( #$/0$
Existence of Non-I.I.Dness
• One metric (NI) for Non-I.I.Dness
• Existence of Non-I.I.Dness on Dataset consisted of 10 subclasses from ImageNet
• For each class
• Training data
• Testing data
• CNN for prediction
132
ubiquitous
strong correlation
Distribution shift
For normalization
Related Datasets
• DatasetA & DatasetB & DatasetC
• NI is ubiquitous, but small on these datasets
• NI is Uncontrollable, not friendly for Non IID setting
133
Small NI
A dataset for Non-I.I.D. image classification is demanded.
ImageNet
PASCAL
VOC MSCOCO
Uncontrollable NI
Average NI: 2.7
NICO - Non-I.I.D. Image Dataset with Contexts
• NICO Datasets:
• Object label: e.g. dog
• Contextual labels (Contexts)
• the background or scene of a object, e.g. grass/water
• Structure of NICO
134
Animal Vehicle
Dog …
…
Train
Grass on bridge…
…
2 Superclasses
10 Classes
10 Contexts
per
per Diverse &
Meaningful
Overlapping
NICO - Non-I.I.D. Image Dataset with Contexts
• Data size of each class in NICO
• Sample size: thousands for each class
• Each superclass: 10,000 images
• Sufficient for some basic neural networks (CNN)
• Samples with contexts in NICO
135
Controlling NI on NICO Dataset
•Minimum Bias (comparing with ImageNet)
•Proportional Bias (controllable)
• Number of samples in each context
•Compositional Bias (controllable)
• Number of contexts that observed
136
Minimum Bias
• In this setting, the way of random sampling leads to minimum distribution shift between
training and testing distributions in dataset, which simulates a nearly i.i.d. scenario.
• 8000 samples for training and 2000 samples for testing in each superclass (ConvNet)
137
Average NI Testing Accuracy
Animal 3.85 49.6%
Vehicle 3.20 63.0%
Images in NICO
are with rich contextual
information
more challenging for
image classification
Average NI on ImageNet: 2.7
Our NICO data is more Non-iid, more challenging
Proportional Bias
• Given a class, when sampling positive samples, we use all contexts for both training and
testing, but the percentage of each context is different between training and testing dataset.
138
4
4.1
4.2
4.3
4.4
4.5
1:1 2:1 3:1 4:1 5:1 6:1
NI
Dominant Ratio in Training Data
Testing
1 : 1
Dominate
Context (55%)
(5%) (5%) (5%) (5%) (5%) (5%) (5%) (5%) (5%)
We can control NI by varying dominate ratio
Compositional Bias
• Given a class, the observed contexts are different between training and testing data.
139
Moderate setting
(Overlap)
Radical setting
(No Overlap &
Dominant ratio)
4.44
4.0
4.2
4.4
4.6
4.8
5.0
1:1 2:1 3:1 4:1 5:1
NI
Dominant Ratio in Training data
4.34
4.0
4.1
4.2
4.3
4.4
7 6 5 4 3
NI
Number of Contexts in Training Data
Training:
Testing:
Training:
Testing:
Testing
1 : 1
NICO - Non-I.I.D. Image Dataset with Contexts
• Large and controllable NI
140
Controllable NILarge NI
small NI
large NI
NICO - Non-I.I.D. Image Dataset with Contexts
• The dataset can be downloaded from (temporary address):
• https://www.dropbox.com/sh/8mouawi5guaupyb/AAD4fdySrA6fn3P
gSmhKwFgva?dl=0
• Please refer to the following paper for details:
• Yue He, Zheyan Shen, Peng Cui. NICO: A Dataset Towards Non-
I.I.D. Image Classification. https://arxiv.org/pdf/1906.02899.pdf
141
Outline
ØCorrelation v.s. Causality
ØCausal Inference
ØStable Learning
ØNICO: An Image Dataset for Stable Learning
ØConclusions
142
Conclusions
• Predictive modeling is not only about Accuracy.
• Stability is critical for us to trust a predictive model.
• Causality has been demonstrated to be useful in stable prediction.
• How to marry causality with predictive modeling effectively and
efficiently is still an open problem.
143
Conclusions
144
Debiasing
Prediction
Causal Inference
Stable Learning
Propensity
Score
Direct Confounder
Balancing
Global
Balancing
Linear Stable
Learning
Disentangled
Learning
Reference
• Shen Z, Cui P, Kuang K, et al. Causally regularized learning with agnostic data
selection bias[C]//2018 ACM Multimedia Conference on Multimedia Conference.
ACM, 2018: 411-419.
• Kuang K, Cui P, Athey S, et al. Stable prediction across unknown
environments[C]//Proceedings of the 24th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining. ACM, 2018: 1617-1626.
• Kuang K, Cui P, Li B, et al. Estimating treatment effect in the wild via differentiated
confounder balancing[C]//Proceedings of the 23rd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining. ACM, 2017: 265-274.
• Kuang K, Cui P, Li B, et al. Treatment effect estimation with data-driven variable
decomposition[C]//Thirty-First AAAI Conference on Artificial Intelligence. 2017.
• Kuang K, Jiang M, Cui P, et al. Steering social media promotions with effective
strategies[C]//2016 IEEE 16th International Conference on Data Mining (ICDM).
IEEE, 2016: 985-990.
145
Reference
• Pearl J. Causality[M]. Cambridge university press, 2009.
• Austin P C. An introduction to propensity score methods for reducing the effects of
confounding in observational studies[J]. Multivariate behavioral research, 2011, 46(3): 399-
424.
• Johansson F, Shalit U, Sontag D. Learning representations for counterfactual
inference[C]//International conference on machine learning. 2016: 3020-3029.
• Shalit U, Johansson F D, Sontag D. Estimating individual treatment effect: generalization
bounds and algorithms[C]//Proceedings of the 34th International Conference on Machine
Learning-Volume 70. JMLR. org, 2017: 3076-3085.
• Johansson F D, Kallus N, Shalit U, et al. Learning weighted representations for generalization
across designs[J]. arXiv preprint arXiv:1802.08598, 2018.
• Louizos C, Shalit U, Mooij J M, et al. Causal effect inference with deep latent-variable
models[C]//Advances in Neural Information Processing Systems. 2017: 6446-6456.
• Thomas V, Bengio E, Fedus W, et al. Disentangling the independently controllable factors of
variation by interacting with the world[J]. arXiv preprint arXiv:1802.09484, 2018.
• Bengio Y, Deleu T, Rahaman N, et al. A Meta-Transfer Objective for Learning to Disentangle
Causal Mechanisms[J]. arXiv preprint arXiv:1901.10912, 2019.
146
Reference
• Yu B. Stability[J]. Bernoulli, 2013, 19(4): 1484-1500.
• Szegedy C, Zaremba W, Sutskever I, et al. Intriguing properties of neural networks[J]. arXiv
preprint arXiv:1312.6199, 2013.
• Volpi R, Namkoong H, Sener O, et al. Generalizing to unseen domains via adversarial data
augmentation[C]//Advances in Neural Information Processing Systems. 2018: 5334-5344.
• Ye N, Zhu Z. Bayesian adversarial learning[C]//Proceedings of the 32nd International
Conference on Neural Information Processing Systems. Curran Associates Inc., 2018: 6892-
6901.
• Muandet K, Balduzzi D, Schölkopf B. Domain generalization via invariant feature
representation[C]//International Conference on Machine Learning. 2013: 10-18.
• Peters J, Bühlmann P, Meinshausen N. Causal inference by using invariant prediction:
identification and confidence intervals[J]. Journal of the Royal Statistical Society: Series B
(Statistical Methodology), 2016, 78(5): 947-1012.
• Rojas-Carulla M, Schölkopf B, Turner R, et al. Invariant models for causal transfer learning[J].
The Journal of Machine Learning Research, 2018, 19(1): 1309-1342.
• Rothenhäusler D, Meinshausen N, Bühlmann P, et al. Anchor regression: heterogeneous data
meets causality[J]. arXiv preprint arXiv:1801.06229, 2018.
147
Acknowledgement
148
Thanks!
Peng Cui
cuip@tsinghua.edu.cn
http://pengcui.thumedialab.com
149
Kun Kuang
kunkuang@zju.edu.cn
https://kunkuang.github.io/

More Related Content

What's hot

Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...Simplilearn
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Basit Rafiq
 
Web scraping
Web scrapingWeb scraping
Web scrapingSelecto
 
You only look once: Unified, real-time object detection (UPC Reading Group)
You only look once: Unified, real-time object detection (UPC Reading Group)You only look once: Unified, real-time object detection (UPC Reading Group)
You only look once: Unified, real-time object detection (UPC Reading Group)Universitat Politècnica de Catalunya
 
Data Science: Applying Random Forest
Data Science: Applying Random ForestData Science: Applying Random Forest
Data Science: Applying Random ForestEdureka!
 
Spammer detection and fake user Identification on Social Networks
Spammer detection and fake user Identification on Social NetworksSpammer detection and fake user Identification on Social Networks
Spammer detection and fake user Identification on Social NetworksJAYAPRAKASH JPINFOTECH
 
Machine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion MatrixMachine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion MatrixAndrew Ferlitsch
 
Automating Predictive Modeling at Zynga with PySpark and Pandas UDFs
Automating Predictive Modeling at Zynga with PySpark and Pandas UDFsAutomating Predictive Modeling at Zynga with PySpark and Pandas UDFs
Automating Predictive Modeling at Zynga with PySpark and Pandas UDFsDatabricks
 
Customer Clustering For Retail Marketing
Customer Clustering For Retail MarketingCustomer Clustering For Retail Marketing
Customer Clustering For Retail MarketingJonathan Sedar
 
Deep Learning in Bio-Medical Imaging
Deep Learning in Bio-Medical ImagingDeep Learning in Bio-Medical Imaging
Deep Learning in Bio-Medical ImagingJoonhyung Lee
 
Linear Regression vs Logistic Regression | Edureka
Linear Regression vs Logistic Regression | EdurekaLinear Regression vs Logistic Regression | Edureka
Linear Regression vs Logistic Regression | EdurekaEdureka!
 
The History Of The Internet Presentation
The  History Of The  Internet  PresentationThe  History Of The  Internet  Presentation
The History Of The Internet Presentationdgieseler1
 
Machine Learning and Causal Inference
Machine Learning and Causal InferenceMachine Learning and Causal Inference
Machine Learning and Causal InferenceNBER
 
Data Visualization
Data VisualizationData Visualization
Data Visualizationsimonwandrew
 

What's hot (20)

Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)
 
Webhook
WebhookWebhook
Webhook
 
Web scraping
Web scrapingWeb scraping
Web scraping
 
You only look once: Unified, real-time object detection (UPC Reading Group)
You only look once: Unified, real-time object detection (UPC Reading Group)You only look once: Unified, real-time object detection (UPC Reading Group)
You only look once: Unified, real-time object detection (UPC Reading Group)
 
Data Science: Applying Random Forest
Data Science: Applying Random ForestData Science: Applying Random Forest
Data Science: Applying Random Forest
 
Spammer detection and fake user Identification on Social Networks
Spammer detection and fake user Identification on Social NetworksSpammer detection and fake user Identification on Social Networks
Spammer detection and fake user Identification on Social Networks
 
Machine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion MatrixMachine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion Matrix
 
Automating Predictive Modeling at Zynga with PySpark and Pandas UDFs
Automating Predictive Modeling at Zynga with PySpark and Pandas UDFsAutomating Predictive Modeling at Zynga with PySpark and Pandas UDFs
Automating Predictive Modeling at Zynga with PySpark and Pandas UDFs
 
Computer vision
Computer visionComputer vision
Computer vision
 
Customer Clustering For Retail Marketing
Customer Clustering For Retail MarketingCustomer Clustering For Retail Marketing
Customer Clustering For Retail Marketing
 
Deep Learning in Bio-Medical Imaging
Deep Learning in Bio-Medical ImagingDeep Learning in Bio-Medical Imaging
Deep Learning in Bio-Medical Imaging
 
Linear Regression vs Logistic Regression | Edureka
Linear Regression vs Logistic Regression | EdurekaLinear Regression vs Logistic Regression | Edureka
Linear Regression vs Logistic Regression | Edureka
 
The History Of The Internet Presentation
The  History Of The  Internet  PresentationThe  History Of The  Internet  Presentation
The History Of The Internet Presentation
 
Yolov3
Yolov3Yolov3
Yolov3
 
Machine Learning and Causal Inference
Machine Learning and Causal InferenceMachine Learning and Causal Inference
Machine Learning and Causal Inference
 
Data Visualization
Data VisualizationData Visualization
Data Visualization
 
3 data visualization
3 data visualization3 data visualization
3 data visualization
 
Darknet yolo
Darknet yoloDarknet yolo
Darknet yolo
 
Iot Report
Iot ReportIot Report
Iot Report
 

Similar to Causally regularized machine learning

Dive into the Data
Dive into the DataDive into the Data
Dive into the Datadr_jp_ebejer
 
Mixed Effects Models - Random Intercepts
Mixed Effects Models - Random InterceptsMixed Effects Models - Random Intercepts
Mixed Effects Models - Random InterceptsScott Fraundorf
 
Approximate ANCOVA
Approximate ANCOVAApproximate ANCOVA
Approximate ANCOVAStephen Senn
 
A measure to evaluate latent variable model fit by sensitivity analysis
A measure to evaluate latent variable model fit by sensitivity analysisA measure to evaluate latent variable model fit by sensitivity analysis
A measure to evaluate latent variable model fit by sensitivity analysisDaniel Oberski
 
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...Sangwoo Mo
 
Test of significance in Statistics
Test of significance in StatisticsTest of significance in Statistics
Test of significance in StatisticsVikash Keshri
 
Introduction to simulating data to improve your research
Introduction to simulating data to improve your researchIntroduction to simulating data to improve your research
Introduction to simulating data to improve your researchDorothy Bishop
 
Causality and Propensity Score Methods
Causality and Propensity Score MethodsCausality and Propensity Score Methods
Causality and Propensity Score Methodsinovex GmbH
 
Rsqrd AI - ML Interpretability: Beyond Feature Importance
Rsqrd AI - ML Interpretability: Beyond Feature ImportanceRsqrd AI - ML Interpretability: Beyond Feature Importance
Rsqrd AI - ML Interpretability: Beyond Feature ImportanceAlessya Visnjic
 
Terminological cluster trees for Disjointness Axiom Discovery
Terminological cluster trees for Disjointness Axiom DiscoveryTerminological cluster trees for Disjointness Axiom Discovery
Terminological cluster trees for Disjointness Axiom DiscoveryGiuseppe Rizzo
 
ISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to StatisticsISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to StatisticsAndrea Arcuri
 

Similar to Causally regularized machine learning (20)

Lecture 4
Lecture 4Lecture 4
Lecture 4
 
Dive into the Data
Dive into the DataDive into the Data
Dive into the Data
 
Errors2
Errors2Errors2
Errors2
 
Core Training Presentations- 3 Estimating an Ag Database using CE Methods
Core Training Presentations- 3 Estimating an Ag Database using CE MethodsCore Training Presentations- 3 Estimating an Ag Database using CE Methods
Core Training Presentations- 3 Estimating an Ag Database using CE Methods
 
Mixed Effects Models - Random Intercepts
Mixed Effects Models - Random InterceptsMixed Effects Models - Random Intercepts
Mixed Effects Models - Random Intercepts
 
Ch01 03
Ch01 03Ch01 03
Ch01 03
 
Approximate ANCOVA
Approximate ANCOVAApproximate ANCOVA
Approximate ANCOVA
 
A measure to evaluate latent variable model fit by sensitivity analysis
A measure to evaluate latent variable model fit by sensitivity analysisA measure to evaluate latent variable model fit by sensitivity analysis
A measure to evaluate latent variable model fit by sensitivity analysis
 
4646150.ppt
4646150.ppt4646150.ppt
4646150.ppt
 
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
 
Test of significance in Statistics
Test of significance in StatisticsTest of significance in Statistics
Test of significance in Statistics
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
5. RV and Distributions.pptx
5. RV and Distributions.pptx5. RV and Distributions.pptx
5. RV and Distributions.pptx
 
Introduction to simulating data to improve your research
Introduction to simulating data to improve your researchIntroduction to simulating data to improve your research
Introduction to simulating data to improve your research
 
ECONOMETRICS I ASA
ECONOMETRICS I ASAECONOMETRICS I ASA
ECONOMETRICS I ASA
 
Population and sample mean
Population and sample meanPopulation and sample mean
Population and sample mean
 
Causality and Propensity Score Methods
Causality and Propensity Score MethodsCausality and Propensity Score Methods
Causality and Propensity Score Methods
 
Rsqrd AI - ML Interpretability: Beyond Feature Importance
Rsqrd AI - ML Interpretability: Beyond Feature ImportanceRsqrd AI - ML Interpretability: Beyond Feature Importance
Rsqrd AI - ML Interpretability: Beyond Feature Importance
 
Terminological cluster trees for Disjointness Axiom Discovery
Terminological cluster trees for Disjointness Axiom DiscoveryTerminological cluster trees for Disjointness Axiom Discovery
Terminological cluster trees for Disjointness Axiom Discovery
 
ISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to StatisticsISSTA'16 Summer School: Intro to Statistics
ISSTA'16 Summer School: Intro to Statistics
 

More from Wanjin Yu

Architecture Design for Deep Neural Networks III
Architecture Design for Deep Neural Networks IIIArchitecture Design for Deep Neural Networks III
Architecture Design for Deep Neural Networks IIIWanjin Yu
 
Intelligent Multimedia Recommendation
Intelligent Multimedia RecommendationIntelligent Multimedia Recommendation
Intelligent Multimedia RecommendationWanjin Yu
 
Architecture Design for Deep Neural Networks II
Architecture Design for Deep Neural Networks IIArchitecture Design for Deep Neural Networks II
Architecture Design for Deep Neural Networks IIWanjin Yu
 
Architecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks IArchitecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks IWanjin Yu
 
Computer vision for transportation
Computer vision for transportationComputer vision for transportation
Computer vision for transportationWanjin Yu
 
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet IIIObject Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet IIIWanjin Yu
 
Object Detection Beyond Mask R-CNN and RetinaNet II
Object Detection Beyond Mask R-CNN and RetinaNet IIObject Detection Beyond Mask R-CNN and RetinaNet II
Object Detection Beyond Mask R-CNN and RetinaNet IIWanjin Yu
 
Object Detection Beyond Mask R-CNN and RetinaNet I
Object Detection Beyond Mask R-CNN and RetinaNet IObject Detection Beyond Mask R-CNN and RetinaNet I
Object Detection Beyond Mask R-CNN and RetinaNet IWanjin Yu
 
Visual Search and Question Answering II
Visual Search and Question Answering IIVisual Search and Question Answering II
Visual Search and Question Answering IIWanjin Yu
 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Wanjin Yu
 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Wanjin Yu
 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Wanjin Yu
 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Wanjin Yu
 
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Wanjin Yu
 
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Wanjin Yu
 
Big Data Intelligence: from Correlation Discovery to Causal Reasoning
Big Data Intelligence: from Correlation Discovery to Causal Reasoning Big Data Intelligence: from Correlation Discovery to Causal Reasoning
Big Data Intelligence: from Correlation Discovery to Causal Reasoning Wanjin Yu
 

More from Wanjin Yu (16)

Architecture Design for Deep Neural Networks III
Architecture Design for Deep Neural Networks IIIArchitecture Design for Deep Neural Networks III
Architecture Design for Deep Neural Networks III
 
Intelligent Multimedia Recommendation
Intelligent Multimedia RecommendationIntelligent Multimedia Recommendation
Intelligent Multimedia Recommendation
 
Architecture Design for Deep Neural Networks II
Architecture Design for Deep Neural Networks IIArchitecture Design for Deep Neural Networks II
Architecture Design for Deep Neural Networks II
 
Architecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks IArchitecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks I
 
Computer vision for transportation
Computer vision for transportationComputer vision for transportation
Computer vision for transportation
 
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet IIIObject Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
 
Object Detection Beyond Mask R-CNN and RetinaNet II
Object Detection Beyond Mask R-CNN and RetinaNet IIObject Detection Beyond Mask R-CNN and RetinaNet II
Object Detection Beyond Mask R-CNN and RetinaNet II
 
Object Detection Beyond Mask R-CNN and RetinaNet I
Object Detection Beyond Mask R-CNN and RetinaNet IObject Detection Beyond Mask R-CNN and RetinaNet I
Object Detection Beyond Mask R-CNN and RetinaNet I
 
Visual Search and Question Answering II
Visual Search and Question Answering IIVisual Search and Question Answering II
Visual Search and Question Answering II
 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
 
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
 
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
 
Big Data Intelligence: from Correlation Discovery to Causal Reasoning
Big Data Intelligence: from Correlation Discovery to Causal Reasoning Big Data Intelligence: from Correlation Discovery to Causal Reasoning
Big Data Intelligence: from Correlation Discovery to Causal Reasoning
 

Recently uploaded

Call girls Service in Ajman 0505086370 Ajman call girls
Call girls Service in Ajman 0505086370 Ajman call girlsCall girls Service in Ajman 0505086370 Ajman call girls
Call girls Service in Ajman 0505086370 Ajman call girlsMonica Sydney
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfJOHNBEBONYAP1
 
Local Call Girls in Seoni 9332606886 HOT & SEXY Models beautiful and charmin...
Local Call Girls in Seoni  9332606886 HOT & SEXY Models beautiful and charmin...Local Call Girls in Seoni  9332606886 HOT & SEXY Models beautiful and charmin...
Local Call Girls in Seoni 9332606886 HOT & SEXY Models beautiful and charmin...kumargunjan9515
 
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsRussian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsMonica Sydney
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirtrahman018755
 
Best SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasBest SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasDigicorns Technologies
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样ayvbos
 
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...kajalverma014
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdfMatthew Sinclair
 
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrStory Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrHenryBriggs2
 
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...meghakumariji156
 
Mira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
Mira Road Housewife Call Girls 07506202331, Nalasopara Call GirlsMira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
Mira Road Housewife Call Girls 07506202331, Nalasopara Call GirlsPriya Reddy
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge GraphsEleniIlkou
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsMonica Sydney
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制pxcywzqs
 
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查ydyuyu
 
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样ayvbos
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtrahman018755
 
Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.krishnachandrapal52
 

Recently uploaded (20)

Call girls Service in Ajman 0505086370 Ajman call girls
Call girls Service in Ajman 0505086370 Ajman call girlsCall girls Service in Ajman 0505086370 Ajman call girls
Call girls Service in Ajman 0505086370 Ajman call girls
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
 
Local Call Girls in Seoni 9332606886 HOT & SEXY Models beautiful and charmin...
Local Call Girls in Seoni  9332606886 HOT & SEXY Models beautiful and charmin...Local Call Girls in Seoni  9332606886 HOT & SEXY Models beautiful and charmin...
Local Call Girls in Seoni 9332606886 HOT & SEXY Models beautiful and charmin...
 
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsRussian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 
Best SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasBest SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency Dallas
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
 
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
 
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrStory Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
 
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
 
Mira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
Mira Road Housewife Call Girls 07506202331, Nalasopara Call GirlsMira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
Mira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
 
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
 
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
 
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 
Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.
 

Causally regularized machine learning

  • 1. CAUSALLY REGULARIZED MACHINE LEARNING Peng Cui, Tsinghua University Kun Kuang, Zhejiang University Bo Li, Tsinghua University
  • 2. ML techniques are impacting our life 2 • A day in our life with ML techniques 8:30 am 8:00 am 10:00 am 4:00 pm 6:00 pm 8:00 pm
  • 3. Now we are stepping into risk-sensitive areas 3 Shifting from Performance Driven to Risk Sensitive
  • 4. Problems of today’s ML - Explainability 4 Most machine learning models are black-box models
  • 5. 5 Most ML methods are developed under I.I.D hypothesis Problems of today’s ML - Stability
  • 7. 7 • Cancer survival rate prediction Training Data Predictive Model Testing Data City Hospital University HospitalHigher income, higher survival rate. City Hospital Survival rate is not so correlated with income. Problems of today’s ML - Stability
  • 8. 8 A plausible reason: Correlation Correlation is the very basics of machine learning.
  • 9. 9 Correlation is not explainable
  • 11. 11 It’s not the fault of correlation, but the way we use it • Three sources of correlation: • Causation • Causal mechanism • Stable and explainable • Confounding • Ignoring X • Spurious Correlation • Sample Selection Bias • Conditional on S • Spurious Correlation T Y T Y X T Y S Accepted Income Financial product offer DogGrass Sample Selection Ice Cream Sales Summer
  • 12. A Practical Definition of Causality Definition: T causes Y if and only if changing T leads to a change in Y, while keeping everything else constant. Causal effect is defined as the magnitude by which Y is changed by a unit change in T. Called the “interventionist” interpretation of causality. 12 http://plato.stanford.edu/entries/causation-mani/ X T Y
  • 13. 13 The benefits of bringing causality into learning Causal Framework T grass X dog nose Y label Grass—Label: Strong correlation Weak causation Dog nose—Label: Strong correlation Strong causation X T Y More Explainable and More Stable
  • 14. 14 The gap between causality and learning pHow to evaluate the outcome? pWild environments p High-dimensional p Highly noisy p Little prior knowledge (model specification, confounding structures) p Targeting problems p Understanding v.s. Prediction p Depth v.s. Scale and Performance How to bridge the gap between causality and (stable) learning?
  • 15. Outline ØCorrelation v.s. Causality ØCausal Inference ØStable Learning ØNICO: An Image Dataset for Stable Learning ØConclusions 15
  • 16. 16 T Y U Z W• Causal Identification with back door criterion • Causal Estimation with do calculus Paradigms - Structural Causal Model A graphical model to describe the causal mechanisms of a system How to discover the causal structure?
  • 17. 17 • Causal Discovery • Constraint-based: conditional independence • Functional causal model based Paradigms – Structural Causal Model A generative model with strong expressive power. But it induces high complexity.
  • 18. Paradigms - Potential Outcome Framework • A simpler setting • Suppose the confounders of T are known a priori • The computational complexity is affordable • Under stronger assumptions • E.g. all confounders need to be observed 18 More like a discriminative way to estimate treatment’s partial effect on outcome.
  • 19. Causal Effect Estimation • Treatment Variable: ! = 1 or ! = 0 • Treated Group (! = 1) and Control Group (! = 0) • Potential Outcome: &(! = 1) and &(! = 0) • Average Causal Effect of Treatment (ATE): 19 (!) = )[& ! = 1 − & ! = 0 ]
  • 20. Counterfactual Problem • Two key points for causal effect estimation • Changing T • Keeping everything else constant • For each person, observe only one: either !"#$or !"#% • For different group (T=1 and T=0), something else are not constant 20 Person T &'#( &'#) P1 1 0.4 ? P2 0 ? 0.6 P3 1 0.3 ? P4 0 ? 0.1 P5 1 0.5 ? P6 0 ? 0.5 P7 0 ? 0.1
  • 21. Ideal Solution: Counterfactual World • Reason about a world that does not exist • Everything in the counterfactual world is the same as the real world, except the treatment 21 ! " = 1 ! " = 0
  • 22. Randomized Experiments are the “Gold Standard” • Drawbacks of randomized experiments: • Cost • Unethical • Unrealistic 22 What can we do when an experiment is not possible? Observational Studies!
  • 23. Recap: Causal Effect and Potential Outcome • Two key points for causal effect estimation • Changing T • Keeping everything else (X) constant • Counterfactual Problem • Ideal Solution: Counterfactual World • “Gold Standard”: Randomized Experiments • We will discuss other solutions in next Section. 23 ! " = 1 or ! " = 0
  • 24. Outline ØCorrelation v.s. Causality ØCausal Inference ØMethods for Causal Inference ØStable Learning ØNICO: An Image Dataset for Stable Learning ØConclusions 24
  • 25. Causal Inference with Observational Data • Average Treatment Effect (ATE) represents the mean (average) difference between the potential outcome of units under treated (T=1) and control (T=0) status. • Treated (T=1): taking a particular medication • Control (T=0): not taking any medications • ATE: the causal effect of the particular medication 25 !"# = #[& " = 1 − & " = 0 ]
  • 26. Causal Inference with Observational Data • Counterfactual Problem: • Can we estimate ATE by directly comparing the average outcome between treated and control groups? • Yes with randomized experiments (X are the same) • No with observational data (X might be different) • Two key points: • Changing T (T=1 and T=0) • Keeping everything else (Confounder X) constant 26 ! " = 1 or ! " = 0 Balancing Confounders’ Distribution
  • 27. Methods for Causal Inference • Matching • Propensity Score Based Methods • Propensity Score Matching • Inverse of Propensity Weighting (IPW) • Doubly Robust • Data-Driven Variable Decomposition (D2VD) • Directly Confounder Balancing • Entropy Balancing • Approximate Residual Balancing • Differentiated Confounder Balancing (DCB) 27
  • 28. Assumptions of Causal Inference • A1: Stable Unit Treatment Value (SUTV): The effect of treatment on a unit is independent of the treatment assignment of other units ! "# $#, $&, '# = ! "# $#, '# • A2: Unconfounderness: The distribution of treatment is independent of potential outcome when given the observed variables $ ⊥ " 0 , " 1 | ' No unmeasured confounders • A3: Overlap: Each unit has nonzero probability to receive either treatment status when given the observed variables 0 < ! $ = 1 ' = . < 1 28
  • 29. Methods for Causal Inference • Matching • Propensity Score Based Methods • Propensity Score Matching • Inverse of Propensity Weighting (IPW) • Doubly Robust • Data-Driven Variable Decomposition (D2VD) • Directly Confounder Balancing • Entropy Balancing • Approximate Residual Balancing • Differentiated Confounder Balancing 29
  • 32. Matching • Identify pairs of treated (T=1) and control (T=0) units whose confounders X are similar or even identical to each other • Paired units provide the everything else (Confounders) approximate constant • Estimating average causal effect by comparing average outcome in the paired dataset • Smaller !: less bias, but higher variance 32 " #$%&'()*+ ,-, ,/ ≤ !
  • 33. Matching • Exactly Matching: • Easy to implement, but limited to low- dimensional settings • Since in high-dimensional settings, there will be few exact matches 33 !"#$%&'( )*, ), ≤ . !"#$%&'( )*, ), = 0 0, )* = ), ∞, )* ≠ ), 4 5
  • 34. Methods for Causal Inference • Matching • Propensity Score Based Methods • Propensity Score Matching • Inverse of Propensity Weighting (IPW) • Doubly Robust • Data-Driven Variable Decomposition (D2VD) • Directly Confounder Balancing • Entropy Balancing • Approximate Residual Balancing • Differentiated Confounder Balancing 34
  • 35. Propensity Score Based Methods • Propensity score !(#) is the probability of a unit to be treated • Then, Rubin shows that the propensity score is sufficient to control or summarize the information of confounders • Propensity score are rarely observed, need to be estimated 35 ! # = &(' = 1|#) ' ⫫ # | !(#) ' ⫫ (+ 1 , +(0)) | !(#)
  • 36. Propensity Score Matching • Estimating propensity score: • Supervised learning: predicting a known label T based on observed covariates X. • Conventionally, use logistic regression • Matching pairs by distance between propensity score: • High dimensional challenge: 36 !"#$%&'( )*, ), ≤ . ̂( ) = 1(3 = 1|)) !"#$%&'( )*, ), = | ̂( )* − ̂( ), | transferred from matching to PS estimation
  • 37. Methods for Causal Inference • Matching • Propensity Score Based Methods • Propensity Score Matching • Inverse of Propensity Weighting (IPW) • Doubly Robust • Data-Driven Variable Decomposition (D2VD) • Directly Confounder Balancing • Entropy Balancing • Approximate Residual Balancing • Differentiated Confounder Balancing 37
  • 38. Inverse of Propensity Weighting (IPW) • Why weighting with inverse of propensity score is helpful? • Propensity score induces the distribution bias on confounders X 38 Unit !(#) % − !(#) #units #units (T=1) #units (T=0) A 0.7 0.3 10 7 3 B 0.6 0.4 50 30 20 C 0.2 0.8 40 8 32 ' ( = *(+ = 1|() Reweighting by inverse of propensity score: Unit #units (T=1) #units (T=0) A B C ./ = +/ '/ + 1 − +/ 1 − '/ Confounders are the same! 10 10 50 50 40 40 Distribution Bias
  • 39. Inverse of Propensity Weighting (IPW) • Estimating ATE by IPW [1]: • Interpretation: IPW creates a pseudo-population where the confounders are the same between treated and control groups. • Why does this work? Consider 39 !" = $" %" + 1 − $" 1 − %"
  • 40. Inverse of Propensity Weighting (IPW) • If: , the true propensity score • Similarly: 40 ! = # ∗ !% + % − # ∗ !( # ⊥ !%, !( | , - , = .(#|,) 123 = 3[5 1 − 5 0 ]
  • 41. Inverse of Propensity Weighting (IPW) • If: , the true propensity score, the IPW estimator is unbiased • Wildly used in many applications • But requires the propensity score model is correct • High variance when ! is close to 0 or 1 41
  • 42. Methods for Causal Inference • Matching • Propensity Score Based Methods • Propensity Score Matching • Inverse of Propensity Weighting (IPW) • Doubly Robust • Data-Driven Variable Decomposition (D2VD) • Directly Confounder Balancing • Entropy Balancing • Approximate Residual Balancing • Differentiated Confounder Balancing 42
  • 43. Doubly Robust • Recap: • Simple outcome regression: • Unbiased if the regression models are correct • IPW estimator: • Unbiased if the propensity score model is correct • Doubly Robust [2]: combine both approaches 43 !"# = #[& " = 1 − & " = 0 ] and
  • 44. Doubly Robust • Estimating ATE with Doubly Robust estimator: • Unbiased if either propensity score or regression model is correct • This property is referred to as double robustness 44
  • 46. Doubly Robust • Estimating ATE with Doubly Robust estimator: • Unbiased if propensity score or regression model is correct • This property is referred to as double robustness • But may be very biased if both models are incorrect 46
  • 47. Propensity Score based Methods •Recap: • Propensity Score Matching • Inverse of Propensity Weighting • Doubly Robust •Need to estimate propensity score • Treat all observed variables as confounders • In Big Data Era, High dimensional data • But, not all variables are confounders 47
  • 48. Propensity Score based Methods •Recap: • Propensity Score Matching • Inverse of Propensity Weighting • Doubly Robust •Need to estimate propensity score • Treat all observed variables as confounders • In Big Data Era, High dimensional data • But, not all variables are confounders 48 How to automatically separate the confounders?
  • 49. Methods for Causal Inference • Matching • Propensity Score Based Methods • Propensity Score Matching • Inverse of Propensity Weighting (IPW) • Doubly Robust • Data-Driven Variable Decomposition (D2VD) • Directly Confounder Balancing • Entropy Balancing • Approximate Residual Balancing • Differentiated Confounder Balancing (DCB) 49
  • 50. 50 Inverse of Propensity Weighting (IPW) • Treat all observed variables U as confounders X • Propensity Score Estimation: • Adjusted Outcome: • IPW ATE Estimator:
  • 51. Data-Driven Variable Decomposition (D2VD) 51 • Separateness Assumption: • All observed variables U can be decomposed into three sets: Confounders X, Adjustment Variables Z, and Irrelevant variables I (Omitted). • Propensity Score Estimation: • Adjusted Outcome: • Our D2VD ATE Estimator:
  • 52. Data-Driven Variable Decomposition (D2VD) • Confounders Separation & ATE Estimation. • With our D2VD estimator: • By minimizing following objective function: • We can estimate the ATE as: 52
  • 53. Data-Driven Variable Decomposition (D2VD) 53 • Adjustment variables: • Confounders: • Treatment Effect: !, #, $ Where Where Replace X, Z with U
  • 54. Data-Driven Variable Decomposition (D2VD) 54 Bias Analysis: Our D2VD algorithm is unbiased to estimate causal effect Variance Analysis: The asymptotic variance of Our D2VD algorithm is smaller
  • 55. Data-Driven Variable Decomposition (D2VD) •OUR: Data-Driven Variable Decomposition (D2VD) •Baselines •Directly Estimator (dir): ignores confounding bias •IPW Estimator (IPW): treats all variables as confounders •Doubly Robust Estimator (DR): IPW+regression •Non-Separation Estimator (D2VD-): no variables separation 55
  • 56. Data-Driven Variable Decomposition (D2VD) • Dataset generation: • Sample size m={1000,5000} • Dimension of observed variables n={50,100,200} • Observed variables: • Treatment: logistic and misspecified • Outcome: 56
  • 57. Data-Driven Variable Decomposition (D2VD) • Dataset generation: • Sample size m={1000,5000} • Dimension of observed variables n={50,100,200} • Observed variables: • Treatment: logistic and misspecified • Outcome: 57 The true treatment effect in synthetic data is 1.
  • 58. Data-Driven Variable Decomposition (D2VD) • Experimental Results on Synthetic Data: 58
  • 59. Data-Driven Variable Decomposition (D2VD) • Experimental Results on Synthetic Data: 59 1. The direct estimator is failed under all settings. 2. IPW and DR estimators are good when T=Tlogit, but poor when T=Tmissp. 3. D2VD(-) has no variables separation, get similar results with DR estimator. 4. D2VD can improve accuracy and reduce variance for ATE estimation.
  • 60. Data-Driven Variable Decomposition (D2VD) • Experimental Results on Synthetic Data: 60 TPR: true positive rate TNR: true negative rate Our D2VD algorithm can precisely separate the confounders and adjustment variables.
  • 61. Experiments on Real World Data • Dataset Description: • Online advertising campaign (LONGCHAMP) • Users Feedback: 14,891 LIKE; 93,108 DISLIKE • 56 Features for each user • Age, gender, #friends, device, user setting on WeChat • Experimental Setting: • Outcome Y: users feedback • Treatment T: one feature • Observed Variables U: other features 61 2015 Y = 1, if LIKE Y = 0, if DISLIKE
  • 62. Experiments Results • ATE Estimation. 62 1. Our D2VD estimator evaluate the ATE more accuracy. 2. Our D2VD estimator can reduce the variance of estimated ATE. 3. Younger Ladies are with higher probability to like the LONGCHAMP ads.
  • 63. Experiments Results • Variables Decomposition. 63 1. The confounders are many other ways for adding friends on WeChat. 2. The adjustment variables have significant effect on outcome. 3. Our D2VD algorithm can precisely separate the confounders and adjustment variables.
  • 64. Summary: Propensity Score based Methods • Propensity Score Matching (PSM): • Units matching by their propensity score • Inverse of Propensity Weighting (IPW): • Units reweighted by inverse of propensity score • Doubly Robust (DR): • Combing IPW and regression • Data-Driven Variable Decomposition (D2VD): • Automatically separate the confounders and adjustment variables • Confounder: estimate propensity score for IPW • Adjustment variables: regression on outcome for reducing variance • Improving accuracy and reducing variance on treatment effect estimation • But, these methods need propensity score model is correct 64 Treat all observed variables as confounder, ignoring non-confounders ! " = $(& = 1|")
  • 65. Methods for Causal Inference • Matching • Propensity Score Based Methods • Propensity Score Matching • Inverse of Propensity Weighting (IPW) • Doubly Robust • Data-Driven Variable Decomposition (D2VD) • Directly Confounder Balancing • Entropy Balancing • Approximate Residual Balancing • Differentiated Confounder Balancing (DCB) 65
  • 66. Causal Inference with Observational Data • Average Treatment Effect (ATE): • Average Treatment effect on the Treated (ATT): • Two key points: • Changing T (T=1 and T=0) • Keeping everything else (Confounder X) constant 66 !"# = #[& " = 1 − & " = 0 ] !"" = #[& 1 |" = 1] − #[& 0 |" = 1]
  • 67. Causal Inference with Observational Data • Average Treatment Effect (ATE): • Average Treatment effect on the Treated (ATT): • Two key points: • Changing T (T=1 and T=0) • Keeping everything else (Confounder X) constant 67 !"# = #[& " = 1 − & " = 0 ] !"" = #[& 1 |" = 1] − #[& 0 |" = 1] Balancing Confounders’ Distribution
  • 68. Directly Confounder Balancing • Recap: Propensity score based methods • Sample reweighting for confounder balancing • But, need propensity score model is correct • Weights would be very large if propensity score is close to 0 or 1 • Can we directly learn sample weight that can balance confounders’ distribution between treated and control? 68 !" = $" %" + 1 − $" 1 − %" Yes!
  • 69. Directly Confounder Balancing • Motivation: The collection of all the moments of variables uniquely determine their distributions. • Methods: Learning sample weights by directly balancing confounders’ moments as follows 69 The first moments of X on the Control Group The first moments of X on the Treated Group With moments, the sample weights can be learned without any model specification.
  • 70. Directly Confounder Balancing • Motivation: The collection of all the moments of variables uniquely determine their distributions. • Methods: Learning sample weights by directly balancing confounders’ moments as follows • Estimating ATT by: 70 The first moments of X on the Control Group The first moments of X on the Treated Group
  • 71. Methods for Causal Inference • Matching • Propensity Score Based Methods • Propensity Score Matching • Inverse of Propensity Weighting (IPW) • Doubly Robust • Data-Driven Variable Decomposition (D2VD) • Directly Confounder Balancing • Entropy Balancing • Approximate Residual Balancing • Differentiated Confounder Balancing (DCB) 71
  • 72. Entropy Balancing • Directly confounder balancing by sample weights W • Maximize the entropy of sample weights W • But, treat all variables as confounders and balance them equally 72
  • 73. Approximate Residual Balancing • 1. compute approximate balancing weights W as • 2. Fit !" in the linear model using a lasso or elastic net, • 3. Estimate the ATT as • Double Robustness: Exact confounder balancing or regression is correct. • But, treats all variables as confounders and balance them equally 73
  • 74. Directly Confounder Balancing • Recap: • Entropy Balancing, Approximate Residual Balancing etc. • Moments uniquely determine variables’ distribution • Learning sample weights by balancing confounders’ moments • But, treat all variables as confounders, and balance them equally • Different confounders make different confounding bias 74 The first moments of X on the Control Group The first moments of X on the Treated Group
  • 75. Directly Confounder Balancing • Recap: • Entropy Balancing, Approximate Residual Balancing etc. • Moments uniquely determine variables’ distribution • Learning sample weights by balancing confounders’ moments • But, treat all variables as confounders, and balance them equally • Different confounders make different confounding bias 75 The first moments of X on the Control Group The first moments of X on the Treated Group How to differentiated confounders and their bias?
  • 76. Methods for Causal Inference • Matching • Propensity Score Based Methods • Propensity Score Matching • Inverse of Propensity Weighting (IPW) • Doubly Robust • Data-Driven Variable Decomposition (D2VD) • Directly Confounder Balancing • Entropy Balancing • Approximate Residual Balancing • Differentiated Confounder Balancing (DCB) 76
  • 77. Differentiated Confounder Balancing •Ideas: simultaneously learn confounder weights ! and sample weighs ". •Confounder weights determine which variable is confounder and its contribution on confounding bias. •Sample weights are designed for confounder balancing. 77 How to learn the confounder weights?
  • 78. Confounder Weights Learning • General relationship among !, ", and #: 78 Confounding biasConfounder weights If $% = 0, then (% is not confounder, no need to balance. Different confounders have different confounding weights.
  • 79. Confounder Weights Learning Propositions: • In observational studies, not all observed variables are confounders, and different confounders make unequal confounding bias on ATT with their own weights. • The confounder weights can be learned by regressing potential outcome ! 0 on augmented variables #. 79
  • 80. Differentiated Confounder Balancing • Objective Function 81 The ENT[3] and ARB[4] algorithms are special case of our DCB algorithm by setting the confounder weights as unit vector. Our DCB algorithm is more generalize for treatment effect estimation.
  • 81. Differentiated Confounder Balancing • Algorithm • Training Complexity: ! "# • ": sample size, #: dimensions of variables 82 In each iteration, we first update $ by fixing %, and then update % by fixing $
  • 82. Experiments •Experimental Tasks: ØRobustness Test (high-dimensional and noisy) ØAccuracy Test (real world dataset) ØPredictive Power Test (real ad application) 83
  • 83. Experiments • Baselines: • Directly Estimator: comparing average outcome between treated and control units. • IPW Estimator [1]: reweighting via inverse of propensity score • Doubly Robust Estimator [2]: IPW + regression method • Entropy Balancing Estimator [3]: directly confounder balancing with entropy loss • Approximate Residual Balancing [4]: confounder balancing + regression • Evaluation Metric: 84
  • 84. Experiments - Robustness Test • Dataset ØSample size: ! = {2000, 5000} ØVariables’ dimensions: ) = {50,100} ØObserved Variables: ØTreatment: from logistic function +,-./0 and misspecified function +1/223 • Confounding rate 45: the ratio of confounders to all observed variables. • Confounding strength 65: the bias strength of confounders ØOutcome: from linear function 7,/89:; and nonlinear function 78-8,/8 85
  • 85. Experiments - Robustness Test • Directly estimator fails in all settings, since it ignores confounding bias. • IPW and DR estimators make huge error when facing high dimensional variables or the model specifications are incorrect. • ENT and ARB estimators have poor performance since they balance all variables equally. 86 More results see our paper!
  • 86. Experiments - Robustness Test 87 Our DCB estimator achieves significant improvements over the baselines in different settings. More results see our paper! Our DCB estimator is very robust!
  • 87. Experiments - Robustness Test 88 The MAE of our DCB estimator is consistent stable and small. • Sample Size • Dimension of variables • Confounding rate • Confounding strength
  • 88. Experiments - Robustness Test 89 Our DCB algorithm is very robust for treatment effect estimation.
  • 89. Experiments - Accuracy Test • LaLonde Dataset [5]: Would the job training program increase people’s earnings in the year of 1978? • Randomized experiments: provide ground truth of treatment effect • Observational studies: check the performance of all estimators • Experimental Setting: • V-RAW: variables set of 10 raw observed variables, including employment, education, age ethnicity and married status. • V-INTERACTION: variables set of raw variables, their pairwise one way interaction and their squared terms. 90
  • 90. Experiments - Accuracy Test 91 Results of ATT estimation Our DCB estimator is more accurate than the baselines. Our DCB estimator achieve a better confounder balancing under V-INTERACTION setting.
  • 91. Experiments - Predictive Power • Dataset Description: • Online advertising campaign (LONGCHAMP) • Users Feedback: 14,891 LIKE; 93,108 DISLIKE • 56 Features for each user • Age, gender, #friends, device, user settings on WeChat • Experimental Setting: • Outcome Y: users feedback • Treatment T: one feature • Observed Variables X: other features 93 2015 Y = 1, if LIKE Y = 0, if DISLIKE Select the top k features with high causal effect for prediction
  • 92. Experiments - Predictive Power • Two correlation-based feature selection baselines: • MRel [6]: maximum relevance • mRMR [7]: Maximum relevance and minimum redundancy. 94 ØOur DCB estimator achieves the best prediction accuracy. ØCorrelation based methods perform worse than causal methods.
  • 93. Summary: Directly Confounder Balancing • Motivation: Moments can uniquely determine distribution • Entropy Balancing • Confounder balancing with maximizing entropy of sample weights • Approximate Residual Balancing • Combine confounder balancing and regression for doubly robust • Treat all variables as confounders, and balance them equally • But different confounders make different bias • Differentiated Confounder Balancing (DCB) • Theoretical proof on the necessary of differentiation on confounders • Improving the accuracy and robust on treatment effect estimation 95
  • 94. Sectional Summary: Methods for Causal Inference • Matching • Propensity Score Based Methods • Propensity Score Matching • Inverse of Propensity Weighting (IPW) • Doubly Robust • Data-Driven Variable Decomposition (D2VD) • Directly Confounder Balancing • Entropy Balancing • Approximate Residual Balancing • Differentiated Confounder Balancing (DCB) 96 Not all observed variables are confounders Different confounders make different bias Treat all observed variables as confounder Balance all confounder equally Limited to low-dimensional settings
  • 95. Sectional Summary: Methods for Causal Inference 97 p Progress has been made to draw causality from big data. p From single to group p From binary to continuous p Weak assumptions Ready for Learning?
  • 96. Outline ØCorrelation v.s. Causality ØCausal Inference ØStable Learning ØNICO: An Image Dataset for Stable Learning ØFuture Directions and Conclusions 98
  • 97. Stability and Prediction 99 True Model Learning Process Prediction Performance TraditionalLearning StableLearning Bin Yu (2016), Three Principles of Data Science: predictability, computability, stability
  • 98. Stable Learning 100 ModelDistribution 1 Distribution 1 Distribution 2 Distribution 3 Distribution n … Accuracy 1 Accuracy 2 Accuracy 3 Accuracy n … I.I.D. Learning Transfer Learning VAR (Acc) Stable Learning Training Testing
  • 99. Stability and Robustness • Robustness • More on prediction performance over data perturbations • Prediction performance-driven • Stability • More on the true model • Lay more emphasis on Bias • Sufficient for robustness 101 Stable learning is a (intrinsic?) way to realize robust prediction
  • 100. Domain Generalization / Invariant Learning 105 • Given data from different observed environments : • The task is to predict Y given X such that the prediction works well (is “robust”) for “all possible” (including unseen) environments
  • 101. Domain Generalization • Assumption: the conditional probability P(Y|X) is stable or invariant across different environments. • Idea: taking knowledge acquired from a number of related domains and applying it to previously unseen domains • Theorem: Under reasonable technical assumptions. Then with probability at least 106 Muandet K, Balduzzi D, Schölkopf B. Domain generalization via invariant feature. ICML 2013.
  • 102. Invariant Prediction • Invariant Assumption: There exists a subset ! ∈ # is causal for the prediction of $, and the conditional distribution P(Y|S) is stable across all environments. • Idea: Linking to causality • Structural Causal Model (Pearl 2009): • The parent variables of Y in SCM satisfies Invariant Assumption • The causal variables lead to invariance w.r.t. “all” possible environments 107 Peters, J., Bühlmann, P., & Meinshausen, N. (2016). Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2016
  • 103. From Variable Selection to Sample Reweighting 108 X T Y Typical Causal Framework Sample reweighting can make a variable independent of other variables. Given a feature T Assign different weights to samples so that the samples with T and the samples without T have similar distributions in X Calculate the difference of Y distribution in treated and controlled groups. (correlation between T and Y)
  • 104. Global Balancing: Decorrelating Variables 109 X T Y Typical Causal Framework Partial effect can be regarded as causal effect. Predicting with causal variables is stable across different environments. Given ANY feature T Assign different weights to samples so that the samples with T and the samples without T have similar distributions in X Calculate the difference of Y distribution in treated and controlled groups. (correlation between T and Y) Kun Kuang, Peng Cui, Susan Athey, Ruoxuan Li, Bo Li. Stable Prediction across Unknown Environments. KDD, 2018.
  • 105. Theoretical Guarantee 110 Kun Kuang, Peng Cui, Susan Athey, Ruoxuan Li, Bo Li. Stable Prediction across Unknown Environments. KDD, 2018. à 0
  • 106. Causal Regularizer 111 Set feature j as treatment variable Zheyan Shen, Peng Cui, Kun Kuang, Bo Li. Causally Regularized Learning on Data with Agnostic Bias. ACM MM, 2018.
  • 107. Causally Regularized Logistic Regression 112 Zheyan Shen, Peng Cui, Kun Kuang, Bo Li. Causally Regularized Learning on Data with Agnostic Bias. ACM MM, 2018.
  • 108. From Shallow to Deep - DGBR 113 Kun Kuang, Peng Cui, Susan Athey, Ruoxuan Li, Bo Li. Stable Prediction across Unknown Environments. KDD, 2018.
  • 109. Experiment 1 – non-i.i.d. image classification • ) • - 0 1 0 - • ) 0 ) - ) : 0 1 - • 0 ) : ) , 5 0 ) )) : ) - 0 ( 0 114
  • 111. Experimental Result - insights 116
  • 112. From Causal problem to Learning problem 118 • Previous logic: • More direct logic: Sample Reweighting Independent Variables Causal Variable Stable Prediction Sample Reweighting Independent Variables Stable Prediction
  • 113. Thinking from the Learning end 119 !"#$%&(() !"*+"(() ,-.// 01121 /.130 01121 Zheyan Shen, Peng Cui, Tong Zhang. Stable Learning of Linear Models via Sample Reweighting. (under review)
  • 114. Stable Learning of Linear Models • Consider the linear regression with misspecification bias • By accurately estimating with the property that ! " is uniformly small for all ", we can achieve stable learning. • However, the estimation error caused by misspecification term can be as bad as , where #$ is the smallest eigenvalue of centered covariance matrix. 120 Bias term with bound ! " ≤ &Goes to infinity when perfect collinearity exists! Zheyan Shen, Peng Cui, Tong Zhang. Stable Learning of Linear Models via Sample Reweighting. (under review)
  • 115. Toy Example • Assume the design matrix ! consists of two variables !", !$, generated from a multivariate normal distribution: • By changing %, we can simulate different extent of collinearity. • To induce bias related to collinearity, we generate bias term & ! with & ! = !(, where ( is the eigenvector of centered covariance matrix corresponding to its smallest eigenvalue )$ . • The bias term is sensitive to collinearity. 121 Zheyan Shen, Peng Cui, Tong Zhang. Stable Learning of Linear Models via Sample Reweighting. (under review)
  • 116. Simulation Results 122 !"#$% %##&# (%()*+")*&, -*"() !"#$% /"#*",0% *, 1*22%#%,) 1*()#*-3)*&,( *,0#%"(% 0&!!*,%"#*)4 Zheyan Shen, Peng Cui, Tong Zhang. Stable Learning of Linear Models via Sample Reweighting. (under review)
  • 117. Reducing collinearity by sample reweighting 123 Idea: Learn a new set of sample weights !(#) to decorrelate the input variables and increase the smallest eigenvalue • Weighted Least Square Estimation which is equivalent to So, how to find an “oracle” distribution which holds the desired property? Zheyan Shen, Peng Cui, Tong Zhang. Stable Learning of Linear Models via Sample Reweighting. (under review)
  • 118. Sample Reweighted Decorrelation Operator (cont.) 124 Decorrelation where !, #, $, %, &, ' are drawn from 1 … * at random • By treating the different columns independently while performing random resampling, we can obtain a column-decorrelated design matrix with the same marginal as before. • Then we can use density ratio estimation to get +(-). Zheyan Shen, Peng Cui, Tong Zhang. Stable Learning of Linear Models via Sample Reweighting. (under review)
  • 119. Experimental Results • Simulation Study 125 Zheyan Shen, Peng Cui, Tong Zhang. Stable Learning of Linear Models via Sample Reweighting. (under review)
  • 120. Experimental Results • Regression • Classification 126 • Regression • Classification Zheyan Shen, Peng Cui, Tong Zhang. Stable Learning of Linear Models via Sample Reweighting. (under review)
  • 121. Disentanglement Representation Learning • Learning Multiple Levels of Abstraction • The big payoff of deep learning is to allow learning higher levels of abstraction • Higher-level abstractions disentangle the factor of variation, which allows much easier generalization and transfer 127 Yoshua Bengio, From Deep Learning of Disentangled Representations to Higher-level Cognition. (2019). YouTube. Retrieved 22 February 2019. From decorrelating input variables to learning disentangled representation
  • 122. Disentanglement for Causality • Causal / mechanism independence • Independently Controllable Factors (Thomas, Bengio et al., 2017) • Optimize both !" and #" to minimize 128 A policy !" A representation #" selectively change correspond to value Require subtle design on the policy set to guarantee causality.
  • 123. Sectional Summary 129 p Causal inference provide valuable insights for stable learning p Complete causal structure means data generation process, necessarily leading to stable prediction p Stable learning can also help to advance causal inference p Performance driven and practical applications Benchmark is important!
  • 124. Outline ØCorrelation v.s. Causality ØCausal Inference ØStable Learning ØNICO: An Image Dataset for Stable Learning ØFuture Directions and Conclusions 130
  • 125. Non-I.I.D. Image Classification • Non I.I.D. Image Classification • Two tasks • Targeted Non-I.I.D. Image Classification • Have prior knowledge on testing data • e.g. transfer learning, domain adaptation • General Non-I.I.D. Image Classification • Testing is unknown, no prior • more practical & realistic 131 !(#$%&'( = *$%&'(, ,$%&'( ) ≠ !(#$/0$ = *$/0$, ,$/0$ ) unknown known #$%&'( #$/0$
  • 126. Existence of Non-I.I.Dness • One metric (NI) for Non-I.I.Dness • Existence of Non-I.I.Dness on Dataset consisted of 10 subclasses from ImageNet • For each class • Training data • Testing data • CNN for prediction 132 ubiquitous strong correlation Distribution shift For normalization
  • 127. Related Datasets • DatasetA & DatasetB & DatasetC • NI is ubiquitous, but small on these datasets • NI is Uncontrollable, not friendly for Non IID setting 133 Small NI A dataset for Non-I.I.D. image classification is demanded. ImageNet PASCAL VOC MSCOCO Uncontrollable NI Average NI: 2.7
  • 128. NICO - Non-I.I.D. Image Dataset with Contexts • NICO Datasets: • Object label: e.g. dog • Contextual labels (Contexts) • the background or scene of a object, e.g. grass/water • Structure of NICO 134 Animal Vehicle Dog … … Train Grass on bridge… … 2 Superclasses 10 Classes 10 Contexts per per Diverse & Meaningful Overlapping
  • 129. NICO - Non-I.I.D. Image Dataset with Contexts • Data size of each class in NICO • Sample size: thousands for each class • Each superclass: 10,000 images • Sufficient for some basic neural networks (CNN) • Samples with contexts in NICO 135
  • 130. Controlling NI on NICO Dataset •Minimum Bias (comparing with ImageNet) •Proportional Bias (controllable) • Number of samples in each context •Compositional Bias (controllable) • Number of contexts that observed 136
  • 131. Minimum Bias • In this setting, the way of random sampling leads to minimum distribution shift between training and testing distributions in dataset, which simulates a nearly i.i.d. scenario. • 8000 samples for training and 2000 samples for testing in each superclass (ConvNet) 137 Average NI Testing Accuracy Animal 3.85 49.6% Vehicle 3.20 63.0% Images in NICO are with rich contextual information more challenging for image classification Average NI on ImageNet: 2.7 Our NICO data is more Non-iid, more challenging
  • 132. Proportional Bias • Given a class, when sampling positive samples, we use all contexts for both training and testing, but the percentage of each context is different between training and testing dataset. 138 4 4.1 4.2 4.3 4.4 4.5 1:1 2:1 3:1 4:1 5:1 6:1 NI Dominant Ratio in Training Data Testing 1 : 1 Dominate Context (55%) (5%) (5%) (5%) (5%) (5%) (5%) (5%) (5%) (5%) We can control NI by varying dominate ratio
  • 133. Compositional Bias • Given a class, the observed contexts are different between training and testing data. 139 Moderate setting (Overlap) Radical setting (No Overlap & Dominant ratio) 4.44 4.0 4.2 4.4 4.6 4.8 5.0 1:1 2:1 3:1 4:1 5:1 NI Dominant Ratio in Training data 4.34 4.0 4.1 4.2 4.3 4.4 7 6 5 4 3 NI Number of Contexts in Training Data Training: Testing: Training: Testing: Testing 1 : 1
  • 134. NICO - Non-I.I.D. Image Dataset with Contexts • Large and controllable NI 140 Controllable NILarge NI small NI large NI
  • 135. NICO - Non-I.I.D. Image Dataset with Contexts • The dataset can be downloaded from (temporary address): • https://www.dropbox.com/sh/8mouawi5guaupyb/AAD4fdySrA6fn3P gSmhKwFgva?dl=0 • Please refer to the following paper for details: • Yue He, Zheyan Shen, Peng Cui. NICO: A Dataset Towards Non- I.I.D. Image Classification. https://arxiv.org/pdf/1906.02899.pdf 141
  • 136. Outline ØCorrelation v.s. Causality ØCausal Inference ØStable Learning ØNICO: An Image Dataset for Stable Learning ØConclusions 142
  • 137. Conclusions • Predictive modeling is not only about Accuracy. • Stability is critical for us to trust a predictive model. • Causality has been demonstrated to be useful in stable prediction. • How to marry causality with predictive modeling effectively and efficiently is still an open problem. 143
  • 138. Conclusions 144 Debiasing Prediction Causal Inference Stable Learning Propensity Score Direct Confounder Balancing Global Balancing Linear Stable Learning Disentangled Learning
  • 139. Reference • Shen Z, Cui P, Kuang K, et al. Causally regularized learning with agnostic data selection bias[C]//2018 ACM Multimedia Conference on Multimedia Conference. ACM, 2018: 411-419. • Kuang K, Cui P, Athey S, et al. Stable prediction across unknown environments[C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2018: 1617-1626. • Kuang K, Cui P, Li B, et al. Estimating treatment effect in the wild via differentiated confounder balancing[C]//Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2017: 265-274. • Kuang K, Cui P, Li B, et al. Treatment effect estimation with data-driven variable decomposition[C]//Thirty-First AAAI Conference on Artificial Intelligence. 2017. • Kuang K, Jiang M, Cui P, et al. Steering social media promotions with effective strategies[C]//2016 IEEE 16th International Conference on Data Mining (ICDM). IEEE, 2016: 985-990. 145
  • 140. Reference • Pearl J. Causality[M]. Cambridge university press, 2009. • Austin P C. An introduction to propensity score methods for reducing the effects of confounding in observational studies[J]. Multivariate behavioral research, 2011, 46(3): 399- 424. • Johansson F, Shalit U, Sontag D. Learning representations for counterfactual inference[C]//International conference on machine learning. 2016: 3020-3029. • Shalit U, Johansson F D, Sontag D. Estimating individual treatment effect: generalization bounds and algorithms[C]//Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017: 3076-3085. • Johansson F D, Kallus N, Shalit U, et al. Learning weighted representations for generalization across designs[J]. arXiv preprint arXiv:1802.08598, 2018. • Louizos C, Shalit U, Mooij J M, et al. Causal effect inference with deep latent-variable models[C]//Advances in Neural Information Processing Systems. 2017: 6446-6456. • Thomas V, Bengio E, Fedus W, et al. Disentangling the independently controllable factors of variation by interacting with the world[J]. arXiv preprint arXiv:1802.09484, 2018. • Bengio Y, Deleu T, Rahaman N, et al. A Meta-Transfer Objective for Learning to Disentangle Causal Mechanisms[J]. arXiv preprint arXiv:1901.10912, 2019. 146
  • 141. Reference • Yu B. Stability[J]. Bernoulli, 2013, 19(4): 1484-1500. • Szegedy C, Zaremba W, Sutskever I, et al. Intriguing properties of neural networks[J]. arXiv preprint arXiv:1312.6199, 2013. • Volpi R, Namkoong H, Sener O, et al. Generalizing to unseen domains via adversarial data augmentation[C]//Advances in Neural Information Processing Systems. 2018: 5334-5344. • Ye N, Zhu Z. Bayesian adversarial learning[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Curran Associates Inc., 2018: 6892- 6901. • Muandet K, Balduzzi D, Schölkopf B. Domain generalization via invariant feature representation[C]//International Conference on Machine Learning. 2013: 10-18. • Peters J, Bühlmann P, Meinshausen N. Causal inference by using invariant prediction: identification and confidence intervals[J]. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2016, 78(5): 947-1012. • Rojas-Carulla M, Schölkopf B, Turner R, et al. Invariant models for causal transfer learning[J]. The Journal of Machine Learning Research, 2018, 19(1): 1309-1342. • Rothenhäusler D, Meinshausen N, Bühlmann P, et al. Anchor regression: heterogeneous data meets causality[J]. arXiv preprint arXiv:1801.06229, 2018. 147