3. Outline
What’s wrong with machine learning?
A causal proposal
Searching for causality I: observational data
Searching for causality II: multiple environments
Conclusion
4. What succeeds in machine learning?
The recent winner (Hu et al., 2017) achieves a super-human performance of 2.2%.
8. What are the reasons for these successes?
Machines pull impressive performances at
− recognizing objects after training on more images than a human can see,
− translating natural languages after training on more bilingual text than a human can read,
− beating humans at Atari after playing more games than any teenager can endure,
− reigning Go after playing more grandmaster level games than mankind
Models consume too much data to solve a single task!
(From L´eon Bottou)
14. What fails in machine learning?
(Jabri et al., 2016)
15. What fails in machine learning?
(Szegedy et al., 2013)
16. What fails in machine learning?
(IBM system at ICLR 2017)
17. What are the reasons for these failures?
The big liea
in machine learning:
Ptrain(X, Y ) = Ptest(X, Y )
aAs called by Zoubin Ghahramani.
− focus on interpolation
− out-of-distribution catastrophes
− over-justification of “minimizing the average error”
− emphasize the common, forget the rare
− reckless learning
Horses cheat our statistical estimation problems by using unexpected features
18. Outline
What’s wrong with machine learning?
A causal proposal
Searching for causality I: observational data
Searching for causality II: multiple environments
Conclusion
19. This talk in one slide
Predict Y from (X, Z). Process generating labeled training data:
X ← N(0, 1),
Y ← X + N(0, 1)
Z ← Y + N(0, 1).
Least-squares solution: YLS = X
2 + Z
2
Causal solution: YCau = X
Predict Y from (X, Z). Process generating unlabeled testing data:
X ← N(0, 1),
Y ← X + N(0, 1)
Z ← Y + N(0, 10).
Least-squares solution breaks at testing time!
20. Getting around the big lie machine learning
Horses absorb all training correlations recklessly, incl. confounders and spurious patterns
∼
If Ptrain ̸= Ptest, what correlations should we learn and what correlations should we ignore?
21. Reichenbach’s Principle of Common Cause
Correlations between X and Y arise due to one of the three causal structures
X Y X Y X Y
Z
What happens to Y when someone manipulates X? Why is Y = 2?
(Reichenbach, 1956) formalizes the claim “dependence does not imply causation”
∼
We are interested in causal correlations (from features to target)
Predicting open umbrellas from rain is more stable than predicting rain from open umbrellas
22. Focus on causal correlations for invariance?
(Woodward, 2005)
23. Focus on causal correlations for truth?
(Pearl, 2018)
The causal explanation predicts the outcome of real experiments in the world
∼
We will now explore two ways to discover causality in data using data alone
24. Outline
What’s wrong with machine learning?
A causal proposal
Searching for causality I: observational data
Searching for causality II: multiple environments
Conclusion
27. How does causation look like?
−1 0 1
U
−1
0
1
V
−1 0 1
V
−1
0
1
U
Effect = f(Cause) + Noise
Cause independent from Noise
(Peters et al., 2014)
28. How does causation look like?
0.0 0.5 1.0
X
−3
−2
−1
0
1
2
3
Y
P(Y )
P(X)
Effect = f(Cause)
p(Cause) independent from f′
(Daniusis et al., 2010)
29. How does causation look like?
x → y x → y x → y x → y x → y x → y x → y x → y
x → y x → y x → y x → y x → y x → y x → y x → y
x → y x → y x → y x → y x → y x → y x → y x → y
x → y x → y x → y x → y x → y x → y x → y x → y
x → y x → y x → y x → y x → y x → y x → y x → y
x → y x → y x → y x → y x → y x → y x ← y x ← y
x ← y x ← y x ← y x ← y x ← y x ← y x ← y x ← y
x ← y x ← y x ← y x → y x → y x → y x → y x ← y
x ← y x → y x → y x ← y x → y x ← y x → y x ← y
x → y x ← y x ← y x → y x → y x → y x ← y x → y
(Mooij et al., 2014)
30. NCC: learning causation footprints
{(xij, yij)}mi
j=1 (xi1, yi1)
(ximi , yimi )
1
mi
∑mi
j=1(·) ˆP(Xi → Yi)
average
classifier layers
embedding layers
each point featurized separately
(Lopez-Paz et al., 2017)
Trained using synthetic data!
31. NCC is the state-of-the-art
0 20 40 60 80 100
020406080100
decission rate
classificationaccuracy
RCC
ANM
IGCI
33. NCC discovers causation in images
Features inside bounding boxes are caused by the presence of objects (wheel)
Features outside bounding boxes cause the presence of objects (road)object-featureratio
(Lopez-Paz et al., 2017)
34. NCC discovers causation in language
Between word2vec vectors relation concepts such as “smoking → cancer”
counts(WS)
prec-counts(WS)
prec-counts(entropy)
PMI(WS)
prec-PMI(WS)
counts(entropy)
PMI(entropy)
prec-PMI(entropy)
frequency
precedence
distr.prec-PMI
distr.w2vio
distr.PMI
distr.counts
distr.prec-counts
distr.w2vii
distr.w2voi
feat.counts
feat.prec-counts
feat.PPMI
feat.prec-PPMI
feat.w2vio
feat.w2voi
feat.w2vii
feat.w2voutput
feat.w2vinput
feat.w2vall
0.4
0.5
0.6
0.7
0.8
0.9
testaccuracy
baselines
distribution-based
feature-based
(Rojas-Carulla et al., 2017)
35. New hopes for unsupervised learning?
There are unexpected causal signals in unsupervised data!
These allow to gain causal intuitions from data, reducing the need for experimentation
What metrics/divergences best extract these causal signals, while discarding the rest?
We want simple models for a complex world (IKEA instructions)
− Against the usual hope of consistency (P = Q as n → ∞)
36. First results
Cause-effect discovery ≈ choosing the simplest model (Stegle et al., 2010) using a divergence
− GANs divergences distinguish between cause and effect (Lopez-Paz and Oquab, 2016)
− Discriminator((Cause, Generator(Cause, Noise)), (Cause, Effect))
is harder than
Discriminator((Generator(Effect, Noise), Effect), (Cause, Effect))
− These ideas extend to multiple variables (Goudet et al., 2017; Kalainathan et al., 2018)
− Each divergence has important geometry implications (Bottou et al., 2018)
− Hyperbolic divergences recover complex causal hierarchies (Klimovskaia et al., 2018)
p1
p2
p3
p4
p5
a b
...
Euclidean space Poincaré Ball
Preserve pairwise
distances
c
38. Outline
What’s wrong with machine learning?
A causal proposal
Searching for causality I: observational data
Searching for causality II: multiple environments
Conclusion
39. Moving beyond the big lie
Ptrain(X, Y ) ̸= Ptest(X, Y )
Then, what remains invariant between train and test data?
∼
We assume that Ptrain and Ptest produce data about the same phenomena under different
experimental conditions, circumstances, or environments
∼
To succeed at the test environment, we observe multiple training environments and
− learn what is invariant across environments
− discard what is specific to each environment
∼
There is a causal justification for proceeding this way!
40. Functional causal models
A common tool to describe causal structures is the one of Functional Causal Model (FCM)
X1 X2
X3X4
Y
X1 ← f1(N1)
X2 ← f2(X1, X3, N2)
X3 ← f3(X1, N3) // X1 causes X3
X4 ← f4(X1, N4)
Y ← fy(X2, X3, Ny)
Ni ∼ P(N)
FCMs are compositional and allow counterfactual reasoning
FCMs are generative: observing their eqs produces the observational distribution P(X, Y )
We can also intervene the FCM eqs to produce interventional distributions ˜P(X, Y )!
∼
Each intervention produces one environment (distribution) of the phenomena (FCM) of
interest!
41. Functional causal models
One FCM = multiple interventions/distributions/environments
P1
train(X, Y ) ∼
X1 X2
X3X4
Y
X1 = f1(N1)
X2 = f2(X1, X3, N2)
X3= 1.5
X4 = f4(X1, N4)
Y = fy(X2, X3, Ny)
Ni ∼ P(N)
42. Functional causal models
One FCM = multiple interventions/distributions/environments
P2
train(X, Y ) ∼
X1 X2
X3X4
Y
X1∼ N(0, 1)
X2 = f2(X1, X3, N2)
X3 = f3(X1, N3)
X4 = f4(X1, N4)
Y = fy(X2, X3, Ny)
Ni ∼ P(N)
43. Functional causal models
One FCM = multiple interventions/distributions/environments
P3
train(X, Y ) ∼
X1 X2
X3X4
Y
X1 = f1(N1)
X2= f2(X1, X3, N2) + U(−10, 10)
X3 = f3(X1, N3)
X4 = f4(X1, N4)
Y = fy(X2, X3, Ny)
Ni ∼ P(N)
44. Functional causal models
X1 X2
X3X4
Y
X1 = f1(N1)
X2 = f2(X1, X3, N2)
X3 = f3(X1, N3)
X4 = f4(X1, N4)
Y= fy(X2, X3, Ny)
Ni ∼ P(N)
If mechanisms are autonomous, and
no intervention disturbs the conditional expectation of the target causal equation:
− the causal conditional distribution E(Y |X2, X3) remains invariant
− the non-causal conditional distribution E(Y |X) may vary wildly!
This reveals the link between invariances across environments and causal structures
∼
How can we find invariant causal predictors?
45. A simple example: X → Y → Z
For all environments e ∈ R:
Xe
← N(0, e),
Y e
← Xe
+ N(0, e)
Ze
← Y e
+ N(0, 1).
The task is to predict Y e
given (Xe
, Ze
) for unknown test e. We have three options:
E[Y e
|Xe
= x] = x,
E[Y e
|Ze
= z] =
2e
2e + 1
z,
E[Y e
|Xe
= x, Ze
= z] =
1
e + 1
x +
e
e + 1
z
The causal predictor based on x is invariant!
The state-of-the-art (Ganin et al., 2016; Peters et al., 2016) fails at this simple example
46. Our proposal
Find a feature representation that leads to the same optimal classifier across environments.
∼
Let we
ϕ be the optimal classifier for environment e, when using the featurizer ϕ:
we
ϕ = arg min
w
RP e (w ◦ ϕ),
where RP e (f) = E(x,y)∼P e
[
Error(f(x), y)
]
. Measure classifier discrepancy:
∥we
ϕ − we′
ϕ ∥P =
∫
(we
ϕ(ϕ(x)) − we′
ϕ (ϕ(x)))2
dP(X)
Let ¯w = 1
e
∑
e we
ϕ. Then, our new learning objective is:
arg min
ϕ
∑
e
RP e ( ¯w ◦ ϕ) + λ
∑
e,e′̸=e
∥we
ϕ − we′
ϕ ∥P e
(Arjovsky et al., 2018)
47. An approximation to our proposal
C(ϕ) =
∑
e
RP e ( ¯w ◦ ϕ) + λ
∑
e,e′̸=e
∥we
ϕ − we′
ϕ ∥P e
is an intractable bi-level optimization problem, since we
ϕ is an optimization problem itself
We approximate the interactions between the optimization problems using unrolled gradients
∼
1. Initialize at random ϕ and we
ϕ, for all e
1.1 Update we
ϕ ← Gradient(RP e , we
ϕ) using one step and fixed ϕ, for all e
1.2 Update me
ϕ ← Gradient(RP e , we
ϕ) using k steps and fixed ϕ, for all e
1.3 Update ϕ ← Gradient(C, me
ϕ) using one step and fixed me
ϕ
2. Return
(
1
e
∑
e we
ϕ
)
◦ ϕ
(Arjovsky et al., 2018)
48. First results
Empirical risk minimization:
Causal risk minimization:
∼
Implications to fairness? Partitions of one dataset? Theory?
49. Multiple environments in the big picture
setup training test
generative learning U1
1 ∅
unsupervised learning U1
1 U1
2
supervised learning L1
1 U1
1
semi-supervised learning L1
1U1
1 U1
2
transductive learning L1
1U1
1 U1
1
multitask learning L1
1L2
1 U1
2 U2
2
domain adaptation L1
1U2
1 U2
2
transfer learning U1
1 L2
1 U2
1
continual learning L1
1, . . . , L∞
1 U1
1 , . . . , U∞
1
multi-environment learning L1
1L2
1 U3
1 U4
1
− Li
j: labeled dataset number j drawn from distribution i
− Ui
j : unlabeled dataset number j drawn from distribution i
50. Second conclusion
Prediction rules based on stable correlations across environments are likely to be causal 1
1I call this the principle of causal concentration.
51. Outline
What’s wrong with machine learning?
A causal proposal
Searching for causality I: observational data
Searching for causality II: multiple environments
Conclusion
52. Finally: from machine learning to artificial intelligence
AIs will be world simulators that will
− align with the causal outcomes in the world,
− perform robustly across diverse environments,
− interrogate composable autonomous mechanisms to extrapolate,
− allow to imagine multiple futures given uncertainty about a situation,
− enable counterfactual reasoning for extreme generalization
These causal desiderata are out of reach for current machine learning systems. Let’s get to it!
∼
Thanks!
53. References I
Martin Arjovsky, Leon Bottou, and David Lopez-Paz. Learning invariant causal rules across environments. In preparation, 2018.
Leon Bottou, Martin Arjovsky, David Lopez-Paz, and Maxime Oquab. Geometrical insights for implicit generative modeling. In Braverman
Readings in Machine Learning. Key Ideas from Inception to Current State. Springer, 2018.
Povilas Daniusis, Dominik Janzing, Joris Mooij, Jakob Zscheischler, Bastian Steudel, Kun Zhang, and Bernhard Sch¨olkopf. Inferring
deterministic causal relations. In UAI, 2010.
Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Fran¸cois Laviolette, Mario Marchand, and Victor
Lempitsky. Domain-adversarial training of neural networks. JMLR, 2016.
O. Goudet, D. Kalainathan, P. Caillou, I. Guyon, D. Lopez-Paz, and M. Sebag. Causal Generative Neural Networks. arXiv, 2017.
Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. arXiv, 2017.
Allan Jabri, Armand Joulin, and Laurens van der Maaten. Revisiting visual question answering baselines. In ECCV, 2016.
D. Kalainathan, O. Goudet, I. Guyon, D. Lopez-Paz, and M. Sebag. SAM: Structural Agnostic Model, Causal Discovery and Penalized
Adversarial Learning. arXiv, 2018.
Anna Klimovskaia, Leon Bottou, David Lopez-Paz, and Maximilian Nickel. Poincar maps recover continuous hierarchies in single-celldata.
In preparation, 2018.
David Lopez-Paz and Maxime Oquab. Revisiting classifier two-sample tests. ICLR, 2016.
David Lopez-Paz, Robert Nishihara, Soumith Chintala, Bernhard Sch¨olkopf, and L´eon Bottou. Discovering causal signals in images.
CVPR, 2017.
Franz H. Messerli. Chocolate consumption, cognitive function, and nobel laureates. New England Journal of Medicine, 2012.
Joris M. Mooij, Jonas Peters, Dominik Janzing, Jakob Zscheischler, and Bernhard Sch¨olkopf. Distinguishing cause from effect using
observational data: methods and benchmarks. JMLR, 2014.
Judea Pearl. Theoretical impediments to machine learning with seven sparks from the causal revolution. arXiv, 2018.
Jonas Peters, Joris M Mooij, Dominik Janzing, and Bernhard Sch¨olkopf. Causal discovery with continuous additive noise models. JMLR,
2014.
Jonas Peters, Peter B¨uhlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: identification and confidence
intervals. Journal of the Royal Statistical Society, 2016.
Hans Reichenbach. The direction of time. Dover, 1956.
Mateo Rojas-Carulla, Marco Baroni, and David Lopez-Paz. Causal discovery using proxy variables. In preparation, 2017.
A. Rosenfeld, R. Zemel, and J. K. Tsotsos. The Elephant in the Room. arXiv, 2018.
54. References II
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis
Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature,
2016.
Oliver Stegle, Dominik Janzing, Kun Zhang, Joris M Mooij, and Bernhard Sch¨olkopf. Probabilistic latent variable models for
distinguishing between cause and effect. In NIPS. 2010.
Pierre Stock and Moustapha Cisse. Convnets and imagenet beyond accuracy: Explanations, bias detection, adversarial examples and
model criticism. arXiv, 2017.
B. L. Sturm. A simple method to determine if a music information retrieval system is a “horse”. IEEE Transactions on Multimedia, 2014.
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing
properties of neural networks. ICLR, 2013.
James Woodward. Making things happen: A theory of causal explanation. Oxford university press, 2005.