1. Arthur Charpentier, SIDE Summer School, July 2019
# 4 Classification & Neural Nets
Arthur Charpentier (Universit´e du Qu´ebec `a Montr´eal)
Machine Learning & Econometrics
SIDE Summer School - July 2019
@freakonometrics freakonometrics freakonometrics.hypotheses.org 1
2. Arthur Charpentier, SIDE Summer School, July 2019
Modeling a 0/1 random variable
Myocardial infarction of patients admited in E.R.
1 > myocarde=read.table("http:// freakonometrics .free.fr/myocarde.csv",
head=TRUE ,sep=";")
dataset {(xi, yi)} where i = 1, . . . , n, n = 73.
1 FRCAR
2 INCAR
3 INSYS
4 PRDIA
5 PAPUL
6 PVENT
7 REPUL
8 PRONO
◦ heart rate x1
◦ heart index x2
◦ stroke index x3
◦ diastolic pressure x4
◦ pulmonary arterial pressure x5
◦ ventricular pressure x6
◦ lung resistance x7
◦ death or survival y
@freakonometrics freakonometrics freakonometrics.hypotheses.org 2
3. Arthur Charpentier, SIDE Summer School, July 2019
Classification : (Linear) Fisher’s Discrimination
Consider the follwing naive classification rule
m (x) = argminy{P[Y = y|X = x]}
or
m (x) = argminy
P[X = x|Y = y]
P[X = x]
(where P[X = x] is the density in the continuous case).
In the case where y takes two values, that will be standard
{0, 1} here, one can rewrite the later as
m (x) =
1 if E(Y |X = x) >
1
2
0 otherwise
@freakonometrics freakonometrics freakonometrics.hypotheses.org 3
4. Arthur Charpentier, SIDE Summer School, July 2019
Classification : (Linear) Fisher’s Discrimination
the set
DS = x, E(Y |X = x) =
1
2
is called the decision boundary.
Assume that X|Y = 0 ∼ N(µ0, Σ) and X|Y = 1 ∼
N(µ1, Σ), then explicit expressions can be derived.
m (x) =
1 if r2
1 < r2
0 + 2log
P(Y = 1)
P(Y = 0)
+ log
|Σ0|
|Σ1|
0 otherwise
where r2
y is the Mahalanobis distance,
r2
y = [X − µy]T
Σ−1
y [X − µy]
@freakonometrics freakonometrics freakonometrics.hypotheses.org 4
5. Arthur Charpentier, SIDE Summer School, July 2019
Classification : (Linear) Fisher’s Discrimination
Let δy(x) be defined as
−
1
2
log |Σy| −
1
2
[x − µy]T
Σ−1
y [x − µy] + log P(Y = y)
the decision boundary of this classifier is
{x such that δ0(x) = δ1(x)}
which is quadratic in x. This is the quadratic discrim-
inant analysis.
In Fisher (1936, The use of multile measurements in tax-
inomic problems , it was assumed that Σ0 = Σ1
In that case, actually,
δy(x) = xT
Σ−1
µy −
1
2
µT
y Σ−1
µy + log P(Y = y)
and the decision frontier is now linear in x.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 5
6. Arthur Charpentier, SIDE Summer School, July 2019
Neural Nets & Deep Learning in a Nutshell
Very popular for pictures...
Picture xi is
• a n × n matrix in {0, 1}n2
for black & white
• a n × n matrix in [0, 1]n2
for grey-scale
• a 3 × n × n array in ([0, 1]3
)n2
for color
• a T ×3×n×n tensor in (([0, 1]3
)T
)n2
for video
y here is the label (“8”, “9”, “6”, etc)
Suppose we want to recognize a “6” on a picture
m(x) =
+1 if x is a “6”
−1 otherwise
@freakonometrics freakonometrics freakonometrics.hypotheses.org 6
7. Arthur Charpentier, SIDE Summer School, July 2019
Neural Nets & Deep Learning in a Nutshell
Consider some specifics pixels, and associate
weights ω such that
m(x) = sign
i,j
ωi,jxi,j
where
xi,j =
+1 if pixel xi,j is black
−1 if pixel xi,j is white
for some weights ωi,j (that can be negative...)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 7
8. Arthur Charpentier, SIDE Summer School, July 2019
Neural Nets & Deep Learning in a Nutshell
A deep network is a network with a lot of layers
m(x) = sign
i
ωimi(x)
where mi’s are outputs of previous neural nets
Those layers can capture shapes in some areas
nonlinearities, cross-dependence, etc
@freakonometrics freakonometrics freakonometrics.hypotheses.org 8
9. Arthur Charpentier, SIDE Summer School, July 2019
Classification : Neural Nets
A neuron is simply :
Human body has ∼ 100 billion neuron cells
Good at :
vision, speech, driving, playing games, etc
The most interesting part is not ‘neural’ but ‘network’.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 9
10. Arthur Charpentier, SIDE Summer School, July 2019
Classification : Neural Nets
Networks have nodes, and edges (possibly directed) that connect nodes,
or maybe, to more specific some sort of flow network,
@freakonometrics freakonometrics freakonometrics.hypotheses.org 10
11. Arthur Charpentier, SIDE Summer School, July 2019
Classification : Neural Nets
In such a network, we usually have sources (here multiple) sources (here
{s1, s2, s3}), on the left, on a sink (here {t}), on the right. To continue with this
metaphorical introduction, information from the sources should reach the sink.
An usually, sources are explanatory variables, {x1, · · · , xp}, and the sink is our
variable of interest y. The output here is a binary variable y ∈ {0, 1}
@freakonometrics freakonometrics freakonometrics.hypotheses.org 11
12. Arthur Charpentier, SIDE Summer School, July 2019
Classification : the Binary Threshold Neuron & the Perceptron
If x ∈ {0, 1}p
, McCulloch & Pitts (1943, A logical calculus of the ideas immanent
in nervous activity) suggested a simple model, with threshold b
yi = f
p
j=1
xj,i
where f(x) = 1(x ≥ b)
where 1(x ≥ b) = +1 if x ≥ b and 0 otherwise, or (equivalently)
yi = f
ω +
p
j=1
xj,i
where f(x) = 1(x ≥ 0)
with weight ω = −b. The trick of adding 1 as an input was very important !
ω = −1 is the or logical operator : yi = 1 if ∃j such that xj,i = 1
ω = −p, is the and logical operator : yi = 1 if ∀j, xj,i = 1
@freakonometrics freakonometrics freakonometrics.hypotheses.org 12
13. Arthur Charpentier, SIDE Summer School, July 2019
Classification : the Binary Threshold Neuron & the Perceptron
but not possible for the xor logical operator : yi = 1 if x1,i = x2,i
Rosenblatt (1961, Principles of Neurodynamics: Perceptron &
Theory of Brain Mechanism) considered the extension where
x’s are real-valued, with weight ω ∈ Rp
yi = f
p
j=1
ωjxj,i
where f(x) = 1(x ≥ b)
where 1(x ≥ b) = +1 if x ≥ b and 0 otherwise, or (equivalently)
yi = f
ω0 +
p
j=1
ωjxj,i
where f(x) = 1(x ≥ 0)
with weights ω ∈ Rp+1
@freakonometrics freakonometrics freakonometrics.hypotheses.org 13
14. Arthur Charpentier, SIDE Summer School, July 2019
Classification : the Binary Threshold Neuron & the Perceptron
Minsky & Papert (1969, Perceptrons: an Introduction to Computational Geometry)
proved that perceptron were a linear separator, not very powerful
Define the sigmoid function f(x) =
1
1 + e−x
=
ex
ex + 1
(= logistic function).
This function f is called the activation function.
If y ∈ {−1, +1}, one can consider the hyperbolic tangent f(x)) =
(ex
− e−x
)
(ex + e−x)
or
the inverse tangent function (see wikipedia).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 14
15. Arthur Charpentier, SIDE Summer School, July 2019
Classification : Neural Nets
So here, for a classication problem,
yi = f
ω0 +
p
j=1
ωjxj,i
= p(xi
The error of that model can be computed
used the quadratic loss function,
n
i=1
(yi − p(xi))2
or cross-entropy
n
i=1
(yi log p(xi) + [1 − yi] log[1 − p(xi)])
@freakonometrics freakonometrics freakonometrics.hypotheses.org 15
16. Arthur Charpentier, SIDE Summer School, July 2019
Classification : Neural Nets
If
p(x) = f ω0 +
3
h=1
ωhfh ωh,0 + xT
ωh
we want to solve, with back propagation,
ω = argmin
1
n
n
i=1
(yi, p(xi))
To visualize a neural network, use
1 library(devtools)
2 source_url(’https:// freakonometrics .free.fr/nnet_plot.R’)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 16
18. Arthur Charpentier, SIDE Summer School, July 2019
Classification : Neural Nets
Of course, one can consider a multiple layer network,
p(x) = f ω0 +
3
h=1
ωhfh ωh,0 + zh
T
ωh
where
zh = f
ωh,0 +
kh
j=1
ωh,jfh,j ωh,j,0 + xT
ωh,j
1 library(neuralnet)
2 model_nnet = neuralnet(formula(glm(PRONO˜.,
data = myocarde)), myocarde , hidden = 3,
act.fct = sigmoid)
3 plot(model_nnet)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 18
19. Arthur Charpentier, SIDE Summer School, July 2019
Classification : Neural Nets
Consider a single hidden layer, with 3 different neu-
rons, so that
p(x) = f
ω0 +
3
h=1
ωhfh
ωh,0 +
p
j=1
ωh,jxj
or
p(x) = f ω0 +
3
h=1
ωhfh ωh,0 + xT
ωh
@freakonometrics freakonometrics freakonometrics.hypotheses.org 19
20. Arthur Charpentier, SIDE Summer School, July 2019
Optimization Issues
If classical econometrics are based on convex optimization problems (most the
time), (deep) Neural Networks are usually not convex
@freakonometrics freakonometrics freakonometrics.hypotheses.org 20
21. Arthur Charpentier, SIDE Summer School, July 2019
Optimization Issues
Consider optimization problem
min
x∈X
f(x)
solved by gradient descent,
xn+1 = xn − γn f(xn), n ≥ 0
with starting point x0
γn can be call learning rate
Sometime the dimension of the dataset can be really big: we can’t pass all the
data to the computer at once to compute f(x)
we need to divide the data into smaller sizes and give it to our computer one by
one
@freakonometrics freakonometrics freakonometrics.hypotheses.org 21
22. Arthur Charpentier, SIDE Summer School, July 2019
Optimization Issues
Batches
The Batch size is a hyperparameter that defines the number of sub-samples
we use to compute quantities (e.g. the gradient).
Epoch
One Epoch is when the dataset is passed forward and backward through
the neural network (only once).
We can divide the dataset of 2, 000 examples into batches of 400 then it will take
5 iterations to complete 1 epoch.
The number of epochs is the hyperparameter that defines the number times that
the learning algorithm will work through the entire training dataset.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 22
23. Arthur Charpentier, SIDE Summer School, July 2019
Classification : Image Recognition with Neural Networks
Classical {0, 1, · · · , 9} handwritten digit recognition
see tensorflow (Google) and keras.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 23
24. Arthur Charpentier, SIDE Summer School, July 2019
Classification : Image Recognition with Logistic Regression?
See https://www.tensorflow.org/install
1 library(tensorflow)
2 install_tensorflow(method = "conda")
3 library(keras)
4 install_keras(method = "conda")
1 mnist <- dataset_mnist ()
2 idx123 = which(mnist$train$y %in% c(1,2,3))
3 V <- mnist$train$x[idx123 [1:800] , ,]
4 MV = NULL
5 for(i in 1:800) MV=cbind(MV ,as.vector(V[i,,]))
6 MV=t(MV)
7 df=data.frame(y=mnist$train$y[idx123 ][1:800] ,x=MV)
8 reg=glm ((y==1)˜.,data=df ,family=binomial)
Here xi ∈ R784
(for a small greyscale picture).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 24
25. Arthur Charpentier, SIDE Summer School, July 2019
Using Principal Components to Reduce Dimension
Instead of using X as regressors, use ˜X = AX with less covariates
Note that y = β0 + ˜Xβ + ε become y = β0 + X ˜β + ε where ˜β = Aβ.
We can use either principal components or PLS (partial least squares), as
introduced in Wold (1966, Estimation of principal components and related models
by iterative least square).
For both, we need to normalize the data (to have mean 0 and variance 1).
With PCA, the first component z1 = Xω1 is obtained as solution of
ω1 = argmax
ω =1
{Var[Xω]} = argmax
ω =1
{< Xω, Xω >}
With PLS, the first component z1 = Xω1 is obtained as solution of
ω1 = argmax
ω =1
{Cov[y, Xω]} = argmax
ω =1
{< y, Xω >}
@freakonometrics freakonometrics freakonometrics.hypotheses.org 25
26. Arthur Charpentier, SIDE Summer School, July 2019
Classification : Image Recognition and Convolutional Neural Network
Classical strategy : reduce dimension via layers z (e.g. PCA, principal component
analysis)
and then consider a classifier z → y
CNN : Convolutional Neural Network
A convolutional neural network is a fully connected networks : each neuron
in one layer is connected to all neurons in the next layer.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 26
27. Arthur Charpentier, SIDE Summer School, July 2019
Non-supervised : PCA
Projection method, used to reduce dimensionality
PCA : Principal Component Analysis
find a small number of directions in input
space that explain variation in input data
represent data by projecting along those di-
rections
Consider (joint) n observations of d variables, stored in matrix X ∈ Mn×d
We want to project (multiply by a projection matrix) into (much) lower
dimensional space k < d
Search for orthogonal directions in space with highest variance
@freakonometrics freakonometrics freakonometrics.hypotheses.org 27
28. Arthur Charpentier, SIDE Summer School, July 2019
Non-supervised : PCA
Information is stored in the covariance matrix Σ = X X (variables are centered)
The k largest eigenvectors of Σ are the principal components
Assemble these eigenvectors into a d × k matrix ˜Σ
Express d-dimensional vectors x by projecting them to m-dimensional ˜x,
˜x = ˜Σ x
• Heuristics for the first principal component
We have data X ∈ Mn×d, i.e. n vectors x ∈ Rd
,
Let ω1 ∈ Rd
denote weights.
We want to maximize the variance of z1 = ω1 x
max
1
n
n
i=1
(ω1 xi − ω1 x)2
= max ω1 Σω1
subject to ω1 = 1.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 28
29. Arthur Charpentier, SIDE Summer School, July 2019
Non-supervised : PCA
We can use Lagrange multiplier to solve this constraint optimization problem
max
ω1,λ1
ω1 Σω1 + λ(1 − ω1 ω1)
If we differentiate
Σω1 − λ1ω1 = 0, i.e. Σω1 = λ1ω1
i.e. ω1 is an eigenvector of Σ, with eigenvalue the Lagrange multiplier. It should
be the one with the largest eigenvalue.
A nonlinear version can be obtained using reproducing kernel Hilbert spaces, and
kernels, see Sch¨olkopf al. (1996, Nonlinear Component Analysis as a Kernel
Eigenvalue Problem).
For the second one, we should solve a similar problem, with ω2 ⊥ ω1, i.e.
ω2 ω1 = 0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 29
30. Arthur Charpentier, SIDE Summer School, July 2019
Non-supervised : PCA
• Heuristics for the second principal component
max ω2 Σω2
subject to ω2 = 1
ω2 ω1 = 0
The Lagragian is
ω2 Σω2 + λ2(1 − ω2 ω2) − λ1ω2 ω1
when differentiation with respect to ω2, we get Σω2 = λ2ω2
hence, λ2 is the second largest eigenvalue, etc.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 30
31. Arthur Charpentier, SIDE Summer School, July 2019
Non-supervised : PLS
Partial Least Squares
(i) set x
(0)
j = xj
(ii) at step k
• fit regressions marginally, y = θjx
(k−1)
j , and set zk = θ1x
(k−1)
1 + · · · +
θpx
(k−1)
p
• regress y on zk and let y(k)
be the prediction (of y)
• orthogonalize each x
(k−1)
j by regressing on zk i.e. x
(k)
j = x
(k−1)
j + zkτj
where τj = (zk zk)−1
zk x
(k−1)
j
(iii) loop step (ii) and set finally, y = y(1)
+ y(2)
+ · · ·
In practice, PLS and PCA have similar results (especially with low R2
)
see pls library (on partial least squares), with pls::pslr for PLS regression, or
pls::pcr for partial components. One can use stats::prcomp for PCA
@freakonometrics freakonometrics freakonometrics.hypotheses.org 31
32. Arthur Charpentier, SIDE Summer School, July 2019
Non-supervised : Principal Components on Handwritten Pictures
Consider some handwritten pictures, from the mnist dataset. Here {(yi, xi)} with
yi = “3” and xi ∈ [0, 1]28×28
A variable is a pixel here...
Matrix X is a n × 784 matrix, where n is the number of pictures in the dataset.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 32
33. Arthur Charpentier, SIDE Summer School, July 2019
Non-supervised : PCA on Handwritten Pictures
Consider the PCA of matrix X, then ωj ∈ R784
can be interpreted as a picture
(after rescaling).
We can also consider the approximation of xi using only the first k principal
components (here k = 3)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 33
34. Arthur Charpentier, SIDE Summer School, July 2019
Non-supervised : PCA on Tensor Data
1D Tensor (vector)
A = [Ai] ∈ Rd
- i = individual
3D Tensor (array)
A = [Ai,j,k] = Ai’s
Ai ∈ Md1,d2
e.g. (black & white) picture
2D Tensor (matrix)
A = [Ai,j] ∈ Md1,d2
, j = feature
4D Tensor (arrays)
A = [Ai,j,k,l] ∈ Md1,d2
e.g. color picture,
l ∈ {red, green, blue }
@freakonometrics freakonometrics freakonometrics.hypotheses.org 34
35. Arthur Charpentier, SIDE Summer School, July 2019
Non-supervised : PCA on Tensor Data
One can define a multilinear principal component analysis
see also Tucker decomposition, from Hitchcock (1927, The expression of a tensor
or a polyadic as a sum of products)
We can use those decompositions to derive a simple classifier.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 35
36. Arthur Charpentier, SIDE Summer School, July 2019
Non-supervised : Principal Components on Handwritten Pictures
@freakonometrics freakonometrics freakonometrics.hypotheses.org 36
37. Arthur Charpentier, SIDE Summer School, July 2019
Non-supervised : Classification of Handwritten Pictures
Consider a dataset with only three labels,
yi ∈ {“1”, “2”, “3”} and the k principal
components obtained by PCA,
{x1, · · · , x784} → {˜x1, · · · , ˜xk}
E.g. scatterplot {˜x1,i, ˜x2,i}
with • when yi = “1”, • yi = “2” and • yi = “3”,
Let mk denote the multinomial logistic regression
on ˜x1, · · · , ˜xk and visualize the misclassification
rate
See Nielsen (2018, Neural Networks and Deep Learn-
ing)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 37
38. Arthur Charpentier, SIDE Summer School, July 2019
Non-supervised : Autoencoders
PCA is closely related to a neural network
Autoencoder
An autoencoder is a neural network
whose outputs are its own inputs
Here z = f(Wx) and χ = g(Mz).
Given f and g solve
min
W,M
n
i=1
(xi, χi) = min
W,M
n
i=1
(xi, MWxi) if f and g are linear
@freakonometrics freakonometrics freakonometrics.hypotheses.org 38
39. Arthur Charpentier, SIDE Summer School, July 2019
Neural Networks for More Complex Pictures
The algorithm is trained on tagged picture, i.e. (xi, yi) where here yi is some
class of fruits (banana, apple, kiwi, tomatoe, lime, cherry, pear, lychee, papaya, etc)
The challenge is to provide a new picture (xn+1 and to see the prediction yn+1)
See fruits-360/Training
@freakonometrics freakonometrics freakonometrics.hypotheses.org 39
40. Arthur Charpentier, SIDE Summer School, July 2019
Neural Networks for More Complex Pictures
Note that (xn+1 can be something that was not in the training dataset (here
yellow zucchini)
or some invented one (here a blue banana)
see also pyimagesearch for food/no food classifier
@freakonometrics freakonometrics freakonometrics.hypotheses.org 40
41. Arthur Charpentier, SIDE Summer School, July 2019
Neural Networks for Text and Go (Alpha Go)
Neural nets are also popular for text classification...
@freakonometrics freakonometrics freakonometrics.hypotheses.org 41
42. Arthur Charpentier, SIDE Summer School, July 2019
RNN : Recurrent Neural Networks
Introduced in Williams, Hinton & Rumelhart (1986, Learning representations by
back-propagating errors)
RNN : Recurrent Neural Networks
Neural Network where connections between nodes form a directed graph
along a temporal sequence
Allows it to exhibit temporal dynamic behavior.
They will be discussed in the part on time series and dynamic modeling.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 42