SlideShare a Scribd company logo
1 of 45
Download to read offline
Arthur Charpentier, SIDE Summer School, July 2019
# 4 Classification & Neural Nets
Arthur Charpentier (Universit´e du Qu´ebec `a Montr´eal)
Machine Learning & Econometrics
SIDE Summer School - July 2019
@freakonometrics freakonometrics freakonometrics.hypotheses.org 1
Arthur Charpentier, SIDE Summer School, July 2019
Modeling a 0/1 random variable
Myocardial infarction of patients admited in E.R.
1 > myocarde=read.table("http:// freakonometrics .free.fr/myocarde.csv",
head=TRUE ,sep=";")
dataset {(xi, yi)} where i = 1, . . . , n, n = 73.
1 FRCAR
2 INCAR
3 INSYS
4 PRDIA
5 PAPUL
6 PVENT
7 REPUL
8 PRONO
◦ heart rate x1
◦ heart index x2
◦ stroke index x3
◦ diastolic pressure x4
◦ pulmonary arterial pressure x5
◦ ventricular pressure x6
◦ lung resistance x7
◦ death or survival y
@freakonometrics freakonometrics freakonometrics.hypotheses.org 2
Arthur Charpentier, SIDE Summer School, July 2019
Classification : (Linear) Fisher’s Discrimination
Consider the follwing naive classification rule
m (x) = argminy{P[Y = y|X = x]}
or
m (x) = argminy
P[X = x|Y = y]
P[X = x]
(where P[X = x] is the density in the continuous case).
In the case where y takes two values, that will be standard
{0, 1} here, one can rewrite the later as
m (x) =



1 if E(Y |X = x) >
1
2
0 otherwise
@freakonometrics freakonometrics freakonometrics.hypotheses.org 3
Arthur Charpentier, SIDE Summer School, July 2019
Classification : (Linear) Fisher’s Discrimination
the set
DS = x, E(Y |X = x) =
1
2
is called the decision boundary.
Assume that X|Y = 0 ∼ N(µ0, Σ) and X|Y = 1 ∼
N(µ1, Σ), then explicit expressions can be derived.
m (x) =



1 if r2
1 < r2
0 + 2log
P(Y = 1)
P(Y = 0)
+ log
|Σ0|
|Σ1|
0 otherwise
where r2
y is the Mahalanobis distance,
r2
y = [X − µy]T
Σ−1
y [X − µy]
@freakonometrics freakonometrics freakonometrics.hypotheses.org 4
Arthur Charpentier, SIDE Summer School, July 2019
Classification : (Linear) Fisher’s Discrimination
Let δy(x) be defined as
−
1
2
log |Σy| −
1
2
[x − µy]T
Σ−1
y [x − µy] + log P(Y = y)
the decision boundary of this classifier is
{x such that δ0(x) = δ1(x)}
which is quadratic in x. This is the quadratic discrim-
inant analysis.
In Fisher (1936, The use of multile measurements in tax-
inomic problems , it was assumed that Σ0 = Σ1
In that case, actually,
δy(x) = xT
Σ−1
µy −
1
2
µT
y Σ−1
µy + log P(Y = y)
and the decision frontier is now linear in x.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 5
Arthur Charpentier, SIDE Summer School, July 2019
Neural Nets & Deep Learning in a Nutshell
Very popular for pictures...
Picture xi is
• a n × n matrix in {0, 1}n2
for black & white
• a n × n matrix in [0, 1]n2
for grey-scale
• a 3 × n × n array in ([0, 1]3
)n2
for color
• a T ×3×n×n tensor in (([0, 1]3
)T
)n2
for video
y here is the label (“8”, “9”, “6”, etc)
Suppose we want to recognize a “6” on a picture
m(x) =



+1 if x is a “6”
−1 otherwise
@freakonometrics freakonometrics freakonometrics.hypotheses.org 6
Arthur Charpentier, SIDE Summer School, July 2019
Neural Nets & Deep Learning in a Nutshell
Consider some specifics pixels, and associate
weights ω such that
m(x) = sign


i,j
ωi,jxi,j


where
xi,j =



+1 if pixel xi,j is black
−1 if pixel xi,j is white
for some weights ωi,j (that can be negative...)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 7
Arthur Charpentier, SIDE Summer School, July 2019
Neural Nets & Deep Learning in a Nutshell
A deep network is a network with a lot of layers
m(x) = sign
i
ωimi(x)
where mi’s are outputs of previous neural nets
Those layers can capture shapes in some areas
nonlinearities, cross-dependence, etc
@freakonometrics freakonometrics freakonometrics.hypotheses.org 8
Arthur Charpentier, SIDE Summer School, July 2019
Classification : Neural Nets
A neuron is simply :
Human body has ∼ 100 billion neuron cells
Good at :
vision, speech, driving, playing games, etc
The most interesting part is not ‘neural’ but ‘network’.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 9
Arthur Charpentier, SIDE Summer School, July 2019
Classification : Neural Nets
Networks have nodes, and edges (possibly directed) that connect nodes,
or maybe, to more specific some sort of flow network,
@freakonometrics freakonometrics freakonometrics.hypotheses.org 10
Arthur Charpentier, SIDE Summer School, July 2019
Classification : Neural Nets
In such a network, we usually have sources (here multiple) sources (here
{s1, s2, s3}), on the left, on a sink (here {t}), on the right. To continue with this
metaphorical introduction, information from the sources should reach the sink.
An usually, sources are explanatory variables, {x1, · · · , xp}, and the sink is our
variable of interest y. The output here is a binary variable y ∈ {0, 1}
@freakonometrics freakonometrics freakonometrics.hypotheses.org 11
Arthur Charpentier, SIDE Summer School, July 2019
Classification : the Binary Threshold Neuron & the Perceptron
If x ∈ {0, 1}p
, McCulloch & Pitts (1943, A logical calculus of the ideas immanent
in nervous activity) suggested a simple model, with threshold b
yi = f


p
j=1
xj,i

 where f(x) = 1(x ≥ b)
where 1(x ≥ b) = +1 if x ≥ b and 0 otherwise, or (equivalently)
yi = f

ω +
p
j=1
xj,i

 where f(x) = 1(x ≥ 0)
with weight ω = −b. The trick of adding 1 as an input was very important !
ω = −1 is the or logical operator : yi = 1 if ∃j such that xj,i = 1
ω = −p, is the and logical operator : yi = 1 if ∀j, xj,i = 1
@freakonometrics freakonometrics freakonometrics.hypotheses.org 12
Arthur Charpentier, SIDE Summer School, July 2019
Classification : the Binary Threshold Neuron & the Perceptron
but not possible for the xor logical operator : yi = 1 if x1,i = x2,i
Rosenblatt (1961, Principles of Neurodynamics: Perceptron &
Theory of Brain Mechanism) considered the extension where
x’s are real-valued, with weight ω ∈ Rp
yi = f


p
j=1
ωjxj,i

 where f(x) = 1(x ≥ b)
where 1(x ≥ b) = +1 if x ≥ b and 0 otherwise, or (equivalently)
yi = f

ω0 +
p
j=1
ωjxj,i

 where f(x) = 1(x ≥ 0)
with weights ω ∈ Rp+1
@freakonometrics freakonometrics freakonometrics.hypotheses.org 13
Arthur Charpentier, SIDE Summer School, July 2019
Classification : the Binary Threshold Neuron & the Perceptron
Minsky & Papert (1969, Perceptrons: an Introduction to Computational Geometry)
proved that perceptron were a linear separator, not very powerful
Define the sigmoid function f(x) =
1
1 + e−x
=
ex
ex + 1
(= logistic function).
This function f is called the activation function.
If y ∈ {−1, +1}, one can consider the hyperbolic tangent f(x)) =
(ex
− e−x
)
(ex + e−x)
or
the inverse tangent function (see wikipedia).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 14
Arthur Charpentier, SIDE Summer School, July 2019
Classification : Neural Nets
So here, for a classication problem,
yi = f

ω0 +
p
j=1
ωjxj,i

 = p(xi
The error of that model can be computed
used the quadratic loss function,
n
i=1
(yi − p(xi))2
or cross-entropy
n
i=1
(yi log p(xi) + [1 − yi] log[1 − p(xi)])
@freakonometrics freakonometrics freakonometrics.hypotheses.org 15
Arthur Charpentier, SIDE Summer School, July 2019
Classification : Neural Nets
If
p(x) = f ω0 +
3
h=1
ωhfh ωh,0 + xT
ωh
we want to solve, with back propagation,
ω = argmin
1
n
n
i=1
(yi, p(xi))
To visualize a neural network, use
1 library(devtools)
2 source_url(’https:// freakonometrics .free.fr/nnet_plot.R’)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 16
Arthur Charpentier, SIDE Summer School, July 2019
Classification : Neural Nets
1 library(nnet)
2 mix = function(z) (z-min(z))/(max(z)-min(z))
3 myocarde = apply(myocarde , 1, mix)
4 fit_nn = nnet(PRONO˜., data=myocarde , size =3)
5 plot.nnet(fit_nn)
or
1 library(neuralnet)
2 fit_nn = neuralnet(formula(glm(PRONO˜., data =
myocarde)), myocarde , hidden = 3, act.fct
= sigmoid)
3 plot(fit_nn)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 17
Arthur Charpentier, SIDE Summer School, July 2019
Classification : Neural Nets
Of course, one can consider a multiple layer network,
p(x) = f ω0 +
3
h=1
ωhfh ωh,0 + zh
T
ωh
where
zh = f

ωh,0 +
kh
j=1
ωh,jfh,j ωh,j,0 + xT
ωh,j


1 library(neuralnet)
2 model_nnet = neuralnet(formula(glm(PRONO˜.,
data = myocarde)), myocarde , hidden = 3,
act.fct = sigmoid)
3 plot(model_nnet)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 18
Arthur Charpentier, SIDE Summer School, July 2019
Classification : Neural Nets
Consider a single hidden layer, with 3 different neu-
rons, so that
p(x) = f

ω0 +
3
h=1
ωhfh

ωh,0 +
p
j=1
ωh,jxj




or
p(x) = f ω0 +
3
h=1
ωhfh ωh,0 + xT
ωh
@freakonometrics freakonometrics freakonometrics.hypotheses.org 19
Arthur Charpentier, SIDE Summer School, July 2019
Optimization Issues
If classical econometrics are based on convex optimization problems (most the
time), (deep) Neural Networks are usually not convex
@freakonometrics freakonometrics freakonometrics.hypotheses.org 20
Arthur Charpentier, SIDE Summer School, July 2019
Optimization Issues
Consider optimization problem
min
x∈X
f(x)
solved by gradient descent,
xn+1 = xn − γn f(xn), n ≥ 0
with starting point x0
γn can be call learning rate
Sometime the dimension of the dataset can be really big: we can’t pass all the
data to the computer at once to compute f(x)
we need to divide the data into smaller sizes and give it to our computer one by
one
@freakonometrics freakonometrics freakonometrics.hypotheses.org 21
Arthur Charpentier, SIDE Summer School, July 2019
Optimization Issues
Batches
The Batch size is a hyperparameter that defines the number of sub-samples
we use to compute quantities (e.g. the gradient).
Epoch
One Epoch is when the dataset is passed forward and backward through
the neural network (only once).
We can divide the dataset of 2, 000 examples into batches of 400 then it will take
5 iterations to complete 1 epoch.
The number of epochs is the hyperparameter that defines the number times that
the learning algorithm will work through the entire training dataset.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 22
Arthur Charpentier, SIDE Summer School, July 2019
Classification : Image Recognition with Neural Networks
Classical {0, 1, · · · , 9} handwritten digit recognition
see tensorflow (Google) and keras.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 23
Arthur Charpentier, SIDE Summer School, July 2019
Classification : Image Recognition with Logistic Regression?
See https://www.tensorflow.org/install
1 library(tensorflow)
2 install_tensorflow(method = "conda")
3 library(keras)
4 install_keras(method = "conda")
1 mnist <- dataset_mnist ()
2 idx123 = which(mnist$train$y %in% c(1,2,3))
3 V <- mnist$train$x[idx123 [1:800] , ,]
4 MV = NULL
5 for(i in 1:800) MV=cbind(MV ,as.vector(V[i,,]))
6 MV=t(MV)
7 df=data.frame(y=mnist$train$y[idx123 ][1:800] ,x=MV)
8 reg=glm ((y==1)˜.,data=df ,family=binomial)
Here xi ∈ R784
(for a small greyscale picture).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 24
Arthur Charpentier, SIDE Summer School, July 2019
Using Principal Components to Reduce Dimension
Instead of using X as regressors, use ˜X = AX with less covariates
Note that y = β0 + ˜Xβ + ε become y = β0 + X ˜β + ε where ˜β = Aβ.
We can use either principal components or PLS (partial least squares), as
introduced in Wold (1966, Estimation of principal components and related models
by iterative least square).
For both, we need to normalize the data (to have mean 0 and variance 1).
With PCA, the first component z1 = Xω1 is obtained as solution of
ω1 = argmax
ω =1
{Var[Xω]} = argmax
ω =1
{< Xω, Xω >}
With PLS, the first component z1 = Xω1 is obtained as solution of
ω1 = argmax
ω =1
{Cov[y, Xω]} = argmax
ω =1
{< y, Xω >}
@freakonometrics freakonometrics freakonometrics.hypotheses.org 25
Arthur Charpentier, SIDE Summer School, July 2019
Classification : Image Recognition and Convolutional Neural Network
Classical strategy : reduce dimension via layers z (e.g. PCA, principal component
analysis)
and then consider a classifier z → y
CNN : Convolutional Neural Network
A convolutional neural network is a fully connected networks : each neuron
in one layer is connected to all neurons in the next layer.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 26
Arthur Charpentier, SIDE Summer School, July 2019
Non-supervised : PCA
Projection method, used to reduce dimensionality
PCA : Principal Component Analysis
find a small number of directions in input
space that explain variation in input data
represent data by projecting along those di-
rections
Consider (joint) n observations of d variables, stored in matrix X ∈ Mn×d
We want to project (multiply by a projection matrix) into (much) lower
dimensional space k < d
Search for orthogonal directions in space with highest variance
@freakonometrics freakonometrics freakonometrics.hypotheses.org 27
Arthur Charpentier, SIDE Summer School, July 2019
Non-supervised : PCA
Information is stored in the covariance matrix Σ = X X (variables are centered)
The k largest eigenvectors of Σ are the principal components
Assemble these eigenvectors into a d × k matrix ˜Σ
Express d-dimensional vectors x by projecting them to m-dimensional ˜x,
˜x = ˜Σ x
• Heuristics for the first principal component
We have data X ∈ Mn×d, i.e. n vectors x ∈ Rd
,
Let ω1 ∈ Rd
denote weights.
We want to maximize the variance of z1 = ω1 x
max
1
n
n
i=1
(ω1 xi − ω1 x)2
= max ω1 Σω1
subject to ω1 = 1.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 28
Arthur Charpentier, SIDE Summer School, July 2019
Non-supervised : PCA
We can use Lagrange multiplier to solve this constraint optimization problem
max
ω1,λ1
ω1 Σω1 + λ(1 − ω1 ω1)
If we differentiate
Σω1 − λ1ω1 = 0, i.e. Σω1 = λ1ω1
i.e. ω1 is an eigenvector of Σ, with eigenvalue the Lagrange multiplier. It should
be the one with the largest eigenvalue.
A nonlinear version can be obtained using reproducing kernel Hilbert spaces, and
kernels, see Sch¨olkopf al. (1996, Nonlinear Component Analysis as a Kernel
Eigenvalue Problem).
For the second one, we should solve a similar problem, with ω2 ⊥ ω1, i.e.
ω2 ω1 = 0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 29
Arthur Charpentier, SIDE Summer School, July 2019
Non-supervised : PCA
• Heuristics for the second principal component
max ω2 Σω2
subject to ω2 = 1
ω2 ω1 = 0
The Lagragian is
ω2 Σω2 + λ2(1 − ω2 ω2) − λ1ω2 ω1
when differentiation with respect to ω2, we get Σω2 = λ2ω2
hence, λ2 is the second largest eigenvalue, etc.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 30
Arthur Charpentier, SIDE Summer School, July 2019
Non-supervised : PLS
Partial Least Squares
(i) set x
(0)
j = xj
(ii) at step k
• fit regressions marginally, y = θjx
(k−1)
j , and set zk = θ1x
(k−1)
1 + · · · +
θpx
(k−1)
p
• regress y on zk and let y(k)
be the prediction (of y)
• orthogonalize each x
(k−1)
j by regressing on zk i.e. x
(k)
j = x
(k−1)
j + zkτj
where τj = (zk zk)−1
zk x
(k−1)
j
(iii) loop step (ii) and set finally, y = y(1)
+ y(2)
+ · · ·
In practice, PLS and PCA have similar results (especially with low R2
)
see pls library (on partial least squares), with pls::pslr for PLS regression, or
pls::pcr for partial components. One can use stats::prcomp for PCA
@freakonometrics freakonometrics freakonometrics.hypotheses.org 31
Arthur Charpentier, SIDE Summer School, July 2019
Non-supervised : Principal Components on Handwritten Pictures
Consider some handwritten pictures, from the mnist dataset. Here {(yi, xi)} with
yi = “3” and xi ∈ [0, 1]28×28
A variable is a pixel here...
Matrix X is a n × 784 matrix, where n is the number of pictures in the dataset.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 32
Arthur Charpentier, SIDE Summer School, July 2019
Non-supervised : PCA on Handwritten Pictures
Consider the PCA of matrix X, then ωj ∈ R784
can be interpreted as a picture
(after rescaling).
We can also consider the approximation of xi using only the first k principal
components (here k = 3)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 33
Arthur Charpentier, SIDE Summer School, July 2019
Non-supervised : PCA on Tensor Data
1D Tensor (vector)
A = [Ai] ∈ Rd
- i = individual
3D Tensor (array)
A = [Ai,j,k] = Ai’s
Ai ∈ Md1,d2
e.g. (black & white) picture
2D Tensor (matrix)
A = [Ai,j] ∈ Md1,d2
, j = feature
4D Tensor (arrays)
A = [Ai,j,k,l] ∈ Md1,d2
e.g. color picture,
l ∈ {red, green, blue }
@freakonometrics freakonometrics freakonometrics.hypotheses.org 34
Arthur Charpentier, SIDE Summer School, July 2019
Non-supervised : PCA on Tensor Data
One can define a multilinear principal component analysis
see also Tucker decomposition, from Hitchcock (1927, The expression of a tensor
or a polyadic as a sum of products)
We can use those decompositions to derive a simple classifier.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 35
Arthur Charpentier, SIDE Summer School, July 2019
Non-supervised : Principal Components on Handwritten Pictures
@freakonometrics freakonometrics freakonometrics.hypotheses.org 36
Arthur Charpentier, SIDE Summer School, July 2019
Non-supervised : Classification of Handwritten Pictures
Consider a dataset with only three labels,
yi ∈ {“1”, “2”, “3”} and the k principal
components obtained by PCA,
{x1, · · · , x784} → {˜x1, · · · , ˜xk}
E.g. scatterplot {˜x1,i, ˜x2,i}
with • when yi = “1”, • yi = “2” and • yi = “3”,
Let mk denote the multinomial logistic regression
on ˜x1, · · · , ˜xk and visualize the misclassification
rate
See Nielsen (2018, Neural Networks and Deep Learn-
ing)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 37
Arthur Charpentier, SIDE Summer School, July 2019
Non-supervised : Autoencoders
PCA is closely related to a neural network
Autoencoder
An autoencoder is a neural network
whose outputs are its own inputs
Here z = f(Wx) and χ = g(Mz).
Given f and g solve
min
W,M
n
i=1
(xi, χi) = min
W,M
n
i=1
(xi, MWxi) if f and g are linear
@freakonometrics freakonometrics freakonometrics.hypotheses.org 38
Arthur Charpentier, SIDE Summer School, July 2019
Neural Networks for More Complex Pictures
The algorithm is trained on tagged picture, i.e. (xi, yi) where here yi is some
class of fruits (banana, apple, kiwi, tomatoe, lime, cherry, pear, lychee, papaya, etc)
The challenge is to provide a new picture (xn+1 and to see the prediction yn+1)
See fruits-360/Training
@freakonometrics freakonometrics freakonometrics.hypotheses.org 39
Arthur Charpentier, SIDE Summer School, July 2019
Neural Networks for More Complex Pictures
Note that (xn+1 can be something that was not in the training dataset (here
yellow zucchini)
or some invented one (here a blue banana)
see also pyimagesearch for food/no food classifier
@freakonometrics freakonometrics freakonometrics.hypotheses.org 40
Arthur Charpentier, SIDE Summer School, July 2019
Neural Networks for Text and Go (Alpha Go)
Neural nets are also popular for text classification...
@freakonometrics freakonometrics freakonometrics.hypotheses.org 41
Arthur Charpentier, SIDE Summer School, July 2019
RNN : Recurrent Neural Networks
Introduced in Williams, Hinton & Rumelhart (1986, Learning representations by
back-propagating errors)
RNN : Recurrent Neural Networks
Neural Network where connections between nodes form a directed graph
along a temporal sequence
Allows it to exhibit temporal dynamic behavior.
They will be discussed in the part on time series and dynamic modeling.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 42
Arthur Charpentier, SIDE Summer School, July 2019
Neural Networks for Pictures Classification
1 library(keras)
2 library(lime)
3 library(magick)
4 model <- application_vgg16(weights = "imagenet", include_top = TRUE)
5 test_image_files_path <- "/fruits -360/Test"
6 img <- image_read(’https://upload.wikimedia.org/wikipedia/commons/
thumb/8/8a/Banana -Single.jpg/272px -Banana -Single.jpg’)
7 img_path <- file.path(test_image_files_path , "Banana", ’banana.jpg’)
8 image_write(img , img_path)
9 plot(as.raster(img))
10 plot_superpixels(img_path , n_ superpixels = 35, weight = 10)
11 image_prep <- function(x) {
12 arrays <- lapply(x, function(path) {
13 img <- image_load(path , target_size = c(224 ,224))
14 x <- image_to_array(img)
15 x <- array_reshape(x, c(1, dim(x)))
16 x <- imagenet_preprocess_input(x)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 43
Arthur Charpentier, SIDE Summer School, July 2019
17 })
18 do.call(abind ::abind , c(arrays , list(along = 1)))
19 }
20 res <- predict(model , image_prep(img_path))
1 > imagenet_decode_ predictions (res)
2 [[1]]
3 class_name class_ description score
4 1 n07753592 banana 0.9929747581
5 2 n03532672 hook 0.0013420789
6 3 n07747607 orange 0.0010816196
7 4 n07749582 lemon 0.0010625814
8 5 n07716906 spaghetti_squash 0.0009176208
1 model_labels <- readRDS(system.file(’extdata ’, ’imagenet_labels.rds’,
package = ’lime ’))
2 explainer <- lime(img_path , as_classifier(model , model_labels), image
_prep)
3 explanation <- explain(img_path , explainer ,
@freakonometrics freakonometrics freakonometrics.hypotheses.org 44
Arthur Charpentier, SIDE Summer School, July 2019
4 n_labels = 2, n_features = 35,
5 n_ superpixels = 35, weight = 10,
6 background = "white")
7 plot_image_ explanation (explanation )
See http://junma5.weebly.com/ for more advanced models
@freakonometrics freakonometrics freakonometrics.hypotheses.org 45

More Related Content

What's hot

What's hot (20)

Hands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive ModelingHands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive Modeling
 
Side 2019 #7
Side 2019 #7Side 2019 #7
Side 2019 #7
 
Lausanne 2019 #1
Lausanne 2019 #1Lausanne 2019 #1
Lausanne 2019 #1
 
Side 2019 #5
Side 2019 #5Side 2019 #5
Side 2019 #5
 
Side 2019 #10
Side 2019 #10Side 2019 #10
Side 2019 #10
 
Slides econometrics-2018-graduate-4
Slides econometrics-2018-graduate-4Slides econometrics-2018-graduate-4
Slides econometrics-2018-graduate-4
 
Reinforcement Learning in Economics and Finance
Reinforcement Learning in Economics and FinanceReinforcement Learning in Economics and Finance
Reinforcement Learning in Economics and Finance
 
Side 2019 #8
Side 2019 #8Side 2019 #8
Side 2019 #8
 
Lausanne 2019 #4
Lausanne 2019 #4Lausanne 2019 #4
Lausanne 2019 #4
 
Predictive Modeling in Insurance in the context of (possibly) big data
Predictive Modeling in Insurance in the context of (possibly) big dataPredictive Modeling in Insurance in the context of (possibly) big data
Predictive Modeling in Insurance in the context of (possibly) big data
 
Side 2019 #3
Side 2019 #3Side 2019 #3
Side 2019 #3
 
Econ. Seminar Uqam
Econ. Seminar UqamEcon. Seminar Uqam
Econ. Seminar Uqam
 
Slides econometrics-2018-graduate-1
Slides econometrics-2018-graduate-1Slides econometrics-2018-graduate-1
Slides econometrics-2018-graduate-1
 
Slides lln-risques
Slides lln-risquesSlides lln-risques
Slides lln-risques
 
Actuarial Pricing Game
Actuarial Pricing GameActuarial Pricing Game
Actuarial Pricing Game
 
Varese italie #2
Varese italie #2Varese italie #2
Varese italie #2
 
Varese italie seminar
Varese italie seminarVarese italie seminar
Varese italie seminar
 
Slides econometrics-2017-graduate-2
Slides econometrics-2017-graduate-2Slides econometrics-2017-graduate-2
Slides econometrics-2017-graduate-2
 
Econometrics, PhD Course, #1 Nonlinearities
Econometrics, PhD Course, #1 NonlinearitiesEconometrics, PhD Course, #1 Nonlinearities
Econometrics, PhD Course, #1 Nonlinearities
 
Slides econ-lm
Slides econ-lmSlides econ-lm
Slides econ-lm
 

Similar to Side 2019 #4

20130928 automated theorem_proving_harrison
20130928 automated theorem_proving_harrison20130928 automated theorem_proving_harrison
20130928 automated theorem_proving_harrison
Computer Science Club
 
Equational axioms for probability calculus and modelling of Likelihood ratio ...
Equational axioms for probability calculus and modelling of Likelihood ratio ...Equational axioms for probability calculus and modelling of Likelihood ratio ...
Equational axioms for probability calculus and modelling of Likelihood ratio ...
Advanced-Concepts-Team
 
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Umberto Picchini
 
Exponentialentropyonintuitionisticfuzzysets (1)
Exponentialentropyonintuitionisticfuzzysets (1)Exponentialentropyonintuitionisticfuzzysets (1)
Exponentialentropyonintuitionisticfuzzysets (1)
Dr. Hari Arora
 

Similar to Side 2019 #4 (20)

Slides econ-lm
Slides econ-lmSlides econ-lm
Slides econ-lm
 
Slides ACTINFO 2016
Slides ACTINFO 2016Slides ACTINFO 2016
Slides ACTINFO 2016
 
Slides barcelona Machine Learning
Slides barcelona Machine LearningSlides barcelona Machine Learning
Slides barcelona Machine Learning
 
Proba stats-r1-2017
Proba stats-r1-2017Proba stats-r1-2017
Proba stats-r1-2017
 
Slides Bank England
Slides Bank EnglandSlides Bank England
Slides Bank England
 
Information topology, Deep Network generalization and Consciousness quantific...
Information topology, Deep Network generalization and Consciousness quantific...Information topology, Deep Network generalization and Consciousness quantific...
Information topology, Deep Network generalization and Consciousness quantific...
 
Slides ub-3
Slides ub-3Slides ub-3
Slides ub-3
 
Slides ub-2
Slides ub-2Slides ub-2
Slides ub-2
 
Classification
ClassificationClassification
Classification
 
20130928 automated theorem_proving_harrison
20130928 automated theorem_proving_harrison20130928 automated theorem_proving_harrison
20130928 automated theorem_proving_harrison
 
Generative models : VAE and GAN
Generative models : VAE and GANGenerative models : VAE and GAN
Generative models : VAE and GAN
 
Lecture notes
Lecture notes Lecture notes
Lecture notes
 
Econometrics 2017-graduate-3
Econometrics 2017-graduate-3Econometrics 2017-graduate-3
Econometrics 2017-graduate-3
 
Equational axioms for probability calculus and modelling of Likelihood ratio ...
Equational axioms for probability calculus and modelling of Likelihood ratio ...Equational axioms for probability calculus and modelling of Likelihood ratio ...
Equational axioms for probability calculus and modelling of Likelihood ratio ...
 
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
 
Non-parametric regressions & Neural Networks
Non-parametric regressions & Neural NetworksNon-parametric regressions & Neural Networks
Non-parametric regressions & Neural Networks
 
9. MA Hashemi.pdf
9. MA Hashemi.pdf9. MA Hashemi.pdf
9. MA Hashemi.pdf
 
9. MA Hashemi.pdf
9. MA Hashemi.pdf9. MA Hashemi.pdf
9. MA Hashemi.pdf
 
Exponentialentropyonintuitionisticfuzzysets (1)
Exponentialentropyonintuitionisticfuzzysets (1)Exponentialentropyonintuitionisticfuzzysets (1)
Exponentialentropyonintuitionisticfuzzysets (1)
 
Statistics (1): estimation, Chapter 1: Models
Statistics (1): estimation, Chapter 1: ModelsStatistics (1): estimation, Chapter 1: Models
Statistics (1): estimation, Chapter 1: Models
 

More from Arthur Charpentier

More from Arthur Charpentier (14)

Family History and Life Insurance
Family History and Life InsuranceFamily History and Life Insurance
Family History and Life Insurance
 
ACT6100 introduction
ACT6100 introductionACT6100 introduction
ACT6100 introduction
 
Family History and Life Insurance (UConn actuarial seminar)
Family History and Life Insurance (UConn actuarial seminar)Family History and Life Insurance (UConn actuarial seminar)
Family History and Life Insurance (UConn actuarial seminar)
 
Control epidemics
Control epidemics Control epidemics
Control epidemics
 
STT5100 Automne 2020, introduction
STT5100 Automne 2020, introductionSTT5100 Automne 2020, introduction
STT5100 Automne 2020, introduction
 
Family History and Life Insurance
Family History and Life InsuranceFamily History and Life Insurance
Family History and Life Insurance
 
Optimal Control and COVID-19
Optimal Control and COVID-19Optimal Control and COVID-19
Optimal Control and COVID-19
 
Slides OICA 2020
Slides OICA 2020Slides OICA 2020
Slides OICA 2020
 
Lausanne 2019 #3
Lausanne 2019 #3Lausanne 2019 #3
Lausanne 2019 #3
 
Lausanne 2019 #2
Lausanne 2019 #2Lausanne 2019 #2
Lausanne 2019 #2
 
Side 2019 #11
Side 2019 #11Side 2019 #11
Side 2019 #11
 
Pareto Models, Slides EQUINEQ
Pareto Models, Slides EQUINEQPareto Models, Slides EQUINEQ
Pareto Models, Slides EQUINEQ
 
Mutualisation et Segmentation
Mutualisation et SegmentationMutualisation et Segmentation
Mutualisation et Segmentation
 
transformations and nonparametric inference
transformations and nonparametric inferencetransformations and nonparametric inference
transformations and nonparametric inference
 

Recently uploaded

MASTERING FOREX: STRATEGIES FOR SUCCESS.pdf
MASTERING FOREX: STRATEGIES FOR SUCCESS.pdfMASTERING FOREX: STRATEGIES FOR SUCCESS.pdf
MASTERING FOREX: STRATEGIES FOR SUCCESS.pdf
Cocity Enterprises
 

Recently uploaded (20)

Seeman_Fiintouch_LLP_Newsletter_May-2024.pdf
Seeman_Fiintouch_LLP_Newsletter_May-2024.pdfSeeman_Fiintouch_LLP_Newsletter_May-2024.pdf
Seeman_Fiintouch_LLP_Newsletter_May-2024.pdf
 
Pension dashboards forum 1 May 2024 (1).pdf
Pension dashboards forum 1 May 2024 (1).pdfPension dashboards forum 1 May 2024 (1).pdf
Pension dashboards forum 1 May 2024 (1).pdf
 
Fixed exchange rate and flexible exchange rate.pptx
Fixed exchange rate and flexible exchange rate.pptxFixed exchange rate and flexible exchange rate.pptx
Fixed exchange rate and flexible exchange rate.pptx
 
logistics industry development power point ppt.pdf
logistics industry development power point ppt.pdflogistics industry development power point ppt.pdf
logistics industry development power point ppt.pdf
 
Solution Manual For Financial Statement Analysis, 13th Edition By Charles H. ...
Solution Manual For Financial Statement Analysis, 13th Edition By Charles H. ...Solution Manual For Financial Statement Analysis, 13th Edition By Charles H. ...
Solution Manual For Financial Statement Analysis, 13th Edition By Charles H. ...
 
7 steps to achieve financial freedom.pdf
7 steps to achieve financial freedom.pdf7 steps to achieve financial freedom.pdf
7 steps to achieve financial freedom.pdf
 
W.D. Gann Theory Complete Information.pdf
W.D. Gann Theory Complete Information.pdfW.D. Gann Theory Complete Information.pdf
W.D. Gann Theory Complete Information.pdf
 
Escorts Indore Call Girls-9155612368-Vijay Nagar Decent Fantastic Call Girls ...
Escorts Indore Call Girls-9155612368-Vijay Nagar Decent Fantastic Call Girls ...Escorts Indore Call Girls-9155612368-Vijay Nagar Decent Fantastic Call Girls ...
Escorts Indore Call Girls-9155612368-Vijay Nagar Decent Fantastic Call Girls ...
 
Bhubaneswar🌹Kalpana Mesuem ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswa...
Bhubaneswar🌹Kalpana Mesuem  ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswa...Bhubaneswar🌹Kalpana Mesuem  ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswa...
Bhubaneswar🌹Kalpana Mesuem ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswa...
 
Kurla Capable Call Girls ,07506202331, Sion Affordable Call Girls
Kurla Capable Call Girls ,07506202331, Sion Affordable Call GirlsKurla Capable Call Girls ,07506202331, Sion Affordable Call Girls
Kurla Capable Call Girls ,07506202331, Sion Affordable Call Girls
 
Turbhe Fantastic Escorts📞📞9833754194 Kopar Khairane Marathi Call Girls-Kopar ...
Turbhe Fantastic Escorts📞📞9833754194 Kopar Khairane Marathi Call Girls-Kopar ...Turbhe Fantastic Escorts📞📞9833754194 Kopar Khairane Marathi Call Girls-Kopar ...
Turbhe Fantastic Escorts📞📞9833754194 Kopar Khairane Marathi Call Girls-Kopar ...
 
Lion One Corporate Presentation May 2024
Lion One Corporate Presentation May 2024Lion One Corporate Presentation May 2024
Lion One Corporate Presentation May 2024
 
Collecting banker, Capacity of collecting Banker, conditions under section 13...
Collecting banker, Capacity of collecting Banker, conditions under section 13...Collecting banker, Capacity of collecting Banker, conditions under section 13...
Collecting banker, Capacity of collecting Banker, conditions under section 13...
 
Call Girls Howrah ( 8250092165 ) Cheap rates call girls | Get low budget
Call Girls Howrah ( 8250092165 ) Cheap rates call girls | Get low budgetCall Girls Howrah ( 8250092165 ) Cheap rates call girls | Get low budget
Call Girls Howrah ( 8250092165 ) Cheap rates call girls | Get low budget
 
MASTERING FOREX: STRATEGIES FOR SUCCESS.pdf
MASTERING FOREX: STRATEGIES FOR SUCCESS.pdfMASTERING FOREX: STRATEGIES FOR SUCCESS.pdf
MASTERING FOREX: STRATEGIES FOR SUCCESS.pdf
 
Strategic Resources May 2024 Corporate Presentation
Strategic Resources May 2024 Corporate PresentationStrategic Resources May 2024 Corporate Presentation
Strategic Resources May 2024 Corporate Presentation
 
Thane Call Girls , 07506202331 Kalyan Call Girls
Thane Call Girls , 07506202331 Kalyan Call GirlsThane Call Girls , 07506202331 Kalyan Call Girls
Thane Call Girls , 07506202331 Kalyan Call Girls
 
Virar Best Sex Call Girls Number-📞📞9833754194-Poorbi Nalasopara Housewife Cal...
Virar Best Sex Call Girls Number-📞📞9833754194-Poorbi Nalasopara Housewife Cal...Virar Best Sex Call Girls Number-📞📞9833754194-Poorbi Nalasopara Housewife Cal...
Virar Best Sex Call Girls Number-📞📞9833754194-Poorbi Nalasopara Housewife Cal...
 
Bhubaneswar🌹Ravi Tailkes ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswar ...
Bhubaneswar🌹Ravi Tailkes  ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswar ...Bhubaneswar🌹Ravi Tailkes  ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswar ...
Bhubaneswar🌹Ravi Tailkes ❤CALL GIRLS 9777949614 💟 CALL GIRLS IN bhubaneswar ...
 
Webinar on E-Invoicing for Fintech Belgium
Webinar on E-Invoicing for Fintech BelgiumWebinar on E-Invoicing for Fintech Belgium
Webinar on E-Invoicing for Fintech Belgium
 

Side 2019 #4

  • 1. Arthur Charpentier, SIDE Summer School, July 2019 # 4 Classification & Neural Nets Arthur Charpentier (Universit´e du Qu´ebec `a Montr´eal) Machine Learning & Econometrics SIDE Summer School - July 2019 @freakonometrics freakonometrics freakonometrics.hypotheses.org 1
  • 2. Arthur Charpentier, SIDE Summer School, July 2019 Modeling a 0/1 random variable Myocardial infarction of patients admited in E.R. 1 > myocarde=read.table("http:// freakonometrics .free.fr/myocarde.csv", head=TRUE ,sep=";") dataset {(xi, yi)} where i = 1, . . . , n, n = 73. 1 FRCAR 2 INCAR 3 INSYS 4 PRDIA 5 PAPUL 6 PVENT 7 REPUL 8 PRONO ◦ heart rate x1 ◦ heart index x2 ◦ stroke index x3 ◦ diastolic pressure x4 ◦ pulmonary arterial pressure x5 ◦ ventricular pressure x6 ◦ lung resistance x7 ◦ death or survival y @freakonometrics freakonometrics freakonometrics.hypotheses.org 2
  • 3. Arthur Charpentier, SIDE Summer School, July 2019 Classification : (Linear) Fisher’s Discrimination Consider the follwing naive classification rule m (x) = argminy{P[Y = y|X = x]} or m (x) = argminy P[X = x|Y = y] P[X = x] (where P[X = x] is the density in the continuous case). In the case where y takes two values, that will be standard {0, 1} here, one can rewrite the later as m (x) =    1 if E(Y |X = x) > 1 2 0 otherwise @freakonometrics freakonometrics freakonometrics.hypotheses.org 3
  • 4. Arthur Charpentier, SIDE Summer School, July 2019 Classification : (Linear) Fisher’s Discrimination the set DS = x, E(Y |X = x) = 1 2 is called the decision boundary. Assume that X|Y = 0 ∼ N(µ0, Σ) and X|Y = 1 ∼ N(µ1, Σ), then explicit expressions can be derived. m (x) =    1 if r2 1 < r2 0 + 2log P(Y = 1) P(Y = 0) + log |Σ0| |Σ1| 0 otherwise where r2 y is the Mahalanobis distance, r2 y = [X − µy]T Σ−1 y [X − µy] @freakonometrics freakonometrics freakonometrics.hypotheses.org 4
  • 5. Arthur Charpentier, SIDE Summer School, July 2019 Classification : (Linear) Fisher’s Discrimination Let δy(x) be defined as − 1 2 log |Σy| − 1 2 [x − µy]T Σ−1 y [x − µy] + log P(Y = y) the decision boundary of this classifier is {x such that δ0(x) = δ1(x)} which is quadratic in x. This is the quadratic discrim- inant analysis. In Fisher (1936, The use of multile measurements in tax- inomic problems , it was assumed that Σ0 = Σ1 In that case, actually, δy(x) = xT Σ−1 µy − 1 2 µT y Σ−1 µy + log P(Y = y) and the decision frontier is now linear in x. @freakonometrics freakonometrics freakonometrics.hypotheses.org 5
  • 6. Arthur Charpentier, SIDE Summer School, July 2019 Neural Nets & Deep Learning in a Nutshell Very popular for pictures... Picture xi is • a n × n matrix in {0, 1}n2 for black & white • a n × n matrix in [0, 1]n2 for grey-scale • a 3 × n × n array in ([0, 1]3 )n2 for color • a T ×3×n×n tensor in (([0, 1]3 )T )n2 for video y here is the label (“8”, “9”, “6”, etc) Suppose we want to recognize a “6” on a picture m(x) =    +1 if x is a “6” −1 otherwise @freakonometrics freakonometrics freakonometrics.hypotheses.org 6
  • 7. Arthur Charpentier, SIDE Summer School, July 2019 Neural Nets & Deep Learning in a Nutshell Consider some specifics pixels, and associate weights ω such that m(x) = sign   i,j ωi,jxi,j   where xi,j =    +1 if pixel xi,j is black −1 if pixel xi,j is white for some weights ωi,j (that can be negative...) @freakonometrics freakonometrics freakonometrics.hypotheses.org 7
  • 8. Arthur Charpentier, SIDE Summer School, July 2019 Neural Nets & Deep Learning in a Nutshell A deep network is a network with a lot of layers m(x) = sign i ωimi(x) where mi’s are outputs of previous neural nets Those layers can capture shapes in some areas nonlinearities, cross-dependence, etc @freakonometrics freakonometrics freakonometrics.hypotheses.org 8
  • 9. Arthur Charpentier, SIDE Summer School, July 2019 Classification : Neural Nets A neuron is simply : Human body has ∼ 100 billion neuron cells Good at : vision, speech, driving, playing games, etc The most interesting part is not ‘neural’ but ‘network’. @freakonometrics freakonometrics freakonometrics.hypotheses.org 9
  • 10. Arthur Charpentier, SIDE Summer School, July 2019 Classification : Neural Nets Networks have nodes, and edges (possibly directed) that connect nodes, or maybe, to more specific some sort of flow network, @freakonometrics freakonometrics freakonometrics.hypotheses.org 10
  • 11. Arthur Charpentier, SIDE Summer School, July 2019 Classification : Neural Nets In such a network, we usually have sources (here multiple) sources (here {s1, s2, s3}), on the left, on a sink (here {t}), on the right. To continue with this metaphorical introduction, information from the sources should reach the sink. An usually, sources are explanatory variables, {x1, · · · , xp}, and the sink is our variable of interest y. The output here is a binary variable y ∈ {0, 1} @freakonometrics freakonometrics freakonometrics.hypotheses.org 11
  • 12. Arthur Charpentier, SIDE Summer School, July 2019 Classification : the Binary Threshold Neuron & the Perceptron If x ∈ {0, 1}p , McCulloch & Pitts (1943, A logical calculus of the ideas immanent in nervous activity) suggested a simple model, with threshold b yi = f   p j=1 xj,i   where f(x) = 1(x ≥ b) where 1(x ≥ b) = +1 if x ≥ b and 0 otherwise, or (equivalently) yi = f  ω + p j=1 xj,i   where f(x) = 1(x ≥ 0) with weight ω = −b. The trick of adding 1 as an input was very important ! ω = −1 is the or logical operator : yi = 1 if ∃j such that xj,i = 1 ω = −p, is the and logical operator : yi = 1 if ∀j, xj,i = 1 @freakonometrics freakonometrics freakonometrics.hypotheses.org 12
  • 13. Arthur Charpentier, SIDE Summer School, July 2019 Classification : the Binary Threshold Neuron & the Perceptron but not possible for the xor logical operator : yi = 1 if x1,i = x2,i Rosenblatt (1961, Principles of Neurodynamics: Perceptron & Theory of Brain Mechanism) considered the extension where x’s are real-valued, with weight ω ∈ Rp yi = f   p j=1 ωjxj,i   where f(x) = 1(x ≥ b) where 1(x ≥ b) = +1 if x ≥ b and 0 otherwise, or (equivalently) yi = f  ω0 + p j=1 ωjxj,i   where f(x) = 1(x ≥ 0) with weights ω ∈ Rp+1 @freakonometrics freakonometrics freakonometrics.hypotheses.org 13
  • 14. Arthur Charpentier, SIDE Summer School, July 2019 Classification : the Binary Threshold Neuron & the Perceptron Minsky & Papert (1969, Perceptrons: an Introduction to Computational Geometry) proved that perceptron were a linear separator, not very powerful Define the sigmoid function f(x) = 1 1 + e−x = ex ex + 1 (= logistic function). This function f is called the activation function. If y ∈ {−1, +1}, one can consider the hyperbolic tangent f(x)) = (ex − e−x ) (ex + e−x) or the inverse tangent function (see wikipedia). @freakonometrics freakonometrics freakonometrics.hypotheses.org 14
  • 15. Arthur Charpentier, SIDE Summer School, July 2019 Classification : Neural Nets So here, for a classication problem, yi = f  ω0 + p j=1 ωjxj,i   = p(xi The error of that model can be computed used the quadratic loss function, n i=1 (yi − p(xi))2 or cross-entropy n i=1 (yi log p(xi) + [1 − yi] log[1 − p(xi)]) @freakonometrics freakonometrics freakonometrics.hypotheses.org 15
  • 16. Arthur Charpentier, SIDE Summer School, July 2019 Classification : Neural Nets If p(x) = f ω0 + 3 h=1 ωhfh ωh,0 + xT ωh we want to solve, with back propagation, ω = argmin 1 n n i=1 (yi, p(xi)) To visualize a neural network, use 1 library(devtools) 2 source_url(’https:// freakonometrics .free.fr/nnet_plot.R’) @freakonometrics freakonometrics freakonometrics.hypotheses.org 16
  • 17. Arthur Charpentier, SIDE Summer School, July 2019 Classification : Neural Nets 1 library(nnet) 2 mix = function(z) (z-min(z))/(max(z)-min(z)) 3 myocarde = apply(myocarde , 1, mix) 4 fit_nn = nnet(PRONO˜., data=myocarde , size =3) 5 plot.nnet(fit_nn) or 1 library(neuralnet) 2 fit_nn = neuralnet(formula(glm(PRONO˜., data = myocarde)), myocarde , hidden = 3, act.fct = sigmoid) 3 plot(fit_nn) @freakonometrics freakonometrics freakonometrics.hypotheses.org 17
  • 18. Arthur Charpentier, SIDE Summer School, July 2019 Classification : Neural Nets Of course, one can consider a multiple layer network, p(x) = f ω0 + 3 h=1 ωhfh ωh,0 + zh T ωh where zh = f  ωh,0 + kh j=1 ωh,jfh,j ωh,j,0 + xT ωh,j   1 library(neuralnet) 2 model_nnet = neuralnet(formula(glm(PRONO˜., data = myocarde)), myocarde , hidden = 3, act.fct = sigmoid) 3 plot(model_nnet) @freakonometrics freakonometrics freakonometrics.hypotheses.org 18
  • 19. Arthur Charpentier, SIDE Summer School, July 2019 Classification : Neural Nets Consider a single hidden layer, with 3 different neu- rons, so that p(x) = f  ω0 + 3 h=1 ωhfh  ωh,0 + p j=1 ωh,jxj     or p(x) = f ω0 + 3 h=1 ωhfh ωh,0 + xT ωh @freakonometrics freakonometrics freakonometrics.hypotheses.org 19
  • 20. Arthur Charpentier, SIDE Summer School, July 2019 Optimization Issues If classical econometrics are based on convex optimization problems (most the time), (deep) Neural Networks are usually not convex @freakonometrics freakonometrics freakonometrics.hypotheses.org 20
  • 21. Arthur Charpentier, SIDE Summer School, July 2019 Optimization Issues Consider optimization problem min x∈X f(x) solved by gradient descent, xn+1 = xn − γn f(xn), n ≥ 0 with starting point x0 γn can be call learning rate Sometime the dimension of the dataset can be really big: we can’t pass all the data to the computer at once to compute f(x) we need to divide the data into smaller sizes and give it to our computer one by one @freakonometrics freakonometrics freakonometrics.hypotheses.org 21
  • 22. Arthur Charpentier, SIDE Summer School, July 2019 Optimization Issues Batches The Batch size is a hyperparameter that defines the number of sub-samples we use to compute quantities (e.g. the gradient). Epoch One Epoch is when the dataset is passed forward and backward through the neural network (only once). We can divide the dataset of 2, 000 examples into batches of 400 then it will take 5 iterations to complete 1 epoch. The number of epochs is the hyperparameter that defines the number times that the learning algorithm will work through the entire training dataset. @freakonometrics freakonometrics freakonometrics.hypotheses.org 22
  • 23. Arthur Charpentier, SIDE Summer School, July 2019 Classification : Image Recognition with Neural Networks Classical {0, 1, · · · , 9} handwritten digit recognition see tensorflow (Google) and keras. @freakonometrics freakonometrics freakonometrics.hypotheses.org 23
  • 24. Arthur Charpentier, SIDE Summer School, July 2019 Classification : Image Recognition with Logistic Regression? See https://www.tensorflow.org/install 1 library(tensorflow) 2 install_tensorflow(method = "conda") 3 library(keras) 4 install_keras(method = "conda") 1 mnist <- dataset_mnist () 2 idx123 = which(mnist$train$y %in% c(1,2,3)) 3 V <- mnist$train$x[idx123 [1:800] , ,] 4 MV = NULL 5 for(i in 1:800) MV=cbind(MV ,as.vector(V[i,,])) 6 MV=t(MV) 7 df=data.frame(y=mnist$train$y[idx123 ][1:800] ,x=MV) 8 reg=glm ((y==1)˜.,data=df ,family=binomial) Here xi ∈ R784 (for a small greyscale picture). @freakonometrics freakonometrics freakonometrics.hypotheses.org 24
  • 25. Arthur Charpentier, SIDE Summer School, July 2019 Using Principal Components to Reduce Dimension Instead of using X as regressors, use ˜X = AX with less covariates Note that y = β0 + ˜Xβ + ε become y = β0 + X ˜β + ε where ˜β = Aβ. We can use either principal components or PLS (partial least squares), as introduced in Wold (1966, Estimation of principal components and related models by iterative least square). For both, we need to normalize the data (to have mean 0 and variance 1). With PCA, the first component z1 = Xω1 is obtained as solution of ω1 = argmax ω =1 {Var[Xω]} = argmax ω =1 {< Xω, Xω >} With PLS, the first component z1 = Xω1 is obtained as solution of ω1 = argmax ω =1 {Cov[y, Xω]} = argmax ω =1 {< y, Xω >} @freakonometrics freakonometrics freakonometrics.hypotheses.org 25
  • 26. Arthur Charpentier, SIDE Summer School, July 2019 Classification : Image Recognition and Convolutional Neural Network Classical strategy : reduce dimension via layers z (e.g. PCA, principal component analysis) and then consider a classifier z → y CNN : Convolutional Neural Network A convolutional neural network is a fully connected networks : each neuron in one layer is connected to all neurons in the next layer. @freakonometrics freakonometrics freakonometrics.hypotheses.org 26
  • 27. Arthur Charpentier, SIDE Summer School, July 2019 Non-supervised : PCA Projection method, used to reduce dimensionality PCA : Principal Component Analysis find a small number of directions in input space that explain variation in input data represent data by projecting along those di- rections Consider (joint) n observations of d variables, stored in matrix X ∈ Mn×d We want to project (multiply by a projection matrix) into (much) lower dimensional space k < d Search for orthogonal directions in space with highest variance @freakonometrics freakonometrics freakonometrics.hypotheses.org 27
  • 28. Arthur Charpentier, SIDE Summer School, July 2019 Non-supervised : PCA Information is stored in the covariance matrix Σ = X X (variables are centered) The k largest eigenvectors of Σ are the principal components Assemble these eigenvectors into a d × k matrix ˜Σ Express d-dimensional vectors x by projecting them to m-dimensional ˜x, ˜x = ˜Σ x • Heuristics for the first principal component We have data X ∈ Mn×d, i.e. n vectors x ∈ Rd , Let ω1 ∈ Rd denote weights. We want to maximize the variance of z1 = ω1 x max 1 n n i=1 (ω1 xi − ω1 x)2 = max ω1 Σω1 subject to ω1 = 1. @freakonometrics freakonometrics freakonometrics.hypotheses.org 28
  • 29. Arthur Charpentier, SIDE Summer School, July 2019 Non-supervised : PCA We can use Lagrange multiplier to solve this constraint optimization problem max ω1,λ1 ω1 Σω1 + λ(1 − ω1 ω1) If we differentiate Σω1 − λ1ω1 = 0, i.e. Σω1 = λ1ω1 i.e. ω1 is an eigenvector of Σ, with eigenvalue the Lagrange multiplier. It should be the one with the largest eigenvalue. A nonlinear version can be obtained using reproducing kernel Hilbert spaces, and kernels, see Sch¨olkopf al. (1996, Nonlinear Component Analysis as a Kernel Eigenvalue Problem). For the second one, we should solve a similar problem, with ω2 ⊥ ω1, i.e. ω2 ω1 = 0 @freakonometrics freakonometrics freakonometrics.hypotheses.org 29
  • 30. Arthur Charpentier, SIDE Summer School, July 2019 Non-supervised : PCA • Heuristics for the second principal component max ω2 Σω2 subject to ω2 = 1 ω2 ω1 = 0 The Lagragian is ω2 Σω2 + λ2(1 − ω2 ω2) − λ1ω2 ω1 when differentiation with respect to ω2, we get Σω2 = λ2ω2 hence, λ2 is the second largest eigenvalue, etc. @freakonometrics freakonometrics freakonometrics.hypotheses.org 30
  • 31. Arthur Charpentier, SIDE Summer School, July 2019 Non-supervised : PLS Partial Least Squares (i) set x (0) j = xj (ii) at step k • fit regressions marginally, y = θjx (k−1) j , and set zk = θ1x (k−1) 1 + · · · + θpx (k−1) p • regress y on zk and let y(k) be the prediction (of y) • orthogonalize each x (k−1) j by regressing on zk i.e. x (k) j = x (k−1) j + zkτj where τj = (zk zk)−1 zk x (k−1) j (iii) loop step (ii) and set finally, y = y(1) + y(2) + · · · In practice, PLS and PCA have similar results (especially with low R2 ) see pls library (on partial least squares), with pls::pslr for PLS regression, or pls::pcr for partial components. One can use stats::prcomp for PCA @freakonometrics freakonometrics freakonometrics.hypotheses.org 31
  • 32. Arthur Charpentier, SIDE Summer School, July 2019 Non-supervised : Principal Components on Handwritten Pictures Consider some handwritten pictures, from the mnist dataset. Here {(yi, xi)} with yi = “3” and xi ∈ [0, 1]28×28 A variable is a pixel here... Matrix X is a n × 784 matrix, where n is the number of pictures in the dataset. @freakonometrics freakonometrics freakonometrics.hypotheses.org 32
  • 33. Arthur Charpentier, SIDE Summer School, July 2019 Non-supervised : PCA on Handwritten Pictures Consider the PCA of matrix X, then ωj ∈ R784 can be interpreted as a picture (after rescaling). We can also consider the approximation of xi using only the first k principal components (here k = 3) @freakonometrics freakonometrics freakonometrics.hypotheses.org 33
  • 34. Arthur Charpentier, SIDE Summer School, July 2019 Non-supervised : PCA on Tensor Data 1D Tensor (vector) A = [Ai] ∈ Rd - i = individual 3D Tensor (array) A = [Ai,j,k] = Ai’s Ai ∈ Md1,d2 e.g. (black & white) picture 2D Tensor (matrix) A = [Ai,j] ∈ Md1,d2 , j = feature 4D Tensor (arrays) A = [Ai,j,k,l] ∈ Md1,d2 e.g. color picture, l ∈ {red, green, blue } @freakonometrics freakonometrics freakonometrics.hypotheses.org 34
  • 35. Arthur Charpentier, SIDE Summer School, July 2019 Non-supervised : PCA on Tensor Data One can define a multilinear principal component analysis see also Tucker decomposition, from Hitchcock (1927, The expression of a tensor or a polyadic as a sum of products) We can use those decompositions to derive a simple classifier. @freakonometrics freakonometrics freakonometrics.hypotheses.org 35
  • 36. Arthur Charpentier, SIDE Summer School, July 2019 Non-supervised : Principal Components on Handwritten Pictures @freakonometrics freakonometrics freakonometrics.hypotheses.org 36
  • 37. Arthur Charpentier, SIDE Summer School, July 2019 Non-supervised : Classification of Handwritten Pictures Consider a dataset with only three labels, yi ∈ {“1”, “2”, “3”} and the k principal components obtained by PCA, {x1, · · · , x784} → {˜x1, · · · , ˜xk} E.g. scatterplot {˜x1,i, ˜x2,i} with • when yi = “1”, • yi = “2” and • yi = “3”, Let mk denote the multinomial logistic regression on ˜x1, · · · , ˜xk and visualize the misclassification rate See Nielsen (2018, Neural Networks and Deep Learn- ing) @freakonometrics freakonometrics freakonometrics.hypotheses.org 37
  • 38. Arthur Charpentier, SIDE Summer School, July 2019 Non-supervised : Autoencoders PCA is closely related to a neural network Autoencoder An autoencoder is a neural network whose outputs are its own inputs Here z = f(Wx) and χ = g(Mz). Given f and g solve min W,M n i=1 (xi, χi) = min W,M n i=1 (xi, MWxi) if f and g are linear @freakonometrics freakonometrics freakonometrics.hypotheses.org 38
  • 39. Arthur Charpentier, SIDE Summer School, July 2019 Neural Networks for More Complex Pictures The algorithm is trained on tagged picture, i.e. (xi, yi) where here yi is some class of fruits (banana, apple, kiwi, tomatoe, lime, cherry, pear, lychee, papaya, etc) The challenge is to provide a new picture (xn+1 and to see the prediction yn+1) See fruits-360/Training @freakonometrics freakonometrics freakonometrics.hypotheses.org 39
  • 40. Arthur Charpentier, SIDE Summer School, July 2019 Neural Networks for More Complex Pictures Note that (xn+1 can be something that was not in the training dataset (here yellow zucchini) or some invented one (here a blue banana) see also pyimagesearch for food/no food classifier @freakonometrics freakonometrics freakonometrics.hypotheses.org 40
  • 41. Arthur Charpentier, SIDE Summer School, July 2019 Neural Networks for Text and Go (Alpha Go) Neural nets are also popular for text classification... @freakonometrics freakonometrics freakonometrics.hypotheses.org 41
  • 42. Arthur Charpentier, SIDE Summer School, July 2019 RNN : Recurrent Neural Networks Introduced in Williams, Hinton & Rumelhart (1986, Learning representations by back-propagating errors) RNN : Recurrent Neural Networks Neural Network where connections between nodes form a directed graph along a temporal sequence Allows it to exhibit temporal dynamic behavior. They will be discussed in the part on time series and dynamic modeling. @freakonometrics freakonometrics freakonometrics.hypotheses.org 42
  • 43. Arthur Charpentier, SIDE Summer School, July 2019 Neural Networks for Pictures Classification 1 library(keras) 2 library(lime) 3 library(magick) 4 model <- application_vgg16(weights = "imagenet", include_top = TRUE) 5 test_image_files_path <- "/fruits -360/Test" 6 img <- image_read(’https://upload.wikimedia.org/wikipedia/commons/ thumb/8/8a/Banana -Single.jpg/272px -Banana -Single.jpg’) 7 img_path <- file.path(test_image_files_path , "Banana", ’banana.jpg’) 8 image_write(img , img_path) 9 plot(as.raster(img)) 10 plot_superpixels(img_path , n_ superpixels = 35, weight = 10) 11 image_prep <- function(x) { 12 arrays <- lapply(x, function(path) { 13 img <- image_load(path , target_size = c(224 ,224)) 14 x <- image_to_array(img) 15 x <- array_reshape(x, c(1, dim(x))) 16 x <- imagenet_preprocess_input(x) @freakonometrics freakonometrics freakonometrics.hypotheses.org 43
  • 44. Arthur Charpentier, SIDE Summer School, July 2019 17 }) 18 do.call(abind ::abind , c(arrays , list(along = 1))) 19 } 20 res <- predict(model , image_prep(img_path)) 1 > imagenet_decode_ predictions (res) 2 [[1]] 3 class_name class_ description score 4 1 n07753592 banana 0.9929747581 5 2 n03532672 hook 0.0013420789 6 3 n07747607 orange 0.0010816196 7 4 n07749582 lemon 0.0010625814 8 5 n07716906 spaghetti_squash 0.0009176208 1 model_labels <- readRDS(system.file(’extdata ’, ’imagenet_labels.rds’, package = ’lime ’)) 2 explainer <- lime(img_path , as_classifier(model , model_labels), image _prep) 3 explanation <- explain(img_path , explainer , @freakonometrics freakonometrics freakonometrics.hypotheses.org 44
  • 45. Arthur Charpentier, SIDE Summer School, July 2019 4 n_labels = 2, n_features = 35, 5 n_ superpixels = 35, weight = 10, 6 background = "white") 7 plot_image_ explanation (explanation ) See http://junma5.weebly.com/ for more advanced models @freakonometrics freakonometrics freakonometrics.hypotheses.org 45