Lausanne 2019 #1

Use & Value of Unusual Data
Arthur Charpentier
UQAM
Actuarial Summer School 2019
@freakonometrics freakonometrics freakonometrics.hypotheses.org 1 / 57

The Team for the Week
Jean-Philippe Boucher UQAM (Montr´eal, Canada)
Professor, holder of The Co-operators Chair in Actuar-
ial Risk Analysis, author of several research articles in
actuarial science.
@J P Boucher jpboucher.uqam.ca
Arthur Charpentier UQAM (Montr´eal, Canada)
Professor, editor of Computational Actuarial Science with
R. Former director of the Data Science for Actuaries pro-
gram of the French Institute of Actuaries
@freakonometrics freakonometrics freakonometrics.hypotheses.org
Ewen Gallic AMSE (Aix-Marseille, France)
Assistant professor, teaching computational economet-
rics, data science and machine learning.
@3wen 3wen egallic.fr

Practical Issues

General Context (in Insurance)
Networks and P2P Insurance (via https://sharable.life and source)

Networks and P2P Insurance, via Insurance Business (2016)

Telematics and Usage-Based-Insurance (source, source, The
Actuary)

Satellite pictures (source and Index-based Flood Insurance (IBFI),
http://ibﬁ.iwmi.org/)

Very small granularity (1 pixel is 15 × 15cm2, see detail)

More and More (exponentially) Data
1969, (legend says) 5 photographs were taken by Neil Armstrong
More than 800,000,000 pictures are uploaded on every day, on
Facebook (2 billion photos per day with instagram, Messenger and
Whatsapp)

Data Complexity
Pictures, drawings, videos, handwrit-
ing text, ECG, medical imagery, satel-
lite pictures, tweets, phone calls, etc.
Hard to distinguish structured and un-structured data...

Data Complexity

Insurance Data
From questionnaires
to data frames xi,j
row i : insured, policy, claim
column j : variables (features)

From Pictures to Data
Consider pictures saa-iss.ch/testalbum/nggallery/all-old-galleries/

Face detection face-api.js (justadudewhohacks.github.io)

Expression recognition face-api.js (justadudewhohacks.github.io)

Face recognition face-api.js (justadudewhohacks.github.io)

Pictures
Extracting information from a picture
What is on the picture ?
What is the cost of such a claim ?
High dimensional data :
h × d color picture, e.g. 850 × 1280
xi ∈ Rd with d > 3 nillion
Need new statistical tools
(deep) neural networks perform well
or reduce dimension
Pictures Odermatts (2003, Karambolage)

Text
Kaufman et al. (2016, Natural Language Processing-Enabled and
Conventional Data Capture Methods for Input to Electronic Health
Records), Chen et al. (2018, A Natural Language Processing
System That Links Medical Terms in Electronic Health Record
Notes to Lay Deﬁnitions) or Simon (2018, Natural Language
Processing for Healthcare Customers)

Statistics & ML Toolbox : PCA
Projection method, used to reduce dimensionality
Principcal Component Analysis (PCA)
ﬁnd a small number of directions in input
space that explain variation in input data
represent data by projecting along those
directions
n observations in dimension d, stored in matrix X ∈ Mn×d
Natural ideal: linearly project (multiply by a projection matrix) to
much lower dimensional space k < d
Search for orthogonal directions in space with highest variance
Information is stored in the covariance matrix Σ = X X when
variables are centered

Find the k largest eigenvectors of Σ : principal components
Assemble these eigenvectors into a d × k matrix ˜Σ
Express d-dimensional vectors x by projecting them to
m-dimensional ˜x, ˜x = ˜Σ x
We have data X ∈ Mn×d , i.e. n vectors x ∈ Rd ,
Let ω1 ∈ Rd denote weights.
We want to maximize the variance of z = ω1 x
max
1
n
n
i=1
(ω1 xi − ω1 x)2
= max ω1 Σω1 s.t. ω1 = 1

To solve
max
1
n
n
i=1
(ω1 xi − ω1 x)2
= max ω1 Σω1 s.t. ω1 = 1
we can use Lagrange multiplier to solve this constraint
optimization problem
max
ω1,λ1
ω1 Σω1 + λ(1 − ω1 ω1)
If we diﬀerentiate Σω1 − λ1ω1 = 0, i.e. Σω1 = λ1ω1 i.e. ω1 is
an eigenvector of Σ, with eigenvalue the Lagrange multiplier. It
should be the one with the largest eigenvalue.

max ω2 Σω2
subject to ω2 = 1
ω2 ω1 = 0
The Lagragian is ω2 Σω2 + λ2(1 − ω2 ω2) − λ1ω2 ω1
when diﬀerentiating, we get Σω2 = λ2ω2
λ2 is the second largest eigenvalue
Set z = W x and ˜x = Mz. We want to solve (where is the
standard 2 norm)
min
W ,M
n
i=1
(xi , ˜xi )
i.e.
min
W ,M
n
i=1
(xi , MWxi )

Statistics & ML Toolbox : GLM
Generalized Linear Model
Flexible generalization of linear regression that allows for y
to have distribution other than a normal distribution
The classical regression is (matrix form)
y
n×1
= X
n×k
β
k×1
+ ε
n×1
or yi = xT
i β
xi ,β
+ εi
Yi ∼ N(µi , σ2
) , with µi = E[Yi |xi ] = xT
i β
It is possible to consider a more general regression.
For a classiﬁcation y ∈ {0, 1}n,
Yi ∼ B(pi ) , with pi = E[Yi |xi ] =
exT
i β
1 + exT
i β
Recall that we assumed that (Yi , Xi ) were i.i.d. with unknown
distribution P.

Statistics & ML Toolbox : GLM
In the (standard) linear model, we assume that
(Y |X = x) ∼ N(x β, σ2).
The maximum-likelihood estimator of β is then
β = argmax
β∈Rp+1
−
n
i=1
(yi − xi β)2
For the logistic regression, (Y |X = x) ∼ B(logit−1
(x β)).
The maximum-likelihood estimator of β is then
argmin
β∈Rp+1
n
i=1
yi log(logit−1
(xi β)) + (1 − yi ) log(1 − logit−1
(xi β))

Statistics & ML Toolbox : loss and risk
In a general setting, consider some loss function : R × R → R+,
e.g. (y, m) = (y − m)2 (quadratic 2 norm)
Average Risk
The average risk associated with mn is
RP(mn) = EP (Y , mn(X))

In a general setting, consider some loss function : R × R → R+,
e.g. (y, m) = (y − m)2 (quadratic 2 norm)
Empirical Risk
The empirical risk associated with mn is
Rn(mn) =
1
n
n
i=1
yi , mn(xi ) .

If we minimize the average risk, we overﬁt...
Hold-Out Cross Validation
1. Split {1, 2, · · · , n} in T (training) and V (validation)
2 . Estimate m on sample (yi , xi ), i ∈ T :
mT = argmin
m∈M
1
|T| i∈V
yi , m(x1,i , · · · , xp,i )
3. Compute
1
|V | i∈V
yi , mT (x1,i , · · · , xp,i )

source: Hastie et al. (2009, The Elements of Statistical Learning)

Leave-one-Out Cross Validation
1. Estimate n models : estimate m−j on sample (yi , xi ),
i ∈ {1, · · · , n}{j}
m−j = argmin
m∈M
1
n − 1 i=j
yi , m(x1,i , · · · , xp,i )
2. Compute
1
n
n
i=1
yi , m−i (xi )

K-Fold Cross Validation
1. Split {1, 2, · · · , n} in K groups V1, · · · , VK
2. Estimate K models : estinate mk on sample (yi , xi ), i ∈
{1, · · · , n}Vk
3. Compute
1
K
K
k=1
1
|Vk| i∈Vk
yi , mk(x1,i , · · · , Xp,i )

Bootstrap Cross Validation
1. Generate B boostrap samples from {1, 2, · · · , n},
I1, · · · , IB
2. Estimate B models : mb on sample (yi , xi ), i ∈ Ib
3. Compute
1
B
B
b=1
1
n − |Ib| i∈Ib
yi , mb(x1,i , · · · , Xp,i )
The probability that i /∈ Ib is 1 −
1
n
n
∼ e−1
(= 36.78%)
At stage b, we validate on ∼ 36.78% of the dataset.

y ∈ Rd , y = argmin
m∈R



n
i=1
1
n
yi − m
εi
2



is the empirical version of
E[Y ] = argmin
m∈R



y − m
ε
2
dF(y)



= argmin
m∈R



E Y − m
ε
2



where Y is a random variable.
Thus, argmin
m:Rk →R



n
i=1
1
n
yi − m(xi )
εi
2



E[Y |X = x].
See Legendre (1805, Nouvelles méthodes pour la détermination des
orbites des comètes) and Gauβ (1809, Theoria motus corporum
coelestium in sectionibus conicis solem ambientium).

med[y] ∈ argmin
m∈R



n
i=1
1
n
yi − m
εi



med[Y ] ∈ argmin
m∈R



y − m
ε
dF(y)



= argmin
m∈R



E Y − m
ε
1



where P[Y ≤ med[Y ]] ≥
1
2
and P[Y ≥ med[Y ]] ≥
1
2
.
argmin
m:Rk →R



n
i=1
1
n
yi − m(xi )
εi



med[Y |X = x].
See Boscovich (1757, De Litteraria expeditione per pontiﬁciam
ditionem ad dimetiendos duos meridiani ) and Laplace (1793, Sur
quelques points du syst`eme du monde).

For least-squares ( 2 norm), we have a strictly convex problem...
classical optimization routines can be used (e.g. gradient descient)
More complicated with the 1 norm
Consider a sample {y1, · · · , yn}.
To compute the median, solve min
µ
n
i=1
|yi − µ| which can be
solved using linear programming techniques., see Dantzig (1963,
Linear programming).
More precisely, this problem is equivalent to min
µ,a,b
n
i=1
ai + bi
with ai , bi ≥ 0 and yi − µ = ai − bi , ∀i = 1, · · · , n.

For a quantile, the linear program is min
µ,a,b
n
i=1
τai + (1 − τ)bi
with ai , bi ≥ 0 and yi − µ = ai − bi , ∀i = 1, · · · , n.
This is extended to the quantile regression,
min
βτ
,a,b
n
i=1
τai + (1 − τ)bi
with ai , bi ≥ 0 and yi − xi βτ
] = ai − bi , ∀i = 1, · · · , n.
But is the following problem the one we should really care about ?
m = argmin
m∈M
n
i=1
yi , m(xi )

Statistics & ML Toolbox : Penalization
Usually, there are tuning parameters p. To avoid overﬁt,
1. use cross-validation
mp = argmin
p



i∈validation
yi , mp(xi )



where mp = argmin
m∈Mp



i∈training
yi , m(xi )



2. use penalization
m = argmin
m∈M
n
i=1
yi , m(xi ) + penalization(m)

Statistics & ML Toolbox : Penalization
Penalization often yield bias... but it might interesting if the risk is
the mean squared error...
Consider a parametric model, with true (unknown) parameter θ,
then
mse(ˆθ) = E (ˆθ − θ)2
= E (ˆθ − E ˆθ )2
variance
+ E (E ˆθ − θ)2
bias2
One can think of a shrinkage of an unbiased estimator,
Let θ denote an unbiased estimator of θ,
ˆθ =
θ2
θ2 + mse(θ)
· θ ≤ θ
satisﬁes mse(ˆθ) ≤ mse(θ).
−2 −1 0 1 2 3 4
0.00.20.40.60.8
variance

Statistics & ML Toolbox : Ridge
as in Wieringen (2018 Lecture notes on ridge regression)
Ridge Estimator (OLS)
β
ridge
λ = argmin



n
i=1
(yi − xi β)2
+ λ
p
j=1
β2
j



β
ridge
λ = (XT
X + λI)−1
XT
y
Ridge Estimator (GLM)
β
ridge
λ = argmin



−
n
i=1
log f (yi |µi = g−1
(xi β)) +
λ
2
p
j=1
β2
j




Observe that if xj1 ⊥ xj2 , then
β
ridge
λ = [1 + λ]−1
β
ols
which explain relationship with shrinkage.
But not in the general case...
q
q
Smaller mse
There exists λ such that mse[β
ridge
λ ] ≤ mse[β
ols
]

From a Bayesian perspective,
P[θ|y]
posterior
∝ P[y|θ]
likelihood
· P[θ]
prior
i.e. log P[θ|y] = log P[y|θ]
log likelihood
+ log P[θ]
penalty
If β has a prior N(0, τ2I) distribution, then its posterior
distribution has mean
E[β|y, X] = XT
X +
σ2
τ2
I
−1
XT
y.

Statistics & ML Toolbox : lasso
lasso Estimator (OLS)
β
lasso
λ = argmin



n
i=1
(yi − xi β)2
+ λ
p
j=1
|βj|



lasso Estimator (GLM)
β
lasso
λ = argmin



−
n
i=1
log f (yi |µi = g−1
(xi β)) +
λ
2
p
j=1
|βj|




Statistics & ML Toolbox : sparcity
In several applications, k can be (very) large, but a lot of features
are just noise: βj = 0 for many j’s. Let s denote the number of
relevant features, with s << k, cf Hastie, Tibshirani & Wainwright
(2015, Statistical Learning with Sparsity),
s = card{S} where S = {j; βj = 0}
The model is now y = XT
SβS + ε, where XT
SXS is a full rank
matrix.
q
q
= . +

0-regularization Estimator (OLS)
β
lasso
λ = argmin
n
i=1
(yi − xi β)2
+ λ β 0
where a 0 = 1(|aj| > 0) = dim(β).
More generally
p-regularization Estimator (OLS)
β
lasso
λ = argmin
n
i=1
(yi − xi β)2
+ λ β p
• sparsity is obtained when p ≤ 1
• convexity is obtained when p ≥ 1

source: Hastie et al. (2009, The Elements of Statistical Learning)

Statistics & ML Toolbox : lasso
In the orthogonal case, XT
X = I,
β
lasso
k,λ = sign(β
ols
k ) |β
ols
k | −
λ
2
i.e. the lasso estimate is related to the soft
threshold function...
q
q

Numerical Toolbox
LASSO Coordinate Descent Algorithm
1. Set β0 = β
2 . For k = 1, · · ·
for j = 1, · · · , p
(i) compute Rj = xj y−X−jβk−1(−j)
(ii) set βk,j = Rj · 1 −
λ
2|Rj| +
3. The ﬁnal estimate βκ is βλ
The softmax function log(1+ex−s) can be used
as a smooth version of (x − s)+

Statistics & ML Toolbox : trees
Use recursive binary splitting to grow a tree
Trees (CART)
Each interior node corresponds to one of the input
variables; there are edges to children for each of
the possible values of that input variable. Each
leaf represents a value of the target variable given
the values of the input variables represented by
the path from the root to the leaf.

Statistics & ML Toolbox : forests
Bagging
(Bootstrap + Aggregation)
1. For k = 1, · · ·
(i) draw a bootstrap sample
from (yi , xi )’s
(ii) estimate a model mk on
that sample
2. The ﬁnal model is
m (·) =
1
κ
κ
k=1
mk(·)

Statistics & ML Toolbox : boosting
Bosting & Sequential Learning
mk(·) = mk−1(·) + argmin
h∈H



n
i=1
yi − mk−1(xi )
εi
, h(xi )




Statistics & ML Toolbox : Neural Nets
Very popular for pictures...
Picture xi is
• a n×n matrix in {0, 1}n2
for black & white
• a n × n matrix in [0, 1]n2
for grey-scale
• a 3 × n × n array in ([0, 1]3)n2
for color
• a T ×3×n ×n tensor in (([0, 1]3)T )n2
for
video
y here is the label (“8”, “9”, “6”, etc)
Suppose we want to recognize a “6” on a
picture
m(x) =
+1 if x is a “6”
−1 otherwise

Consider some speciﬁcs pixels, and associate
weights ω such that
m(x) = sign


i,j
ωi,jxi,j


where
xi,j =
+1 if pixel xi,j is black
−1 if pixel xi,j is white
for some weights ωi,j (that can be nega-
tive...)

A deep network is a network with a lot of
layers
m(x) = sign
i
ωi mi (x)
where mi ’s are outputs of previous neural
nets
Those layers can capture shapes in some ar-
eas
nonlinearities, cross-dependence, etc

Binary Threshold Neuron & Perceptron
If x ∈ {0, 1}p, McCulloch & Pitts (1943, A logical calculus of the
ideas immanent in nervous activity) suggested a simple model,
with threshold b
yi = f


p
j=1
xj,i

 where f (x) = 1(x ≥ b)
where 1(x ≥ b) = +1 if x ≥ b and 0 otherwise, or (equivalently)
yi = f

ω +
p
j=1
xj,i

 where f (x) = 1(x ≥ 0)
with weight ω = −b. The trick of adding 1 as an input was very
important !
ω = −1 is the or logical operator : yi = 1 if ∃j such that xj,i = 1
ω = −p, is the and logical operator : yi = 1 if ∀j, xj,i = 1

but not possible for the xor logical operator : yi = 1 if x1,i = x2,i
Rosenblatt (1961, Principles of Neurodynamics: Perceptron
& Theory of Brain Mechanism) considered the extension
where x’s are real-valued, with weight ω ∈ Rp
yi = f


p
j=1
ωjxj,i

 where f (x) = 1(x ≥ b)
where 1(x ≥ b) = +1 if x ≥ b and 0 otherwise, or (equiv-
alently)
yi = f

ω0 +
p
j=1
ωjxj,i

 where f (x) = 1(x ≥ 0)
with weights ω ∈ Rp+1

Minsky & Papert (1969, Perceptrons: an Introduction to
Computational Geometry) proved that perceptron were a linear
separator, not very powerful
Deﬁne the sigmoid function f (x) =
1
1 + e−x
=
ex
ex + 1
(= logistic
function).
This function f is called the activation function.
If y ∈ {−1, +1}, one can consider the hyperbolic tangent
f (x)) =
(ex − e−x )
(ex + e−x )
or the inverse tangent function (see
wikipedia).

So here, for a classication problem,
yi = f

ω0 +
p
j=1
ωjxj,i

 = p(xi )
The error of that model can be com-
puted used the quadratic loss function,
n
i=1
(yi − p(xi ))2
or cross-entropy
n
i=1
(yi log p(xi ) + [1 − yi ] log[1 − p(xi )])

Consider a single hidden layer, with 3 diﬀerent
neurons, so that
p(x) = f

ω0 +
3
h=1
ωhfh

ωh,0 +
p
j=1
ωh,jxj




or
p(x) = f ω0 +
3
h=1
ωhfh ωh,0 + x ωh

Lausanne 2019 #1

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Lausanne 2019 #1

Similaire à Lausanne 2019 #1 (20)

Plus de Arthur Charpentier

Plus de Arthur Charpentier (12)

Dernier

Dernier (20)

Lausanne 2019 #1