1. Use & Value of Unusual Data
Arthur Charpentier
UQAM
Actuarial Summer School 2019
@freakonometrics freakonometrics freakonometrics.hypotheses.org 1 / 57
2. The Team for the Week
Jean-Philippe Boucher UQAM (Montr´eal, Canada)
Professor, holder of The Co-operators Chair in Actuar-
ial Risk Analysis, author of several research articles in
actuarial science.
@J P Boucher jpboucher.uqam.ca
Arthur Charpentier UQAM (Montr´eal, Canada)
Professor, editor of Computational Actuarial Science with
R. Former director of the Data Science for Actuaries pro-
gram of the French Institute of Actuaries
@freakonometrics freakonometrics freakonometrics.hypotheses.org
Ewen Gallic AMSE (Aix-Marseille, France)
Assistant professor, teaching computational economet-
rics, data science and machine learning.
@3wen 3wen egallic.fr
@freakonometrics freakonometrics freakonometrics.hypotheses.org 2 / 57
4. General Context (in Insurance)
Networks and P2P Insurance (via https://sharable.life and source)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 4 / 57
5. General Context (in Insurance)
Networks and P2P Insurance, via Insurance Business (2016)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 5 / 57
6. General Context (in Insurance)
Telematics and Usage-Based-Insurance (source, source, The
Actuary)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 6 / 57
7. General Context (in Insurance)
Satellite pictures (source and Index-based Flood Insurance (IBFI),
http://ibfi.iwmi.org/)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 7 / 57
8. General Context (in Insurance)
Very small granularity (1 pixel is 15 × 15cm2, see detail)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 8 / 57
9. More and More (exponentially) Data
1969, (legend says) 5 photographs were taken by Neil Armstrong
More than 800,000,000 pictures are uploaded on every day, on
Facebook (2 billion photos per day with instagram, Messenger and
Whatsapp)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 9 / 57
10. Data Complexity
Pictures, drawings, videos, handwrit-
ing text, ECG, medical imagery, satel-
lite pictures, tweets, phone calls, etc.
Hard to distinguish structured and un-structured data...
@freakonometrics freakonometrics freakonometrics.hypotheses.org 10 / 57
12. Insurance Data
From questionnaires
to data frames xi,j
row i : insured, policy, claim
column j : variables (features)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 12 / 57
13. From Pictures to Data
Consider pictures saa-iss.ch/testalbum/nggallery/all-old-galleries/
@freakonometrics freakonometrics freakonometrics.hypotheses.org 13 / 57
14. From Pictures to Data
Face detection face-api.js (justadudewhohacks.github.io)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 14 / 57
15. From Pictures to Data
Expression recognition face-api.js (justadudewhohacks.github.io)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 15 / 57
16. From Pictures to Data
Face recognition face-api.js (justadudewhohacks.github.io)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 16 / 57
17. Pictures
Extracting information from a picture
What is on the picture ?
What is the cost of such a claim ?
High dimensional data :
h × d color picture, e.g. 850 × 1280
xi ∈ Rd with d > 3 nillion
Need new statistical tools
(deep) neural networks perform well
or reduce dimension
Pictures Odermatts (2003, Karambolage)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 17 / 57
18. Text
Kaufman et al. (2016, Natural Language Processing-Enabled and
Conventional Data Capture Methods for Input to Electronic Health
Records), Chen et al. (2018, A Natural Language Processing
System That Links Medical Terms in Electronic Health Record
Notes to Lay Definitions) or Simon (2018, Natural Language
Processing for Healthcare Customers)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 18 / 57
19. Statistics & ML Toolbox : PCA
Projection method, used to reduce dimensionality
Principcal Component Analysis (PCA)
find a small number of directions in input
space that explain variation in input data
represent data by projecting along those
directions
n observations in dimension d, stored in matrix X ∈ Mn×d
Natural ideal: linearly project (multiply by a projection matrix) to
much lower dimensional space k < d
Search for orthogonal directions in space with highest variance
Information is stored in the covariance matrix Σ = X X when
variables are centered
@freakonometrics freakonometrics freakonometrics.hypotheses.org 19 / 57
20. Statistics & ML Toolbox : PCA
Find the k largest eigenvectors of Σ : principal components
Assemble these eigenvectors into a d × k matrix ˜Σ
Express d-dimensional vectors x by projecting them to
m-dimensional ˜x, ˜x = ˜Σ x
We have data X ∈ Mn×d , i.e. n vectors x ∈ Rd ,
Let ω1 ∈ Rd denote weights.
We want to maximize the variance of z = ω1 x
max
1
n
n
i=1
(ω1 xi − ω1 x)2
= max ω1 Σω1 s.t. ω1 = 1
@freakonometrics freakonometrics freakonometrics.hypotheses.org 20 / 57
21. Statistics & ML Toolbox : PCA
To solve
max
1
n
n
i=1
(ω1 xi − ω1 x)2
= max ω1 Σω1 s.t. ω1 = 1
we can use Lagrange multiplier to solve this constraint
optimization problem
max
ω1,λ1
ω1 Σω1 + λ(1 − ω1 ω1)
If we differentiate Σω1 − λ1ω1 = 0, i.e. Σω1 = λ1ω1 i.e. ω1 is
an eigenvector of Σ, with eigenvalue the Lagrange multiplier. It
should be the one with the largest eigenvalue.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 21 / 57
22. Statistics & ML Toolbox : PCA
max ω2 Σω2
subject to ω2 = 1
ω2 ω1 = 0
The Lagragian is ω2 Σω2 + λ2(1 − ω2 ω2) − λ1ω2 ω1
when differentiating, we get Σω2 = λ2ω2
λ2 is the second largest eigenvalue
Set z = W x and ˜x = Mz. We want to solve (where is the
standard 2 norm)
min
W ,M
n
i=1
(xi , ˜xi )
i.e.
min
W ,M
n
i=1
(xi , MWxi )
@freakonometrics freakonometrics freakonometrics.hypotheses.org 22 / 57
23. Statistics & ML Toolbox : GLM
Generalized Linear Model
Flexible generalization of linear regression that allows for y
to have distribution other than a normal distribution
The classical regression is (matrix form)
y
n×1
= X
n×k
β
k×1
+ ε
n×1
or yi = xT
i β
xi ,β
+ εi
Yi ∼ N(µi , σ2
) , with µi = E[Yi |xi ] = xT
i β
It is possible to consider a more general regression.
For a classification y ∈ {0, 1}n,
Yi ∼ B(pi ) , with pi = E[Yi |xi ] =
exT
i β
1 + exT
i β
Recall that we assumed that (Yi , Xi ) were i.i.d. with unknown
distribution P.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 23 / 57
24. Statistics & ML Toolbox : GLM
In the (standard) linear model, we assume that
(Y |X = x) ∼ N(x β, σ2).
The maximum-likelihood estimator of β is then
β = argmax
β∈Rp+1
−
n
i=1
(yi − xi β)2
For the logistic regression, (Y |X = x) ∼ B(logit−1
(x β)).
The maximum-likelihood estimator of β is then
argmin
β∈Rp+1
n
i=1
yi log(logit−1
(xi β)) + (1 − yi ) log(1 − logit−1
(xi β))
@freakonometrics freakonometrics freakonometrics.hypotheses.org 24 / 57
25. Statistics & ML Toolbox : loss and risk
In a general setting, consider some loss function : R × R → R+,
e.g. (y, m) = (y − m)2 (quadratic 2 norm)
Average Risk
The average risk associated with mn is
RP(mn) = EP (Y , mn(X))
@freakonometrics freakonometrics freakonometrics.hypotheses.org 25 / 57
26. Statistics & ML Toolbox : loss and risk
In a general setting, consider some loss function : R × R → R+,
e.g. (y, m) = (y − m)2 (quadratic 2 norm)
Empirical Risk
The empirical risk associated with mn is
Rn(mn) =
1
n
n
i=1
yi , mn(xi ) .
@freakonometrics freakonometrics freakonometrics.hypotheses.org 26 / 57
27. Statistics & ML Toolbox : loss and risk
If we minimize the average risk, we overfit...
Hold-Out Cross Validation
1. Split {1, 2, · · · , n} in T (training) and V (validation)
2 . Estimate m on sample (yi , xi ), i ∈ T :
mT = argmin
m∈M
1
|T| i∈V
yi , m(x1,i , · · · , xp,i )
3. Compute
1
|V | i∈V
yi , mT (x1,i , · · · , xp,i )
@freakonometrics freakonometrics freakonometrics.hypotheses.org 27 / 57
28. Statistics & ML Toolbox : loss and risk
source: Hastie et al. (2009, The Elements of Statistical Learning)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 28 / 57
29. Statistics & ML Toolbox : loss and risk
Leave-one-Out Cross Validation
1. Estimate n models : estimate m−j on sample (yi , xi ),
i ∈ {1, · · · , n}{j}
m−j = argmin
m∈M
1
n − 1 i=j
yi , m(x1,i , · · · , xp,i )
2. Compute
1
n
n
i=1
yi , m−i (xi )
@freakonometrics freakonometrics freakonometrics.hypotheses.org 29 / 57
30. Statistics & ML Toolbox : loss and risk
K-Fold Cross Validation
1. Split {1, 2, · · · , n} in K groups V1, · · · , VK
2. Estimate K models : estinate mk on sample (yi , xi ), i ∈
{1, · · · , n}Vk
3. Compute
1
K
K
k=1
1
|Vk| i∈Vk
yi , mk(x1,i , · · · , Xp,i )
@freakonometrics freakonometrics freakonometrics.hypotheses.org 30 / 57
31. Statistics & ML Toolbox : loss and risk
Bootstrap Cross Validation
1. Generate B boostrap samples from {1, 2, · · · , n},
I1, · · · , IB
2. Estimate B models : mb on sample (yi , xi ), i ∈ Ib
3. Compute
1
B
B
b=1
1
n − |Ib| i∈Ib
yi , mb(x1,i , · · · , Xp,i )
The probability that i /∈ Ib is 1 −
1
n
n
∼ e−1
(= 36.78%)
At stage b, we validate on ∼ 36.78% of the dataset.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 31 / 57
32. Statistics & ML Toolbox : loss and risk
y ∈ Rd , y = argmin
m∈R
n
i=1
1
n
yi − m
εi
2
is the empirical version of
E[Y ] = argmin
m∈R
y − m
ε
2
dF(y)
= argmin
m∈R
E Y − m
ε
2
where Y is a random variable.
Thus, argmin
m:Rk →R
n
i=1
1
n
yi − m(xi )
εi
2
is the empirical version of
E[Y |X = x].
See Legendre (1805, Nouvelles m´ethodes pour la d´etermination des
orbites des com`etes) and Gauβ (1809, Theoria motus corporum
coelestium in sectionibus conicis solem ambientium).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 32 / 57
33. Statistics & ML Toolbox : loss and risk
med[y] ∈ argmin
m∈R
n
i=1
1
n
yi − m
εi
is the empirical version of
med[Y ] ∈ argmin
m∈R
y − m
ε
dF(y)
= argmin
m∈R
E Y − m
ε
1
where P[Y ≤ med[Y ]] ≥
1
2
and P[Y ≥ med[Y ]] ≥
1
2
.
argmin
m:Rk →R
n
i=1
1
n
yi − m(xi )
εi
is the empirical version of
med[Y |X = x].
See Boscovich (1757, De Litteraria expeditione per pontificiam
ditionem ad dimetiendos duos meridiani ) and Laplace (1793, Sur
quelques points du syst`eme du monde).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 33 / 57
34. Statistics & ML Toolbox : loss and risk
For least-squares ( 2 norm), we have a strictly convex problem...
classical optimization routines can be used (e.g. gradient descient)
More complicated with the 1 norm
Consider a sample {y1, · · · , yn}.
To compute the median, solve min
µ
n
i=1
|yi − µ| which can be
solved using linear programming techniques., see Dantzig (1963,
Linear programming).
More precisely, this problem is equivalent to min
µ,a,b
n
i=1
ai + bi
with ai , bi ≥ 0 and yi − µ = ai − bi , ∀i = 1, · · · , n.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 34 / 57
35. Statistics & ML Toolbox : loss and risk
For a quantile, the linear program is min
µ,a,b
n
i=1
τai + (1 − τ)bi
with ai , bi ≥ 0 and yi − µ = ai − bi , ∀i = 1, · · · , n.
This is extended to the quantile regression,
min
βτ
,a,b
n
i=1
τai + (1 − τ)bi
with ai , bi ≥ 0 and yi − xi βτ
] = ai − bi , ∀i = 1, · · · , n.
But is the following problem the one we should really care about ?
m = argmin
m∈M
n
i=1
yi , m(xi )
@freakonometrics freakonometrics freakonometrics.hypotheses.org 35 / 57
36. Statistics & ML Toolbox : Penalization
Usually, there are tuning parameters p. To avoid overfit,
1. use cross-validation
mp = argmin
p
i∈validation
yi , mp(xi )
where mp = argmin
m∈Mp
i∈training
yi , m(xi )
2. use penalization
m = argmin
m∈M
n
i=1
yi , m(xi ) + penalization(m)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 36 / 57
37. Statistics & ML Toolbox : Penalization
Penalization often yield bias... but it might interesting if the risk is
the mean squared error...
Consider a parametric model, with true (unknown) parameter θ,
then
mse(ˆθ) = E (ˆθ − θ)2
= E (ˆθ − E ˆθ )2
variance
+ E (E ˆθ − θ)2
bias2
One can think of a shrinkage of an unbiased estimator,
Let θ denote an unbiased estimator of θ,
ˆθ =
θ2
θ2 + mse(θ)
· θ ≤ θ
satisfies mse(ˆθ) ≤ mse(θ).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 37 / 57
−2 −1 0 1 2 3 4
0.00.20.40.60.8
variance
38. Statistics & ML Toolbox : Ridge
as in Wieringen (2018 Lecture notes on ridge regression)
Ridge Estimator (OLS)
β
ridge
λ = argmin
n
i=1
(yi − xi β)2
+ λ
p
j=1
β2
j
β
ridge
λ = (XT
X + λI)−1
XT
y
Ridge Estimator (GLM)
β
ridge
λ = argmin
−
n
i=1
log f (yi |µi = g−1
(xi β)) +
λ
2
p
j=1
β2
j
@freakonometrics freakonometrics freakonometrics.hypotheses.org 38 / 57
39. Statistics & ML Toolbox : Ridge
Observe that if xj1 ⊥ xj2 , then
β
ridge
λ = [1 + λ]−1
β
ols
which explain relationship with shrinkage.
But not in the general case...
q
q
Smaller mse
There exists λ such that mse[β
ridge
λ ] ≤ mse[β
ols
]
@freakonometrics freakonometrics freakonometrics.hypotheses.org 39 / 57
40. Statistics & ML Toolbox : Ridge
From a Bayesian perspective,
P[θ|y]
posterior
∝ P[y|θ]
likelihood
· P[θ]
prior
i.e. log P[θ|y] = log P[y|θ]
log likelihood
+ log P[θ]
penalty
If β has a prior N(0, τ2I) distribution, then its posterior
distribution has mean
E[β|y, X] = XT
X +
σ2
τ2
I
−1
XT
y.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 40 / 57
41. Statistics & ML Toolbox : lasso
lasso Estimator (OLS)
β
lasso
λ = argmin
n
i=1
(yi − xi β)2
+ λ
p
j=1
|βj|
lasso Estimator (GLM)
β
lasso
λ = argmin
−
n
i=1
log f (yi |µi = g−1
(xi β)) +
λ
2
p
j=1
|βj|
@freakonometrics freakonometrics freakonometrics.hypotheses.org 41 / 57
42. Statistics & ML Toolbox : sparcity
In several applications, k can be (very) large, but a lot of features
are just noise: βj = 0 for many j’s. Let s denote the number of
relevant features, with s << k, cf Hastie, Tibshirani & Wainwright
(2015, Statistical Learning with Sparsity),
s = card{S} where S = {j; βj = 0}
The model is now y = XT
SβS + ε, where XT
SXS is a full rank
matrix.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 42 / 57
q
q
= . +
43. Statistics & ML Toolbox : sparcity
0-regularization Estimator (OLS)
β
lasso
λ = argmin
n
i=1
(yi − xi β)2
+ λ β 0
where a 0 = 1(|aj| > 0) = dim(β).
More generally
p-regularization Estimator (OLS)
β
lasso
λ = argmin
n
i=1
(yi − xi β)2
+ λ β p
• sparsity is obtained when p ≤ 1
• convexity is obtained when p ≥ 1
@freakonometrics freakonometrics freakonometrics.hypotheses.org 43 / 57
44. Statistics & ML Toolbox : sparcity
source: Hastie et al. (2009, The Elements of Statistical Learning)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 44 / 57
45. Statistics & ML Toolbox : lasso
In the orthogonal case, XT
X = I,
β
lasso
k,λ = sign(β
ols
k ) |β
ols
k | −
λ
2
i.e. the lasso estimate is related to the soft
threshold function...
q
q
@freakonometrics freakonometrics freakonometrics.hypotheses.org 45 / 57
46. Numerical Toolbox
LASSO Coordinate Descent Algorithm
1. Set β0 = β
2 . For k = 1, · · ·
for j = 1, · · · , p
(i) compute Rj = xj y−X−jβk−1(−j)
(ii) set βk,j = Rj · 1 −
λ
2|Rj| +
3. The final estimate βκ is βλ
The softmax function log(1+ex−s) can be used
as a smooth version of (x − s)+
@freakonometrics freakonometrics freakonometrics.hypotheses.org 46 / 57
47. Statistics & ML Toolbox : trees
Use recursive binary splitting to grow a tree
Trees (CART)
Each interior node corresponds to one of the input
variables; there are edges to children for each of
the possible values of that input variable. Each
leaf represents a value of the target variable given
the values of the input variables represented by
the path from the root to the leaf.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 47 / 57
48. Statistics & ML Toolbox : forests
Bagging
(Bootstrap + Aggregation)
1. For k = 1, · · ·
(i) draw a bootstrap sample
from (yi , xi )’s
(ii) estimate a model mk on
that sample
2. The final model is
m (·) =
1
κ
κ
k=1
mk(·)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 48 / 57
50. Statistics & ML Toolbox : Neural Nets
Very popular for pictures...
Picture xi is
• a n×n matrix in {0, 1}n2
for black & white
• a n × n matrix in [0, 1]n2
for grey-scale
• a 3 × n × n array in ([0, 1]3)n2
for color
• a T ×3×n ×n tensor in (([0, 1]3)T )n2
for
video
y here is the label (“8”, “9”, “6”, etc)
Suppose we want to recognize a “6” on a
picture
m(x) =
+1 if x is a “6”
−1 otherwise
@freakonometrics freakonometrics freakonometrics.hypotheses.org 50 / 57
51. Statistics & ML Toolbox : Neural Nets
Consider some specifics pixels, and associate
weights ω such that
m(x) = sign
i,j
ωi,jxi,j
where
xi,j =
+1 if pixel xi,j is black
−1 if pixel xi,j is white
for some weights ωi,j (that can be nega-
tive...)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 51 / 57
52. Statistics & ML Toolbox : Neural Nets
A deep network is a network with a lot of
layers
m(x) = sign
i
ωi mi (x)
where mi ’s are outputs of previous neural
nets
Those layers can capture shapes in some ar-
eas
nonlinearities, cross-dependence, etc
@freakonometrics freakonometrics freakonometrics.hypotheses.org 52 / 57
53. Binary Threshold Neuron & Perceptron
If x ∈ {0, 1}p, McCulloch & Pitts (1943, A logical calculus of the
ideas immanent in nervous activity) suggested a simple model,
with threshold b
yi = f
p
j=1
xj,i
where f (x) = 1(x ≥ b)
where 1(x ≥ b) = +1 if x ≥ b and 0 otherwise, or (equivalently)
yi = f
ω +
p
j=1
xj,i
where f (x) = 1(x ≥ 0)
with weight ω = −b. The trick of adding 1 as an input was very
important !
ω = −1 is the or logical operator : yi = 1 if ∃j such that xj,i = 1
ω = −p, is the and logical operator : yi = 1 if ∀j, xj,i = 1
@freakonometrics freakonometrics freakonometrics.hypotheses.org 53 / 57
54. Binary Threshold Neuron & Perceptron
but not possible for the xor logical operator : yi = 1 if x1,i = x2,i
Rosenblatt (1961, Principles of Neurodynamics: Perceptron
& Theory of Brain Mechanism) considered the extension
where x’s are real-valued, with weight ω ∈ Rp
yi = f
p
j=1
ωjxj,i
where f (x) = 1(x ≥ b)
where 1(x ≥ b) = +1 if x ≥ b and 0 otherwise, or (equiv-
alently)
yi = f
ω0 +
p
j=1
ωjxj,i
where f (x) = 1(x ≥ 0)
with weights ω ∈ Rp+1
@freakonometrics freakonometrics freakonometrics.hypotheses.org 54 / 57
55. Binary Threshold Neuron & Perceptron
Minsky & Papert (1969, Perceptrons: an Introduction to
Computational Geometry) proved that perceptron were a linear
separator, not very powerful
Define the sigmoid function f (x) =
1
1 + e−x
=
ex
ex + 1
(= logistic
function).
This function f is called the activation function.
If y ∈ {−1, +1}, one can consider the hyperbolic tangent
f (x)) =
(ex − e−x )
(ex + e−x )
or the inverse tangent function (see
wikipedia).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 55 / 57
56. Statistics & ML Toolbox : Neural Nets
So here, for a classication problem,
yi = f
ω0 +
p
j=1
ωjxj,i
= p(xi )
The error of that model can be com-
puted used the quadratic loss function,
n
i=1
(yi − p(xi ))2
or cross-entropy
n
i=1
(yi log p(xi ) + [1 − yi ] log[1 − p(xi )])
@freakonometrics freakonometrics freakonometrics.hypotheses.org 56 / 57
57. Statistics & ML Toolbox : Neural Nets
Consider a single hidden layer, with 3 different
neurons, so that
p(x) = f
ω0 +
3
h=1
ωhfh
ωh,0 +
p
j=1
ωh,jxj
or
p(x) = f ω0 +
3
h=1
ωhfh ωh,0 + x ωh
@freakonometrics freakonometrics freakonometrics.hypotheses.org 57 / 57