SlideShare une entreprise Scribd logo
1  sur  88
Télécharger pour lire hors ligne
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Advanced Econometrics #3: Model & Variable Selection*
A. Charpentier (Université de Rennes 1)
Université de Rennes 1,
Graduate Course, 2018.
@freakonometrics 1
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
“Great plot.
Now need to find the theory that explains it”
Deville (2017) http://twitter.com
@freakonometrics 2
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Preliminary Results: Numerical Optimization
Problem : x ∈ argmin{f(x); x ∈ Rd
}
Gradient descent : xk+1 = xk − η f(xk) starting from some x0
Problem : x ∈ argmin{f(x); x ∈ X ⊂ Rd
}
Projected descent : xk+1 = ΠX xk − η f(xk) starting from some x0
A constrained problem is said to be convex if



min{f(x)} with f convex
s.t. gi(x) = 0, ∀i = 1, · · · , n with gi linear
hi(x) ≤ 0, ∀i = 1, · · · , m with hi convex
Lagrangian : L(x, λ, µ) = f(x) +
n
i=1
λigi(x) +
m
i=1
µihi(x) where x are primal
variables and (λ, µ) are dual variables.
Remark L is an affine function in (λ, µ)
@freakonometrics 3
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Preliminary Results: Numerical Optimization
Karush–Kuhn–Tucker conditions : a convex problem has a solution x if and
only if there are (λ , µ ) such that the following condition hold
• stationarity : xL(x, λ, µ) = 0 at (x , λ , µ )
• primal admissibility : gi(x ) = 0 and hi(x ) ≤ 0, ∀i
• dual admissibility : µ ≥ 0
Let L denote the associated dual function L(λ, µ) = min
x
{L(x, λ, µ)}
L is a convex function in (λ, µ) and the dual problem is max
λ,µ
{L(λ, µ)}.
@freakonometrics 4
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
References
Motivation
Banerjee, A., Chandrasekhar, A.G., Duflo, E. & Jackson, M.O. (2016). Gossip:
Identifying Central Individuals in a Social Networks.
References
Belloni, A. & Chernozhukov, V. 2009. Least squares after model selection in
high-dimensional sparse models.
Hastie, T., Tibshirani, R. & Wainwright, M. 2015 Statistical Learning with
Sparsity: The Lasso and Generalizations. CRC Press.
@freakonometrics 5
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Preambule
Assume that y = m(x) + ε, where ε is some idosyncatic impredictible noise.
The error E[(y − m(x))2
] is the sum of three terms
• variance of the estimator : E[(y − m(x))2
]
• bias2
of the estimator : [m(x) − m(x)]2
• variance of the noise : E[(y − m(x))2
]
(the latter exists, even with a ‘perfect’ model).
@freakonometrics 6
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Preambule
Consider a parametric model, with true (unkown) parameter θ, then
mse(ˆθ) = E (ˆθ − θ)2
= E (ˆθ − E ˆθ )2
variance
+ E (E ˆθ − θ)2
bias2
Let θ denote an unbiased estimator of θ. Then
ˆθ =
θ2
θ2 + mse(θ)
· θ = θ −
mse(θ)
θ2 + mse(θ)
· θ
penalty
satisfies mse(ˆθ) ≤ mse(θ).
@freakonometrics 7
−2 −1 0 1 2 3 4
0.00.20.40.60.8
variance
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Bayes vs. Frequentist, inference on heads/tails
Consider some Bernoulli sample x = {x1, x2, · · · , xn}, where xi ∈ {0, 1}.
Xi’s are i.i.d. B(p) variables, fX(x) = px
[1 − p]1−x
, x ∈ {0, 1}.
Standard frequentist approach
p =
1
n
n
i=1
xi = argman
n
i=1
fX(xi)
L(p;x)
From the central limit theorem
√
n
p − p
p(1 − p)
L
→ N(0, 1) as n → ∞
we can derive an approximated 95% confidence interval
p ±
1.96
√
n
p(1 − p)
@freakonometrics 8
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Bayes vs. Frequentist, inference on heads/tails
Example out of 1,047 contracts, 159 claimed a loss
Number of Insured Claiming a Loss
Probability
100 120 140 160 180 200 220
0.0000.0050.0100.0150.0200.0250.0300.035
(True) Binomial Distribution
Poisson Approximation
Gaussian Approximation
@freakonometrics 9
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Bayes’s theorem
Consider some hypothesis H and some evidence E, then
PE(H) = P(H|E) =
P(H ∩ E)
P(E)
=
P(H) · P(E|H)
P(E)
Bayes rule,



prior probability P(H)
versus posterior probability after receiving evidence E, PE(H) = P(H|E).
In Bayesian (parametric) statistics, H = {θ ∈ Θ} and E = {X = x}.
Bayes’ Theorem,
π(θ|x) =
π(θ) · f(x|θ)
f(x)
=
π(θ) · f(x|θ)
f(x|θ)π(θ)dθ
∝ π(θ) · f(x|θ)
@freakonometrics 10
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Small Data and Black Swans
Consider sample x = {0, 0, 0, 0, 0}.
Here the likelihood is



f(xi|θ) = θxi
[1 − θ]1−xi
f(x|θ) = θxT
1
[1 − θ]n−xT
1
and we need a priori distribution π(·) e.g.
a beta distribution
π(θ) =
θα
[1 − θ]β
B(α, β)
π(θ|x) =
θα+xT
1
[1 − θ]β+n−xT
1
B(α + xT1, β + n − xT1)
@freakonometrics 11
0
0.2
0.4
0.6
0.8
1
0
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
On Bayesian Philosophy, Confidence vs. Credibility
for frequentists, a probability is a measure of the the frequency of repeated events
→ parameters are fixed (but unknown), and data are random
for Bayesians, a probability is a measure of the degree of certainty about values
→ parameters are random and data are fixed
“Bayesians : Given our observed data, there is a 95% probability that the true value of θ
falls within the credible region
vs. Frequentists : There is a 95% probability that when I compute a confidence interval
from data of this sort, the true value of θ will fall within it.” in Vanderplas (2014)
Example see Jaynes (1976), e.g. the truncated exponential
@freakonometrics 12
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
On Bayesian Philosophy, Confidence vs. Credibility
Example What is a 95% confidence interval
of a proportion ? Here x = 159 and n = 1047.
1. draw sets (˜x1, · · · , ˜xn)k with Xi ∼ B(x/n)
2. compute for each set of values confidence
intervals
3. determine the fraction of these confidence
interval that contain x
→ the parameter is fixed, and we guarantee
that 95% of the confidence intervals will con-
tain it.
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
140 160 180 200
@freakonometrics 13
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
On Bayesian Philosophy, Confidence vs. Credibility
Example What is 95% credible region of a pro-
portion ? Here x = 159 and n = 1047.
1. draw random parameters pk with from the
posterior distribution, π(·|x)
2. sample sets (˜x1, · · · , ˜xn)k with Xi,k ∼ B(pk)
3. compute for each set of values means xk
4. look at the proportion of those xk
that are within this credible region
[Π−1
(.025|x); Π−1
(.975|x)]
→ the credible region is fixed, and we guarantee
that 95% of possible values of x will fall within it
it.
@freakonometrics 14
140 160 180 200
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Occam’s Razor
The “law of parsimony”, “pluralitas non est ponenda sine necessitate”
Penalize too complex models
@freakonometrics 15
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
James & Stein Estimator
Let X ∼ N(µ, σ2
I). We want to estimate µ.
µmle = Xn ∼ N µ,
σ2
n
I .
From James & Stein (1961) Estimation with quadratic loss
µJS = 1 −
(d − 2)σ2
n y 2
y
where · is the Euclidean norm.
One can prove that if d ≥ 3,
E µJS − µ
2
< E µmle − µ
2
Samworth (2015) Stein’s paradox, “one should use the price of tea in China to
obtain a better estimate of the chance of rain in Melbourne”.
@freakonometrics 16
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
James & Stein Estimator
Heuristics : consider a biased estimator, to decrease the variance.
See Efron (2010) Large-Scale Inference
@freakonometrics 17
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Motivation: Avoiding Overfit
Generalization : the model should perform well on new data (and not only on the
training ones).
q
q
q q q q q q q q q q q
2 4 6 8 10 12
0.00.20.40.6
q
q
q q q q q q
q q q
q
q
@freakonometrics 18
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0.0 0.2 0.4 0.6 0.8 1.0
−1.5−1.0−0.50.00.51.01.5
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0.0 0.2 0.4 0.6 0.8 1.0
−1.5−1.0−0.50.00.51.01.5
q
q
q
q
q
q
q
q
q
qq
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Reducing Dimension with PCA
Use principal components to reduce dimension (on centered and scaled variables):
we want d vectors z1, · · · , zd such that
First Compoment is z1 = Xω1 where
ω1 = argmax
ω =1
X · ω 2
= argmax
ω =1
ωT
XT
Xω
Second Compoment is z2 = Xω2 where
ω2 = argmax
ω =1
X
(1)
· ω 2
0 20 40 60 80
−8−6−4−2
Age
LogMortalityRate
−10 −5 0 5 10 15
−101234
PC score 1
PCscore2
q
q
qq
q
q
qqq
q
qqq
qq
q
qqq
q
q
q
qq
q
q
q
q
q
q
qqq
q
q
q
q
q
q
qq
q
qq
q
q
qq
q
q
q
q
q
q
qq
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qqq
q
qqq
qq
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
qq
q
q
qqqq
q
q
q
q
q
q
q
q
q
qqq
q
qqq
qq
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
1914
1915
1916
1917
1918
1919
1940
1942
1943
1944
0 20 40 60 80
−10−8−6−4−2
Age
LogMortalityRate
−10 −5 0 5 10 15
−10123
PC score 1
PCscore2
qq
q
q
q
q
q
qq
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
qq
qqqq
q
qqqq
q
q
q
q
q
q
with X
(1)
= X − Xω1
z1
ωT
1 .
@freakonometrics 19
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Reducing Dimension with PCA
A regression on (the d) principal components, y = zT
β + η could be an
interesting idea, unfortunatley, principal components have no reason to be
correlated with y. First compoment was z1 = Xω1 where
ω1 = argmax
ω =1
X · ω 2
= argmax
ω =1
ωT
XT
Xω
It is a non-supervised technique.
Instead, use partial least squares, introduced in Wold (1966) Estimation of
Principal Components and Related Models by Iterative Least squares. First
compoment is z1 = Xω1 where
ω1 = argmax
ω =1
{ y, X · ω } = argmax
ω =1
ωT
XT
yyT
Xω
@freakonometrics 20
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Terminology
Consider a dataset {yi, xi}, assumed to be generated from Y, X, from an
unknown distribution P.
Let m0(·) be the “true” model. Assume that yi = m0(xi) + εi.
In a regression context (quadratic loss function function), the risk associated to
m is
R(m) = EP Y − m(X)
2
An optimal model m within a class M satisfies
R(m ) = inf
m∈M
R(m)
Such a model m is usually called oracle.
Observe that m (x) = E[Y |X = x] is the solution of
R(m ) = inf
m∈M
R(m) where M is the set of measurable functions
@freakonometrics 21
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
The empirical risk is
Rn(m) =
1
n
n
i=1
yi − m(xi)
2
For instance, m can be a linear predictor, m(x) = β0 + xT
β, where θ = (β0, β)
should estimated (trained).
E Rn(m) = E (m(X) − Y )2
can be expressed as
E (m(X) − E[m(X)|X])2
variance of m
+ E E[m(X)|X] − E[Y |X]
m0(X)
2
bias of m
+ E Y − E[Y |X]
m0(X)
)2
variance of the noise
The third term is the risk of the “optimal” estimator m, that cannot be
decreased.
@freakonometrics 22
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Mallows Penalty and Model Complexity
Consider a linear predictor (see #1), i.e. y = m(x) = Ay.
Assume that y = m0(x) + ε, with E[ε] = 0 and Var[ε] = σ2
I.
Let · denote the Euclidean norm
Empirical risk : Rn(m) = 1
n y − m(x) 2
Vapnik’s risk : E[Rn(m)] =
1
n
m0(x − m(x) 2
+
1
n
E y − m0(x 2
with
m0(x = E[Y |X = x].
Observe that
nE Rn(m) = E y − m(x) 2
= (I − A)m0
2
+ σ2
I − A 2
while
= E m0(x) − m(x) 2
= (I − A)m0
bias
2
+ σ2
A 2
variance
@freakonometrics 23
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Mallows Penalty and Model Complexity
One can obtain
E Rn(m) = E Rn(m) + 2
σ2
n
trace(A).
If trace(A) ≥ 0 the empirical risk underestimate the true risk of the estimator.
The number of degrees of freedom of the (linear) predictor is related to trace(A)
2
σ2
n
trace(A) is called Mallow’s penalty CL.
If A is a projection matrix, trace(A) is the dimension of the projection space, p,
then we obtain Mallow’s CP , 2
σ2
n
p.
Remark : Mallows (1973) Some Comments on Cp introduced this penalty while
focusing on the R2
.
@freakonometrics 24
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Penalty and Likelihood
CP is associated to a quadratic risk
an alternative is to use a distance on the (conditional) distribution of Y , namely
Kullback-Leibler distance
discrete case: DKL(P Q) =
i
P(i) log
P(i)
Q(i)
continuous case :
DKL(P Q) =
∞
−∞
p(x) log
p(x)
q(x)
dxDKL(P Q) =
∞
−∞
p(x) log p(x)
q(x) dx
Let f denote the true (unknown) density, and fθ some parametric distribution,
DKL(f fθ) =
∞
−∞
f(x) log
f(x)
fθ(x)
dx= f(x) log[f(x)] dx− f(x) log[fθ(x)] dx
relative information
Hence
minimize {DKL(f fθ)} ←→ maximize E log[fθ(X)]
@freakonometrics 25
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Penalty and Likelihood
Akaike (1974) A new look at the statistical model identification observe that for n
large enough
E log[fθ(X)] ∼ log[L(θ)] − dim(θ)
Thus
AIC = −2 log L(θ) + 2dim(θ)
Example : in a (Gaussian) linear model, yi = β0 + xT
i β + εi
AIC = n log
1
n
n
i=1
εi + 2[dim(β) + 2]
@freakonometrics 26
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Penalty and Likelihood
Remark : this is valid for large sample (rule of thumb n/dim(θ) > 40),
otherwise use a corrected AIC
AICc = AIC +
2k(k + 1)
n − k − 1
bias correction
where k = dim(θ)
see Sugiura (1978) Further analysis of the data by Akaike’s information criterion and
the finite corrections second order AIC.
Using a Bayesian interpretation, Schwarz (1978) Estimating the dimension of a
model obtained
BIC = −2 log L(θ) + log(n)dim(θ).
Observe that the criteria considered is
criteria = −function L(θ) + penality complexity
@freakonometrics 27
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Estimation of the Risk
Consider a naive bootstrap procedure, based on a bootstrap sample
Sb = {(y
(b)
i , x
(b)
i )}.
The plug-in estimator of the empirical risk is
Rn(m(b)
) =
1
n
n
i=1
yi − m(b)
(xi)
2
and then
Rn =
1
B
B
b=1
Rn(m(b)
) =
1
B
B
b=1
1
n
n
i=1
yi − m(b)
(xi)
2
@freakonometrics 28
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Estimation of the Risk
One might improve this estimate using a out-of-bag procedure
Rn =
1
n
n
i=1
1
#Bi
b∈Bi
yi − m(b)
(xi)
2
where Bi is the set of all boostrap sample that contain (yi, xi).
Remark: P ((yi, xi) /∈ Sb) = 1 −
1
n
n
∼ e−1
= 36, 78%.
@freakonometrics 29
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Linear Regression Shortcoming
Least Squares Estimator β = (XT
X)−1
XT
y
Unbiased Estimator E[β] = β
Variance Var[β] = σ2
(XT
X)−1
which can be (extremely) large when det[(XT
X)] ∼ 0.
X =







1 −1 2
1 0 1
1 2 −1
1 1 0







then XT
X =




4 2 2
2 6 −4
2 −4 6



 while XT
X+I =




5 2 2
2 7 −4
2 −4 7




eigenvalues : {10, 6, 0} {11, 7, 1}
Ad-hoc strategy: use XT
X + λI
@freakonometrics 30
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Linear Regression Shortcoming
Evolution of (β1, β2) →
n
i=1
[yi − (β1x1,i + β2x2,i)]2
when cor(X1, X2) = r ∈ [0, 1], on top.
Below, Ridge regression
(β1, β2) →
n
i=1
[yi − (β1x1,i + β2x2,i)]2
+λ(β2
1 + β2
2)
where λ ∈ [0, ∞), below,
when cor(X1, X2) ∼ 1 (colinearity).
@freakonometrics 31
−2 −1 0 1 2 3 4
−3−2−10123
β1
β2
500
1000
1500
2000
2000
2500
2500 2500
2500
3000
3000 3000
3000
3500
q
−2 −1 0 1 2 3 4
−3−2−10123
β1
β2
1000
1000
2000
2000
3000
3000
4000
4000
5000
5000
6000
6000
7000
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Normalization : Euclidean 2 vs. Mahalonobis
We want to penalize complicated models :
if βk is “too small”, we prefer to have βk = 0.
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Instead of d(x, y) = (x − y)T
(x − y)
use dΣ(x, y) = (x − y)TΣ−1
(x − y)
beta1
beta2
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics 32
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Ridge Regression
... like the least square, but it shrinks estimated coefficients towards 0.
β
ridge
λ = argmin



n
i=1
(yi − xT
i β)2
+ λ
p
j=1
β2
j



β
ridge
λ = argmin



y − Xβ
2
2
=criteria
+ λ β 2
2
=penalty



λ ≥ 0 is a tuning parameter.
The constant is usually unpenalized. The true equation is
β
ridge
λ = argmin



y − (β0 + Xβ)
2
2
=criteria
+ λ β
2
2
=penalty



@freakonometrics 33
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Ridge Regression
β
ridge
λ = argmin y − (β0 + Xβ)
2
2
+ λ β
2
2
can be seen as a constrained optimization problem
β
ridge
λ = argmin
β 2
2
≤hλ
y − (β0 + Xβ)
2
2
Explicit solution
βλ = (XT
X + λI)−1
XT
y
If λ → 0, β
ridge
0 = β
ols
If λ → ∞, β
ridge
∞ = 0.
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
30
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics 34
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Ridge Regression
This penalty can be seen as rather unfair if compo-
nents of x are not expressed on the same scale
• center: xj = 0, then β0 = y
• scale: xT
j xj = 1
Then compute
β
ridge
λ = argmin



y − Xβ 2
2
=loss
+ λ β 2
2
=penalty



beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
30
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
40
40
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics 35
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Ridge Regression
Observe that if xj1
⊥ xj2
, then
β
ridge
λ = [1 + λ]−1
β
ols
λ
which explain relationship with shrinkage.
But generally, it is not the case...
q
q
Theorem There exists λ such that mse[β
ridge
λ ] ≤ mse[β
ols
λ ]
@freakonometrics 36
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Ridge Regression
Lλ(β) =
n
i=1
(yi − β0 − xT
i β)2
+ λ
p
j=1
β2
j
∂Lλ(β)
∂β
= −2XT
y + 2(XT
X + λI)β
∂2
Lλ(β)
∂β∂βT
= 2(XT
X + λI)
where XT
X is a semi-positive definite matrix, and λI is a positive definite
matrix, and
βλ = (XT
X + λI)−1
XT
y
@freakonometrics 37
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
The Bayesian Interpretation
From a Bayesian perspective,
P[θ|y]
posterior
∝ P[y|θ]
likelihood
· P[θ]
prior
i.e. log P[θ|y] = log P[y|θ]
log likelihood
+ log P[θ]
penalty
If β has a prior N(0, τ2
I) distribution, then its posterior distribution has mean
E[β|y, X] = XT
X +
σ2
τ2
I
−1
XT
y.
@freakonometrics 38
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Properties of the Ridge Estimator
βλ = (XT
X + λI)−1
XT
y
E[βλ] = XT
X(λI + XT
X)−1
β.
i.e. E[βλ] = β.
Observe that E[βλ] → 0 as λ → ∞.
Assume that X is an orthogonal design matrix, i.e. XT
X = I, then
βλ = (1 + λ)−1
β
ols
.
@freakonometrics 39
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Properties of the Ridge Estimator
Set W λ = (I + λ[XT
X]−1
)−1
. One can prove that
W λβ
ols
= βλ.
Thus,
Var[βλ] = W λVar[β
ols
]W T
λ
and
Var[βλ] = σ2
(XT
X + λI)−1
XT
X[(XT
X + λI)−1
]T
.
Observe that
Var[β
ols
] − Var[βλ] = σ2
W λ[2λ(XT
X)−2
+ λ2
(XT
X)−3
]W T
λ ≥ 0.
@freakonometrics 40
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Properties of the Ridge Estimator
Hence, the confidence ellipsoid of ridge estimator is
indeed smaller than the OLS,
If X is an orthogonal design matrix,
Var[βλ] = σ2
(1 + λ)−2
I.
mse[βλ] = σ2
trace(W λ(XT
X)−1
W T
λ) + βT
(W λ − I)T
(W λ − I)β.
If X is an orthogonal design matrix,
mse[βλ] =
pσ2
(1 + λ)2
+
λ2
(1 + λ)2
βT
β
@freakonometrics 41
0.0 0.2 0.4 0.6 0.8
−1.0−0.8−0.6−0.4−0.2
β1
β2
1
2
3
4
5
6
7
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Properties of the Ridge Estimator
mse[βλ] =
pσ2
(1 + λ)2
+
λ2
(1 + λ)2
βT
β
is minimal for
λ =
pσ2
βT
β
Note that there exists λ > 0 such that mse[βλ] < mse[β0] = mse[β
ols
].
@freakonometrics 42
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
SVD decomposition
For any matrix A, m × n, there are orthogonal matrices U (m × m), V (n × n)
and a "diagonal" matrix Σ (m × n) such that A = UΣV T
, or AV = UΣ.
Hence, there exists a special orthonormal set of vectors (i.e. the columns of V ),
that is mapped by the matrix A into an orthonormal set of vectors (i.e. the
columns of U).
Let r = rank(A), then A =
r
i=1
σiuivT
i (called the dyadic decomposition of A).
Observe that it can be used to compute (e.g.) the Frobenius norm of A,
A = a2
i,j = σ2
1 + · · · + σ2
min{m,n}.
Further AT
A = V ΣT
ΣV T
while AAT
= UΣΣT
UT
.
Hence, σ2
i ’s are related to eigenvalues of AT
A and AAT
, and ui, vi are associated
eigenvectors.
Golub & Reinsh (1970) Singular Value Decomposition and Least Squares Solutions
@freakonometrics 43
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
SVD decomposition
Consider the singular value decomposition of X, X = UDV T
.
Then
β
ols
= V D−2
D UT
y
βλ = V (D2
+ λI)−1
D UT
y
Observe that
D−1
i,i ≥
Di,i
D2
i,i + λ
hence, the ridge penality shrinks singular values.
Set now R = UD (n × n matrix), so that X = RV T
,
βλ = V (RT
R + λI)−1
RT
y
@freakonometrics 44
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Hat matrix and Degrees of Freedom
Recall that Y = HY with
H = X(XT
X)−1
XT
Similarly
Hλ = X(XT
X + λI)−1
XT
trace[Hλ] =
p
j=1
d2
j,j
d2
j,j + λ
→ 0, as λ → ∞.
@freakonometrics 45
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Sparsity Issues
In severall applications, k can be (very) large, but a lot of features are just noise:
βj = 0 for many j’s. Let s denote the number of relevent features, with s << k,
cf Hastie, Tibshirani & Wainwright (2015) Statistical Learning with Sparsity,
s = card{S} where S = {j; βj = 0}
The model is now y = XT
SβS + ε, where XT
SXS is a full rank matrix.
@freakonometrics 46
q
q
= . +
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Going further on sparcity issues
The Ridge regression problem was to solve
β = argmin
β∈{ β 2 ≤s}
{ Y − XT
β 2
2
}
Define a 0
= 1(|ai| > 0).
Here dim(β) = k but β 0
= s.
We wish we could solve
β = argmin
β∈{ β 0 =s}
{ Y − XT
β 2
2
}
Problem: it is usually not possible to describe all possible constraints, since
s
k
coefficients should be chosen here (with k (very) large).
@freakonometrics 47
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Going further on sparcity issues
In a convex problem, solve the dual problem,
e.g. in the Ridge regression : primal problem
min
β∈{ β 2 ≤s}
{ Y − XT
β 2
2
}
and the dual problem
min
β∈{ Y −XTβ 2 ≤t}
{ β 2
2
}
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
26
27
30
32
35
40
40
50
60
70
80
90
100
110
120
120
130
130
140 140
X
q
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
26
27
30
32
35
40
40
50
60
70
80
90
100
110
120
120
130
130
140 140
X
q
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics 48
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Going further on sparcity issues
Idea: solve the dual problem
β = argmin
β∈{ Y −XTβ 2 ≤h}
{ β 0 }
where we might convexify the 0 norm, · 0
.
@freakonometrics 49
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Going further on sparcity issues
On [−1, +1]k
, the convex hull of β 0
is β 1
On [−a, +a]k
, the convex hull of β 0
is a−1
β 1
Hence, why not solve
β = argmin
β; β 1 ≤˜s
{ Y − XT
β 2
}
which is equivalent (Kuhn-Tucker theorem) to the Lagragian optimization
problem
β = argmin{ Y − XT
β 2
2
+λ β 1 }
@freakonometrics 50
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
LASSO Least Absolute Shrinkage and Selection Operator
β ∈ argmin{ Y − XT
β 2
2
+λ β 1
}
is a convex problem (several algorithms ), but not strictly convex (no unicity of
the minimum). Nevertheless, predictions y = xT
β are unique.
MM, minimize majorization, coordinate descent Hunter & Lange (2003) A
Tutorial on MM Algorithms.
@freakonometrics 51
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
LASSO Regression
No explicit solution...
If λ → 0, β
lasso
0 = β
ols
If λ → ∞, β
lasso
∞ = 0.
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics 52
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
LASSO Regression
For some λ, there are k’s such that β
lasso
k,λ = 0.
Further, λ → β
lasso
k,λ is piecewise linear
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
30
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
beta1
beta2
−1 −0.5 0.5 1
−1
−0.5
0.5
1
30
40
40
50
60
70
80
90
100
110
120
120
150 150
40
40
X
−1.0 −0.5 0.0 0.5 1.0
−1.0−0.50.00.51.0
@freakonometrics 53
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
LASSO Regression
In the orthogonal case, XT
X = I,
β
lasso
k,λ = sign(β
ols
k ) |β
ols
k | −
λ
2
i.e. the LASSO estimate is related to the soft
threshold function...
q
q
@freakonometrics 54
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Optimal LASSO Penalty
Use cross validation, e.g. K-fold,
β(−k)(λ) = argmin



i∈Ik
[yi − xT
i β]2
+ λ β 1



then compute the sum of the squared errors,
Qk(λ) =
i∈Ik
[yi − xT
i β(−k)(λ)]2
and finally solve
λ = argmin Q(λ) =
1
K
k
Qk(λ)
Note that this might overfit, so Hastie, Tibshiriani & Friedman (2009) Elements
of Statistical Learning suggest the largest λ such that
Q(λ) ≤ Q(λ ) + se[λ ] with se[λ]2
=
1
K2
K
k=1
[Qk(λ) − Q(λ)]2
@freakonometrics 55
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
LASSO and Ridge, with R
1 > library(glmnet)
2 > chicago=read.table("http:// freakonometrics .free.fr/
chicago.txt",header=TRUE ,sep=";")
3 > standardize <- function(x) {(x-mean(x))/sd(x)}
4 > z0 <- standardize(chicago[, 1])
5 > z1 <- standardize(chicago[, 3])
6 > z2 <- standardize(chicago[, 4])
7 > ridge <-glmnet(cbind(z1 , z2), z0 , alpha =0, intercept=
FALSE , lambda =1)
8 > lasso <-glmnet(cbind(z1 , z2), z0 , alpha =1, intercept=
FALSE , lambda =1)
9 > elastic <-glmnet(cbind(z1 , z2), z0 , alpha =.5,
intercept=FALSE , lambda =1)
Elastic net, λ1 β 1
+ λ2 β 2
2
q
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
qq
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
q
q
q
q
q
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
q
q
q
qq
q
q
q
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
q
q
q
q
q
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
qq
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
q
@freakonometrics 56
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
LASSO Regression, Smoothing and Overfit
LASSO can be used to avoid overfit.
@freakonometrics 57
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0.0 0.2 0.4 0.6 0.8 1.0
−1.0−0.50.00.51.0
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Going further, 0, 1 and 2 penalty
Define
a 0
=
d
i=1
1(ai = 0), a 1
=
d
i=1
|ai| and a 2
=
d
i=1
a2
i
1/2
, for a ∈ Rd
.
constrained penalized
optimization optimization
argmin
β; β 0 ≤s
n
i=1
(yi, β0 + xT
β) argmin
β,λ
n
i=1
(yi, β0 + xT
β) + λ β 0 ( 0)
argmin
β; β 1 ≤s
n
i=1
(yi, β0 + xT
β) argmin
β,λ
n
i=1
(yi, β0 + xT
β) + λ β 1
( 1)
argmin
β; β 2 ≤s
n
i=1
(yi, β0 + xT
β) argmin
β,λ
n
i=1
(yi, β0 + xT
β) + λ β 2
( 2)
Assume that is the quadratic norm.
@freakonometrics 58
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Going further, 0, 1 and 2 penalty
The two problems ( 2) are equivalent : ∀(β , s ) solution of the left problem, ∃λ
such that (β , λ ) is solution of the right problem. And conversely.
The two problems ( 1) are equivalent : ∀(β , s ) solution of the left problem, ∃λ
such that (β , λ ) is solution of the right problem. And conversely. Nevertheless,
if there is a theoretical equivalence, there might be numerical issues since there is
not necessarily unicity of the solution.
The two problems ( 0) are not equivalent : if (β , λ ) is solution of the right
problem, ∃s such that β is a solution of the left problem. But the converse is
not true.
More generally, consider a p norm,
• sparsity is obtained when p ≤ 1
• convexity is obtained when p ≥ 1
@freakonometrics 59
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Going further, 0, 1 and 2 penalty
Foster& George (1994) the risk inflation criterion for multiple regression tried to
solve directly the penalized problem of ( 0).
But it is a complex combinatorial problem in high dimension (Natarajan (1995)
sparse approximate solutions to linear systems proved that it was a NP-hard
problem)
One can prove that if λ ∼ σ2
log(p), alors
E [xT
β − xT
β0]2
≤ E [xS
T
βS − xT
β0]2
=σ2#S
· 4 log p + 2 + o(1) .
In that case
β
sub
λ,j =



0 si j /∈ Sλ(β)
β
ols
j si j ∈ Sλ(β),
where Sλ(β) is the set of non-null values in solutions of ( 0).
@freakonometrics 60
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
If is no longer the quadratic norm but 1, problem ( 1) is not alway strictly
convex, and optimum is not always unique (e.g. if XT
X is singular).
But in the quadratic case, is strictly convex, and at least Xβ is unique.
Further, note that solutions are necessarily coherent (signs of coefficients) : it is
not possible to have βj < 0 for one solution and βj > 0 for another one.
In many cases, problem ( 1) yields a corner-type solution, which can be seen as a
"best subset" solution - like in ( 0).
@freakonometrics 61
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Going further, 0, 1 and 2 penalty
Consider a simple regression yi = xiβ + ε, with 1-penality and a 2-loss fuction.
( 1) becomes
min yT
y − 2yT
xβ + βxT
xβ + 2λ|β|
First order condition can be written
−2yT
x + 2xT
xβ±2λ = 0.
(the sign in ± being the sign of β). Assume that least-square estimate (λ = 0) is
(strictely) positive, i.e. yT
x > 0. If λ is not too large β and βols
have the same
sign, and
−2yT
x + 2xT
xβ + 2λ = 0.
with solution βlasso
λ =
yT
x − λ
xTx
.
@freakonometrics 62
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Going further, 0, 1 and 2 penalty
Increase λ so that βλ = 0.
Increase slightly more, βλ cannot become negative, because the sign of the first
order condition will change, and we should solve
−2yT
x + 2xT
xβ − 2λ = 0.
and solution would be βlasso
λ =
yT
x + λ
xTx
. But that solution is positive (we
assumed that yT
x > 0), to we should have βλ < 0.
Thus, at some point βλ = 0, which is a corner solution.
In higher dimension, see Tibshirani & Wasserman (2016) a closer look at sparse
regression or Candès & Plan (2009) Near-ideal model selection by 1 minimization.,
With some additional technical assumption, that LASSO estimator is
"sparsistent" in the sense that the support of β
lasso
λ is the same as β,
Thus, LASSO can be used for variable selection (see Hastie et al. (2001) The
@freakonometrics 63
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Elements of Statistical Learning).
Generally, βlasso
λ is a biased estimator but its variance can be small enough to
have a smaller least squared error than the OLS estimate.
With orthonormal covariance, one can prove that
βsub
λ,j = βols
j 1|βsub
λ,j
|>b
, βridge
λ,j =
βols
j
1 + λ
et βlasso
λ,j = signe[βols
j ] · (|βols
j | − λ)+.
@freakonometrics 64
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Optimization Heuristics
First idea: given some initial guess β(0), |β| ∼ |β(0)| +
1
2|β(0)|
(β2
− β2
(0))
LASSO estimate can probably be derived from iterated Ridge estimates
y − Xβ(k+1)
2
2
+ λ β(k+1) 1 ∼ Xβ(k+1)
2
2
+
λ
2 j
1
|βj,(k)|
[βj,(k+1)]2
which is a weighted ridge penalty function
Thus,
β(k+1) = XT
X + λ∆(k)
−1
XT
y
where ∆(k) = diag[|βj,(k)|−1
]. Then β(k) → β
lasso
, as k → ∞.
@freakonometrics 65
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Properties of LASSO Estimate
From this iterative technique
β
lasso
λ ∼ XT
X + λ∆
−1
XT
y
where ∆ = diag[|β
lasso
j,λ |−1
] if β
lasso
j,λ = 0, 0 otherwise.
Thus,
E[β
lasso
λ ] ∼ XT
X + λ∆
−1
XT
Xβ
and
Var[β
lasso
λ ] ∼ σ2
XT
X + λ∆
−1
XT
XT
X XT
X + λ∆
−1
XT
@freakonometrics 66
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Optimization Heuristics
Consider here a simplified problem, min
a∈R
1
2
(a − b)2
+ λ|a|
g(a)
with λ > 0.
Observe that g (0) = −b ± λ. Then
• if |b| ≤ λ, then a = 0
• if b ≥ λ, then a = b − λ
• if b ≤ −λ, then a = b + λ
a = argmin
a∈R
1
2
(a − b)2
+ λ|a| = Sλ(b) = sign(b) · (|b| − λ)+,
also called soft-thresholding operator.
@freakonometrics 67
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Optimization Heuristics
Definition for any convex function h, define the proximal operator operator of h,
proximalh(y) = argmin
x∈Rd
1
2
x − y 2
2
+ h(x)
Note that
proximalλ · 2
2
(y) =
1
1 + λ
x shrinkage operator
proximalλ · 1
(y) = Sλ(y) = sign(y) · (|y| − λ)+
@freakonometrics 68
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Optimization Heuristics
We want to solve here
θ ∈ argmin
θ∈Rd
1
n
y − mθ(x)) 2
2
f(θ)
+ λ · penalty(θ)
g(θ)
.
where f is convex and smooth, and g is convex, but not smooth...
1. Focus on f : descent lemma, ∀θ, θ
f(θ) ≤ f(θ ) + f(θ ), θ − θ +
t
2
θ − θ 2
2
Consider a gradient descent sequence θk, i.e. θk+1 = θk − t−1
f(θk), then
f(θ) ≤
ϕ(θ): θk+1=argmin{ϕ(θ)}
f(θk) + f(θk), θ − θk +
t
2
θ − θk
2
2
@freakonometrics 69
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Optimization Heuristics
2. Add function g
f(θ)+g(θ) ≤
ψ(θ)
f(θk) + f(θk), θ − θk +
t
2
θ − θk
2
2
+g(θ)
And one can proof that
θk+1 = argmin
θ∈Rd
ψ(θ) = proximalg/t θk − t−1
f(θk)
so called proximal gradient descent algorithm, since
argmin {ψ(θ)} = argmin
t
2
θ − θk − t−1
f(θk)
2
2
+ g(θ)
@freakonometrics 70
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Coordinate-wise minimization
Consider some convex differentiable f : Rk
→ R function.
Consider x ∈ Rk
obtained by minimizing along each coordinate axis, i.e.
f(x1, xi−1, xi, xi+1, · · · , xk) ≥ f(x1, xi−1, xi , xi+1, · · · , xk)
for all i. Is x a global minimizer? i.e.
f(x) ≥ f(x ), ∀x ∈ Rk
.
Yes. If f is convex and differentiable.
f(x)|x=x =
∂f(x)
∂x1
, · · · ,
∂f(x)
∂xk
= 0
There might be problem if f is not differentiable (except in each axis direction).
If f(x) = g(x) +
k
i=1 hi(xi) with g convex and differentiable, yes, since
f(x) − f(x ) ≥ g(x )T
(x − x ) +
i
[hi(xi) − hi(xi )]
@freakonometrics 71
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Coordinate-wise minimization
f(x) − f(x ) ≥
i
[ ig(x )T
(xi − xi )hi(xi) − hi(xi )]
≥0
≥ 0
Thus, for functions f(x) = g(x) +
k
i=1 hi(xi) we can use coordinate descent to
find a minimizer, i.e. at step j
x
(j)
1 ∈ argmin
x1
f(x1, x
(j−1)
2 , x
(j−1)
3 , · · · x
(j−1)
k )
x
(j)
2 ∈ argmin
x2
f(x
(j)
1 , x2, x
(j−1)
3 , · · · x
(j−1)
k )
x
(j)
3 ∈ argmin
x3
f(x
(j)
1 , x
(j)
2 , x3, · · · x
(j−1)
k )
Tseng (2001) Convergence of Block Coordinate Descent Method: if f is continuous,
then x∞
is a minimizer of f.
@freakonometrics 72
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Application in Linear Regression
Let f(x) = 1
2 y − Ax 2
, with y ∈ Rn
and A ∈ Mn×k. Let A = [A1, · · · , Ak].
Let us minimize in direction i. Let x−i denote the vector in Rk−1
without xi.
Here
0 =
∂f(x)
∂xi
= AT
i [Ax − y] = AT
i [Aixi + A−ix−i − y]
thus, the optimal value is here
xi =
AT
i [A−ix−i − y]
AT
i Ai
@freakonometrics 73
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Application to LASSO
Let f(x) = 1
2 y − Ax 2
+ λ x 1 , so that the non-differentiable part is
separable, since x 1
=
k
i=1 |xi|.
Let us minimize in direction i. Let x−i denote the vector in Rk−1
without xi.
Here
0 =
∂f(x)
∂xi
= AT
i [Aixi + A−ix−i − y] + λsi
where si ∈ ∂|xi|. Thus, solution is obtained by soft-thresholding
xi = Sλ/ Ai
2
AT
i [A−ix−i − y]
AT
i Ai
@freakonometrics 74
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Convergence rate for LASSO
Let f(x) = g(x) + λ x 1
with
• g convex, g Lipschitz with constant L > 0, and Id − g/L monotone
inscreasing in each component
• there exists z such that, componentwise, either z ≥ Sλ(z − g(z)) or
z ≤ Sλ(z − g(z))
Saka & Tewari (2010), On the finite time convergence of cyclic coordinate descent
methods proved that a coordinate descent starting from z satisfies
f(x(j)
) − f(x ) ≤
L z − x 2
2j
@freakonometrics 75
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Lasso for Autoregressive Time Series
Consider some AR(p) autoregressive time series,
Xt = φ1Xt−1 + φ2Xt−2 + · · · + φp−1Xt−p+1 + φpXt−p + εt,
for some white noise (εt), with a causal type representation. Write y = xT
φ + ε.
The LASSO estimator φ is a minimizer of
1
2T
y = xT
φ 2
+ λ
p
i=1
λi|φi|,
for some tuning parameters (λ, λ1, · · · , λp).
See Nardi & Rinaldo (2011).
@freakonometrics 76
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Graphical Lasso and Covariance Estimation
We want to estimate an (unknown) covariance matrix Σ, or Σ−1
.
An estimate for Σ−1
is Θ solution of
Θ ∈ argmin
Θ∈Mk×k
{− log[det(Θ)] + trace[SΘ] + λ Θ 1
} where S =
XT
X
n
and where Θ 1
= |Θi,j|.
See van Wieringen (2016) Undirected network reconstruction from high-dimensional
data and https://github.com/kaizhang/glasso
@freakonometrics 77
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Application to Network Simplification
Can be applied on networks, to spot ‘significant’
connexions...
Source: http://khughitt.github.io/graphical-lasso/
@freakonometrics 78
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Extention of Penalization Techniques
In a more general context, we want to solve
θ ∈ argmin
θ∈Rd
1
n
n
i=1
(yi, mθ(xi)) + λ · penalty(θ) .
The quadratic loss function was related to the Gaussian case, but much more
alternatives can be considered...
@freakonometrics 79
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Linear models, nonlinear modes and GLMs
linear model
• (Y |X = x) ∼ N(θx, σ2
)
• E[Y |X = x] = θx = xT
β
1 > fit <- lm(y ~ x, data = df)
@freakonometrics 80
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
qq
q
q
q
q
q
qqq
q
q
qq
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
q
qq
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
qq
q
q
q
q
q
q q
q
q
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Linear models, nonlinear modes and GLMs
Nonlinear models
• (Y |X = x) ∼ N(θx, σ2
)
• E[Y |X = x] = θx = m(x)
1 > fit <- lm(y ~ poly(x, k), data = df)
2 > fit <- lm(y ~ bs(x), data = df)
@freakonometrics 81
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
qq
q
q
q
q
q
qqq
q
q
qq
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
q
qq
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
qq
q
q
q
q
q
q q
q
q
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Linear models, nonlinear modes and GLMs
Generalized Linear Models
• (Y |X = x) ∼ L(θx, ϕ)
• E[Y |X = x] = h−1
(θx) = h−1
(xT
β)
1 > fit <- glm(y ~ x, data = df ,
2 + family = poisson(link = "log")
@freakonometrics 82
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
qq
q
q
q
q
q
qqq
q
q
qq
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
q
qq
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
qq
q
q
q
q
q
q q
q
q
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
The exponential Family
Consider distributions with parameter θ (and ϕ) with density (with respect to
the appropriate measure, on N or on R)
f(y|θ, ϕ) = exp
yθ − b(θ)
a(ϕ)
+ c(y, ϕ) ,
where a(·), b(·) and c(·) are functions, where θ is called canonical paramer.
θ is the quantity of interest, while ϕ is a nuisance parameter.
@freakonometrics 83
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
The Exponential Family
Example The Gaussian distribution with mean µ and variance σ2
, N(µ, σ2
)
belongs to that family θ = µ, ϕ = σ2
, a(ϕ) = ϕ, b(θ) = θ2
/2 and
c(y, ϕ) = −
1
2
y2
σ2
+ log(2πσ2
) , y ∈ R,
Example Bernoulli distribution, with mean π, B(π) is obtained with
θ = log{p/(1 − p)}, a(ϕ) = 1, b(θ) = log(1 + exp(θ)), ϕ = 1 and c(y, ϕ) = 0.
Example The binomiale distribution with mean nπ, B(n, π) is obtained with
θ = log{p/(1 − p)}, a(ϕ) = 1, b(θ) = n log(1 + exp(θ)), ϕ = 1 and
c(y, ϕ) = log
n
y
.
Example The Poisson distribution with mean λ, P(λ) belongs to that family
f(y|λ) = exp(−λ)
λy
y!
= exp y log λ − λ − log y! , y ∈ N,
with θ = log λ, ϕ = 1, a(ϕ) = 1, b(θ) = exp θ = λ and c(y, ϕ) = − log y!.
@freakonometrics 84
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
The Exponential Family
Example La loi Negative Binomiale distribution with parameters r and p,
f(k|r, p) =
y + r − 1
y
(1 − p)r
py
, y ∈ N.
can be written, equivalently
f(k|r, p) = exp y log p + r log(1 − p) + log
y + r − 1
y
i.e. θ = log p, b(θ) = −r log p and a(ϕ) = 1
@freakonometrics 85
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
The Exponential Family
Example The Gamma distribution, with mean µ and variance ν−1
,
f(y|µ, ν) =
1
Γ(ν)
ν
µ
ν
yν−1
exp −
ν
µ
y , y ∈ R+,
is also in the Exponential family θ = −
1
µ
, a(ϕ) = ϕ, b(θ) = − log(−θ), ϕ = ν−1
and
c(y, ϕ) =
1
ϕ
− 1 log(y) − log Γ
1
ϕ
@freakonometrics 86
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
Mean and Variance
Let Y be a random variable in the Exponential family
E(Y ) = b (θ) and Var(Y ) = b (θ)ϕ,
i.e. the variance of Y is the product of two quantities
• b (θ) is a function θ (only) and is called the variance function,
• a function of ϕ.
Observe that µ = E(Y ), hence parameter θ is related to mean µ. Hence, the
variance function is function of µ , and can be denote
V(µ) = b ([b ]−1
(µ))ϕ.
Example In the Gaussian case, V(µ) = 1, while for the Poisson distribution,
V (µ) = µ and for the Gamma one V (µ) = µ2
.
@freakonometrics 87
Arthur CHARPENTIER, Advanced Econometrics Graduate Course
From the Exponential Family to GLMs
Consider independent random variables Y1, Y2, . . . , Yn suchthat
f(yi|θi, ϕ) = exp
yiθi − b(θi)
a(ϕ)
+ c(yi, ϕ)
so that the likelihood can be written
L(θ, ϕ|y) =
n
i=1
f(yi|θi, ϕ) = exp
n
i=1 yiθi −
n
i=1 b(θi)
a(ϕ)
+
n
i=1
c(yi, ϕ) .
or
log L(β) =
n
i=1
yiθi − b(θi)
a(ϕ)
(up to an additive constant...)
@freakonometrics 88

Contenu connexe

Tendances (20)

Side 2019 #10
Side 2019 #10Side 2019 #10
Side 2019 #10
 
Side 2019 #3
Side 2019 #3Side 2019 #3
Side 2019 #3
 
transformations and nonparametric inference
transformations and nonparametric inferencetransformations and nonparametric inference
transformations and nonparametric inference
 
Side 2019 #4
Side 2019 #4Side 2019 #4
Side 2019 #4
 
Varese italie #2
Varese italie #2Varese italie #2
Varese italie #2
 
Lausanne 2019 #4
Lausanne 2019 #4Lausanne 2019 #4
Lausanne 2019 #4
 
Hands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive ModelingHands-On Algorithms for Predictive Modeling
Hands-On Algorithms for Predictive Modeling
 
Reinforcement Learning in Economics and Finance
Reinforcement Learning in Economics and FinanceReinforcement Learning in Economics and Finance
Reinforcement Learning in Economics and Finance
 
Machine Learning in Actuarial Science & Insurance
Machine Learning in Actuarial Science & InsuranceMachine Learning in Actuarial Science & Insurance
Machine Learning in Actuarial Science & Insurance
 
Side 2019 #12
Side 2019 #12Side 2019 #12
Side 2019 #12
 
Econometrics, PhD Course, #1 Nonlinearities
Econometrics, PhD Course, #1 NonlinearitiesEconometrics, PhD Course, #1 Nonlinearities
Econometrics, PhD Course, #1 Nonlinearities
 
Lausanne 2019 #1
Lausanne 2019 #1Lausanne 2019 #1
Lausanne 2019 #1
 
Side 2019 #8
Side 2019 #8Side 2019 #8
Side 2019 #8
 
Slides univ-van-amsterdam
Slides univ-van-amsterdamSlides univ-van-amsterdam
Slides univ-van-amsterdam
 
Slides econometrics-2017-graduate-2
Slides econometrics-2017-graduate-2Slides econometrics-2017-graduate-2
Slides econometrics-2017-graduate-2
 
Slides sales-forecasting-session2-web
Slides sales-forecasting-session2-webSlides sales-forecasting-session2-web
Slides sales-forecasting-session2-web
 
Slides econ-lm
Slides econ-lmSlides econ-lm
Slides econ-lm
 
Graduate Econometrics Course, part 4, 2017
Graduate Econometrics Course, part 4, 2017Graduate Econometrics Course, part 4, 2017
Graduate Econometrics Course, part 4, 2017
 
Slides lln-risques
Slides lln-risquesSlides lln-risques
Slides lln-risques
 
Varese italie #2
Varese italie #2Varese italie #2
Varese italie #2
 

Similaire à Slides econometrics-2018-graduate-3

Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...Asma Ben Slimene
 
Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...Asma Ben Slimene
 
Predictive Modeling in Insurance in the context of (possibly) big data
Predictive Modeling in Insurance in the context of (possibly) big dataPredictive Modeling in Insurance in the context of (possibly) big data
Predictive Modeling in Insurance in the context of (possibly) big dataArthur Charpentier
 
Diagnostic methods for Building the regression model
Diagnostic methods for Building the regression modelDiagnostic methods for Building the regression model
Diagnostic methods for Building the regression modelMehdi Shayegani
 
The axiomatic power of Kolmogorov complexity
The axiomatic power of Kolmogorov complexity The axiomatic power of Kolmogorov complexity
The axiomatic power of Kolmogorov complexity lbienven
 
Discrete and continuous probability distributions ppt @ bec doms
Discrete and continuous probability distributions ppt @ bec domsDiscrete and continuous probability distributions ppt @ bec doms
Discrete and continuous probability distributions ppt @ bec domsBabasab Patil
 
statistics linear progression
statistics linear progressionstatistics linear progression
statistics linear progressionatif muhammad
 
Bayes Classification
Bayes ClassificationBayes Classification
Bayes Classificationsathish sak
 

Similaire à Slides econometrics-2018-graduate-3 (20)

Econometrics 2017-graduate-3
Econometrics 2017-graduate-3Econometrics 2017-graduate-3
Econometrics 2017-graduate-3
 
Slides ACTINFO 2016
Slides ACTINFO 2016Slides ACTINFO 2016
Slides ACTINFO 2016
 
Slides econ-lm
Slides econ-lmSlides econ-lm
Slides econ-lm
 
Slides ub-3
Slides ub-3Slides ub-3
Slides ub-3
 
Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...
 
Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...
 
Lesson 26
Lesson 26Lesson 26
Lesson 26
 
AI Lesson 26
AI Lesson 26AI Lesson 26
AI Lesson 26
 
Predictive Modeling in Insurance in the context of (possibly) big data
Predictive Modeling in Insurance in the context of (possibly) big dataPredictive Modeling in Insurance in the context of (possibly) big data
Predictive Modeling in Insurance in the context of (possibly) big data
 
Diagnostic methods for Building the regression model
Diagnostic methods for Building the regression modelDiagnostic methods for Building the regression model
Diagnostic methods for Building the regression model
 
Slides ensae-2016-9
Slides ensae-2016-9Slides ensae-2016-9
Slides ensae-2016-9
 
Slides ub-2
Slides ub-2Slides ub-2
Slides ub-2
 
Slides ensae 9
Slides ensae 9Slides ensae 9
Slides ensae 9
 
The axiomatic power of Kolmogorov complexity
The axiomatic power of Kolmogorov complexity The axiomatic power of Kolmogorov complexity
The axiomatic power of Kolmogorov complexity
 
Discrete and continuous probability distributions ppt @ bec doms
Discrete and continuous probability distributions ppt @ bec domsDiscrete and continuous probability distributions ppt @ bec doms
Discrete and continuous probability distributions ppt @ bec doms
 
statistics linear progression
statistics linear progressionstatistics linear progression
statistics linear progression
 
Slides Bank England
Slides Bank EnglandSlides Bank England
Slides Bank England
 
Bayes Classification
Bayes ClassificationBayes Classification
Bayes Classification
 
Regression
RegressionRegression
Regression
 
Lec13_Bayes.pptx
Lec13_Bayes.pptxLec13_Bayes.pptx
Lec13_Bayes.pptx
 

Plus de Arthur Charpentier

Plus de Arthur Charpentier (14)

Family History and Life Insurance
Family History and Life InsuranceFamily History and Life Insurance
Family History and Life Insurance
 
ACT6100 introduction
ACT6100 introductionACT6100 introduction
ACT6100 introduction
 
Family History and Life Insurance (UConn actuarial seminar)
Family History and Life Insurance (UConn actuarial seminar)Family History and Life Insurance (UConn actuarial seminar)
Family History and Life Insurance (UConn actuarial seminar)
 
Control epidemics
Control epidemics Control epidemics
Control epidemics
 
STT5100 Automne 2020, introduction
STT5100 Automne 2020, introductionSTT5100 Automne 2020, introduction
STT5100 Automne 2020, introduction
 
Family History and Life Insurance
Family History and Life InsuranceFamily History and Life Insurance
Family History and Life Insurance
 
Optimal Control and COVID-19
Optimal Control and COVID-19Optimal Control and COVID-19
Optimal Control and COVID-19
 
Slides OICA 2020
Slides OICA 2020Slides OICA 2020
Slides OICA 2020
 
Lausanne 2019 #3
Lausanne 2019 #3Lausanne 2019 #3
Lausanne 2019 #3
 
Lausanne 2019 #2
Lausanne 2019 #2Lausanne 2019 #2
Lausanne 2019 #2
 
Side 2019 #11
Side 2019 #11Side 2019 #11
Side 2019 #11
 
Pareto Models, Slides EQUINEQ
Pareto Models, Slides EQUINEQPareto Models, Slides EQUINEQ
Pareto Models, Slides EQUINEQ
 
Econ. Seminar Uqam
Econ. Seminar UqamEcon. Seminar Uqam
Econ. Seminar Uqam
 
Mutualisation et Segmentation
Mutualisation et SegmentationMutualisation et Segmentation
Mutualisation et Segmentation
 

Dernier

The Triple Threat | Article on Global Resession | Harsh Kumar
The Triple Threat | Article on Global Resession | Harsh KumarThe Triple Threat | Article on Global Resession | Harsh Kumar
The Triple Threat | Article on Global Resession | Harsh KumarHarsh Kumar
 
Economic Risk Factor Update: April 2024 [SlideShare]
Economic Risk Factor Update: April 2024 [SlideShare]Economic Risk Factor Update: April 2024 [SlideShare]
Economic Risk Factor Update: April 2024 [SlideShare]Commonwealth
 
Managing Finances in a Small Business (yes).pdf
Managing Finances  in a Small Business (yes).pdfManaging Finances  in a Small Business (yes).pdf
Managing Finances in a Small Business (yes).pdfmar yame
 
Kempen ' UK DB Endgame Paper Apr 24 final3.pdf
Kempen ' UK DB Endgame Paper Apr 24 final3.pdfKempen ' UK DB Endgame Paper Apr 24 final3.pdf
Kempen ' UK DB Endgame Paper Apr 24 final3.pdfHenry Tapper
 
Tenets of Physiocracy History of Economic
Tenets of Physiocracy History of EconomicTenets of Physiocracy History of Economic
Tenets of Physiocracy History of Economiccinemoviesu
 
Call Girls Near Golden Tulip Essential Hotel, New Delhi 9873777170
Call Girls Near Golden Tulip Essential Hotel, New Delhi 9873777170Call Girls Near Golden Tulip Essential Hotel, New Delhi 9873777170
Call Girls Near Golden Tulip Essential Hotel, New Delhi 9873777170Sonam Pathan
 
House of Commons ; CDC schemes overview document
House of Commons ; CDC schemes overview documentHouse of Commons ; CDC schemes overview document
House of Commons ; CDC schemes overview documentHenry Tapper
 
原版1:1复刻温哥华岛大学毕业证Vancouver毕业证留信学历认证
原版1:1复刻温哥华岛大学毕业证Vancouver毕业证留信学历认证原版1:1复刻温哥华岛大学毕业证Vancouver毕业证留信学历认证
原版1:1复刻温哥华岛大学毕业证Vancouver毕业证留信学历认证rjrjkk
 
AnyConv.com__FSS Advance Retail & Distribution - 15.06.17.ppt
AnyConv.com__FSS Advance Retail & Distribution - 15.06.17.pptAnyConv.com__FSS Advance Retail & Distribution - 15.06.17.ppt
AnyConv.com__FSS Advance Retail & Distribution - 15.06.17.pptPriyankaSharma89719
 
Overview of Inkel Unlisted Shares Price.
Overview of Inkel Unlisted Shares Price.Overview of Inkel Unlisted Shares Price.
Overview of Inkel Unlisted Shares Price.Precize Formely Leadoff
 
Unveiling Business Expansion Trends in 2024
Unveiling Business Expansion Trends in 2024Unveiling Business Expansion Trends in 2024
Unveiling Business Expansion Trends in 2024Champak Jhagmag
 
Stock Market Brief Deck for "this does not happen often".pdf
Stock Market Brief Deck for "this does not happen often".pdfStock Market Brief Deck for "this does not happen often".pdf
Stock Market Brief Deck for "this does not happen often".pdfMichael Silva
 
(中央兰开夏大学毕业证学位证成绩单-案例)
(中央兰开夏大学毕业证学位证成绩单-案例)(中央兰开夏大学毕业证学位证成绩单-案例)
(中央兰开夏大学毕业证学位证成绩单-案例)twfkn8xj
 
《加拿大本地办假证-寻找办理Dalhousie毕业证和达尔豪斯大学毕业证书的中介代理》
《加拿大本地办假证-寻找办理Dalhousie毕业证和达尔豪斯大学毕业证书的中介代理》《加拿大本地办假证-寻找办理Dalhousie毕业证和达尔豪斯大学毕业证书的中介代理》
《加拿大本地办假证-寻找办理Dalhousie毕业证和达尔豪斯大学毕业证书的中介代理》rnrncn29
 
government_intervention_in_business_ownership[1].pdf
government_intervention_in_business_ownership[1].pdfgovernment_intervention_in_business_ownership[1].pdf
government_intervention_in_business_ownership[1].pdfshaunmashale756
 
Economics, Commerce and Trade Management: An International Journal (ECTIJ)
Economics, Commerce and Trade Management: An International Journal (ECTIJ)Economics, Commerce and Trade Management: An International Journal (ECTIJ)
Economics, Commerce and Trade Management: An International Journal (ECTIJ)ECTIJ
 
Role of Information and technology in banking and finance .pptx
Role of Information and technology in banking and finance .pptxRole of Information and technology in banking and finance .pptx
Role of Information and technology in banking and finance .pptxNarayaniTripathi2
 
Bladex 1Q24 Earning Results Presentation
Bladex 1Q24 Earning Results PresentationBladex 1Q24 Earning Results Presentation
Bladex 1Q24 Earning Results PresentationBladex
 
PMFBY , Pradhan Mantri Fasal bima yojna
PMFBY , Pradhan Mantri  Fasal bima yojnaPMFBY , Pradhan Mantri  Fasal bima yojna
PMFBY , Pradhan Mantri Fasal bima yojnaDharmendra Kumar
 
2024 Q1 Crypto Industry Report | CoinGecko
2024 Q1 Crypto Industry Report | CoinGecko2024 Q1 Crypto Industry Report | CoinGecko
2024 Q1 Crypto Industry Report | CoinGeckoCoinGecko
 

Dernier (20)

The Triple Threat | Article on Global Resession | Harsh Kumar
The Triple Threat | Article on Global Resession | Harsh KumarThe Triple Threat | Article on Global Resession | Harsh Kumar
The Triple Threat | Article on Global Resession | Harsh Kumar
 
Economic Risk Factor Update: April 2024 [SlideShare]
Economic Risk Factor Update: April 2024 [SlideShare]Economic Risk Factor Update: April 2024 [SlideShare]
Economic Risk Factor Update: April 2024 [SlideShare]
 
Managing Finances in a Small Business (yes).pdf
Managing Finances  in a Small Business (yes).pdfManaging Finances  in a Small Business (yes).pdf
Managing Finances in a Small Business (yes).pdf
 
Kempen ' UK DB Endgame Paper Apr 24 final3.pdf
Kempen ' UK DB Endgame Paper Apr 24 final3.pdfKempen ' UK DB Endgame Paper Apr 24 final3.pdf
Kempen ' UK DB Endgame Paper Apr 24 final3.pdf
 
Tenets of Physiocracy History of Economic
Tenets of Physiocracy History of EconomicTenets of Physiocracy History of Economic
Tenets of Physiocracy History of Economic
 
Call Girls Near Golden Tulip Essential Hotel, New Delhi 9873777170
Call Girls Near Golden Tulip Essential Hotel, New Delhi 9873777170Call Girls Near Golden Tulip Essential Hotel, New Delhi 9873777170
Call Girls Near Golden Tulip Essential Hotel, New Delhi 9873777170
 
House of Commons ; CDC schemes overview document
House of Commons ; CDC schemes overview documentHouse of Commons ; CDC schemes overview document
House of Commons ; CDC schemes overview document
 
原版1:1复刻温哥华岛大学毕业证Vancouver毕业证留信学历认证
原版1:1复刻温哥华岛大学毕业证Vancouver毕业证留信学历认证原版1:1复刻温哥华岛大学毕业证Vancouver毕业证留信学历认证
原版1:1复刻温哥华岛大学毕业证Vancouver毕业证留信学历认证
 
AnyConv.com__FSS Advance Retail & Distribution - 15.06.17.ppt
AnyConv.com__FSS Advance Retail & Distribution - 15.06.17.pptAnyConv.com__FSS Advance Retail & Distribution - 15.06.17.ppt
AnyConv.com__FSS Advance Retail & Distribution - 15.06.17.ppt
 
Overview of Inkel Unlisted Shares Price.
Overview of Inkel Unlisted Shares Price.Overview of Inkel Unlisted Shares Price.
Overview of Inkel Unlisted Shares Price.
 
Unveiling Business Expansion Trends in 2024
Unveiling Business Expansion Trends in 2024Unveiling Business Expansion Trends in 2024
Unveiling Business Expansion Trends in 2024
 
Stock Market Brief Deck for "this does not happen often".pdf
Stock Market Brief Deck for "this does not happen often".pdfStock Market Brief Deck for "this does not happen often".pdf
Stock Market Brief Deck for "this does not happen often".pdf
 
(中央兰开夏大学毕业证学位证成绩单-案例)
(中央兰开夏大学毕业证学位证成绩单-案例)(中央兰开夏大学毕业证学位证成绩单-案例)
(中央兰开夏大学毕业证学位证成绩单-案例)
 
《加拿大本地办假证-寻找办理Dalhousie毕业证和达尔豪斯大学毕业证书的中介代理》
《加拿大本地办假证-寻找办理Dalhousie毕业证和达尔豪斯大学毕业证书的中介代理》《加拿大本地办假证-寻找办理Dalhousie毕业证和达尔豪斯大学毕业证书的中介代理》
《加拿大本地办假证-寻找办理Dalhousie毕业证和达尔豪斯大学毕业证书的中介代理》
 
government_intervention_in_business_ownership[1].pdf
government_intervention_in_business_ownership[1].pdfgovernment_intervention_in_business_ownership[1].pdf
government_intervention_in_business_ownership[1].pdf
 
Economics, Commerce and Trade Management: An International Journal (ECTIJ)
Economics, Commerce and Trade Management: An International Journal (ECTIJ)Economics, Commerce and Trade Management: An International Journal (ECTIJ)
Economics, Commerce and Trade Management: An International Journal (ECTIJ)
 
Role of Information and technology in banking and finance .pptx
Role of Information and technology in banking and finance .pptxRole of Information and technology in banking and finance .pptx
Role of Information and technology in banking and finance .pptx
 
Bladex 1Q24 Earning Results Presentation
Bladex 1Q24 Earning Results PresentationBladex 1Q24 Earning Results Presentation
Bladex 1Q24 Earning Results Presentation
 
PMFBY , Pradhan Mantri Fasal bima yojna
PMFBY , Pradhan Mantri  Fasal bima yojnaPMFBY , Pradhan Mantri  Fasal bima yojna
PMFBY , Pradhan Mantri Fasal bima yojna
 
2024 Q1 Crypto Industry Report | CoinGecko
2024 Q1 Crypto Industry Report | CoinGecko2024 Q1 Crypto Industry Report | CoinGecko
2024 Q1 Crypto Industry Report | CoinGecko
 

Slides econometrics-2018-graduate-3

  • 1. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Advanced Econometrics #3: Model & Variable Selection* A. Charpentier (Université de Rennes 1) Université de Rennes 1, Graduate Course, 2018. @freakonometrics 1
  • 2. Arthur CHARPENTIER, Advanced Econometrics Graduate Course “Great plot. Now need to find the theory that explains it” Deville (2017) http://twitter.com @freakonometrics 2
  • 3. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Preliminary Results: Numerical Optimization Problem : x ∈ argmin{f(x); x ∈ Rd } Gradient descent : xk+1 = xk − η f(xk) starting from some x0 Problem : x ∈ argmin{f(x); x ∈ X ⊂ Rd } Projected descent : xk+1 = ΠX xk − η f(xk) starting from some x0 A constrained problem is said to be convex if    min{f(x)} with f convex s.t. gi(x) = 0, ∀i = 1, · · · , n with gi linear hi(x) ≤ 0, ∀i = 1, · · · , m with hi convex Lagrangian : L(x, λ, µ) = f(x) + n i=1 λigi(x) + m i=1 µihi(x) where x are primal variables and (λ, µ) are dual variables. Remark L is an affine function in (λ, µ) @freakonometrics 3
  • 4. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Preliminary Results: Numerical Optimization Karush–Kuhn–Tucker conditions : a convex problem has a solution x if and only if there are (λ , µ ) such that the following condition hold • stationarity : xL(x, λ, µ) = 0 at (x , λ , µ ) • primal admissibility : gi(x ) = 0 and hi(x ) ≤ 0, ∀i • dual admissibility : µ ≥ 0 Let L denote the associated dual function L(λ, µ) = min x {L(x, λ, µ)} L is a convex function in (λ, µ) and the dual problem is max λ,µ {L(λ, µ)}. @freakonometrics 4
  • 5. Arthur CHARPENTIER, Advanced Econometrics Graduate Course References Motivation Banerjee, A., Chandrasekhar, A.G., Duflo, E. & Jackson, M.O. (2016). Gossip: Identifying Central Individuals in a Social Networks. References Belloni, A. & Chernozhukov, V. 2009. Least squares after model selection in high-dimensional sparse models. Hastie, T., Tibshirani, R. & Wainwright, M. 2015 Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press. @freakonometrics 5
  • 6. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Preambule Assume that y = m(x) + ε, where ε is some idosyncatic impredictible noise. The error E[(y − m(x))2 ] is the sum of three terms • variance of the estimator : E[(y − m(x))2 ] • bias2 of the estimator : [m(x) − m(x)]2 • variance of the noise : E[(y − m(x))2 ] (the latter exists, even with a ‘perfect’ model). @freakonometrics 6
  • 7. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Preambule Consider a parametric model, with true (unkown) parameter θ, then mse(ˆθ) = E (ˆθ − θ)2 = E (ˆθ − E ˆθ )2 variance + E (E ˆθ − θ)2 bias2 Let θ denote an unbiased estimator of θ. Then ˆθ = θ2 θ2 + mse(θ) · θ = θ − mse(θ) θ2 + mse(θ) · θ penalty satisfies mse(ˆθ) ≤ mse(θ). @freakonometrics 7 −2 −1 0 1 2 3 4 0.00.20.40.60.8 variance
  • 8. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Bayes vs. Frequentist, inference on heads/tails Consider some Bernoulli sample x = {x1, x2, · · · , xn}, where xi ∈ {0, 1}. Xi’s are i.i.d. B(p) variables, fX(x) = px [1 − p]1−x , x ∈ {0, 1}. Standard frequentist approach p = 1 n n i=1 xi = argman n i=1 fX(xi) L(p;x) From the central limit theorem √ n p − p p(1 − p) L → N(0, 1) as n → ∞ we can derive an approximated 95% confidence interval p ± 1.96 √ n p(1 − p) @freakonometrics 8
  • 9. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Bayes vs. Frequentist, inference on heads/tails Example out of 1,047 contracts, 159 claimed a loss Number of Insured Claiming a Loss Probability 100 120 140 160 180 200 220 0.0000.0050.0100.0150.0200.0250.0300.035 (True) Binomial Distribution Poisson Approximation Gaussian Approximation @freakonometrics 9
  • 10. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Bayes’s theorem Consider some hypothesis H and some evidence E, then PE(H) = P(H|E) = P(H ∩ E) P(E) = P(H) · P(E|H) P(E) Bayes rule,    prior probability P(H) versus posterior probability after receiving evidence E, PE(H) = P(H|E). In Bayesian (parametric) statistics, H = {θ ∈ Θ} and E = {X = x}. Bayes’ Theorem, π(θ|x) = π(θ) · f(x|θ) f(x) = π(θ) · f(x|θ) f(x|θ)π(θ)dθ ∝ π(θ) · f(x|θ) @freakonometrics 10
  • 11. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Small Data and Black Swans Consider sample x = {0, 0, 0, 0, 0}. Here the likelihood is    f(xi|θ) = θxi [1 − θ]1−xi f(x|θ) = θxT 1 [1 − θ]n−xT 1 and we need a priori distribution π(·) e.g. a beta distribution π(θ) = θα [1 − θ]β B(α, β) π(θ|x) = θα+xT 1 [1 − θ]β+n−xT 1 B(α + xT1, β + n − xT1) @freakonometrics 11 0 0.2 0.4 0.6 0.8 1 0
  • 12. Arthur CHARPENTIER, Advanced Econometrics Graduate Course On Bayesian Philosophy, Confidence vs. Credibility for frequentists, a probability is a measure of the the frequency of repeated events → parameters are fixed (but unknown), and data are random for Bayesians, a probability is a measure of the degree of certainty about values → parameters are random and data are fixed “Bayesians : Given our observed data, there is a 95% probability that the true value of θ falls within the credible region vs. Frequentists : There is a 95% probability that when I compute a confidence interval from data of this sort, the true value of θ will fall within it.” in Vanderplas (2014) Example see Jaynes (1976), e.g. the truncated exponential @freakonometrics 12
  • 13. Arthur CHARPENTIER, Advanced Econometrics Graduate Course On Bayesian Philosophy, Confidence vs. Credibility Example What is a 95% confidence interval of a proportion ? Here x = 159 and n = 1047. 1. draw sets (˜x1, · · · , ˜xn)k with Xi ∼ B(x/n) 2. compute for each set of values confidence intervals 3. determine the fraction of these confidence interval that contain x → the parameter is fixed, and we guarantee that 95% of the confidence intervals will con- tain it. q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 140 160 180 200 @freakonometrics 13
  • 14. Arthur CHARPENTIER, Advanced Econometrics Graduate Course On Bayesian Philosophy, Confidence vs. Credibility Example What is 95% credible region of a pro- portion ? Here x = 159 and n = 1047. 1. draw random parameters pk with from the posterior distribution, π(·|x) 2. sample sets (˜x1, · · · , ˜xn)k with Xi,k ∼ B(pk) 3. compute for each set of values means xk 4. look at the proportion of those xk that are within this credible region [Π−1 (.025|x); Π−1 (.975|x)] → the credible region is fixed, and we guarantee that 95% of possible values of x will fall within it it. @freakonometrics 14 140 160 180 200
  • 15. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Occam’s Razor The “law of parsimony”, “pluralitas non est ponenda sine necessitate” Penalize too complex models @freakonometrics 15
  • 16. Arthur CHARPENTIER, Advanced Econometrics Graduate Course James & Stein Estimator Let X ∼ N(µ, σ2 I). We want to estimate µ. µmle = Xn ∼ N µ, σ2 n I . From James & Stein (1961) Estimation with quadratic loss µJS = 1 − (d − 2)σ2 n y 2 y where · is the Euclidean norm. One can prove that if d ≥ 3, E µJS − µ 2 < E µmle − µ 2 Samworth (2015) Stein’s paradox, “one should use the price of tea in China to obtain a better estimate of the chance of rain in Melbourne”. @freakonometrics 16
  • 17. Arthur CHARPENTIER, Advanced Econometrics Graduate Course James & Stein Estimator Heuristics : consider a biased estimator, to decrease the variance. See Efron (2010) Large-Scale Inference @freakonometrics 17
  • 18. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Motivation: Avoiding Overfit Generalization : the model should perform well on new data (and not only on the training ones). q q q q q q q q q q q q q 2 4 6 8 10 12 0.00.20.40.6 q q q q q q q q q q q q q @freakonometrics 18 q q q q q q q q q q q q q q q q q q q q q q 0.0 0.2 0.4 0.6 0.8 1.0 −1.5−1.0−0.50.00.51.01.5 q q q q q q q q q q q q q q q q q q q q q q 0.0 0.2 0.4 0.6 0.8 1.0 −1.5−1.0−0.50.00.51.01.5 q q q q q q q q q qq
  • 19. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Reducing Dimension with PCA Use principal components to reduce dimension (on centered and scaled variables): we want d vectors z1, · · · , zd such that First Compoment is z1 = Xω1 where ω1 = argmax ω =1 X · ω 2 = argmax ω =1 ωT XT Xω Second Compoment is z2 = Xω2 where ω2 = argmax ω =1 X (1) · ω 2 0 20 40 60 80 −8−6−4−2 Age LogMortalityRate −10 −5 0 5 10 15 −101234 PC score 1 PCscore2 q q qq q q qqq q qqq qq q qqq q q q qq q q q q q q qqq q q q q q q qq q qq q q qq q q q q q q qq q qqq q q q q q q q q q q q q q q qq q q qqq q qqq qq q qqq q q q q q q q q q q q q q q q q q q q qq q q q q qq q q qq q q qqqq q q q q q q q q q qqq q qqq qq q qq q q q q qq q q q q q q q q q q q q q 1914 1915 1916 1917 1918 1919 1940 1942 1943 1944 0 20 40 60 80 −10−8−6−4−2 Age LogMortalityRate −10 −5 0 5 10 15 −10123 PC score 1 PCscore2 qq q q q q q qq qq qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq qq q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q qq qq q q q q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q q q q q q q qq q q q q q q q q q q qq qq q q q q q qq qqqq q qqqq q q q q q q with X (1) = X − Xω1 z1 ωT 1 . @freakonometrics 19
  • 20. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Reducing Dimension with PCA A regression on (the d) principal components, y = zT β + η could be an interesting idea, unfortunatley, principal components have no reason to be correlated with y. First compoment was z1 = Xω1 where ω1 = argmax ω =1 X · ω 2 = argmax ω =1 ωT XT Xω It is a non-supervised technique. Instead, use partial least squares, introduced in Wold (1966) Estimation of Principal Components and Related Models by Iterative Least squares. First compoment is z1 = Xω1 where ω1 = argmax ω =1 { y, X · ω } = argmax ω =1 ωT XT yyT Xω @freakonometrics 20
  • 21. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Terminology Consider a dataset {yi, xi}, assumed to be generated from Y, X, from an unknown distribution P. Let m0(·) be the “true” model. Assume that yi = m0(xi) + εi. In a regression context (quadratic loss function function), the risk associated to m is R(m) = EP Y − m(X) 2 An optimal model m within a class M satisfies R(m ) = inf m∈M R(m) Such a model m is usually called oracle. Observe that m (x) = E[Y |X = x] is the solution of R(m ) = inf m∈M R(m) where M is the set of measurable functions @freakonometrics 21
  • 22. Arthur CHARPENTIER, Advanced Econometrics Graduate Course The empirical risk is Rn(m) = 1 n n i=1 yi − m(xi) 2 For instance, m can be a linear predictor, m(x) = β0 + xT β, where θ = (β0, β) should estimated (trained). E Rn(m) = E (m(X) − Y )2 can be expressed as E (m(X) − E[m(X)|X])2 variance of m + E E[m(X)|X] − E[Y |X] m0(X) 2 bias of m + E Y − E[Y |X] m0(X) )2 variance of the noise The third term is the risk of the “optimal” estimator m, that cannot be decreased. @freakonometrics 22
  • 23. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Mallows Penalty and Model Complexity Consider a linear predictor (see #1), i.e. y = m(x) = Ay. Assume that y = m0(x) + ε, with E[ε] = 0 and Var[ε] = σ2 I. Let · denote the Euclidean norm Empirical risk : Rn(m) = 1 n y − m(x) 2 Vapnik’s risk : E[Rn(m)] = 1 n m0(x − m(x) 2 + 1 n E y − m0(x 2 with m0(x = E[Y |X = x]. Observe that nE Rn(m) = E y − m(x) 2 = (I − A)m0 2 + σ2 I − A 2 while = E m0(x) − m(x) 2 = (I − A)m0 bias 2 + σ2 A 2 variance @freakonometrics 23
  • 24. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Mallows Penalty and Model Complexity One can obtain E Rn(m) = E Rn(m) + 2 σ2 n trace(A). If trace(A) ≥ 0 the empirical risk underestimate the true risk of the estimator. The number of degrees of freedom of the (linear) predictor is related to trace(A) 2 σ2 n trace(A) is called Mallow’s penalty CL. If A is a projection matrix, trace(A) is the dimension of the projection space, p, then we obtain Mallow’s CP , 2 σ2 n p. Remark : Mallows (1973) Some Comments on Cp introduced this penalty while focusing on the R2 . @freakonometrics 24
  • 25. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Penalty and Likelihood CP is associated to a quadratic risk an alternative is to use a distance on the (conditional) distribution of Y , namely Kullback-Leibler distance discrete case: DKL(P Q) = i P(i) log P(i) Q(i) continuous case : DKL(P Q) = ∞ −∞ p(x) log p(x) q(x) dxDKL(P Q) = ∞ −∞ p(x) log p(x) q(x) dx Let f denote the true (unknown) density, and fθ some parametric distribution, DKL(f fθ) = ∞ −∞ f(x) log f(x) fθ(x) dx= f(x) log[f(x)] dx− f(x) log[fθ(x)] dx relative information Hence minimize {DKL(f fθ)} ←→ maximize E log[fθ(X)] @freakonometrics 25
  • 26. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Penalty and Likelihood Akaike (1974) A new look at the statistical model identification observe that for n large enough E log[fθ(X)] ∼ log[L(θ)] − dim(θ) Thus AIC = −2 log L(θ) + 2dim(θ) Example : in a (Gaussian) linear model, yi = β0 + xT i β + εi AIC = n log 1 n n i=1 εi + 2[dim(β) + 2] @freakonometrics 26
  • 27. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Penalty and Likelihood Remark : this is valid for large sample (rule of thumb n/dim(θ) > 40), otherwise use a corrected AIC AICc = AIC + 2k(k + 1) n − k − 1 bias correction where k = dim(θ) see Sugiura (1978) Further analysis of the data by Akaike’s information criterion and the finite corrections second order AIC. Using a Bayesian interpretation, Schwarz (1978) Estimating the dimension of a model obtained BIC = −2 log L(θ) + log(n)dim(θ). Observe that the criteria considered is criteria = −function L(θ) + penality complexity @freakonometrics 27
  • 28. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Estimation of the Risk Consider a naive bootstrap procedure, based on a bootstrap sample Sb = {(y (b) i , x (b) i )}. The plug-in estimator of the empirical risk is Rn(m(b) ) = 1 n n i=1 yi − m(b) (xi) 2 and then Rn = 1 B B b=1 Rn(m(b) ) = 1 B B b=1 1 n n i=1 yi − m(b) (xi) 2 @freakonometrics 28
  • 29. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Estimation of the Risk One might improve this estimate using a out-of-bag procedure Rn = 1 n n i=1 1 #Bi b∈Bi yi − m(b) (xi) 2 where Bi is the set of all boostrap sample that contain (yi, xi). Remark: P ((yi, xi) /∈ Sb) = 1 − 1 n n ∼ e−1 = 36, 78%. @freakonometrics 29
  • 30. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Linear Regression Shortcoming Least Squares Estimator β = (XT X)−1 XT y Unbiased Estimator E[β] = β Variance Var[β] = σ2 (XT X)−1 which can be (extremely) large when det[(XT X)] ∼ 0. X =        1 −1 2 1 0 1 1 2 −1 1 1 0        then XT X =     4 2 2 2 6 −4 2 −4 6     while XT X+I =     5 2 2 2 7 −4 2 −4 7     eigenvalues : {10, 6, 0} {11, 7, 1} Ad-hoc strategy: use XT X + λI @freakonometrics 30
  • 31. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Linear Regression Shortcoming Evolution of (β1, β2) → n i=1 [yi − (β1x1,i + β2x2,i)]2 when cor(X1, X2) = r ∈ [0, 1], on top. Below, Ridge regression (β1, β2) → n i=1 [yi − (β1x1,i + β2x2,i)]2 +λ(β2 1 + β2 2) where λ ∈ [0, ∞), below, when cor(X1, X2) ∼ 1 (colinearity). @freakonometrics 31 −2 −1 0 1 2 3 4 −3−2−10123 β1 β2 500 1000 1500 2000 2000 2500 2500 2500 2500 3000 3000 3000 3000 3500 q −2 −1 0 1 2 3 4 −3−2−10123 β1 β2 1000 1000 2000 2000 3000 3000 4000 4000 5000 5000 6000 6000 7000
  • 32. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Normalization : Euclidean 2 vs. Mahalonobis We want to penalize complicated models : if βk is “too small”, we prefer to have βk = 0. 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 Instead of d(x, y) = (x − y)T (x − y) use dΣ(x, y) = (x − y)TΣ−1 (x − y) beta1 beta2 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics 32
  • 33. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Ridge Regression ... like the least square, but it shrinks estimated coefficients towards 0. β ridge λ = argmin    n i=1 (yi − xT i β)2 + λ p j=1 β2 j    β ridge λ = argmin    y − Xβ 2 2 =criteria + λ β 2 2 =penalty    λ ≥ 0 is a tuning parameter. The constant is usually unpenalized. The true equation is β ridge λ = argmin    y − (β0 + Xβ) 2 2 =criteria + λ β 2 2 =penalty    @freakonometrics 33
  • 34. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Ridge Regression β ridge λ = argmin y − (β0 + Xβ) 2 2 + λ β 2 2 can be seen as a constrained optimization problem β ridge λ = argmin β 2 2 ≤hλ y − (β0 + Xβ) 2 2 Explicit solution βλ = (XT X + λI)−1 XT y If λ → 0, β ridge 0 = β ols If λ → ∞, β ridge ∞ = 0. beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 30 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics 34
  • 35. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Ridge Regression This penalty can be seen as rather unfair if compo- nents of x are not expressed on the same scale • center: xj = 0, then β0 = y • scale: xT j xj = 1 Then compute β ridge λ = argmin    y − Xβ 2 2 =loss + λ β 2 2 =penalty    beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 30 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 40 40 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics 35
  • 36. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Ridge Regression Observe that if xj1 ⊥ xj2 , then β ridge λ = [1 + λ]−1 β ols λ which explain relationship with shrinkage. But generally, it is not the case... q q Theorem There exists λ such that mse[β ridge λ ] ≤ mse[β ols λ ] @freakonometrics 36
  • 37. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Ridge Regression Lλ(β) = n i=1 (yi − β0 − xT i β)2 + λ p j=1 β2 j ∂Lλ(β) ∂β = −2XT y + 2(XT X + λI)β ∂2 Lλ(β) ∂β∂βT = 2(XT X + λI) where XT X is a semi-positive definite matrix, and λI is a positive definite matrix, and βλ = (XT X + λI)−1 XT y @freakonometrics 37
  • 38. Arthur CHARPENTIER, Advanced Econometrics Graduate Course The Bayesian Interpretation From a Bayesian perspective, P[θ|y] posterior ∝ P[y|θ] likelihood · P[θ] prior i.e. log P[θ|y] = log P[y|θ] log likelihood + log P[θ] penalty If β has a prior N(0, τ2 I) distribution, then its posterior distribution has mean E[β|y, X] = XT X + σ2 τ2 I −1 XT y. @freakonometrics 38
  • 39. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Properties of the Ridge Estimator βλ = (XT X + λI)−1 XT y E[βλ] = XT X(λI + XT X)−1 β. i.e. E[βλ] = β. Observe that E[βλ] → 0 as λ → ∞. Assume that X is an orthogonal design matrix, i.e. XT X = I, then βλ = (1 + λ)−1 β ols . @freakonometrics 39
  • 40. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Properties of the Ridge Estimator Set W λ = (I + λ[XT X]−1 )−1 . One can prove that W λβ ols = βλ. Thus, Var[βλ] = W λVar[β ols ]W T λ and Var[βλ] = σ2 (XT X + λI)−1 XT X[(XT X + λI)−1 ]T . Observe that Var[β ols ] − Var[βλ] = σ2 W λ[2λ(XT X)−2 + λ2 (XT X)−3 ]W T λ ≥ 0. @freakonometrics 40
  • 41. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Properties of the Ridge Estimator Hence, the confidence ellipsoid of ridge estimator is indeed smaller than the OLS, If X is an orthogonal design matrix, Var[βλ] = σ2 (1 + λ)−2 I. mse[βλ] = σ2 trace(W λ(XT X)−1 W T λ) + βT (W λ − I)T (W λ − I)β. If X is an orthogonal design matrix, mse[βλ] = pσ2 (1 + λ)2 + λ2 (1 + λ)2 βT β @freakonometrics 41 0.0 0.2 0.4 0.6 0.8 −1.0−0.8−0.6−0.4−0.2 β1 β2 1 2 3 4 5 6 7
  • 42. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Properties of the Ridge Estimator mse[βλ] = pσ2 (1 + λ)2 + λ2 (1 + λ)2 βT β is minimal for λ = pσ2 βT β Note that there exists λ > 0 such that mse[βλ] < mse[β0] = mse[β ols ]. @freakonometrics 42
  • 43. Arthur CHARPENTIER, Advanced Econometrics Graduate Course SVD decomposition For any matrix A, m × n, there are orthogonal matrices U (m × m), V (n × n) and a "diagonal" matrix Σ (m × n) such that A = UΣV T , or AV = UΣ. Hence, there exists a special orthonormal set of vectors (i.e. the columns of V ), that is mapped by the matrix A into an orthonormal set of vectors (i.e. the columns of U). Let r = rank(A), then A = r i=1 σiuivT i (called the dyadic decomposition of A). Observe that it can be used to compute (e.g.) the Frobenius norm of A, A = a2 i,j = σ2 1 + · · · + σ2 min{m,n}. Further AT A = V ΣT ΣV T while AAT = UΣΣT UT . Hence, σ2 i ’s are related to eigenvalues of AT A and AAT , and ui, vi are associated eigenvectors. Golub & Reinsh (1970) Singular Value Decomposition and Least Squares Solutions @freakonometrics 43
  • 44. Arthur CHARPENTIER, Advanced Econometrics Graduate Course SVD decomposition Consider the singular value decomposition of X, X = UDV T . Then β ols = V D−2 D UT y βλ = V (D2 + λI)−1 D UT y Observe that D−1 i,i ≥ Di,i D2 i,i + λ hence, the ridge penality shrinks singular values. Set now R = UD (n × n matrix), so that X = RV T , βλ = V (RT R + λI)−1 RT y @freakonometrics 44
  • 45. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Hat matrix and Degrees of Freedom Recall that Y = HY with H = X(XT X)−1 XT Similarly Hλ = X(XT X + λI)−1 XT trace[Hλ] = p j=1 d2 j,j d2 j,j + λ → 0, as λ → ∞. @freakonometrics 45
  • 46. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Sparsity Issues In severall applications, k can be (very) large, but a lot of features are just noise: βj = 0 for many j’s. Let s denote the number of relevent features, with s << k, cf Hastie, Tibshirani & Wainwright (2015) Statistical Learning with Sparsity, s = card{S} where S = {j; βj = 0} The model is now y = XT SβS + ε, where XT SXS is a full rank matrix. @freakonometrics 46 q q = . +
  • 47. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Going further on sparcity issues The Ridge regression problem was to solve β = argmin β∈{ β 2 ≤s} { Y − XT β 2 2 } Define a 0 = 1(|ai| > 0). Here dim(β) = k but β 0 = s. We wish we could solve β = argmin β∈{ β 0 =s} { Y − XT β 2 2 } Problem: it is usually not possible to describe all possible constraints, since s k coefficients should be chosen here (with k (very) large). @freakonometrics 47 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0
  • 48. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Going further on sparcity issues In a convex problem, solve the dual problem, e.g. in the Ridge regression : primal problem min β∈{ β 2 ≤s} { Y − XT β 2 2 } and the dual problem min β∈{ Y −XTβ 2 ≤t} { β 2 2 } beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 26 27 30 32 35 40 40 50 60 70 80 90 100 110 120 120 130 130 140 140 X q −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 26 27 30 32 35 40 40 50 60 70 80 90 100 110 120 120 130 130 140 140 X q −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics 48
  • 49. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Going further on sparcity issues Idea: solve the dual problem β = argmin β∈{ Y −XTβ 2 ≤h} { β 0 } where we might convexify the 0 norm, · 0 . @freakonometrics 49
  • 50. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Going further on sparcity issues On [−1, +1]k , the convex hull of β 0 is β 1 On [−a, +a]k , the convex hull of β 0 is a−1 β 1 Hence, why not solve β = argmin β; β 1 ≤˜s { Y − XT β 2 } which is equivalent (Kuhn-Tucker theorem) to the Lagragian optimization problem β = argmin{ Y − XT β 2 2 +λ β 1 } @freakonometrics 50
  • 51. Arthur CHARPENTIER, Advanced Econometrics Graduate Course LASSO Least Absolute Shrinkage and Selection Operator β ∈ argmin{ Y − XT β 2 2 +λ β 1 } is a convex problem (several algorithms ), but not strictly convex (no unicity of the minimum). Nevertheless, predictions y = xT β are unique. MM, minimize majorization, coordinate descent Hunter & Lange (2003) A Tutorial on MM Algorithms. @freakonometrics 51
  • 52. Arthur CHARPENTIER, Advanced Econometrics Graduate Course LASSO Regression No explicit solution... If λ → 0, β lasso 0 = β ols If λ → ∞, β lasso ∞ = 0. beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics 52
  • 53. Arthur CHARPENTIER, Advanced Econometrics Graduate Course LASSO Regression For some λ, there are k’s such that β lasso k,λ = 0. Further, λ → β lasso k,λ is piecewise linear beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 30 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 beta1 beta2 −1 −0.5 0.5 1 −1 −0.5 0.5 1 30 40 40 50 60 70 80 90 100 110 120 120 150 150 40 40 X −1.0 −0.5 0.0 0.5 1.0 −1.0−0.50.00.51.0 @freakonometrics 53
  • 54. Arthur CHARPENTIER, Advanced Econometrics Graduate Course LASSO Regression In the orthogonal case, XT X = I, β lasso k,λ = sign(β ols k ) |β ols k | − λ 2 i.e. the LASSO estimate is related to the soft threshold function... q q @freakonometrics 54
  • 55. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Optimal LASSO Penalty Use cross validation, e.g. K-fold, β(−k)(λ) = argmin    i∈Ik [yi − xT i β]2 + λ β 1    then compute the sum of the squared errors, Qk(λ) = i∈Ik [yi − xT i β(−k)(λ)]2 and finally solve λ = argmin Q(λ) = 1 K k Qk(λ) Note that this might overfit, so Hastie, Tibshiriani & Friedman (2009) Elements of Statistical Learning suggest the largest λ such that Q(λ) ≤ Q(λ ) + se[λ ] with se[λ]2 = 1 K2 K k=1 [Qk(λ) − Q(λ)]2 @freakonometrics 55
  • 56. Arthur CHARPENTIER, Advanced Econometrics Graduate Course LASSO and Ridge, with R 1 > library(glmnet) 2 > chicago=read.table("http:// freakonometrics .free.fr/ chicago.txt",header=TRUE ,sep=";") 3 > standardize <- function(x) {(x-mean(x))/sd(x)} 4 > z0 <- standardize(chicago[, 1]) 5 > z1 <- standardize(chicago[, 3]) 6 > z2 <- standardize(chicago[, 4]) 7 > ridge <-glmnet(cbind(z1 , z2), z0 , alpha =0, intercept= FALSE , lambda =1) 8 > lasso <-glmnet(cbind(z1 , z2), z0 , alpha =1, intercept= FALSE , lambda =1) 9 > elastic <-glmnet(cbind(z1 , z2), z0 , alpha =.5, intercept=FALSE , lambda =1) Elastic net, λ1 β 1 + λ2 β 2 2 q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q qq q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q q q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q q q qq q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q q q q q q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q qq q q qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq q q q @freakonometrics 56
  • 57. Arthur CHARPENTIER, Advanced Econometrics Graduate Course LASSO Regression, Smoothing and Overfit LASSO can be used to avoid overfit. @freakonometrics 57 q q q q q q q q q q q q q q q q q q q q q 0.0 0.2 0.4 0.6 0.8 1.0 −1.0−0.50.00.51.0
  • 58. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Going further, 0, 1 and 2 penalty Define a 0 = d i=1 1(ai = 0), a 1 = d i=1 |ai| and a 2 = d i=1 a2 i 1/2 , for a ∈ Rd . constrained penalized optimization optimization argmin β; β 0 ≤s n i=1 (yi, β0 + xT β) argmin β,λ n i=1 (yi, β0 + xT β) + λ β 0 ( 0) argmin β; β 1 ≤s n i=1 (yi, β0 + xT β) argmin β,λ n i=1 (yi, β0 + xT β) + λ β 1 ( 1) argmin β; β 2 ≤s n i=1 (yi, β0 + xT β) argmin β,λ n i=1 (yi, β0 + xT β) + λ β 2 ( 2) Assume that is the quadratic norm. @freakonometrics 58
  • 59. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Going further, 0, 1 and 2 penalty The two problems ( 2) are equivalent : ∀(β , s ) solution of the left problem, ∃λ such that (β , λ ) is solution of the right problem. And conversely. The two problems ( 1) are equivalent : ∀(β , s ) solution of the left problem, ∃λ such that (β , λ ) is solution of the right problem. And conversely. Nevertheless, if there is a theoretical equivalence, there might be numerical issues since there is not necessarily unicity of the solution. The two problems ( 0) are not equivalent : if (β , λ ) is solution of the right problem, ∃s such that β is a solution of the left problem. But the converse is not true. More generally, consider a p norm, • sparsity is obtained when p ≤ 1 • convexity is obtained when p ≥ 1 @freakonometrics 59
  • 60. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Going further, 0, 1 and 2 penalty Foster& George (1994) the risk inflation criterion for multiple regression tried to solve directly the penalized problem of ( 0). But it is a complex combinatorial problem in high dimension (Natarajan (1995) sparse approximate solutions to linear systems proved that it was a NP-hard problem) One can prove that if λ ∼ σ2 log(p), alors E [xT β − xT β0]2 ≤ E [xS T βS − xT β0]2 =σ2#S · 4 log p + 2 + o(1) . In that case β sub λ,j =    0 si j /∈ Sλ(β) β ols j si j ∈ Sλ(β), where Sλ(β) is the set of non-null values in solutions of ( 0). @freakonometrics 60
  • 61. Arthur CHARPENTIER, Advanced Econometrics Graduate Course If is no longer the quadratic norm but 1, problem ( 1) is not alway strictly convex, and optimum is not always unique (e.g. if XT X is singular). But in the quadratic case, is strictly convex, and at least Xβ is unique. Further, note that solutions are necessarily coherent (signs of coefficients) : it is not possible to have βj < 0 for one solution and βj > 0 for another one. In many cases, problem ( 1) yields a corner-type solution, which can be seen as a "best subset" solution - like in ( 0). @freakonometrics 61
  • 62. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Going further, 0, 1 and 2 penalty Consider a simple regression yi = xiβ + ε, with 1-penality and a 2-loss fuction. ( 1) becomes min yT y − 2yT xβ + βxT xβ + 2λ|β| First order condition can be written −2yT x + 2xT xβ±2λ = 0. (the sign in ± being the sign of β). Assume that least-square estimate (λ = 0) is (strictely) positive, i.e. yT x > 0. If λ is not too large β and βols have the same sign, and −2yT x + 2xT xβ + 2λ = 0. with solution βlasso λ = yT x − λ xTx . @freakonometrics 62
  • 63. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Going further, 0, 1 and 2 penalty Increase λ so that βλ = 0. Increase slightly more, βλ cannot become negative, because the sign of the first order condition will change, and we should solve −2yT x + 2xT xβ − 2λ = 0. and solution would be βlasso λ = yT x + λ xTx . But that solution is positive (we assumed that yT x > 0), to we should have βλ < 0. Thus, at some point βλ = 0, which is a corner solution. In higher dimension, see Tibshirani & Wasserman (2016) a closer look at sparse regression or Candès & Plan (2009) Near-ideal model selection by 1 minimization., With some additional technical assumption, that LASSO estimator is "sparsistent" in the sense that the support of β lasso λ is the same as β, Thus, LASSO can be used for variable selection (see Hastie et al. (2001) The @freakonometrics 63
  • 64. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Elements of Statistical Learning). Generally, βlasso λ is a biased estimator but its variance can be small enough to have a smaller least squared error than the OLS estimate. With orthonormal covariance, one can prove that βsub λ,j = βols j 1|βsub λ,j |>b , βridge λ,j = βols j 1 + λ et βlasso λ,j = signe[βols j ] · (|βols j | − λ)+. @freakonometrics 64
  • 65. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Optimization Heuristics First idea: given some initial guess β(0), |β| ∼ |β(0)| + 1 2|β(0)| (β2 − β2 (0)) LASSO estimate can probably be derived from iterated Ridge estimates y − Xβ(k+1) 2 2 + λ β(k+1) 1 ∼ Xβ(k+1) 2 2 + λ 2 j 1 |βj,(k)| [βj,(k+1)]2 which is a weighted ridge penalty function Thus, β(k+1) = XT X + λ∆(k) −1 XT y where ∆(k) = diag[|βj,(k)|−1 ]. Then β(k) → β lasso , as k → ∞. @freakonometrics 65
  • 66. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Properties of LASSO Estimate From this iterative technique β lasso λ ∼ XT X + λ∆ −1 XT y where ∆ = diag[|β lasso j,λ |−1 ] if β lasso j,λ = 0, 0 otherwise. Thus, E[β lasso λ ] ∼ XT X + λ∆ −1 XT Xβ and Var[β lasso λ ] ∼ σ2 XT X + λ∆ −1 XT XT X XT X + λ∆ −1 XT @freakonometrics 66
  • 67. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Optimization Heuristics Consider here a simplified problem, min a∈R 1 2 (a − b)2 + λ|a| g(a) with λ > 0. Observe that g (0) = −b ± λ. Then • if |b| ≤ λ, then a = 0 • if b ≥ λ, then a = b − λ • if b ≤ −λ, then a = b + λ a = argmin a∈R 1 2 (a − b)2 + λ|a| = Sλ(b) = sign(b) · (|b| − λ)+, also called soft-thresholding operator. @freakonometrics 67
  • 68. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Optimization Heuristics Definition for any convex function h, define the proximal operator operator of h, proximalh(y) = argmin x∈Rd 1 2 x − y 2 2 + h(x) Note that proximalλ · 2 2 (y) = 1 1 + λ x shrinkage operator proximalλ · 1 (y) = Sλ(y) = sign(y) · (|y| − λ)+ @freakonometrics 68
  • 69. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Optimization Heuristics We want to solve here θ ∈ argmin θ∈Rd 1 n y − mθ(x)) 2 2 f(θ) + λ · penalty(θ) g(θ) . where f is convex and smooth, and g is convex, but not smooth... 1. Focus on f : descent lemma, ∀θ, θ f(θ) ≤ f(θ ) + f(θ ), θ − θ + t 2 θ − θ 2 2 Consider a gradient descent sequence θk, i.e. θk+1 = θk − t−1 f(θk), then f(θ) ≤ ϕ(θ): θk+1=argmin{ϕ(θ)} f(θk) + f(θk), θ − θk + t 2 θ − θk 2 2 @freakonometrics 69
  • 70. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Optimization Heuristics 2. Add function g f(θ)+g(θ) ≤ ψ(θ) f(θk) + f(θk), θ − θk + t 2 θ − θk 2 2 +g(θ) And one can proof that θk+1 = argmin θ∈Rd ψ(θ) = proximalg/t θk − t−1 f(θk) so called proximal gradient descent algorithm, since argmin {ψ(θ)} = argmin t 2 θ − θk − t−1 f(θk) 2 2 + g(θ) @freakonometrics 70
  • 71. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Coordinate-wise minimization Consider some convex differentiable f : Rk → R function. Consider x ∈ Rk obtained by minimizing along each coordinate axis, i.e. f(x1, xi−1, xi, xi+1, · · · , xk) ≥ f(x1, xi−1, xi , xi+1, · · · , xk) for all i. Is x a global minimizer? i.e. f(x) ≥ f(x ), ∀x ∈ Rk . Yes. If f is convex and differentiable. f(x)|x=x = ∂f(x) ∂x1 , · · · , ∂f(x) ∂xk = 0 There might be problem if f is not differentiable (except in each axis direction). If f(x) = g(x) + k i=1 hi(xi) with g convex and differentiable, yes, since f(x) − f(x ) ≥ g(x )T (x − x ) + i [hi(xi) − hi(xi )] @freakonometrics 71
  • 72. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Coordinate-wise minimization f(x) − f(x ) ≥ i [ ig(x )T (xi − xi )hi(xi) − hi(xi )] ≥0 ≥ 0 Thus, for functions f(x) = g(x) + k i=1 hi(xi) we can use coordinate descent to find a minimizer, i.e. at step j x (j) 1 ∈ argmin x1 f(x1, x (j−1) 2 , x (j−1) 3 , · · · x (j−1) k ) x (j) 2 ∈ argmin x2 f(x (j) 1 , x2, x (j−1) 3 , · · · x (j−1) k ) x (j) 3 ∈ argmin x3 f(x (j) 1 , x (j) 2 , x3, · · · x (j−1) k ) Tseng (2001) Convergence of Block Coordinate Descent Method: if f is continuous, then x∞ is a minimizer of f. @freakonometrics 72
  • 73. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Application in Linear Regression Let f(x) = 1 2 y − Ax 2 , with y ∈ Rn and A ∈ Mn×k. Let A = [A1, · · · , Ak]. Let us minimize in direction i. Let x−i denote the vector in Rk−1 without xi. Here 0 = ∂f(x) ∂xi = AT i [Ax − y] = AT i [Aixi + A−ix−i − y] thus, the optimal value is here xi = AT i [A−ix−i − y] AT i Ai @freakonometrics 73
  • 74. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Application to LASSO Let f(x) = 1 2 y − Ax 2 + λ x 1 , so that the non-differentiable part is separable, since x 1 = k i=1 |xi|. Let us minimize in direction i. Let x−i denote the vector in Rk−1 without xi. Here 0 = ∂f(x) ∂xi = AT i [Aixi + A−ix−i − y] + λsi where si ∈ ∂|xi|. Thus, solution is obtained by soft-thresholding xi = Sλ/ Ai 2 AT i [A−ix−i − y] AT i Ai @freakonometrics 74
  • 75. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Convergence rate for LASSO Let f(x) = g(x) + λ x 1 with • g convex, g Lipschitz with constant L > 0, and Id − g/L monotone inscreasing in each component • there exists z such that, componentwise, either z ≥ Sλ(z − g(z)) or z ≤ Sλ(z − g(z)) Saka & Tewari (2010), On the finite time convergence of cyclic coordinate descent methods proved that a coordinate descent starting from z satisfies f(x(j) ) − f(x ) ≤ L z − x 2 2j @freakonometrics 75
  • 76. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Lasso for Autoregressive Time Series Consider some AR(p) autoregressive time series, Xt = φ1Xt−1 + φ2Xt−2 + · · · + φp−1Xt−p+1 + φpXt−p + εt, for some white noise (εt), with a causal type representation. Write y = xT φ + ε. The LASSO estimator φ is a minimizer of 1 2T y = xT φ 2 + λ p i=1 λi|φi|, for some tuning parameters (λ, λ1, · · · , λp). See Nardi & Rinaldo (2011). @freakonometrics 76
  • 77. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Graphical Lasso and Covariance Estimation We want to estimate an (unknown) covariance matrix Σ, or Σ−1 . An estimate for Σ−1 is Θ solution of Θ ∈ argmin Θ∈Mk×k {− log[det(Θ)] + trace[SΘ] + λ Θ 1 } where S = XT X n and where Θ 1 = |Θi,j|. See van Wieringen (2016) Undirected network reconstruction from high-dimensional data and https://github.com/kaizhang/glasso @freakonometrics 77
  • 78. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Application to Network Simplification Can be applied on networks, to spot ‘significant’ connexions... Source: http://khughitt.github.io/graphical-lasso/ @freakonometrics 78
  • 79. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Extention of Penalization Techniques In a more general context, we want to solve θ ∈ argmin θ∈Rd 1 n n i=1 (yi, mθ(xi)) + λ · penalty(θ) . The quadratic loss function was related to the Gaussian case, but much more alternatives can be considered... @freakonometrics 79
  • 80. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Linear models, nonlinear modes and GLMs linear model • (Y |X = x) ∼ N(θx, σ2 ) • E[Y |X = x] = θx = xT β 1 > fit <- lm(y ~ x, data = df) @freakonometrics 80 q q q q q q q q q qq q qq q q q q q q qq q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q qqq q q qq q q q q q qq q q q qq q q q q q q q q q q q qq qq q q q q q q q qq q q q q q q q q q q q q q q q q q qq q q qq qq q q q q q q q q q
  • 81. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Linear models, nonlinear modes and GLMs Nonlinear models • (Y |X = x) ∼ N(θx, σ2 ) • E[Y |X = x] = θx = m(x) 1 > fit <- lm(y ~ poly(x, k), data = df) 2 > fit <- lm(y ~ bs(x), data = df) @freakonometrics 81 q q q q q q q q q qq q qq q q q q q q qq q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q qqq q q qq q q q q q qq q q q qq q q q q q q q q q q q qq qq q q q q q q q qq q q q q q q q q q q q q q q q q q qq q q qq qq q q q q q q q q q
  • 82. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Linear models, nonlinear modes and GLMs Generalized Linear Models • (Y |X = x) ∼ L(θx, ϕ) • E[Y |X = x] = h−1 (θx) = h−1 (xT β) 1 > fit <- glm(y ~ x, data = df , 2 + family = poisson(link = "log") @freakonometrics 82 q q q q q q q q q qq q qq q q q q q q qq q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q qqq q q qq q q q q q qq q q q qq q q q q q q q q q q q qq qq q q q q q q q qq q q q q q q q q q q q q q q q q q qq q q qq qq q q q q q q q q q
  • 83. Arthur CHARPENTIER, Advanced Econometrics Graduate Course The exponential Family Consider distributions with parameter θ (and ϕ) with density (with respect to the appropriate measure, on N or on R) f(y|θ, ϕ) = exp yθ − b(θ) a(ϕ) + c(y, ϕ) , where a(·), b(·) and c(·) are functions, where θ is called canonical paramer. θ is the quantity of interest, while ϕ is a nuisance parameter. @freakonometrics 83
  • 84. Arthur CHARPENTIER, Advanced Econometrics Graduate Course The Exponential Family Example The Gaussian distribution with mean µ and variance σ2 , N(µ, σ2 ) belongs to that family θ = µ, ϕ = σ2 , a(ϕ) = ϕ, b(θ) = θ2 /2 and c(y, ϕ) = − 1 2 y2 σ2 + log(2πσ2 ) , y ∈ R, Example Bernoulli distribution, with mean π, B(π) is obtained with θ = log{p/(1 − p)}, a(ϕ) = 1, b(θ) = log(1 + exp(θ)), ϕ = 1 and c(y, ϕ) = 0. Example The binomiale distribution with mean nπ, B(n, π) is obtained with θ = log{p/(1 − p)}, a(ϕ) = 1, b(θ) = n log(1 + exp(θ)), ϕ = 1 and c(y, ϕ) = log n y . Example The Poisson distribution with mean λ, P(λ) belongs to that family f(y|λ) = exp(−λ) λy y! = exp y log λ − λ − log y! , y ∈ N, with θ = log λ, ϕ = 1, a(ϕ) = 1, b(θ) = exp θ = λ and c(y, ϕ) = − log y!. @freakonometrics 84
  • 85. Arthur CHARPENTIER, Advanced Econometrics Graduate Course The Exponential Family Example La loi Negative Binomiale distribution with parameters r and p, f(k|r, p) = y + r − 1 y (1 − p)r py , y ∈ N. can be written, equivalently f(k|r, p) = exp y log p + r log(1 − p) + log y + r − 1 y i.e. θ = log p, b(θ) = −r log p and a(ϕ) = 1 @freakonometrics 85
  • 86. Arthur CHARPENTIER, Advanced Econometrics Graduate Course The Exponential Family Example The Gamma distribution, with mean µ and variance ν−1 , f(y|µ, ν) = 1 Γ(ν) ν µ ν yν−1 exp − ν µ y , y ∈ R+, is also in the Exponential family θ = − 1 µ , a(ϕ) = ϕ, b(θ) = − log(−θ), ϕ = ν−1 and c(y, ϕ) = 1 ϕ − 1 log(y) − log Γ 1 ϕ @freakonometrics 86
  • 87. Arthur CHARPENTIER, Advanced Econometrics Graduate Course Mean and Variance Let Y be a random variable in the Exponential family E(Y ) = b (θ) and Var(Y ) = b (θ)ϕ, i.e. the variance of Y is the product of two quantities • b (θ) is a function θ (only) and is called the variance function, • a function of ϕ. Observe that µ = E(Y ), hence parameter θ is related to mean µ. Hence, the variance function is function of µ , and can be denote V(µ) = b ([b ]−1 (µ))ϕ. Example In the Gaussian case, V(µ) = 1, while for the Poisson distribution, V (µ) = µ and for the Gamma one V (µ) = µ2 . @freakonometrics 87
  • 88. Arthur CHARPENTIER, Advanced Econometrics Graduate Course From the Exponential Family to GLMs Consider independent random variables Y1, Y2, . . . , Yn suchthat f(yi|θi, ϕ) = exp yiθi − b(θi) a(ϕ) + c(yi, ϕ) so that the likelihood can be written L(θ, ϕ|y) = n i=1 f(yi|θi, ϕ) = exp n i=1 yiθi − n i=1 b(θi) a(ϕ) + n i=1 c(yi, ϕ) . or log L(β) = n i=1 yiθi − b(θi) a(ϕ) (up to an additive constant...) @freakonometrics 88