Robust parametric classification and variable selection with minimum distance estimation

Robust parametric classiﬁcation and variable
selection with minimum distance estimation

Eric Chia,b,1 with David W. Scotta,2
a Department of Statistics,
Rice University
b Baylor College of Medicine

June 17, 2010

1
DOE DE-FG02-97ER25308
2
NSF DMS-09-07491

Outline

The binary regression problem

The L2 E Method

Estimation

Variable Selection

Simulations

Conclusion

Logistic Regression

Suppose we wish to predict y ∈ {0, 1}n using X ∈ Rn×p .
The number of features p could be very large.

Univariate Logistic Regression: MLE

1.0 qq q q q q
q qq q q qq
q q q qq qq qq qqqq q qq q
qq q q qq
qq q qq
q

0.8

0.6
Pr(Y=1)

0.4

0.2

0.0 qq q q q q q qq q q q q q q
q q q q q q q qq
qq q q q qq
q q
q q q q q qq
q q q
q q

−6 −4 −2 0 2 4 6
X

MLE is sensitive to outliers

1.0 q qq q q qq q q q q
q qq q q qq
qq q q qq
qq q qq
q

0.8

0.6
Pr(Y=1)

0.4

0.2

q q q q q q q qq
qq q q q qq
q q
q q q q q qq
q q q
q q

−6 −4 −2 0 2 4 6
X


Likelihood based choice
Outlier or not, MLE puts mass wherever data lies.
Cost: MLE puts mass over regions where there is no data.

1.0 q qq q q qq q q q q
q qq q q qq
qq q q qq
qq q qq
q

0.8

0.6
Pr(Y=1)

0.4

0.2

q q q q q q q qq
qq q q q qq
q q
q q q q q qq
q q q
q q

−6 −4 −2 0 2 4 6
X

There are no ’ones’ between -4 and -2.

But P(Y = 1|X ∈ (−4, −2)) ↑.

There are no ’zeros’ between 4 and 6.

But P(Y = 0|X ∈ (4, 6)) ↑.

The L2 distance as an alternative to the deviance loss.

g : unknown true density.
fθ : putative parametric density.
Find θ that minimizes the ISE

ˆ
θ = argmin (fθ (x) − g (x))2 dx.
θ

The L2 E Method

The equivalent empirical criterion:
n
ˆ 2
θ = argmin fθ (x)2 dx − fθ (Xi ) ,
θ n
i=1

where Xi ∈ Rp is the covariate vector of the i th observation.
The L2 Estimator or L2 E [Scott, 2001].
Familar quantity: Smoothing parameter selection in
non-parametric density estimation.

Density-power divergence

The L2 E and MLE are empirical minimizers of two different points
in a spectrum divergence measures [Basu et al, 1998].

1 1 1+γ
dγ (g , fθ ) = fθ1+γ (z) − 1 + g (z)fθγ (z) + g (z) dz,
γ γ

γ > 0 trades off efficiency for robustness.
γ = 1 =⇒ L2 loss.
γ → 0 =⇒ Kullback - Leibler divergence.

Robustness of the L2 distance

ˆ
θ = argmin (fθ (x) − g (x))2 dx.
θ

The L2 distance is zero-forcing:

g (x) = 0 forces fθ (x) = 0.

Puts premium on avoiding “false positives”.
L2 E balances:

mass where data is
v.s.
no mass where data is absent.

Partial Densities: An extra degree of freedom

Expand the search space [Scott, 2001]:

(wfθ (x) − g (x))2 dx.

Fit a parametric model to only a fraction, w , of the data
(Hopefully the fraction described well by the parametric
model!)

n
ˆ ˆ 2 2 2w
(θ, w ) = argmin w fθ (x) dx − fθ (Xi ) .
θ,w n
i=1

Logistic L2 E loss

Let F (u) = 1/(1 + exp(−u)), logistic function, then

n
ˆ ˆ w2
(β, w ) = argmin F (xiT β)2 + (1 − F (xiT β))2
β,w ∈[0,1] n i=1
n
w
−2 yi F (xiT β) + (1 − yi )(1 − F (xiT β)) .
n
i=1

Two dimensional example

4

2

0

X2
−2

−4

−5 0 5 10
X1

n = 300 and p = 2.
Three clusters each of size 100
Two are labelled 0
One is labelled 1

5 5

0 0
X2

X2
−5 −5

0 5 10 0 5 10
X1 X1

(a) MLE (b) L2 E A :w = 1.026
ˆ

5 5

0 0
X2

X2
−5 −5

0 5 10 0 5 10
X1 X1

(c) L2 E B: w = 0.666
ˆ (d) L2 E C: w = 0.668
ˆ

The optimization problem

Challenges
L2 E loss is not convex.
Hessian of the L2 E loss is non-deﬁnite.
Standard Newton-Raphson fails.
Scalability and stability as p increases?

Solution
Majorization-Minimization


Strategy
Minimize a surrogate function, majorization.
Choose surrogate such that
↓ surrogate =⇒ ↓ objective.
surrogate is easier to minimize than objective.


Deﬁnition
Given f and g , real-valued functions on Rp , g majorizes f at x if
1. g (x) = f (x) and
2. g (u) ≥ f (u) for all u.

More
Lack of fit

Less

very bad optimal less bad

The spectrum of logistic models

Quadratic majorization of the logistic L2 E loss

The loss has bounded curvature with respect to β. Fix w .
Majorize the exact second order Taylor expansion.
1 T −1 T (m)
β (m+1) = β (m) − (X X ) X Z ,
K
where
1 3 4 w
K≥ max wz − z 3 − 2wz 2 + z + .
4 z∈[−1,1] 2 2

K controls the step size. Its lower bound is related to the
maximum curvature of the loss.
Z (m) is a working response that depends on Y and X β (m) .

Continuous variable selection with the LASSO

Minimize
p
“L2 E loss ”+λ |βi |
i=1

Penalized majorization of loss majorizes the penalized loss.
Minimize
p
“majorization of L2 E loss ”+λ |βi |
i=1

Coordinate Descent

Suppose X is standardized, then

(m+1) (m) 1 T (m)
βk = S βk − X Z ,λ ,
K (k)

where S is the soft threshold function

S(x, λ) = sign(x) max(|x| − λ, 0).

Extension to elastic net is straightforward.

Heuristic Model Selection

Regularization Path
Calculate penalized regression coefficients for range of λ values.

Information Criterion
For each λ, calculate deviance loss using L2 E coefficients and
add correction term (AIC and BIC).
Select model with lowest AIC/BIC value.
Use number of non-zero penalized regression coefficients for
degrees of freedom [Zou et al, 2007].

Heuristic Model Selection

151 127 111 97 69 38 7 4 4 4 4 3 2 0

800
1.5

700
1.0

600
0.5

L2E BIC
βj

500
0.0

400
-0.5

300
-1.0

200
-3.0 -2.5 -2.0 -1.5 -3.0 -2.5 -2.0 -1.5

log10(λ) log10(λ)

Simulations: Estimation

n = 200, p = 4
Xi | Group 1 ∼ i.i.d. N(µ, σ)
Xi | Group 2 ∼ i.i.d. N(−µ, σ)
β = (1, 1/2, 1, 2)
Yi |Xi ∼ i.d. Bern(F (XiT β))
1,000 replicates.

Case 1

Vary position of 1 outlier.

Distributions of ﬁtted coeﬃcients

1 2

6

4 q q q q q q
q q
q q
q q q q q q q q q q q q q
q q
q q
q q
q q
q q
q q
q q
q q
q q
q q
q q
q q
q q q
q
q q
q q q q q
q q
q q q
q q q
q q
q q q q q
q q q q
q q q
q q
q q
q q
q q q q
q q
q q
q q
q q
q q
q q q
q q
q q
q q q
q q
q q
q q q q q
q q q
q q q q q q
q q q
q
2 q
q q
q q q
q
q
q q q
q
q
q
q
q
q q
q
q
q
q q q
q q
q q q
q q q
q q
q q q q
q q
q q q
q
q q q
q
q
q
0 q
q
q
q
q
q q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q q
q
q q
q
q
q q
q
q
q
q q
q q
q
q
q
q
q
q
q q q q
q
q q
q
q
q q q q q q q q q q q q q q q
Method
Fitted Value

q
q q
q q q q q q q q q q q q q q
q q q q q q
q
q q
MLE
3 4
L2E: w = 1
q q q
q q q q
q
q
q L2E: w = wopt
6 q
q
q q q q q q
q q
q q
4 q q q q q q q q q q
q q
q
q q q q q q q q q q q
q

2

0

−0.25 1.5 3 6 12 24 −0.25 1.5 3 6 12 24
Outlier position

Estimation

MLE regression coeﬃcients driven to zero (implosion breakdown)

Case 2

Vary number of outliers at a ﬁxed position.

Distributions of ﬁtted coeﬃcients

1 2
q q
q
q q
q q
4 q q
4 q q q q
q q q
q q q
q q q
q q q
q q q
q q
3 q q q
q q
q q q q
q q q
q
q
q q
q
q
q
q q 3 q q
q
q q q
q q q q q q q q
q q
q q
q
q q q q
q q q q
q q
q q
q
q
q
q q q
q
q q q
q q q q q q q q q
q q q q q q
q q
q q
q q
q q
q q q q q
q q q q
q q q
q q q q
q q q
q q q q
q q q
q q q q q q
q q q q q q q q q q q q
2 q q q
q
q q q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q q
q
q
q 2 q
q q q q q
q q q q q
q
q
q
q
q
q q
q q
q q
q 1 q
q q q
1 q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q q
q
q q
q q q q q q q
0
0 q q q
q q q
q
q q q q q q
q q q
q q q q q
q q q q
q q q
q
q
q
q q
q q
q q q
q q q q
q
q q q q q q
q q q
q q
q q q
q q
q
q q q q q q q q
q q q q −1 q q q q q q q q q q q
q q q q q q q Method
Fitted Value

q q q q q q q q q
q q q q q q
q q q q
−1 q q
q
q
q
q
q
q
q q
q
q
q
q q
q q
q q q
MLE
3 4
L2E: w = 1
q q
q q
q 8 q L2E: w = w.opt
q q
6 q
q
q q q
q q q q q q q q
q
q q q q q q 6 q q q
q q q q q q
4 q q q q q q q
q q q q
q
q q q q q q q q q q
4
2
2
0
0

0 1 5 10 15 20 0 1 5 10 15 20
Number of outliers

Simulations: Variable Selection

n = 200, p = 1000
Xi | Group 1 ∼ i.i.d. N(µ, σ)
Xi | Group 2 ∼ i.i.d. N(−µ, σ)
β = (1, 1, 1, 1, 0, . . . , 0)
Yi |Xi ∼ i.d. Bern(F (XiT β))
1,000 replicates.

Single Outlier
Moved along ray starting at centroid of one group and moving
away along (1, 1, 1, 1, 0, . . . , 0).

Average number of correct variables selected

AIC BIC

4

3

method
Expectation

MLE
2
L2E: w = 1
L2E: w = wopt

1

0

0 1.06 2.11 3.17 4.22 5.28 6.33 7.39 8.44 9.5 0 1.06 2.11 3.17 4.22 5.28 6.33 7.39 8.44 9.5
Outlier Relative Position

Average number of incorrect variables selected

AIC BIC

140

120

100

method
Expectation

80
MLE
L2E: w = 1
60 L2E: w = wopt

40

20

0

0 1.06 2.11 3.17 4.22 5.28 6.33 7.39 8.44 9.5 0 1.06 2.11 3.17 4.22 5.28 6.33 7.39 8.44 9.5
Outlier Relative Position

Variable Selection

Implosion breakdown =⇒ reduced SNR =⇒ missed detections

Summary

MLE logistic regression is sensitive to implosion breakdown.
Estimation and variable selection are aﬀected: contaminants
reduce SNR.
L2 E is robust because it is zero forcing.
Majorization-Minimization + Coordinate Descent facilitate
fast and stable optimization.

Future work

Is w worth optimizing over?
What is the correct AIC or BIC formulation?
What are the degrees of freedom in the L2 E loss model?

References

D.W. Scott.
Parametric statistical modeling by minimum integrated square
error.
Technometrics, 43(3):274–285, 2001.
A. Basu et al.
Robust and eﬃcient estimation by minimising a density power
divergence.
Biometrika, 85(3):549–559, 1998
H. Zou et al.
On the “degrees of freedom” of the lasso.
Annals of Statistics, 35(5):2173–2192, 2007

Robust parametric classification and variable selection with minimum distance estimation

Recommandé

Recommandé

Contenu connexe

Similaire à Robust parametric classification and variable selection with minimum distance estimation

Similaire à Robust parametric classification and variable selection with minimum distance estimation (11)

Dernier

Dernier (20)

Robust parametric classification and variable selection with minimum distance estimation