SlideShare une entreprise Scribd logo
1  sur  22
Télécharger pour lire hors ligne
Journal of the Indian Statistical Association
Vol.53 No. 1 & 2, 2015, 153-174
Variable selection using
Kullback-Leibler divergence loss
Shibasish Dasgupta
University of South Alabama, Mobile, U.S.A.
Abstract
The adaptive lasso is a recent technique for simultaneous estimation
and variable selection where adaptive weights are used for penalizing
different coefficients in the l1 penalty. In this paper, we propose an
alternative approach to the adaptive lasso through the Kullback-Leibler
(KL) divergence loss, called the KL adaptive lasso, where we replace the
squared error loss in the adaptive lasso set up by the KL divergence
loss which is also known as the entropy distance. There are various
theoretical reasons to defend the use of Kullback-Leibler distance,
ranging from information theory to the relevance of logarithmic scoring
rule and the location-scale invariance of the distance. We show
that the KL adaptive lasso enjoys the oracle properties; namely, it
performs as well as if the true underlying model were given in advance.
Furthermore, the KL adaptive lasso can be solved by the same efficient
algorithm for solving the lasso. We also discuss the extension of the
KL adaptive lasso in generalized linear models (GLMs) and show that
the oracle properties still hold under mild regularity conditions.
Key Words : Asymptotic normality, Adaptive lasso, Divergence loss,
Generalized linear models, Kullback-Leibler, Oracle property, Sparsity,
Variable selection
Received: April, 2014
154 Journal, Indian Statistical Association
1 Introduction
There are two fundamental goals in statistical learning: ensuring high
prediction accuracy and discovering relevant predictive variables. Variable
selection is particularly important when the true underlying model has a
sparse representation. Identifying significant predictors will enhance the
prediction performance of the fitted model. Variable selection is also
fundamental to high-dimensional statistical modeling. Many approaches
in use are stepwise selection procedures, which can be computationally
expensive and ignore stochastic errors in the variable selection process.
Regularization methods are characterized by loss functions measuring data
fits and penalty terms constraining model parameters. The ‘lasso’ is a
popular regularization technique for simultaneous estimation and variable
selection (Tibshirani, 1996). Fan and Li (2006) gave a comprehensive
overview of variable/feature selection and proposed a unified framework to
approach the problem of variable selection.
1.1 Variable selection and the lasso
Let us consider the usual linear model set up. Suppose we observe an
independent and identically distributed (iid) sample (xi,yi), i = 1,...,n,
where xi=(xi1,...,xip) is the vector of p covariates. The linear model is
given by: Y = Xβ + , where ∼ N(0, σ2
I). Our main interest is to
estimate the regression coefficient β = (β1,...,βp). We also know that
for p < n, the least squares estimate (LSE) of β is unique and is given
by: ˆβLS = (XT
X)−1
XT
Y . Moreover, ˆβLS is the Best Linear Unbiased
Estimator (BLUE) in this scenario. But under high-dimensional setting the
dimension of XT
X is usually very big, which in turns leads to an unstable
estimator of β. So, if p > n, then the LSE is not unique and it will usually
overfit the data, i.e; all observations are predicted perfectly, but there are
many solutions to the coefficients of the fit and new observations are not
Variable selection using KL divergence loss 155
uniquely predictable. The classical solution to this problem was to try to
reduce the number of variables by processes such as forward and backward
regression with reduction in variables determined by hypothesis tests, see
Draper and Smith (1998), for example.
Suppose, the true (unknown) regression coefficients are:
β∗
= (β∗
1 ,...,β∗
p )T
. We denote the true non-null set as A = {j β∗
j ≠ 0}
and the true dimension of this set is given by d = A . So, we have the
relationship: d < n < p. Now, the parameter estimate is ˆβ and the estimated
non-null set is denoted by An = {j ˆβj ≠ 0}.
For variable selection, our goal will be 3-fold:
• Variable Selection Consistency: Recover the true non-zero set A. We
would like An = A.
• Estimation Consistency: ˆβ are close to β∗
.
• Prediction Accuracy: X ˆβ are close to Xβ∗
.
The classical methods for variable selection are the following:
• Best subset selection: consider all 2p
sub-models and choose the best
one
• Forward Selection: starting from the null model, retrieve the most
significant variable sequentially
• Backward Selection: starting from the full model, delete the most
insignificant variable sequentially
• Stepwise Regression: at every step, do both the retrieving and deleting
But there are some problems in the above methods, namely, estimation
accuracy, computational expediency and algorithmic stability. An
alternative strategy that emerged was penalizing the squared error loss, i.e;
156 Journal, Indian Statistical Association
adding to the residual sum of squares Y − Xβ 2
a penalty, pen(β;λ). So,
the penalized likelihood function is given by:
L(β;λ) = Y − Xβ 2
+ pen(β;λ).
When pen(β;λ) = λ β 2
, this is called ridge regression (Hoerl and Kennard,
1970) and we have the (unique) minimizer as:
ˆβRidge = (XT
X + λI)−1
XT
Y .
The motivation of the lasso is Breiman’s non-negative garotte (Breiman,
1995a):
ˆβ = arg min
⎧⎪⎪
⎨
⎪⎪⎩
n
∑
i=1
(yi − ∑
j
cj
ˆβ0
j xij)2
⎫⎪⎪
⎬
⎪⎪⎭
s.t.cj ≥ 0,∑
j
cj ≤ t,
where ˆβ0
j is the full LSE of βj, j = 1...p and t ≥ 0 is a tuning parameter.
There are some advantages of using non-negative garotte as it gives
lower prediction error than subset selection and it is competitive with ridge
regression except when the true model has many small non-zero coefficients.
But the drawback of this method is that it depends on both the sign and
the magnitude of LSE, thus it suffers when LSE behaves poorly.
The lasso is a shrinkage and selection method for linear regression. It
minimizes the usual sum of squared errors, with a bound on the sum of the
absolute values of the coefficients. Because of the nature of this constraint
it tends to produce some coefficients that are exactly 0 and hence gives
interpretable models. The Lasso has several advantages. It simultaneously
performs model selection and model fit. In addition, although it is a non-
linear method, it is the global minimum of a convex penalized loss function
and can be computed efficiently.
For Lasso, we standardize xij as ∑i xij/n = 0, ∑i x2
ij/n = 1. Now, denote
Variable selection using KL divergence loss 157
ˆβ = ( ˆβ1,..., ˆβp)T
, the lasso estimate ˆβ is given by:
ˆβ = arg min
⎧⎪⎪
⎨
⎪⎪⎩
n
∑
i=1
(yi − ∑
j
βjxij)2
⎫⎪⎪
⎬
⎪⎪⎭
s.t.∑
j
βj ≤ t.
Here, the tuning parameter t controls the amount of shrinkage, i.e; when
t < t0 = ∑ ˆβ0
j (where ˆβ0
j is the full LSE), then that will cause shrinkage of
the solutions towards 0, and some coefficients may be exactly equal to 0.
The above optimization problem can be seen as a quadratic programming
problem with linear inequality constraints as follows:
ˆβlasso
= arg min
⎧⎪⎪
⎨
⎪⎪⎩
n
∑
i=1
(yi − ∑
j
βjxij)2
+ λ∑
j
βj
⎫⎪⎪
⎬
⎪⎪⎭
.
1.2 Oracle property and the adaptive lasso
Let us consider model estimation and variable selection in linear regression
models. Suppose that y = (y1,...,yn)T
is the response vector and xj =
(x1j,...,xnj)T
, j = 1,...,p, are the linearly independent predictors. Let
X = [x1,...,xp] be the predictor matrix. We assume that E[y x] =
β∗
1 x1 + ... + β∗
p xp. Without loss of generality, we assume that the data are
centered, so the intercept is not included in the regression function. Let A
= {j β∗
j ≠ 0} and further assume that A = p0 < p. Thus the true model
depends only on a subset of the predictors. Denote by ˆβ(δ) the coefficient
estimator produced by a fitting procedure δ. Using the language of Fan and
Li (2001), we call δ an oracle procedure if ˆβ(δ) (asymptotically) has the
following oracle properties:
• Identifies the right subset model: An = {j ˆβj ≠ 0} = A.
• Has the optimal estimation rate:
√
n( ˆβ(δ)A −β∗
A) →d N(0,Σ∗
), where
Σ∗
is the covariance matrix based on the true subset model.
158 Journal, Indian Statistical Association
It has been argued (Fan and Li 2001 and Fan and Peng 2004) that a good
procedure should have these oracle properties.
Knight and Fu (2000) studied asymptotic behavior of Lasso type
estimators. Under some appropriate conditions, they showed that the
limiting distributions have positive probability mass at 0 when the true
value of the parameter is 0, and they established asymptotic normality for
large parameters in some sense. Fan and Li (2001) conjectured that the
oracle properties do not hold for the lasso. They also proposed a smoothly
clipped absolute deviation (SCAD) penalty for variable selection and proved
its oracle properties.
Zou (2006) proposed a new version of the lasso, called the adaptive lasso,
where adaptive weights are used for penalizing different coefficients in the
l1 penalty and showed that the adaptive lasso enjoys the oracle properties;
namely, it performs as well as if the true underlying model were given in
advance. The adaptive lasso is defined as follows:
Suppose that ˆβ is a root-n-consistent estimator to β∗
; for example, we can
use ˆβ(ols). Pick a γ > 0, and define the weight vector ˆw = 1/ ˆβ γ
. The
adaptive lasso estimators ˆβadalasso
are given by:
ˆβadalasso
= arg min
⎧⎪⎪
⎨
⎪⎪⎩
n
∑
i=1
(yi − ∑
j
βjxij)2
+ λn ∑
j
ˆwj βj
⎫⎪⎪
⎬
⎪⎪⎭
,
where λn varies with n.
The rest of the article is organized as follows. In Section 2 we propose an
alternative approach to the adaptive lasso through the KL divergence loss
and we show that our proposed methodology enjoys the oracle properties
for variable selection. In Section 3 we use the LARS algorithm (Efron
et al. 2004) to solve the entire solution path of this newly developed
methodology. In Section 4 we apply our variable selection method to the
diabetes data (Efron et al. 2004). We extend the variable selection theory
and methodology to the generalized linear models (GLMs) in Section 5, and
give concluding remarks in Section 6.
Variable selection using KL divergence loss 159
2 An alternative approach to the adaptive lasso
through the KL divergence loss
In this paper, we replace the squared error loss in the adaptive lasso set
up by the KL divergence loss, which is also known as the entropy distance.
There are various theoretical reasons to defend the use of Kullback-Leibler
distance, ranging from information theory to the relevance of logarithmic
scoring rule and the location-scale invariance of the distance, as detailed
in Bernardo and Smith (1994). We also show that the oracle properties
discussed above hold in this case under mild regularity conditions.
We adopt the setup of Knight and Fu (2000) for the asymptotic analysis.
We assume two conditions:
(a) yi = xT
i β∗
+ i, where 1,..., n are iid random variables with mean
0 and variance σ2
. Here, for convenience, we assume that σ2
is known and
is equal to 1.
(b) 1
n XT
X → C, where C is a positive definite matrix.
Without loss of generality, assume that A = {1,2,...,p0}. Let C11 is the
p0 × p0 upper left-corner partitioned sub-matrix of C.
Now, suppose that f(yi xi,β∗
) and f(yi xi,β) are the normal densities
evaluated at β∗
and β respectively. Then the “Adaptive Penalized KL
Divergence” estimators (which we are going call as the “KL adaptive lasso”
estimators now onwards) ˆβKL
are given by:
ˆβKL
= arg min
⎧⎪⎪
⎨
⎪⎪⎩
n
∑
i=1
Eβ∗ [log
f(yi xi,β∗
)
f(yi xi,β)
] + λn ∑
j
ˆwj βj
⎫⎪⎪
⎬
⎪⎪⎭
. (2.1)
Now, since the vector of true regression coefficients β∗
is unknown, so
we replace it by ˆβ, which is the ordinary least squares (ols) estimator of β∗
.
Hence, after a bit algebraic calculation we get,
ˆβKL
= arg min
⎧⎪⎪
⎨
⎪⎪⎩
[X ˆβ − Xβ]
T
[X ˆβ − Xβ] + λn ∑
j
ˆwj βj
⎫⎪⎪
⎬
⎪⎪⎭
, (2.2)
160 Journal, Indian Statistical Association
where ˆwj = 1/ ˆβj
γ
. This is a convex optimization problem in β as we have
used convex penalty here and hence the local minimizer ˆβKL
is the unique
global KL adaptive lasso estimator (for non-convex penalties, however, the
local minimizer may not be globally unique).
Let An = {j ˆβKL
j ≠ 0}. We have shown that with a proper choice of λn,
the KL adaptive lasso enjoys the oracle properties.
Theorem 2.1. Suppose that λn/
√
n → 0 and λnn(γ−1)/2
→ ∞. Then, the
KL adaptive lasso estimates must satisfy the following:
1. Consistency in variable selection: limP (An = A) = 1.
2. Asymptotic normality:
√
n( ˆβKL
A − β∗
A) →d N (0,C−1
11 ).
Proof. We first prove the asymptotic normality part. Let β = β∗
+ u√
n
, and
ψn(u) =
⎡
⎢
⎢
⎢
⎣
X ˆβ − X (β∗
+
u
√
n
)
⎤
⎥
⎥
⎥
⎦
T ⎡
⎢
⎢
⎢
⎣
X ˆβ − X (β∗
+
u
√
n
)
⎤
⎥
⎥
⎥
⎦
+λn
p
∑
j=1
ˆwj β∗
j +
uj
√
n
.
Let ˆu(n)
= arg minψn(u); then ˆβKL
= β∗
+ ˆu(n)
√
n
, or,
ˆu(n)
=
√
n( ˆβKL
− β∗
). Define:
V (n)
(u) = ψn(u) − ψn(0)
= uT
(
1
n
XT
X)u − 2
uT
XT
X
√
n
( ˆβ − β∗
)
+
λn
√
n
p
∑
j=1
ˆwj
√
n
⎛
⎝
β∗
j +
uj
√
n
− β∗
j
⎞
⎠
. (2.3)
Since ˆβ is the ols estimate of β∗
, hence
√
n( ˆβ − β∗
) →d N(0,σ2
C−1
)
then by assumption (b) we get: ( 1
nXT
X)
√
n( ˆβ − β∗
) →d W , where
W ∼ N(0,σ2
C). Now consider the limiting behavior of the third term
in (2.3). If β∗
j ≠ 0, then ˆwj →p β∗
j
−γ
(using ˆβj →p β∗
j and the continuous
mapping theorem) and
√
n( β∗
j +
uj
√
n
− β∗
j ) → ujsgn(β∗
j ). Then, we have
λn√
n
ˆwj
√
n( β∗
j +
uj
√
n
− β∗
j ) →p 0, since by one of the assumptions of this
theorem λn/
√
n → 0.
Variable selection using KL divergence loss 161
If β∗
j = 0, then
√
n( β∗
j +
uj
√
n
− β∗
j ) = uj and
λn√
n
ˆwj = λn√
n
nγ/2
(
√
nˆβj )
−γ
= λnn(γ−1)/2
(
√
nˆβj )
−γ
, where
√
nˆβj = Op(1).
From the above and the assumption of the theorem that λnn(γ−1)/2
→ ∞, we
have
λn√
n
ˆwj
√
n( β∗
j +
uj
√
n
− β∗
j ) →p 0, if uj = 0 and
λn√
n
ˆwj
√
n( β∗
j +
uj
√
n
− β∗
j ) →p ∞, if uj ≠ 0.
Hence, we summarize the results as follows:
λn√
n
ˆwj
√
n( β∗
j +
uj
√
n
− β∗
j ) →p 0, if β∗
j ≠ 0,
λn√
n
ˆwj
√
n( β∗
j +
uj
√
n
− β∗
j ) →p 0, if β∗
j = 0 & uj = 0 and
λn√
n
ˆwj
√
n( β∗
j +
uj
√
n
− β∗
j ) →p ∞, if β∗
j = 0 & uj ≠ 0.
Thus, by Slutsky’s theorem, we see that V (n)
(u) →d V (u) for every u,
where, V (u) = uT
AC11uA − 2uT
AWA; if uj = 0∀j ∉ A and V (u) = ∞,
otherwise. Now note that V (n)
is convex and the unique minimum of V
is (C−1
11 WA,0)T
. Following the epi-convergence results of Geyer (1994) and
Knight and Fu (2000), we have: ˆu
(n)
A →d C−1
11 WA and ˆu
(n)
Ac →d 0. Finally, we
observe that WA = N(0,σ2
C11); then we prove the asymptotic normality
part. Now, we show the consistency part. ∀j ∈ A, the asymptotic normality
result indicates that ˆβ
(n)
j →p β∗
j ; thus P(j ∈ A∗
n) → 1. Then it suffices to
show that ∀j′
∉ A, P(j′
∈ A∗
n) → 0. Consider the event j′
∈ A∗
n. Then, by the
Karush-Kuhn-Tucker (KKT) optimality conditions (Hastie, Tibshirani, and
Friedman 2009, p. 421), we know that 2xT
j′ (X ˆβ − X ˆβKL
) = λn ˆwj′ . Note
that
λn ˆwj′
√
n
= λn√
n
nγ/2 1
√
nˆβj′ γ
→p ∞ ( λnn(γ−1)/2
→ ∞ and
√
nˆβj′ = Op(1),
since j′
∉ A, i.e; β∗
j′ = 0), whereas
2xT
j′ (X ˆβ−X ˆβKL)
√
n
= 2
xT
j′ X
√
n( ˆβ− ˆβKL)
n =
2
xT
j′ X
n {
√
n( ˆβ − β∗
) −
√
n( ˆβKL
− β∗
)}. Now observe that
√
n( ˆβ − ˆβKL
) =
√
n( ˆβ − β∗
) −
√
n( ˆβKL
− β∗
) = Op(1) since
√
n( ˆβ − β∗
) →d some normal
r.v as well as
√
n( ˆβKL
− β∗
) →d some normal r.v. i.e;
162 Journal, Indian Statistical Association
2xT
j′ X
√
n( ˆβ − ˆβKL
) = Op(1). Thus, 2
xT
j′ X
√
n( ˆβ− ˆβKL)
n →p 0. Hence,
P [j′
∈ A∗
n] ≤ P [2xT
j′ (X ˆβ − X ˆβKL
) = λn ˆwj′ ] → 0.
This completes the proof. ◻
3 Computations
In this section we discuss the computational issues. The KL adaptive lasso
estimates in (2.2) can be solved by the LARS algorithm (Efron et al. 2004).
The computational details are given in the following Algorithm, the proof of
which is very simple and so is omitted.
Algorithm (The LARS algorithm for the KL adaptive lasso).
1. Find ˆβ, the ols estimate of β∗
, by the least squares estimation and
hence find ˆw for some γ > 0.
2. Define x∗
j =
xj
ˆwj
, j = 1,...,p.
3. Evaluate ˆy = ∑
p
j=1 x∗
j
ˆβj.
4. Solve the following optimization problem for all λn,
ˆβ∗
= arg min
β
ˆy −
p
∑
j=1
x∗
j βj
2
+ λn
p
∑
j=1
βj .
5. Output ˆβKL
j = ˆβ∗
j / ˆwj, j = 1,...,p.
Tuning is an important issue in practice. Suppose that we use ˆβ (ols)
to construct the adaptive weights in the KL adaptive lasso; we then want
to find an optimal pair of (γ,λn) which minimizes the objective function
among all other pairs. We can use two-dimensional cross-validation to tune
the KL adaptive lasso. Note that for a given γ, we can use cross-validation
along with the LARS algorithm to exclusively search for the optimal λn. In
principle, we can also replace ˆβ (ols) with other consistent estimators. Hence
we can treat it as the third tuning parameter and perform three-dimensional
cross-validation to find an optimal triple ( ˆβ,γ,λn). We suggest using ˆβ (ols)
Variable selection using KL divergence loss 163
unless collinearity is a concern, in which case we can try ˆβ (ridge) from the
best ridge regression fit, because it is more stable than ˆβ (ols).
4 Numerical example: diabetes data
From Efron et al.(2004), we have:
“Ten baseline variables, age, sex, body mass index (bmi), average blood
pressure (map), and six blood serum measurements (tc, ldl, hdl, tch, ltg, glu)
were obtained for each of n = 442 diabetes patients, as well as the response
of interest (y), a quantitative measure of disease progression one year after
baseline.”
By applying our KL adaptive lasso variable selection methodology on this
data, we get the regression coefficient estimates for the predictor variables
as follows:
age 0.0000
sex −201.6889
bmi 540.5075
map 314.5000
tc −514.2194
ldl 268.7080
hdl 0.0000
tch 119.6211
ltg 682.5415
glu 0.0000
It is very clear from the above estimates that the variables named “age”,
“hdl” and “glu” do not have any significant influence on the response. Hence,
we can only select the remaining 7 variables to predict the response. The
above result supports the result obtained from the lasso as well as the
adaptive lasso techniques applied on the same data.
5 Further extension
Having shown the oracle properties of the KL adaptive lasso in linear
regression models, we have further extended the theory and methodology to
164 Journal, Indian Statistical Association
generalized linear models (GLMs). We consider the penalized KL divergence
loss function using the adaptively weighted l1 penalty, where the density
belongs to the exponential family with canonical parameter θ. The generic
density form can be written as, (McCullagh and Nelder, (1989))
f(y x,θ) = h(y)exp(yθ − φ(θ)).
Generalized linear models assume that θ = xT
β∗
. Suppose that ˆβ is the
maximum likelihood estimates (mle) of β∗
in the GLM. We construct the
weight vector ˆw = 1/ ˆβ γ
for some γ > 0.
Suppose that f(yi xi,β∗
) and f(yi xi,β) are the exponential family
densities evaluated at β∗
and β respectively. Note that, Eβ∗ (yi xi) =
φ′
(xT
i β∗
), where φ′
(⋅) is the first derivative of φ(⋅). Then the KL adaptive
lasso estimates ˆβKL
(glm) are given by,
ˆβKL
(glm)
= arg min
⎧⎪⎪
⎨
⎪⎪⎩
n
∑
i=1
Eβ∗ [log
f(yi xi,β∗
)
f(yi xi,β)
] + λn ∑
j
ˆwj βj
⎫⎪⎪
⎬
⎪⎪⎭
= arg min
⎧⎪⎪
⎨
⎪⎪⎩
n
∑
i=1
{φ′
(xT
i β∗
)(xT
i β∗
− xT
i β) − φ(xT
i β∗
) + φ(xT
i β)} + λn ∑
j
ˆwj βj
⎫⎪⎪
⎬
⎪⎪⎭
.
(5.4)
We need to replace the true but unknown β∗
by a root-n-consistent estimator
of β∗
. Hence, we replace β∗
by ˆβ (mle) in the KL divergence loss function.
Thus, equation (5.4) becomes,
ˆβKL
(glm)
= arg min
⎧⎪⎪
⎨
⎪⎪⎩
n
∑
i=1
{φ′
(xT
i
ˆβ)(xT
i
ˆβ − xT
i β) − φ(xT
i
ˆβ) + φ(xT
i β)} + λn ∑
j
ˆwj βj
⎫⎪⎪
⎬
⎪⎪⎭
.
(5.5)
For logistic regression, (5.5) becomes
Variable selection using KL divergence loss 165
ˆβKL
(logistic)
= arg min
⎧⎪⎪
⎨
⎪⎪⎩
n
∑
i=1
{φ′
(xT
i
ˆβ)(xT
i
ˆβ − xT
i β) − log(1 + exp(xT
i
ˆβ))
+ log(1 + exp(xT
i β))} + λn ∑
j
ˆwj βj
⎫⎪⎪
⎬
⎪⎪⎭
.
For Poisson log-linear regression models, (5.5) becomes
ˆβKL
(poisson)
= arg min
⎧⎪⎪
⎨
⎪⎪⎩
n
∑
i=1
{φ′
(xT
i
ˆβ)(xT
i
ˆβ − xT
i β) − exp(xT
i
ˆβ)
+ exp(xT
i β)} + λn ∑
j
ˆwj βj
⎫⎪⎪
⎬
⎪⎪⎭
.
Let KLn(β) = ∑n
i=1 {φ′
(xT
i
ˆβ)(xT
i
ˆβ − xT
i β) − φ(xT
i
ˆβ) + φ(xT
i β)}. Then,
∂2
Ln(β)
∂β∂βT
=
n
∑
i=1
φ′′
(xT
i
ˆβ)xixT
i ;
which is positive definite, since the variance function φ′′
(xT
i
ˆβ) > 0.
Now, let kln(β) = KLn(β) + λn ∑j ˆwj βj . Since, in this case, we have used
convex penalty, so kln(β) is necessarily convex in β and hence the local
minimizer ˆβKL
(glm) is the unique global KL adaptive lasso estimator.
Assume that the true model has a sparse representation. Without loss
of generality, let A = {j β∗
j ≠ 0} = {1,2,...,p0} and p0 < p. Let I11 is the
p0 × p0 upper left-corner partitioned sub-matrix of the Fisher information
matrix I(β∗
). Then, I11 is the Fisher information with the true sub-model
known.
We show that under some mild regularity conditions, the KL adaptive
lasso estimates
ˆβKL
(glm) enjoys the oracle properties if λn is chosen appropriately.
166 Journal, Indian Statistical Association
Theorem 5.1. Let An = {j ˆβKL
j (glm) ≠ 0}. Suppose that λn/
√
n → 0 and
λnn(γ−1)/2
→ ∞. Then, the KL adaptive lasso estimates ˆβKL
(glm) must
satisfy the following:
1. Consistency in variable selection: limP (An = A) = 1.
2. Asymptotic normality:
√
n( ˆβKL
A (glm) − β∗
A) →d N (0,σ2
I−1
11 ).
Proof. We assume the following regularity conditions:
1. The Fisher information matrix I(β∗
) = E [φ
′′
(xT
β∗
)xxT
] is finite
and positive definite.
2. There is a sufficiently large enough open set O that contains β∗
such
that ∀β ∈ O,
φ
′′′
(xT
β) ≤ M(x) < ∞
and
E [M(x) xjxkxl ] < ∞ ∀1 ≤ j,k,l ≤ p.
We first prove the asymptotic normality part.
ˆβKL
(glm)
= arg min
⎧⎪⎪
⎨
⎪⎪⎩
n
∑
i=1
{φ′
(xT
i
ˆβ)(xT
i
ˆβ − xT
i β) − φ(xT
i
ˆβ) + φ(xT
i β)} + λn ∑
j
ˆwj βj
⎫⎪⎪
⎬
⎪⎪⎭
.
Let β = β∗
+ u√
n
; u ∈ Rp
.
Define:
Γn(u) =
n
∑
i=1
{xT
i ( ˆβ − β∗
−
u
√
n
)φ′
(xT
i
ˆβ) − φ(xT
i
ˆβ) + φ(xT
i (β∗
+
u
√
n
))}
+ λn ∑
j
ˆwj β∗
j +
uj
√
n
.
and
Γn(0) =
n
∑
i=1
{xT
i ( ˆβ − β∗
)φ′
(xT
i
ˆβ) − φ(xT
i
ˆβ) + φ(xT
i (β∗
)} + λn ∑
j
ˆwj β∗
j .
Variable selection using KL divergence loss 167
Let ˆun
= arg minΓn(u) = arg min{Γn(u) − Γn(0)}. Then,
ˆun
=
√
n( ˆβKL
(glm) − β∗
). Let
H(n)
(u) = Γn(u) − Γn(0)
=
n
∑
i=1
⎧⎪⎪
⎨
⎪⎪⎩
xT
i (−
u
√
n
)φ′
(xT
i
ˆβ) + φ(xT
i β∗
+
xT
i u
√
n
) − φ(xT
i β∗
)
⎫⎪⎪
⎬
⎪⎪⎭
+
λn
√
n
∑
j
ˆwj
√
n( β∗
j +
uj
√
n
− β∗
j ).
Notice that:
φ(xT
i β∗
+
xT
i u
√
n
) = φ(xT
i β∗
) + φ′
(xT
i β∗
)
xT
i u
√
n
+
1
2
φ′′
(xT
i β∗
)uT (xixT
i )
n
u
+
n−3/2
6
φ′′′
(xT
i β∗∗
)(xT
i u)3
,
where β∗∗
is between β∗
and β∗
+ u√
n
.
Hence,
H(n)
(u) =
n
∑
i=1
⎧⎪⎪
⎨
⎪⎪⎩
(φ′
(xT
i β∗
) − φ′
(xT
i
ˆβ))
xT
i u
√
n
+
1
2
φ′′
(xT
i β∗
)uT (xixT
i )
n
u
+
n−3/2
6
φ′′′
(xT
i β∗∗
)(xT
i u)3
⎫⎪⎪
⎬
⎪⎪⎭
+
λn
√
n
∑
j
ˆwj
√
n( β∗
j +
uj
√
n
− β∗
j )
= A
(n)
1 + A
(n)
2 + A
(n)
3 + A
(n)
4 ; (say)
Now,
A
(n)
1 = −
n
∑
i=1
[φ′
(xT
i
ˆβ) − φ′
(xT
i β∗
)]
xT
i u
√
n
= −
n
∑
i=1
[φ′′
(xT
i β∗
)(xT
i
ˆβ − xT
i β∗
)]
xT
i u
√
n
−
1
2
n
∑
i=1
[φ′′′
(xT
i
ˆβ∗∗
)(xT
i
ˆβ − xT
i β∗
)2
]
xT
i u
√
n
= −A
(n)
11 − A
(n)
12 ; (say),
where ˆβ∗∗
is in between ˆβ and β∗
. Notice that
A
(n)
11 =
n
∑
i=1
[φ′′
(xT
i β∗
)(xT
i
ˆβ − xT
i β∗
)]
xT
i u
√
n
= uT
⎡
⎢
⎢
⎢
⎢
⎣
1
n
n
∑
i=1
φ′′
(xT
i β∗
)xixT
i
⎤
⎥
⎥
⎥
⎥
⎦
W ′
n;
168 Journal, Indian Statistical Association
where, W ′
n =
√
n( ˆβ − β∗
) →d W ′
, such that W ′
is N(0,[I(β∗
)]−1
).
Also, 1
n ∑n
i=1 φ′′
(xT
i β∗
)xixT
i →p I(β∗
) (by WLLN). Thus, by Slutsky’s
theorem,
A
(n)
11 →d uT
I(β∗
)W ′
= uT
W ,
where, W ∼ N(0,I(β∗
)). Consider:
A
(n)
12 =
1
2
n
∑
i=1
[φ′′′
(xT
i
ˆβ∗∗
)(xT
i
ˆβ − xT
i β∗
)2
]
xT
i u
√
n
=
1
2
√
n
W ′T
n
⎡
⎢
⎢
⎢
⎢
⎣
1
n
n
∑
i=1
φ′′′
(xT
i
ˆβ∗∗
)xixT
i uT
xi
⎤
⎥
⎥
⎥
⎥
⎦
W ′
n
≤
1
2
√
n
W ′T
n
⎡
⎢
⎢
⎢
⎢
⎣
1
n
n
∑
i=1
M(xi)xixT
i xT
i u
⎤
⎥
⎥
⎥
⎥
⎦
W ′
n ,
by the regularity condition 2 since ˆβ∗∗
∈ O. Also W ′
n = Op(1) since
W ′
n =
√
n( ˆβ − β∗
) →d W ′
. By the WLLN,
1
n
n
∑
i=1
M(xi)xixT
i xT
i u →p E [M(x)xxT
xT
u ].
Notice that, the (i,j)th.
element of the p × p matrix E [M(x)xxT
xT
u ],
∀1 ≤ i,j ≤ p is given by,
((E [M(x)xxT
xT
u ]))
i,j
= E [M(x)xixj xT
u ]
≤
p
∑
k=1
E [M(x) xi xj xk uk ]
=
p
∑
k=1
uk E [M(x) xixjxk ] < ∞,
again, by the regularity condition 2. Hence,
1
n
n
∑
i=1
M(xi)xixT
i xT
i u = Op(1).
Thus, A
(n)
12 →p 0 which implies that A
(n)
1 →d uT
W .
Now for the second term A
(n)
2 , we observe that: 1
n ∑n
i=1 φ′′
(xT
i β∗
)xixT
i →p
Variable selection using KL divergence loss 169
I(β∗
). Thus, by Slutsky’s theorem, A
(n)
2 →p
1
2uT
I(β∗
)u.
Since ˆβ∗∗
∈ O, so by the regularity condition 2, the third term A
(n)
3 can be
bounded as:
6
√
nA
(n)
3 ≤
1
n
n
∑
i=1
M(xi) xT
i u 3
→p E [M(x) xT
u 3
] < ∞.
Hence, A
(n)
3 →p 0.
The limiting behavior of the fourth term A
(n)
4 is already discussed in the
proof of Theorem 1. We summarize the results as follows:
λn√
n
ˆwj
√
n( β∗
j +
uj
√
n
− β∗
j ) →p 0, if β∗
j ≠ 0,
λn√
n
ˆwj
√
n( β∗
j +
uj
√
n
− β∗
j ) →p 0, if β∗
j = 0 & uj = 0 and
λn√
n
ˆwj
√
n( β∗
j +
uj
√
n
− β∗
j ) →p ∞, if β∗
j = 0 & uj ≠ 0.
Thus, by Slutsky’s theorem, we see that Hn
(u) →d H(u) for every u,
where
H(u) = uT
AI11uA − 2uT
AWA, if uj = 0∀j ∉ A and H(u) = ∞, otherwise,
where W ∼ N(0,I(β∗
)). H(n)
is convex and the unique minimum of H is
(I−1
11 WA,0)T
. Then, we have:
ˆu
(n)
A →d I−1
11 WA and ˆu
(n)
Ac →d 0.
Because WA ∼ N(0,I11), the asymptotic normality part is proven. Now
we show the consistency part. ∀j ∈ A, the asymptotic normality indicates
that P(j ∈ An) → 1. Then it suffices to show that ∀j′
∉ A, P(j′
∈ An) → 0.
Consider the event j′
∈ An. By the KKT optimality conditions, we must
have
n
∑
i=1
xij′ (φ′
(xT
i
ˆβ) − φ′
(xT
i
ˆβKL
(glm))) = λn ˆwj′ ;
thus P(j′
∈ An) ≤ P (∑n
i=1 xij′ (φ′
(xT
i
ˆβ) − φ′
(xT
i
ˆβKL
(glm))) = λn ˆwj′ ).
170 Journal, Indian Statistical Association
Note that
n
∑
i=1
xij′ (φ′
(xT
i
ˆβ) − φ′
(xT
i
ˆβKL
(glm))) = B
(n)
1 + B
(n)
2 + B
(n)
3 ,
with
B
(n)
1 =
n
∑
i=1
xij′ (φ′
(xT
i
ˆβ) − φ′
(xT
i β∗
))/
√
n,
B
(n)
2 =
⎛
⎝
1
n
n
∑
i=1
xij′ φ′′
(xT
i β∗
)xT
i
⎞
⎠
√
n(β∗
− ˆβKL
(glm))
and B
(n)
3 =
1
2n
n
∑
i=1
xij′ φ′′′
(xT
i
ˆβ∗∗∗
)(xT
i
√
n(β∗
− ˆβKL
(glm)))
2
/
√
n,
where ˆβ∗∗∗
is in between ˆβKL
(glm) and β∗
.
Now,
B
(n)
1 =
n
∑
i=1
xij′ (φ′
(xT
i
ˆβ) − φ′
(xT
i β∗
))/
√
n
=
n
∑
i=1
[φ′′
(xT
i β∗
)(xT
i
ˆβ − xT
i β∗
)]
xij′
√
n
+
1
2
n
∑
i=1
[φ′′′
(xT
i
ˆβ∗∗∗
)(xT
i
ˆβ − xT
i β∗
)2
]
xij′
√
n
= B
(n)
11 + B
(n)
12 , (say).
Notice that
B
(n)
11 =
n
∑
i=1
[φ′′
(xT
i β∗
)(xT
i
ˆβ − xT
i β∗
)]
xij′
√
n
=
⎡
⎢
⎢
⎢
⎢
⎣
1
n
n
∑
i=1
xij′ φ′′
(xT
i β∗
)xT
i
⎤
⎥
⎥
⎥
⎥
⎦
W ′
n.
Now, 1
n ∑n
i=1 xij′ φ′′
(xT
i β∗
)xT
i →p Ij′ , where, Ij′ is the j′
th. row of I(β∗
)
and W ′
n →d W ′
. Thus, by Slustsky’s theorem, we have:
B
(n)
11 →d Ij′ W ′
∼ N(0,Ij′ [I(β∗
)]−1
IT
j′ ).
Variable selection using KL divergence loss 171
Also,
B
(n)
12 =
1
2
n
∑
i=1
[φ′′′
(xT
i
ˆβ∗∗∗
)(xT
i
ˆβ − xT
i β∗
)2
]
xij′
√
n
=
1
2
√
n
W ′T
n
⎡
⎢
⎢
⎢
⎢
⎣
1
n
n
∑
i=1
φ′′′
(xT
i
ˆβ∗∗∗
)xixT
i xij′
⎤
⎥
⎥
⎥
⎥
⎦
W ′
n
≤
1
2
√
n
W ′T
n
⎡
⎢
⎢
⎢
⎢
⎣
1
n
n
∑
i=1
M(xi)xixT
i xij′
⎤
⎥
⎥
⎥
⎥
⎦
W ′
n ;
by the regularity condition 2 since ˆβ∗∗∗
∈ O.
We know that W ′
n = Op(1) and by the WLLN,
1
n
n
∑
i=1
M(xi)xixT
i xij′ →p E [M(x)xxT
xij′ ],
which implies 1
n ∑n
i=1 M(xi)xixT
i xij′ = Op(1). Hence, B
(n)
12 →p 0.
Since 1
n ∑n
i=1 xij′ φ′′
(xT
i β∗
)xT
i →p Ij′ , the asymptotic normality part
implies that B
(n)
2 →d to some normal r.v. Meanwhile, we have,
λn ˆwj′
√
n
=
λn
√
n
nγ/2 1
√
nˆβj′
γ →p ∞,
since, λnn(γ−1)/2
→p ∞ for γ > 0, by the assumption and as j′
∉ A ⇒ ˆβj′ ≡ 0,
so
√
n(ˆβj′ − 0) →d some normal r.v. ⇒
√
nˆβj′
γ
= Op(1).
Hence, P(j′
∈ An) → 0. This completes the proof. ◻
6 Conclusion
In this article we have proposed the KL adaptive lasso for simultaneous
estimation and variable selection. We have shown that the KL adaptive
lasso also enjoys the oracle properties by utilizing the adaptively weighted l1
penalty. Owing to the efficient path algorithm, the KL adaptive lasso also
enjoys the computational advantage of the lasso. Our numerical example
has shown that the KL adaptive lasso performs similarly with the lasso and
172 Journal, Indian Statistical Association
the adaptive lasso. In future, we want to compare the prediction accuracy
of the KL adaptive lasso with other existing sparse modeling techniques like
the lasso, the adaptive lasso, the nonnegative garotte etc. For comparison,
we are going to report the prediction error, E[(ˆy − ytest)2
]. We would also
like to extend our KL divergence based variable selection methodology in
the high dimensional regime (i.e; when the number of regressors p grows
to infinity at a certain rate relative to the growth of the sample size n) as
well as in a survival analysis context where we have to modify the data to
account for censoring, and investigate the oracle properties specified above
in this scenario.
Acknowledgments
The author would like to thank Prof. Malay Ghosh and Prof. Kshitij
Khare for their help with the paper.
References
[1] Breiman, L. (1995a). Better subset regression using the
nonnegative garrote, Technometrics, 37, 373–384.
[2] Bernardo, J. M. and Smith, A. F. M. (1994). Bayesian
Theory, John Wiley, New York.
[3] Draper, N. R. and Smith, H. (1998). Applied Regression
Aanalysis, third edition, John Wiley, New York.
[4] Efron, B., Hastie, T., Johnstone, I., and Tibshirani,
R. (2004). Least angle regression, The Annals of Statistics,
32, 407–499.
Variable selection using KL divergence loss 173
[5] Fan, J. and Li, R. (2001). Variable selection via nonconcave
penalized likelihood and its oracle properties, Journal of the
American Statistical Association, 96, 1348–1360.
[6] Fan, J. and Li, R. (2006). Statistical challenges
with high dimensionality: Feature selection in knowledge
discovery, Proceedings of the Madrid International Congress
of Mathematicians.
[7] Fan, J. and Peng, H. (2004). On nonconcave penalized
likelihood with diverging number of parameters, The Annals
of Statistics, 32, 928–961.
[8] Geyer, C. (1994). On the asymptotics of constrained M-
estimation, The Annals of Statistics, 22, 1993–2010.
[9] Hastie, T., Tibshirani, R. and Friedman, J. H.
(2009). The Elements of Statistical Learning, second edition,
Springer-Verlag, New York.
[10] Hoerl, A. E. and Kennard, R. W. (1970). Ridge
regression: Biased estimation for nonorthogonal problems,
Technometrics, 12(1), 55–67.
[11] Knight, K. and Fu, W. (2000). Asymptotics for lasso-type
estimators, The Annals of Statistics, 28, 1356–1378.
[12] McCullagh, P. and Nelder, J. (1989). Generalized Linear
Models, Second edition, Chapman & Hall, New York.
[13] Tibshirani, R. (1996). Regression shrinkage and selection
via the lasso, Journal of the Royal Statistical Society, Ser. B,
58, 267–288.
[14] Zou, H. (2006). The adaptive lasso and its oracle
properties, Journal of the American Statistical Association,
101(476), 1418–1429.
174 Journal, Indian Statistical Association
Shibasish Dasgupta
Department of Mathematics and Statistics
University of South Alabama
411 University Boulevard North
Mobile, AL 36688-0002, U.S.A.
E-mail: sdasgupta@southalabama.edu

Contenu connexe

Tendances

MetiTarski: An Automatic Prover for Real-Valued Special Functions
MetiTarski: An Automatic Prover for Real-Valued Special FunctionsMetiTarski: An Automatic Prover for Real-Valued Special Functions
MetiTarski: An Automatic Prover for Real-Valued Special Functions
Lawrence Paulson
 
tensor-decomposition
tensor-decompositiontensor-decomposition
tensor-decomposition
Kenta Oono
 
Ck31369376
Ck31369376Ck31369376
Ck31369376
IJMER
 

Tendances (17)

Approximate Solution of a Linear Descriptor Dynamic Control System via a non-...
Approximate Solution of a Linear Descriptor Dynamic Control System via a non-...Approximate Solution of a Linear Descriptor Dynamic Control System via a non-...
Approximate Solution of a Linear Descriptor Dynamic Control System via a non-...
 
MetiTarski: An Automatic Prover for Real-Valued Special Functions
MetiTarski: An Automatic Prover for Real-Valued Special FunctionsMetiTarski: An Automatic Prover for Real-Valued Special Functions
MetiTarski: An Automatic Prover for Real-Valued Special Functions
 
Discrete time prey predator model with generalized holling type interaction
Discrete time prey predator model with generalized holling type interactionDiscrete time prey predator model with generalized holling type interaction
Discrete time prey predator model with generalized holling type interaction
 
tensor-decomposition
tensor-decompositiontensor-decomposition
tensor-decomposition
 
Basic calculus (ii) recap
Basic calculus (ii) recapBasic calculus (ii) recap
Basic calculus (ii) recap
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Regression
RegressionRegression
Regression
 
Ck31369376
Ck31369376Ck31369376
Ck31369376
 
5 regression
5 regression5 regression
5 regression
 
Power method
Power methodPower method
Power method
 
Dynstoch (presented)
Dynstoch (presented)Dynstoch (presented)
Dynstoch (presented)
 
Sampling based approximation of confidence intervals for functions of genetic...
Sampling based approximation of confidence intervals for functions of genetic...Sampling based approximation of confidence intervals for functions of genetic...
Sampling based approximation of confidence intervals for functions of genetic...
 
Numerical Methods - Power Method for Eigen values
Numerical Methods - Power Method for Eigen valuesNumerical Methods - Power Method for Eigen values
Numerical Methods - Power Method for Eigen values
 
Pmath 351 note
Pmath 351 notePmath 351 note
Pmath 351 note
 
Ullmayer_Rodriguez_Presentation
Ullmayer_Rodriguez_PresentationUllmayer_Rodriguez_Presentation
Ullmayer_Rodriguez_Presentation
 
Regression analysis.
Regression analysis.Regression analysis.
Regression analysis.
 
Chap11 simple regression
Chap11 simple regressionChap11 simple regression
Chap11 simple regression
 

En vedette

Querubín garcía
Querubín garcíaQuerubín garcía
Querubín garcía
KEFON
 
Hazme mas pequeña mi cruz
Hazme mas pequeña mi cruzHazme mas pequeña mi cruz
Hazme mas pequeña mi cruz
alexitojs
 
Formato planificacion novoembre
Formato planificacion novoembreFormato planificacion novoembre
Formato planificacion novoembre
Carolina Martini
 
Rc marly angulo
Rc marly anguloRc marly angulo
Rc marly angulo
mangulom
 
Introducción de las tic
Introducción de las ticIntroducción de las tic
Introducción de las tic
jimevisconti
 

En vedette (20)

Equipo 1 glosario
Equipo 1 glosarioEquipo 1 glosario
Equipo 1 glosario
 
Инвест. презентация Иркутские Берега
Инвест. презентация Иркутские БерегаИнвест. презентация Иркутские Берега
Инвест. презентация Иркутские Берега
 
Digipack drafts
Digipack draftsDigipack drafts
Digipack drafts
 
Presentación1
Presentación1Presentación1
Presentación1
 
Presentacion informatica
Presentacion informaticaPresentacion informatica
Presentacion informatica
 
Elementos que componen una Pequeña y Mediana Empresa
Elementos que componen una Pequeña y Mediana EmpresaElementos que componen una Pequeña y Mediana Empresa
Elementos que componen una Pequeña y Mediana Empresa
 
ADA 1 bloque 3
ADA 1 bloque 3ADA 1 bloque 3
ADA 1 bloque 3
 
Querubín garcía
Querubín garcíaQuerubín garcía
Querubín garcía
 
Hazme mas pequeña mi cruz
Hazme mas pequeña mi cruzHazme mas pequeña mi cruz
Hazme mas pequeña mi cruz
 
Php
PhpPhp
Php
 
Formato planificacion novoembre
Formato planificacion novoembreFormato planificacion novoembre
Formato planificacion novoembre
 
Распоряжение об изменении состава Губернаторского ИТ-Совета
Распоряжение об изменении состава Губернаторского ИТ-СоветаРаспоряжение об изменении состава Губернаторского ИТ-Совета
Распоряжение об изменении состава Губернаторского ИТ-Совета
 
Corporate Drama
Corporate DramaCorporate Drama
Corporate Drama
 
Job search for the currently employed
Job search for the currently employedJob search for the currently employed
Job search for the currently employed
 
Caligrama Oda a la Tipografía
Caligrama Oda a la TipografíaCaligrama Oda a la Tipografía
Caligrama Oda a la Tipografía
 
Rc marly angulo
Rc marly anguloRc marly angulo
Rc marly angulo
 
Hsa
HsaHsa
Hsa
 
Μεσοποταμια
ΜεσοποταμιαΜεσοποταμια
Μεσοποταμια
 
Borja
BorjaBorja
Borja
 
Introducción de las tic
Introducción de las ticIntroducción de las tic
Introducción de las tic
 

Similaire à JISA_Paper

Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...
Mengxi Jiang
 
Factor analysis
Factor analysis Factor analysis
Factor analysis
Mintu246
 
A comparative analysis of predictve data mining techniques3
A comparative analysis of predictve data mining techniques3A comparative analysis of predictve data mining techniques3
A comparative analysis of predictve data mining techniques3
Mintu246
 
Extreme bound analysis based on correlation coefficient for optimal regressio...
Extreme bound analysis based on correlation coefficient for optimal regressio...Extreme bound analysis based on correlation coefficient for optimal regressio...
Extreme bound analysis based on correlation coefficient for optimal regressio...
Loc Nguyen
 
Bag of Pursuits and Neural Gas for Improved Sparse Codin
Bag of Pursuits and Neural Gas for Improved Sparse CodinBag of Pursuits and Neural Gas for Improved Sparse Codin
Bag of Pursuits and Neural Gas for Improved Sparse Codin
Karlos Svoboda
 

Similaire à JISA_Paper (20)

Regularization and variable selection via elastic net
Regularization and variable selection via elastic netRegularization and variable selection via elastic net
Regularization and variable selection via elastic net
 
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...
 
nber_slides.pdf
nber_slides.pdfnber_slides.pdf
nber_slides.pdf
 
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsData Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
 
Heteroscedasticity Remedial Measures.pptx
Heteroscedasticity Remedial Measures.pptxHeteroscedasticity Remedial Measures.pptx
Heteroscedasticity Remedial Measures.pptx
 
Factor analysis
Factor analysis Factor analysis
Factor analysis
 
A comparative analysis of predictve data mining techniques3
A comparative analysis of predictve data mining techniques3A comparative analysis of predictve data mining techniques3
A comparative analysis of predictve data mining techniques3
 
Introduction to Supervised ML Concepts and Algorithms
Introduction to Supervised ML Concepts and AlgorithmsIntroduction to Supervised ML Concepts and Algorithms
Introduction to Supervised ML Concepts and Algorithms
 
Heteroscedasticity Remedial Measures.pptx
Heteroscedasticity Remedial Measures.pptxHeteroscedasticity Remedial Measures.pptx
Heteroscedasticity Remedial Measures.pptx
 
Lecture 1 maximum likelihood
Lecture 1 maximum likelihoodLecture 1 maximum likelihood
Lecture 1 maximum likelihood
 
2. diagnostics, collinearity, transformation, and missing data
2. diagnostics, collinearity, transformation, and missing data 2. diagnostics, collinearity, transformation, and missing data
2. diagnostics, collinearity, transformation, and missing data
 
Get Multiple Regression Assignment Help
Get Multiple Regression Assignment Help Get Multiple Regression Assignment Help
Get Multiple Regression Assignment Help
 
Extreme bound analysis based on correlation coefficient for optimal regressio...
Extreme bound analysis based on correlation coefficient for optimal regressio...Extreme bound analysis based on correlation coefficient for optimal regressio...
Extreme bound analysis based on correlation coefficient for optimal regressio...
 
Bag of Pursuits and Neural Gas for Improved Sparse Codin
Bag of Pursuits and Neural Gas for Improved Sparse CodinBag of Pursuits and Neural Gas for Improved Sparse Codin
Bag of Pursuits and Neural Gas for Improved Sparse Codin
 
Regression ppt.pptx
Regression ppt.pptxRegression ppt.pptx
Regression ppt.pptx
 
REGRESSION ANALYSIS THEORY EXPLAINED HERE
REGRESSION ANALYSIS THEORY EXPLAINED HEREREGRESSION ANALYSIS THEORY EXPLAINED HERE
REGRESSION ANALYSIS THEORY EXPLAINED HERE
 
Bayesian Variable Selection in Linear Regression and A Comparison
Bayesian Variable Selection in Linear Regression and A ComparisonBayesian Variable Selection in Linear Regression and A Comparison
Bayesian Variable Selection in Linear Regression and A Comparison
 
Sparsenet
SparsenetSparsenet
Sparsenet
 
Data classification sammer
Data classification sammer Data classification sammer
Data classification sammer
 
Stochastic Approximation and Simulated Annealing
Stochastic Approximation and Simulated AnnealingStochastic Approximation and Simulated Annealing
Stochastic Approximation and Simulated Annealing
 

JISA_Paper

  • 1. Journal of the Indian Statistical Association Vol.53 No. 1 & 2, 2015, 153-174 Variable selection using Kullback-Leibler divergence loss Shibasish Dasgupta University of South Alabama, Mobile, U.S.A. Abstract The adaptive lasso is a recent technique for simultaneous estimation and variable selection where adaptive weights are used for penalizing different coefficients in the l1 penalty. In this paper, we propose an alternative approach to the adaptive lasso through the Kullback-Leibler (KL) divergence loss, called the KL adaptive lasso, where we replace the squared error loss in the adaptive lasso set up by the KL divergence loss which is also known as the entropy distance. There are various theoretical reasons to defend the use of Kullback-Leibler distance, ranging from information theory to the relevance of logarithmic scoring rule and the location-scale invariance of the distance. We show that the KL adaptive lasso enjoys the oracle properties; namely, it performs as well as if the true underlying model were given in advance. Furthermore, the KL adaptive lasso can be solved by the same efficient algorithm for solving the lasso. We also discuss the extension of the KL adaptive lasso in generalized linear models (GLMs) and show that the oracle properties still hold under mild regularity conditions. Key Words : Asymptotic normality, Adaptive lasso, Divergence loss, Generalized linear models, Kullback-Leibler, Oracle property, Sparsity, Variable selection Received: April, 2014
  • 2. 154 Journal, Indian Statistical Association 1 Introduction There are two fundamental goals in statistical learning: ensuring high prediction accuracy and discovering relevant predictive variables. Variable selection is particularly important when the true underlying model has a sparse representation. Identifying significant predictors will enhance the prediction performance of the fitted model. Variable selection is also fundamental to high-dimensional statistical modeling. Many approaches in use are stepwise selection procedures, which can be computationally expensive and ignore stochastic errors in the variable selection process. Regularization methods are characterized by loss functions measuring data fits and penalty terms constraining model parameters. The ‘lasso’ is a popular regularization technique for simultaneous estimation and variable selection (Tibshirani, 1996). Fan and Li (2006) gave a comprehensive overview of variable/feature selection and proposed a unified framework to approach the problem of variable selection. 1.1 Variable selection and the lasso Let us consider the usual linear model set up. Suppose we observe an independent and identically distributed (iid) sample (xi,yi), i = 1,...,n, where xi=(xi1,...,xip) is the vector of p covariates. The linear model is given by: Y = Xβ + , where ∼ N(0, σ2 I). Our main interest is to estimate the regression coefficient β = (β1,...,βp). We also know that for p < n, the least squares estimate (LSE) of β is unique and is given by: ˆβLS = (XT X)−1 XT Y . Moreover, ˆβLS is the Best Linear Unbiased Estimator (BLUE) in this scenario. But under high-dimensional setting the dimension of XT X is usually very big, which in turns leads to an unstable estimator of β. So, if p > n, then the LSE is not unique and it will usually overfit the data, i.e; all observations are predicted perfectly, but there are many solutions to the coefficients of the fit and new observations are not
  • 3. Variable selection using KL divergence loss 155 uniquely predictable. The classical solution to this problem was to try to reduce the number of variables by processes such as forward and backward regression with reduction in variables determined by hypothesis tests, see Draper and Smith (1998), for example. Suppose, the true (unknown) regression coefficients are: β∗ = (β∗ 1 ,...,β∗ p )T . We denote the true non-null set as A = {j β∗ j ≠ 0} and the true dimension of this set is given by d = A . So, we have the relationship: d < n < p. Now, the parameter estimate is ˆβ and the estimated non-null set is denoted by An = {j ˆβj ≠ 0}. For variable selection, our goal will be 3-fold: • Variable Selection Consistency: Recover the true non-zero set A. We would like An = A. • Estimation Consistency: ˆβ are close to β∗ . • Prediction Accuracy: X ˆβ are close to Xβ∗ . The classical methods for variable selection are the following: • Best subset selection: consider all 2p sub-models and choose the best one • Forward Selection: starting from the null model, retrieve the most significant variable sequentially • Backward Selection: starting from the full model, delete the most insignificant variable sequentially • Stepwise Regression: at every step, do both the retrieving and deleting But there are some problems in the above methods, namely, estimation accuracy, computational expediency and algorithmic stability. An alternative strategy that emerged was penalizing the squared error loss, i.e;
  • 4. 156 Journal, Indian Statistical Association adding to the residual sum of squares Y − Xβ 2 a penalty, pen(β;λ). So, the penalized likelihood function is given by: L(β;λ) = Y − Xβ 2 + pen(β;λ). When pen(β;λ) = λ β 2 , this is called ridge regression (Hoerl and Kennard, 1970) and we have the (unique) minimizer as: ˆβRidge = (XT X + λI)−1 XT Y . The motivation of the lasso is Breiman’s non-negative garotte (Breiman, 1995a): ˆβ = arg min ⎧⎪⎪ ⎨ ⎪⎪⎩ n ∑ i=1 (yi − ∑ j cj ˆβ0 j xij)2 ⎫⎪⎪ ⎬ ⎪⎪⎭ s.t.cj ≥ 0,∑ j cj ≤ t, where ˆβ0 j is the full LSE of βj, j = 1...p and t ≥ 0 is a tuning parameter. There are some advantages of using non-negative garotte as it gives lower prediction error than subset selection and it is competitive with ridge regression except when the true model has many small non-zero coefficients. But the drawback of this method is that it depends on both the sign and the magnitude of LSE, thus it suffers when LSE behaves poorly. The lasso is a shrinkage and selection method for linear regression. It minimizes the usual sum of squared errors, with a bound on the sum of the absolute values of the coefficients. Because of the nature of this constraint it tends to produce some coefficients that are exactly 0 and hence gives interpretable models. The Lasso has several advantages. It simultaneously performs model selection and model fit. In addition, although it is a non- linear method, it is the global minimum of a convex penalized loss function and can be computed efficiently. For Lasso, we standardize xij as ∑i xij/n = 0, ∑i x2 ij/n = 1. Now, denote
  • 5. Variable selection using KL divergence loss 157 ˆβ = ( ˆβ1,..., ˆβp)T , the lasso estimate ˆβ is given by: ˆβ = arg min ⎧⎪⎪ ⎨ ⎪⎪⎩ n ∑ i=1 (yi − ∑ j βjxij)2 ⎫⎪⎪ ⎬ ⎪⎪⎭ s.t.∑ j βj ≤ t. Here, the tuning parameter t controls the amount of shrinkage, i.e; when t < t0 = ∑ ˆβ0 j (where ˆβ0 j is the full LSE), then that will cause shrinkage of the solutions towards 0, and some coefficients may be exactly equal to 0. The above optimization problem can be seen as a quadratic programming problem with linear inequality constraints as follows: ˆβlasso = arg min ⎧⎪⎪ ⎨ ⎪⎪⎩ n ∑ i=1 (yi − ∑ j βjxij)2 + λ∑ j βj ⎫⎪⎪ ⎬ ⎪⎪⎭ . 1.2 Oracle property and the adaptive lasso Let us consider model estimation and variable selection in linear regression models. Suppose that y = (y1,...,yn)T is the response vector and xj = (x1j,...,xnj)T , j = 1,...,p, are the linearly independent predictors. Let X = [x1,...,xp] be the predictor matrix. We assume that E[y x] = β∗ 1 x1 + ... + β∗ p xp. Without loss of generality, we assume that the data are centered, so the intercept is not included in the regression function. Let A = {j β∗ j ≠ 0} and further assume that A = p0 < p. Thus the true model depends only on a subset of the predictors. Denote by ˆβ(δ) the coefficient estimator produced by a fitting procedure δ. Using the language of Fan and Li (2001), we call δ an oracle procedure if ˆβ(δ) (asymptotically) has the following oracle properties: • Identifies the right subset model: An = {j ˆβj ≠ 0} = A. • Has the optimal estimation rate: √ n( ˆβ(δ)A −β∗ A) →d N(0,Σ∗ ), where Σ∗ is the covariance matrix based on the true subset model.
  • 6. 158 Journal, Indian Statistical Association It has been argued (Fan and Li 2001 and Fan and Peng 2004) that a good procedure should have these oracle properties. Knight and Fu (2000) studied asymptotic behavior of Lasso type estimators. Under some appropriate conditions, they showed that the limiting distributions have positive probability mass at 0 when the true value of the parameter is 0, and they established asymptotic normality for large parameters in some sense. Fan and Li (2001) conjectured that the oracle properties do not hold for the lasso. They also proposed a smoothly clipped absolute deviation (SCAD) penalty for variable selection and proved its oracle properties. Zou (2006) proposed a new version of the lasso, called the adaptive lasso, where adaptive weights are used for penalizing different coefficients in the l1 penalty and showed that the adaptive lasso enjoys the oracle properties; namely, it performs as well as if the true underlying model were given in advance. The adaptive lasso is defined as follows: Suppose that ˆβ is a root-n-consistent estimator to β∗ ; for example, we can use ˆβ(ols). Pick a γ > 0, and define the weight vector ˆw = 1/ ˆβ γ . The adaptive lasso estimators ˆβadalasso are given by: ˆβadalasso = arg min ⎧⎪⎪ ⎨ ⎪⎪⎩ n ∑ i=1 (yi − ∑ j βjxij)2 + λn ∑ j ˆwj βj ⎫⎪⎪ ⎬ ⎪⎪⎭ , where λn varies with n. The rest of the article is organized as follows. In Section 2 we propose an alternative approach to the adaptive lasso through the KL divergence loss and we show that our proposed methodology enjoys the oracle properties for variable selection. In Section 3 we use the LARS algorithm (Efron et al. 2004) to solve the entire solution path of this newly developed methodology. In Section 4 we apply our variable selection method to the diabetes data (Efron et al. 2004). We extend the variable selection theory and methodology to the generalized linear models (GLMs) in Section 5, and give concluding remarks in Section 6.
  • 7. Variable selection using KL divergence loss 159 2 An alternative approach to the adaptive lasso through the KL divergence loss In this paper, we replace the squared error loss in the adaptive lasso set up by the KL divergence loss, which is also known as the entropy distance. There are various theoretical reasons to defend the use of Kullback-Leibler distance, ranging from information theory to the relevance of logarithmic scoring rule and the location-scale invariance of the distance, as detailed in Bernardo and Smith (1994). We also show that the oracle properties discussed above hold in this case under mild regularity conditions. We adopt the setup of Knight and Fu (2000) for the asymptotic analysis. We assume two conditions: (a) yi = xT i β∗ + i, where 1,..., n are iid random variables with mean 0 and variance σ2 . Here, for convenience, we assume that σ2 is known and is equal to 1. (b) 1 n XT X → C, where C is a positive definite matrix. Without loss of generality, assume that A = {1,2,...,p0}. Let C11 is the p0 × p0 upper left-corner partitioned sub-matrix of C. Now, suppose that f(yi xi,β∗ ) and f(yi xi,β) are the normal densities evaluated at β∗ and β respectively. Then the “Adaptive Penalized KL Divergence” estimators (which we are going call as the “KL adaptive lasso” estimators now onwards) ˆβKL are given by: ˆβKL = arg min ⎧⎪⎪ ⎨ ⎪⎪⎩ n ∑ i=1 Eβ∗ [log f(yi xi,β∗ ) f(yi xi,β) ] + λn ∑ j ˆwj βj ⎫⎪⎪ ⎬ ⎪⎪⎭ . (2.1) Now, since the vector of true regression coefficients β∗ is unknown, so we replace it by ˆβ, which is the ordinary least squares (ols) estimator of β∗ . Hence, after a bit algebraic calculation we get, ˆβKL = arg min ⎧⎪⎪ ⎨ ⎪⎪⎩ [X ˆβ − Xβ] T [X ˆβ − Xβ] + λn ∑ j ˆwj βj ⎫⎪⎪ ⎬ ⎪⎪⎭ , (2.2)
  • 8. 160 Journal, Indian Statistical Association where ˆwj = 1/ ˆβj γ . This is a convex optimization problem in β as we have used convex penalty here and hence the local minimizer ˆβKL is the unique global KL adaptive lasso estimator (for non-convex penalties, however, the local minimizer may not be globally unique). Let An = {j ˆβKL j ≠ 0}. We have shown that with a proper choice of λn, the KL adaptive lasso enjoys the oracle properties. Theorem 2.1. Suppose that λn/ √ n → 0 and λnn(γ−1)/2 → ∞. Then, the KL adaptive lasso estimates must satisfy the following: 1. Consistency in variable selection: limP (An = A) = 1. 2. Asymptotic normality: √ n( ˆβKL A − β∗ A) →d N (0,C−1 11 ). Proof. We first prove the asymptotic normality part. Let β = β∗ + u√ n , and ψn(u) = ⎡ ⎢ ⎢ ⎢ ⎣ X ˆβ − X (β∗ + u √ n ) ⎤ ⎥ ⎥ ⎥ ⎦ T ⎡ ⎢ ⎢ ⎢ ⎣ X ˆβ − X (β∗ + u √ n ) ⎤ ⎥ ⎥ ⎥ ⎦ +λn p ∑ j=1 ˆwj β∗ j + uj √ n . Let ˆu(n) = arg minψn(u); then ˆβKL = β∗ + ˆu(n) √ n , or, ˆu(n) = √ n( ˆβKL − β∗ ). Define: V (n) (u) = ψn(u) − ψn(0) = uT ( 1 n XT X)u − 2 uT XT X √ n ( ˆβ − β∗ ) + λn √ n p ∑ j=1 ˆwj √ n ⎛ ⎝ β∗ j + uj √ n − β∗ j ⎞ ⎠ . (2.3) Since ˆβ is the ols estimate of β∗ , hence √ n( ˆβ − β∗ ) →d N(0,σ2 C−1 ) then by assumption (b) we get: ( 1 nXT X) √ n( ˆβ − β∗ ) →d W , where W ∼ N(0,σ2 C). Now consider the limiting behavior of the third term in (2.3). If β∗ j ≠ 0, then ˆwj →p β∗ j −γ (using ˆβj →p β∗ j and the continuous mapping theorem) and √ n( β∗ j + uj √ n − β∗ j ) → ujsgn(β∗ j ). Then, we have λn√ n ˆwj √ n( β∗ j + uj √ n − β∗ j ) →p 0, since by one of the assumptions of this theorem λn/ √ n → 0.
  • 9. Variable selection using KL divergence loss 161 If β∗ j = 0, then √ n( β∗ j + uj √ n − β∗ j ) = uj and λn√ n ˆwj = λn√ n nγ/2 ( √ nˆβj ) −γ = λnn(γ−1)/2 ( √ nˆβj ) −γ , where √ nˆβj = Op(1). From the above and the assumption of the theorem that λnn(γ−1)/2 → ∞, we have λn√ n ˆwj √ n( β∗ j + uj √ n − β∗ j ) →p 0, if uj = 0 and λn√ n ˆwj √ n( β∗ j + uj √ n − β∗ j ) →p ∞, if uj ≠ 0. Hence, we summarize the results as follows: λn√ n ˆwj √ n( β∗ j + uj √ n − β∗ j ) →p 0, if β∗ j ≠ 0, λn√ n ˆwj √ n( β∗ j + uj √ n − β∗ j ) →p 0, if β∗ j = 0 & uj = 0 and λn√ n ˆwj √ n( β∗ j + uj √ n − β∗ j ) →p ∞, if β∗ j = 0 & uj ≠ 0. Thus, by Slutsky’s theorem, we see that V (n) (u) →d V (u) for every u, where, V (u) = uT AC11uA − 2uT AWA; if uj = 0∀j ∉ A and V (u) = ∞, otherwise. Now note that V (n) is convex and the unique minimum of V is (C−1 11 WA,0)T . Following the epi-convergence results of Geyer (1994) and Knight and Fu (2000), we have: ˆu (n) A →d C−1 11 WA and ˆu (n) Ac →d 0. Finally, we observe that WA = N(0,σ2 C11); then we prove the asymptotic normality part. Now, we show the consistency part. ∀j ∈ A, the asymptotic normality result indicates that ˆβ (n) j →p β∗ j ; thus P(j ∈ A∗ n) → 1. Then it suffices to show that ∀j′ ∉ A, P(j′ ∈ A∗ n) → 0. Consider the event j′ ∈ A∗ n. Then, by the Karush-Kuhn-Tucker (KKT) optimality conditions (Hastie, Tibshirani, and Friedman 2009, p. 421), we know that 2xT j′ (X ˆβ − X ˆβKL ) = λn ˆwj′ . Note that λn ˆwj′ √ n = λn√ n nγ/2 1 √ nˆβj′ γ →p ∞ ( λnn(γ−1)/2 → ∞ and √ nˆβj′ = Op(1), since j′ ∉ A, i.e; β∗ j′ = 0), whereas 2xT j′ (X ˆβ−X ˆβKL) √ n = 2 xT j′ X √ n( ˆβ− ˆβKL) n = 2 xT j′ X n { √ n( ˆβ − β∗ ) − √ n( ˆβKL − β∗ )}. Now observe that √ n( ˆβ − ˆβKL ) = √ n( ˆβ − β∗ ) − √ n( ˆβKL − β∗ ) = Op(1) since √ n( ˆβ − β∗ ) →d some normal r.v as well as √ n( ˆβKL − β∗ ) →d some normal r.v. i.e;
  • 10. 162 Journal, Indian Statistical Association 2xT j′ X √ n( ˆβ − ˆβKL ) = Op(1). Thus, 2 xT j′ X √ n( ˆβ− ˆβKL) n →p 0. Hence, P [j′ ∈ A∗ n] ≤ P [2xT j′ (X ˆβ − X ˆβKL ) = λn ˆwj′ ] → 0. This completes the proof. ◻ 3 Computations In this section we discuss the computational issues. The KL adaptive lasso estimates in (2.2) can be solved by the LARS algorithm (Efron et al. 2004). The computational details are given in the following Algorithm, the proof of which is very simple and so is omitted. Algorithm (The LARS algorithm for the KL adaptive lasso). 1. Find ˆβ, the ols estimate of β∗ , by the least squares estimation and hence find ˆw for some γ > 0. 2. Define x∗ j = xj ˆwj , j = 1,...,p. 3. Evaluate ˆy = ∑ p j=1 x∗ j ˆβj. 4. Solve the following optimization problem for all λn, ˆβ∗ = arg min β ˆy − p ∑ j=1 x∗ j βj 2 + λn p ∑ j=1 βj . 5. Output ˆβKL j = ˆβ∗ j / ˆwj, j = 1,...,p. Tuning is an important issue in practice. Suppose that we use ˆβ (ols) to construct the adaptive weights in the KL adaptive lasso; we then want to find an optimal pair of (γ,λn) which minimizes the objective function among all other pairs. We can use two-dimensional cross-validation to tune the KL adaptive lasso. Note that for a given γ, we can use cross-validation along with the LARS algorithm to exclusively search for the optimal λn. In principle, we can also replace ˆβ (ols) with other consistent estimators. Hence we can treat it as the third tuning parameter and perform three-dimensional cross-validation to find an optimal triple ( ˆβ,γ,λn). We suggest using ˆβ (ols)
  • 11. Variable selection using KL divergence loss 163 unless collinearity is a concern, in which case we can try ˆβ (ridge) from the best ridge regression fit, because it is more stable than ˆβ (ols). 4 Numerical example: diabetes data From Efron et al.(2004), we have: “Ten baseline variables, age, sex, body mass index (bmi), average blood pressure (map), and six blood serum measurements (tc, ldl, hdl, tch, ltg, glu) were obtained for each of n = 442 diabetes patients, as well as the response of interest (y), a quantitative measure of disease progression one year after baseline.” By applying our KL adaptive lasso variable selection methodology on this data, we get the regression coefficient estimates for the predictor variables as follows: age 0.0000 sex −201.6889 bmi 540.5075 map 314.5000 tc −514.2194 ldl 268.7080 hdl 0.0000 tch 119.6211 ltg 682.5415 glu 0.0000 It is very clear from the above estimates that the variables named “age”, “hdl” and “glu” do not have any significant influence on the response. Hence, we can only select the remaining 7 variables to predict the response. The above result supports the result obtained from the lasso as well as the adaptive lasso techniques applied on the same data. 5 Further extension Having shown the oracle properties of the KL adaptive lasso in linear regression models, we have further extended the theory and methodology to
  • 12. 164 Journal, Indian Statistical Association generalized linear models (GLMs). We consider the penalized KL divergence loss function using the adaptively weighted l1 penalty, where the density belongs to the exponential family with canonical parameter θ. The generic density form can be written as, (McCullagh and Nelder, (1989)) f(y x,θ) = h(y)exp(yθ − φ(θ)). Generalized linear models assume that θ = xT β∗ . Suppose that ˆβ is the maximum likelihood estimates (mle) of β∗ in the GLM. We construct the weight vector ˆw = 1/ ˆβ γ for some γ > 0. Suppose that f(yi xi,β∗ ) and f(yi xi,β) are the exponential family densities evaluated at β∗ and β respectively. Note that, Eβ∗ (yi xi) = φ′ (xT i β∗ ), where φ′ (⋅) is the first derivative of φ(⋅). Then the KL adaptive lasso estimates ˆβKL (glm) are given by, ˆβKL (glm) = arg min ⎧⎪⎪ ⎨ ⎪⎪⎩ n ∑ i=1 Eβ∗ [log f(yi xi,β∗ ) f(yi xi,β) ] + λn ∑ j ˆwj βj ⎫⎪⎪ ⎬ ⎪⎪⎭ = arg min ⎧⎪⎪ ⎨ ⎪⎪⎩ n ∑ i=1 {φ′ (xT i β∗ )(xT i β∗ − xT i β) − φ(xT i β∗ ) + φ(xT i β)} + λn ∑ j ˆwj βj ⎫⎪⎪ ⎬ ⎪⎪⎭ . (5.4) We need to replace the true but unknown β∗ by a root-n-consistent estimator of β∗ . Hence, we replace β∗ by ˆβ (mle) in the KL divergence loss function. Thus, equation (5.4) becomes, ˆβKL (glm) = arg min ⎧⎪⎪ ⎨ ⎪⎪⎩ n ∑ i=1 {φ′ (xT i ˆβ)(xT i ˆβ − xT i β) − φ(xT i ˆβ) + φ(xT i β)} + λn ∑ j ˆwj βj ⎫⎪⎪ ⎬ ⎪⎪⎭ . (5.5) For logistic regression, (5.5) becomes
  • 13. Variable selection using KL divergence loss 165 ˆβKL (logistic) = arg min ⎧⎪⎪ ⎨ ⎪⎪⎩ n ∑ i=1 {φ′ (xT i ˆβ)(xT i ˆβ − xT i β) − log(1 + exp(xT i ˆβ)) + log(1 + exp(xT i β))} + λn ∑ j ˆwj βj ⎫⎪⎪ ⎬ ⎪⎪⎭ . For Poisson log-linear regression models, (5.5) becomes ˆβKL (poisson) = arg min ⎧⎪⎪ ⎨ ⎪⎪⎩ n ∑ i=1 {φ′ (xT i ˆβ)(xT i ˆβ − xT i β) − exp(xT i ˆβ) + exp(xT i β)} + λn ∑ j ˆwj βj ⎫⎪⎪ ⎬ ⎪⎪⎭ . Let KLn(β) = ∑n i=1 {φ′ (xT i ˆβ)(xT i ˆβ − xT i β) − φ(xT i ˆβ) + φ(xT i β)}. Then, ∂2 Ln(β) ∂β∂βT = n ∑ i=1 φ′′ (xT i ˆβ)xixT i ; which is positive definite, since the variance function φ′′ (xT i ˆβ) > 0. Now, let kln(β) = KLn(β) + λn ∑j ˆwj βj . Since, in this case, we have used convex penalty, so kln(β) is necessarily convex in β and hence the local minimizer ˆβKL (glm) is the unique global KL adaptive lasso estimator. Assume that the true model has a sparse representation. Without loss of generality, let A = {j β∗ j ≠ 0} = {1,2,...,p0} and p0 < p. Let I11 is the p0 × p0 upper left-corner partitioned sub-matrix of the Fisher information matrix I(β∗ ). Then, I11 is the Fisher information with the true sub-model known. We show that under some mild regularity conditions, the KL adaptive lasso estimates ˆβKL (glm) enjoys the oracle properties if λn is chosen appropriately.
  • 14. 166 Journal, Indian Statistical Association Theorem 5.1. Let An = {j ˆβKL j (glm) ≠ 0}. Suppose that λn/ √ n → 0 and λnn(γ−1)/2 → ∞. Then, the KL adaptive lasso estimates ˆβKL (glm) must satisfy the following: 1. Consistency in variable selection: limP (An = A) = 1. 2. Asymptotic normality: √ n( ˆβKL A (glm) − β∗ A) →d N (0,σ2 I−1 11 ). Proof. We assume the following regularity conditions: 1. The Fisher information matrix I(β∗ ) = E [φ ′′ (xT β∗ )xxT ] is finite and positive definite. 2. There is a sufficiently large enough open set O that contains β∗ such that ∀β ∈ O, φ ′′′ (xT β) ≤ M(x) < ∞ and E [M(x) xjxkxl ] < ∞ ∀1 ≤ j,k,l ≤ p. We first prove the asymptotic normality part. ˆβKL (glm) = arg min ⎧⎪⎪ ⎨ ⎪⎪⎩ n ∑ i=1 {φ′ (xT i ˆβ)(xT i ˆβ − xT i β) − φ(xT i ˆβ) + φ(xT i β)} + λn ∑ j ˆwj βj ⎫⎪⎪ ⎬ ⎪⎪⎭ . Let β = β∗ + u√ n ; u ∈ Rp . Define: Γn(u) = n ∑ i=1 {xT i ( ˆβ − β∗ − u √ n )φ′ (xT i ˆβ) − φ(xT i ˆβ) + φ(xT i (β∗ + u √ n ))} + λn ∑ j ˆwj β∗ j + uj √ n . and Γn(0) = n ∑ i=1 {xT i ( ˆβ − β∗ )φ′ (xT i ˆβ) − φ(xT i ˆβ) + φ(xT i (β∗ )} + λn ∑ j ˆwj β∗ j .
  • 15. Variable selection using KL divergence loss 167 Let ˆun = arg minΓn(u) = arg min{Γn(u) − Γn(0)}. Then, ˆun = √ n( ˆβKL (glm) − β∗ ). Let H(n) (u) = Γn(u) − Γn(0) = n ∑ i=1 ⎧⎪⎪ ⎨ ⎪⎪⎩ xT i (− u √ n )φ′ (xT i ˆβ) + φ(xT i β∗ + xT i u √ n ) − φ(xT i β∗ ) ⎫⎪⎪ ⎬ ⎪⎪⎭ + λn √ n ∑ j ˆwj √ n( β∗ j + uj √ n − β∗ j ). Notice that: φ(xT i β∗ + xT i u √ n ) = φ(xT i β∗ ) + φ′ (xT i β∗ ) xT i u √ n + 1 2 φ′′ (xT i β∗ )uT (xixT i ) n u + n−3/2 6 φ′′′ (xT i β∗∗ )(xT i u)3 , where β∗∗ is between β∗ and β∗ + u√ n . Hence, H(n) (u) = n ∑ i=1 ⎧⎪⎪ ⎨ ⎪⎪⎩ (φ′ (xT i β∗ ) − φ′ (xT i ˆβ)) xT i u √ n + 1 2 φ′′ (xT i β∗ )uT (xixT i ) n u + n−3/2 6 φ′′′ (xT i β∗∗ )(xT i u)3 ⎫⎪⎪ ⎬ ⎪⎪⎭ + λn √ n ∑ j ˆwj √ n( β∗ j + uj √ n − β∗ j ) = A (n) 1 + A (n) 2 + A (n) 3 + A (n) 4 ; (say) Now, A (n) 1 = − n ∑ i=1 [φ′ (xT i ˆβ) − φ′ (xT i β∗ )] xT i u √ n = − n ∑ i=1 [φ′′ (xT i β∗ )(xT i ˆβ − xT i β∗ )] xT i u √ n − 1 2 n ∑ i=1 [φ′′′ (xT i ˆβ∗∗ )(xT i ˆβ − xT i β∗ )2 ] xT i u √ n = −A (n) 11 − A (n) 12 ; (say), where ˆβ∗∗ is in between ˆβ and β∗ . Notice that A (n) 11 = n ∑ i=1 [φ′′ (xT i β∗ )(xT i ˆβ − xT i β∗ )] xT i u √ n = uT ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ 1 n n ∑ i=1 φ′′ (xT i β∗ )xixT i ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ W ′ n;
  • 16. 168 Journal, Indian Statistical Association where, W ′ n = √ n( ˆβ − β∗ ) →d W ′ , such that W ′ is N(0,[I(β∗ )]−1 ). Also, 1 n ∑n i=1 φ′′ (xT i β∗ )xixT i →p I(β∗ ) (by WLLN). Thus, by Slutsky’s theorem, A (n) 11 →d uT I(β∗ )W ′ = uT W , where, W ∼ N(0,I(β∗ )). Consider: A (n) 12 = 1 2 n ∑ i=1 [φ′′′ (xT i ˆβ∗∗ )(xT i ˆβ − xT i β∗ )2 ] xT i u √ n = 1 2 √ n W ′T n ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ 1 n n ∑ i=1 φ′′′ (xT i ˆβ∗∗ )xixT i uT xi ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ W ′ n ≤ 1 2 √ n W ′T n ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ 1 n n ∑ i=1 M(xi)xixT i xT i u ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ W ′ n , by the regularity condition 2 since ˆβ∗∗ ∈ O. Also W ′ n = Op(1) since W ′ n = √ n( ˆβ − β∗ ) →d W ′ . By the WLLN, 1 n n ∑ i=1 M(xi)xixT i xT i u →p E [M(x)xxT xT u ]. Notice that, the (i,j)th. element of the p × p matrix E [M(x)xxT xT u ], ∀1 ≤ i,j ≤ p is given by, ((E [M(x)xxT xT u ])) i,j = E [M(x)xixj xT u ] ≤ p ∑ k=1 E [M(x) xi xj xk uk ] = p ∑ k=1 uk E [M(x) xixjxk ] < ∞, again, by the regularity condition 2. Hence, 1 n n ∑ i=1 M(xi)xixT i xT i u = Op(1). Thus, A (n) 12 →p 0 which implies that A (n) 1 →d uT W . Now for the second term A (n) 2 , we observe that: 1 n ∑n i=1 φ′′ (xT i β∗ )xixT i →p
  • 17. Variable selection using KL divergence loss 169 I(β∗ ). Thus, by Slutsky’s theorem, A (n) 2 →p 1 2uT I(β∗ )u. Since ˆβ∗∗ ∈ O, so by the regularity condition 2, the third term A (n) 3 can be bounded as: 6 √ nA (n) 3 ≤ 1 n n ∑ i=1 M(xi) xT i u 3 →p E [M(x) xT u 3 ] < ∞. Hence, A (n) 3 →p 0. The limiting behavior of the fourth term A (n) 4 is already discussed in the proof of Theorem 1. We summarize the results as follows: λn√ n ˆwj √ n( β∗ j + uj √ n − β∗ j ) →p 0, if β∗ j ≠ 0, λn√ n ˆwj √ n( β∗ j + uj √ n − β∗ j ) →p 0, if β∗ j = 0 & uj = 0 and λn√ n ˆwj √ n( β∗ j + uj √ n − β∗ j ) →p ∞, if β∗ j = 0 & uj ≠ 0. Thus, by Slutsky’s theorem, we see that Hn (u) →d H(u) for every u, where H(u) = uT AI11uA − 2uT AWA, if uj = 0∀j ∉ A and H(u) = ∞, otherwise, where W ∼ N(0,I(β∗ )). H(n) is convex and the unique minimum of H is (I−1 11 WA,0)T . Then, we have: ˆu (n) A →d I−1 11 WA and ˆu (n) Ac →d 0. Because WA ∼ N(0,I11), the asymptotic normality part is proven. Now we show the consistency part. ∀j ∈ A, the asymptotic normality indicates that P(j ∈ An) → 1. Then it suffices to show that ∀j′ ∉ A, P(j′ ∈ An) → 0. Consider the event j′ ∈ An. By the KKT optimality conditions, we must have n ∑ i=1 xij′ (φ′ (xT i ˆβ) − φ′ (xT i ˆβKL (glm))) = λn ˆwj′ ; thus P(j′ ∈ An) ≤ P (∑n i=1 xij′ (φ′ (xT i ˆβ) − φ′ (xT i ˆβKL (glm))) = λn ˆwj′ ).
  • 18. 170 Journal, Indian Statistical Association Note that n ∑ i=1 xij′ (φ′ (xT i ˆβ) − φ′ (xT i ˆβKL (glm))) = B (n) 1 + B (n) 2 + B (n) 3 , with B (n) 1 = n ∑ i=1 xij′ (φ′ (xT i ˆβ) − φ′ (xT i β∗ ))/ √ n, B (n) 2 = ⎛ ⎝ 1 n n ∑ i=1 xij′ φ′′ (xT i β∗ )xT i ⎞ ⎠ √ n(β∗ − ˆβKL (glm)) and B (n) 3 = 1 2n n ∑ i=1 xij′ φ′′′ (xT i ˆβ∗∗∗ )(xT i √ n(β∗ − ˆβKL (glm))) 2 / √ n, where ˆβ∗∗∗ is in between ˆβKL (glm) and β∗ . Now, B (n) 1 = n ∑ i=1 xij′ (φ′ (xT i ˆβ) − φ′ (xT i β∗ ))/ √ n = n ∑ i=1 [φ′′ (xT i β∗ )(xT i ˆβ − xT i β∗ )] xij′ √ n + 1 2 n ∑ i=1 [φ′′′ (xT i ˆβ∗∗∗ )(xT i ˆβ − xT i β∗ )2 ] xij′ √ n = B (n) 11 + B (n) 12 , (say). Notice that B (n) 11 = n ∑ i=1 [φ′′ (xT i β∗ )(xT i ˆβ − xT i β∗ )] xij′ √ n = ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ 1 n n ∑ i=1 xij′ φ′′ (xT i β∗ )xT i ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ W ′ n. Now, 1 n ∑n i=1 xij′ φ′′ (xT i β∗ )xT i →p Ij′ , where, Ij′ is the j′ th. row of I(β∗ ) and W ′ n →d W ′ . Thus, by Slustsky’s theorem, we have: B (n) 11 →d Ij′ W ′ ∼ N(0,Ij′ [I(β∗ )]−1 IT j′ ).
  • 19. Variable selection using KL divergence loss 171 Also, B (n) 12 = 1 2 n ∑ i=1 [φ′′′ (xT i ˆβ∗∗∗ )(xT i ˆβ − xT i β∗ )2 ] xij′ √ n = 1 2 √ n W ′T n ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ 1 n n ∑ i=1 φ′′′ (xT i ˆβ∗∗∗ )xixT i xij′ ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ W ′ n ≤ 1 2 √ n W ′T n ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ 1 n n ∑ i=1 M(xi)xixT i xij′ ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ W ′ n ; by the regularity condition 2 since ˆβ∗∗∗ ∈ O. We know that W ′ n = Op(1) and by the WLLN, 1 n n ∑ i=1 M(xi)xixT i xij′ →p E [M(x)xxT xij′ ], which implies 1 n ∑n i=1 M(xi)xixT i xij′ = Op(1). Hence, B (n) 12 →p 0. Since 1 n ∑n i=1 xij′ φ′′ (xT i β∗ )xT i →p Ij′ , the asymptotic normality part implies that B (n) 2 →d to some normal r.v. Meanwhile, we have, λn ˆwj′ √ n = λn √ n nγ/2 1 √ nˆβj′ γ →p ∞, since, λnn(γ−1)/2 →p ∞ for γ > 0, by the assumption and as j′ ∉ A ⇒ ˆβj′ ≡ 0, so √ n(ˆβj′ − 0) →d some normal r.v. ⇒ √ nˆβj′ γ = Op(1). Hence, P(j′ ∈ An) → 0. This completes the proof. ◻ 6 Conclusion In this article we have proposed the KL adaptive lasso for simultaneous estimation and variable selection. We have shown that the KL adaptive lasso also enjoys the oracle properties by utilizing the adaptively weighted l1 penalty. Owing to the efficient path algorithm, the KL adaptive lasso also enjoys the computational advantage of the lasso. Our numerical example has shown that the KL adaptive lasso performs similarly with the lasso and
  • 20. 172 Journal, Indian Statistical Association the adaptive lasso. In future, we want to compare the prediction accuracy of the KL adaptive lasso with other existing sparse modeling techniques like the lasso, the adaptive lasso, the nonnegative garotte etc. For comparison, we are going to report the prediction error, E[(ˆy − ytest)2 ]. We would also like to extend our KL divergence based variable selection methodology in the high dimensional regime (i.e; when the number of regressors p grows to infinity at a certain rate relative to the growth of the sample size n) as well as in a survival analysis context where we have to modify the data to account for censoring, and investigate the oracle properties specified above in this scenario. Acknowledgments The author would like to thank Prof. Malay Ghosh and Prof. Kshitij Khare for their help with the paper. References [1] Breiman, L. (1995a). Better subset regression using the nonnegative garrote, Technometrics, 37, 373–384. [2] Bernardo, J. M. and Smith, A. F. M. (1994). Bayesian Theory, John Wiley, New York. [3] Draper, N. R. and Smith, H. (1998). Applied Regression Aanalysis, third edition, John Wiley, New York. [4] Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression, The Annals of Statistics, 32, 407–499.
  • 21. Variable selection using KL divergence loss 173 [5] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties, Journal of the American Statistical Association, 96, 1348–1360. [6] Fan, J. and Li, R. (2006). Statistical challenges with high dimensionality: Feature selection in knowledge discovery, Proceedings of the Madrid International Congress of Mathematicians. [7] Fan, J. and Peng, H. (2004). On nonconcave penalized likelihood with diverging number of parameters, The Annals of Statistics, 32, 928–961. [8] Geyer, C. (1994). On the asymptotics of constrained M- estimation, The Annals of Statistics, 22, 1993–2010. [9] Hastie, T., Tibshirani, R. and Friedman, J. H. (2009). The Elements of Statistical Learning, second edition, Springer-Verlag, New York. [10] Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems, Technometrics, 12(1), 55–67. [11] Knight, K. and Fu, W. (2000). Asymptotics for lasso-type estimators, The Annals of Statistics, 28, 1356–1378. [12] McCullagh, P. and Nelder, J. (1989). Generalized Linear Models, Second edition, Chapman & Hall, New York. [13] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society, Ser. B, 58, 267–288. [14] Zou, H. (2006). The adaptive lasso and its oracle properties, Journal of the American Statistical Association, 101(476), 1418–1429.
  • 22. 174 Journal, Indian Statistical Association Shibasish Dasgupta Department of Mathematics and Statistics University of South Alabama 411 University Boulevard North Mobile, AL 36688-0002, U.S.A. E-mail: sdasgupta@southalabama.edu