1. Journal of the Indian Statistical Association
Vol.53 No. 1 & 2, 2015, 153-174
Variable selection using
Kullback-Leibler divergence loss
Shibasish Dasgupta
University of South Alabama, Mobile, U.S.A.
Abstract
The adaptive lasso is a recent technique for simultaneous estimation
and variable selection where adaptive weights are used for penalizing
different coefficients in the l1 penalty. In this paper, we propose an
alternative approach to the adaptive lasso through the Kullback-Leibler
(KL) divergence loss, called the KL adaptive lasso, where we replace the
squared error loss in the adaptive lasso set up by the KL divergence
loss which is also known as the entropy distance. There are various
theoretical reasons to defend the use of Kullback-Leibler distance,
ranging from information theory to the relevance of logarithmic scoring
rule and the location-scale invariance of the distance. We show
that the KL adaptive lasso enjoys the oracle properties; namely, it
performs as well as if the true underlying model were given in advance.
Furthermore, the KL adaptive lasso can be solved by the same efficient
algorithm for solving the lasso. We also discuss the extension of the
KL adaptive lasso in generalized linear models (GLMs) and show that
the oracle properties still hold under mild regularity conditions.
Key Words : Asymptotic normality, Adaptive lasso, Divergence loss,
Generalized linear models, Kullback-Leibler, Oracle property, Sparsity,
Variable selection
Received: April, 2014
2. 154 Journal, Indian Statistical Association
1 Introduction
There are two fundamental goals in statistical learning: ensuring high
prediction accuracy and discovering relevant predictive variables. Variable
selection is particularly important when the true underlying model has a
sparse representation. Identifying significant predictors will enhance the
prediction performance of the fitted model. Variable selection is also
fundamental to high-dimensional statistical modeling. Many approaches
in use are stepwise selection procedures, which can be computationally
expensive and ignore stochastic errors in the variable selection process.
Regularization methods are characterized by loss functions measuring data
fits and penalty terms constraining model parameters. The ‘lasso’ is a
popular regularization technique for simultaneous estimation and variable
selection (Tibshirani, 1996). Fan and Li (2006) gave a comprehensive
overview of variable/feature selection and proposed a unified framework to
approach the problem of variable selection.
1.1 Variable selection and the lasso
Let us consider the usual linear model set up. Suppose we observe an
independent and identically distributed (iid) sample (xi,yi), i = 1,...,n,
where xi=(xi1,...,xip) is the vector of p covariates. The linear model is
given by: Y = Xβ + , where ∼ N(0, σ2
I). Our main interest is to
estimate the regression coefficient β = (β1,...,βp). We also know that
for p < n, the least squares estimate (LSE) of β is unique and is given
by: ˆβLS = (XT
X)−1
XT
Y . Moreover, ˆβLS is the Best Linear Unbiased
Estimator (BLUE) in this scenario. But under high-dimensional setting the
dimension of XT
X is usually very big, which in turns leads to an unstable
estimator of β. So, if p > n, then the LSE is not unique and it will usually
overfit the data, i.e; all observations are predicted perfectly, but there are
many solutions to the coefficients of the fit and new observations are not
3. Variable selection using KL divergence loss 155
uniquely predictable. The classical solution to this problem was to try to
reduce the number of variables by processes such as forward and backward
regression with reduction in variables determined by hypothesis tests, see
Draper and Smith (1998), for example.
Suppose, the true (unknown) regression coefficients are:
β∗
= (β∗
1 ,...,β∗
p )T
. We denote the true non-null set as A = {j β∗
j ≠ 0}
and the true dimension of this set is given by d = A . So, we have the
relationship: d < n < p. Now, the parameter estimate is ˆβ and the estimated
non-null set is denoted by An = {j ˆβj ≠ 0}.
For variable selection, our goal will be 3-fold:
• Variable Selection Consistency: Recover the true non-zero set A. We
would like An = A.
• Estimation Consistency: ˆβ are close to β∗
.
• Prediction Accuracy: X ˆβ are close to Xβ∗
.
The classical methods for variable selection are the following:
• Best subset selection: consider all 2p
sub-models and choose the best
one
• Forward Selection: starting from the null model, retrieve the most
significant variable sequentially
• Backward Selection: starting from the full model, delete the most
insignificant variable sequentially
• Stepwise Regression: at every step, do both the retrieving and deleting
But there are some problems in the above methods, namely, estimation
accuracy, computational expediency and algorithmic stability. An
alternative strategy that emerged was penalizing the squared error loss, i.e;
4. 156 Journal, Indian Statistical Association
adding to the residual sum of squares Y − Xβ 2
a penalty, pen(β;λ). So,
the penalized likelihood function is given by:
L(β;λ) = Y − Xβ 2
+ pen(β;λ).
When pen(β;λ) = λ β 2
, this is called ridge regression (Hoerl and Kennard,
1970) and we have the (unique) minimizer as:
ˆβRidge = (XT
X + λI)−1
XT
Y .
The motivation of the lasso is Breiman’s non-negative garotte (Breiman,
1995a):
ˆβ = arg min
⎧⎪⎪
⎨
⎪⎪⎩
n
∑
i=1
(yi − ∑
j
cj
ˆβ0
j xij)2
⎫⎪⎪
⎬
⎪⎪⎭
s.t.cj ≥ 0,∑
j
cj ≤ t,
where ˆβ0
j is the full LSE of βj, j = 1...p and t ≥ 0 is a tuning parameter.
There are some advantages of using non-negative garotte as it gives
lower prediction error than subset selection and it is competitive with ridge
regression except when the true model has many small non-zero coefficients.
But the drawback of this method is that it depends on both the sign and
the magnitude of LSE, thus it suffers when LSE behaves poorly.
The lasso is a shrinkage and selection method for linear regression. It
minimizes the usual sum of squared errors, with a bound on the sum of the
absolute values of the coefficients. Because of the nature of this constraint
it tends to produce some coefficients that are exactly 0 and hence gives
interpretable models. The Lasso has several advantages. It simultaneously
performs model selection and model fit. In addition, although it is a non-
linear method, it is the global minimum of a convex penalized loss function
and can be computed efficiently.
For Lasso, we standardize xij as ∑i xij/n = 0, ∑i x2
ij/n = 1. Now, denote
5. Variable selection using KL divergence loss 157
ˆβ = ( ˆβ1,..., ˆβp)T
, the lasso estimate ˆβ is given by:
ˆβ = arg min
⎧⎪⎪
⎨
⎪⎪⎩
n
∑
i=1
(yi − ∑
j
βjxij)2
⎫⎪⎪
⎬
⎪⎪⎭
s.t.∑
j
βj ≤ t.
Here, the tuning parameter t controls the amount of shrinkage, i.e; when
t < t0 = ∑ ˆβ0
j (where ˆβ0
j is the full LSE), then that will cause shrinkage of
the solutions towards 0, and some coefficients may be exactly equal to 0.
The above optimization problem can be seen as a quadratic programming
problem with linear inequality constraints as follows:
ˆβlasso
= arg min
⎧⎪⎪
⎨
⎪⎪⎩
n
∑
i=1
(yi − ∑
j
βjxij)2
+ λ∑
j
βj
⎫⎪⎪
⎬
⎪⎪⎭
.
1.2 Oracle property and the adaptive lasso
Let us consider model estimation and variable selection in linear regression
models. Suppose that y = (y1,...,yn)T
is the response vector and xj =
(x1j,...,xnj)T
, j = 1,...,p, are the linearly independent predictors. Let
X = [x1,...,xp] be the predictor matrix. We assume that E[y x] =
β∗
1 x1 + ... + β∗
p xp. Without loss of generality, we assume that the data are
centered, so the intercept is not included in the regression function. Let A
= {j β∗
j ≠ 0} and further assume that A = p0 < p. Thus the true model
depends only on a subset of the predictors. Denote by ˆβ(δ) the coefficient
estimator produced by a fitting procedure δ. Using the language of Fan and
Li (2001), we call δ an oracle procedure if ˆβ(δ) (asymptotically) has the
following oracle properties:
• Identifies the right subset model: An = {j ˆβj ≠ 0} = A.
• Has the optimal estimation rate:
√
n( ˆβ(δ)A −β∗
A) →d N(0,Σ∗
), where
Σ∗
is the covariance matrix based on the true subset model.
6. 158 Journal, Indian Statistical Association
It has been argued (Fan and Li 2001 and Fan and Peng 2004) that a good
procedure should have these oracle properties.
Knight and Fu (2000) studied asymptotic behavior of Lasso type
estimators. Under some appropriate conditions, they showed that the
limiting distributions have positive probability mass at 0 when the true
value of the parameter is 0, and they established asymptotic normality for
large parameters in some sense. Fan and Li (2001) conjectured that the
oracle properties do not hold for the lasso. They also proposed a smoothly
clipped absolute deviation (SCAD) penalty for variable selection and proved
its oracle properties.
Zou (2006) proposed a new version of the lasso, called the adaptive lasso,
where adaptive weights are used for penalizing different coefficients in the
l1 penalty and showed that the adaptive lasso enjoys the oracle properties;
namely, it performs as well as if the true underlying model were given in
advance. The adaptive lasso is defined as follows:
Suppose that ˆβ is a root-n-consistent estimator to β∗
; for example, we can
use ˆβ(ols). Pick a γ > 0, and define the weight vector ˆw = 1/ ˆβ γ
. The
adaptive lasso estimators ˆβadalasso
are given by:
ˆβadalasso
= arg min
⎧⎪⎪
⎨
⎪⎪⎩
n
∑
i=1
(yi − ∑
j
βjxij)2
+ λn ∑
j
ˆwj βj
⎫⎪⎪
⎬
⎪⎪⎭
,
where λn varies with n.
The rest of the article is organized as follows. In Section 2 we propose an
alternative approach to the adaptive lasso through the KL divergence loss
and we show that our proposed methodology enjoys the oracle properties
for variable selection. In Section 3 we use the LARS algorithm (Efron
et al. 2004) to solve the entire solution path of this newly developed
methodology. In Section 4 we apply our variable selection method to the
diabetes data (Efron et al. 2004). We extend the variable selection theory
and methodology to the generalized linear models (GLMs) in Section 5, and
give concluding remarks in Section 6.
7. Variable selection using KL divergence loss 159
2 An alternative approach to the adaptive lasso
through the KL divergence loss
In this paper, we replace the squared error loss in the adaptive lasso set
up by the KL divergence loss, which is also known as the entropy distance.
There are various theoretical reasons to defend the use of Kullback-Leibler
distance, ranging from information theory to the relevance of logarithmic
scoring rule and the location-scale invariance of the distance, as detailed
in Bernardo and Smith (1994). We also show that the oracle properties
discussed above hold in this case under mild regularity conditions.
We adopt the setup of Knight and Fu (2000) for the asymptotic analysis.
We assume two conditions:
(a) yi = xT
i β∗
+ i, where 1,..., n are iid random variables with mean
0 and variance σ2
. Here, for convenience, we assume that σ2
is known and
is equal to 1.
(b) 1
n XT
X → C, where C is a positive definite matrix.
Without loss of generality, assume that A = {1,2,...,p0}. Let C11 is the
p0 × p0 upper left-corner partitioned sub-matrix of C.
Now, suppose that f(yi xi,β∗
) and f(yi xi,β) are the normal densities
evaluated at β∗
and β respectively. Then the “Adaptive Penalized KL
Divergence” estimators (which we are going call as the “KL adaptive lasso”
estimators now onwards) ˆβKL
are given by:
ˆβKL
= arg min
⎧⎪⎪
⎨
⎪⎪⎩
n
∑
i=1
Eβ∗ [log
f(yi xi,β∗
)
f(yi xi,β)
] + λn ∑
j
ˆwj βj
⎫⎪⎪
⎬
⎪⎪⎭
. (2.1)
Now, since the vector of true regression coefficients β∗
is unknown, so
we replace it by ˆβ, which is the ordinary least squares (ols) estimator of β∗
.
Hence, after a bit algebraic calculation we get,
ˆβKL
= arg min
⎧⎪⎪
⎨
⎪⎪⎩
[X ˆβ − Xβ]
T
[X ˆβ − Xβ] + λn ∑
j
ˆwj βj
⎫⎪⎪
⎬
⎪⎪⎭
, (2.2)
8. 160 Journal, Indian Statistical Association
where ˆwj = 1/ ˆβj
γ
. This is a convex optimization problem in β as we have
used convex penalty here and hence the local minimizer ˆβKL
is the unique
global KL adaptive lasso estimator (for non-convex penalties, however, the
local minimizer may not be globally unique).
Let An = {j ˆβKL
j ≠ 0}. We have shown that with a proper choice of λn,
the KL adaptive lasso enjoys the oracle properties.
Theorem 2.1. Suppose that λn/
√
n → 0 and λnn(γ−1)/2
→ ∞. Then, the
KL adaptive lasso estimates must satisfy the following:
1. Consistency in variable selection: limP (An = A) = 1.
2. Asymptotic normality:
√
n( ˆβKL
A − β∗
A) →d N (0,C−1
11 ).
Proof. We first prove the asymptotic normality part. Let β = β∗
+ u√
n
, and
ψn(u) =
⎡
⎢
⎢
⎢
⎣
X ˆβ − X (β∗
+
u
√
n
)
⎤
⎥
⎥
⎥
⎦
T ⎡
⎢
⎢
⎢
⎣
X ˆβ − X (β∗
+
u
√
n
)
⎤
⎥
⎥
⎥
⎦
+λn
p
∑
j=1
ˆwj β∗
j +
uj
√
n
.
Let ˆu(n)
= arg minψn(u); then ˆβKL
= β∗
+ ˆu(n)
√
n
, or,
ˆu(n)
=
√
n( ˆβKL
− β∗
). Define:
V (n)
(u) = ψn(u) − ψn(0)
= uT
(
1
n
XT
X)u − 2
uT
XT
X
√
n
( ˆβ − β∗
)
+
λn
√
n
p
∑
j=1
ˆwj
√
n
⎛
⎝
β∗
j +
uj
√
n
− β∗
j
⎞
⎠
. (2.3)
Since ˆβ is the ols estimate of β∗
, hence
√
n( ˆβ − β∗
) →d N(0,σ2
C−1
)
then by assumption (b) we get: ( 1
nXT
X)
√
n( ˆβ − β∗
) →d W , where
W ∼ N(0,σ2
C). Now consider the limiting behavior of the third term
in (2.3). If β∗
j ≠ 0, then ˆwj →p β∗
j
−γ
(using ˆβj →p β∗
j and the continuous
mapping theorem) and
√
n( β∗
j +
uj
√
n
− β∗
j ) → ujsgn(β∗
j ). Then, we have
λn√
n
ˆwj
√
n( β∗
j +
uj
√
n
− β∗
j ) →p 0, since by one of the assumptions of this
theorem λn/
√
n → 0.
9. Variable selection using KL divergence loss 161
If β∗
j = 0, then
√
n( β∗
j +
uj
√
n
− β∗
j ) = uj and
λn√
n
ˆwj = λn√
n
nγ/2
(
√
nˆβj )
−γ
= λnn(γ−1)/2
(
√
nˆβj )
−γ
, where
√
nˆβj = Op(1).
From the above and the assumption of the theorem that λnn(γ−1)/2
→ ∞, we
have
λn√
n
ˆwj
√
n( β∗
j +
uj
√
n
− β∗
j ) →p 0, if uj = 0 and
λn√
n
ˆwj
√
n( β∗
j +
uj
√
n
− β∗
j ) →p ∞, if uj ≠ 0.
Hence, we summarize the results as follows:
λn√
n
ˆwj
√
n( β∗
j +
uj
√
n
− β∗
j ) →p 0, if β∗
j ≠ 0,
λn√
n
ˆwj
√
n( β∗
j +
uj
√
n
− β∗
j ) →p 0, if β∗
j = 0 & uj = 0 and
λn√
n
ˆwj
√
n( β∗
j +
uj
√
n
− β∗
j ) →p ∞, if β∗
j = 0 & uj ≠ 0.
Thus, by Slutsky’s theorem, we see that V (n)
(u) →d V (u) for every u,
where, V (u) = uT
AC11uA − 2uT
AWA; if uj = 0∀j ∉ A and V (u) = ∞,
otherwise. Now note that V (n)
is convex and the unique minimum of V
is (C−1
11 WA,0)T
. Following the epi-convergence results of Geyer (1994) and
Knight and Fu (2000), we have: ˆu
(n)
A →d C−1
11 WA and ˆu
(n)
Ac →d 0. Finally, we
observe that WA = N(0,σ2
C11); then we prove the asymptotic normality
part. Now, we show the consistency part. ∀j ∈ A, the asymptotic normality
result indicates that ˆβ
(n)
j →p β∗
j ; thus P(j ∈ A∗
n) → 1. Then it suffices to
show that ∀j′
∉ A, P(j′
∈ A∗
n) → 0. Consider the event j′
∈ A∗
n. Then, by the
Karush-Kuhn-Tucker (KKT) optimality conditions (Hastie, Tibshirani, and
Friedman 2009, p. 421), we know that 2xT
j′ (X ˆβ − X ˆβKL
) = λn ˆwj′ . Note
that
λn ˆwj′
√
n
= λn√
n
nγ/2 1
√
nˆβj′ γ
→p ∞ ( λnn(γ−1)/2
→ ∞ and
√
nˆβj′ = Op(1),
since j′
∉ A, i.e; β∗
j′ = 0), whereas
2xT
j′ (X ˆβ−X ˆβKL)
√
n
= 2
xT
j′ X
√
n( ˆβ− ˆβKL)
n =
2
xT
j′ X
n {
√
n( ˆβ − β∗
) −
√
n( ˆβKL
− β∗
)}. Now observe that
√
n( ˆβ − ˆβKL
) =
√
n( ˆβ − β∗
) −
√
n( ˆβKL
− β∗
) = Op(1) since
√
n( ˆβ − β∗
) →d some normal
r.v as well as
√
n( ˆβKL
− β∗
) →d some normal r.v. i.e;
10. 162 Journal, Indian Statistical Association
2xT
j′ X
√
n( ˆβ − ˆβKL
) = Op(1). Thus, 2
xT
j′ X
√
n( ˆβ− ˆβKL)
n →p 0. Hence,
P [j′
∈ A∗
n] ≤ P [2xT
j′ (X ˆβ − X ˆβKL
) = λn ˆwj′ ] → 0.
This completes the proof. ◻
3 Computations
In this section we discuss the computational issues. The KL adaptive lasso
estimates in (2.2) can be solved by the LARS algorithm (Efron et al. 2004).
The computational details are given in the following Algorithm, the proof of
which is very simple and so is omitted.
Algorithm (The LARS algorithm for the KL adaptive lasso).
1. Find ˆβ, the ols estimate of β∗
, by the least squares estimation and
hence find ˆw for some γ > 0.
2. Define x∗
j =
xj
ˆwj
, j = 1,...,p.
3. Evaluate ˆy = ∑
p
j=1 x∗
j
ˆβj.
4. Solve the following optimization problem for all λn,
ˆβ∗
= arg min
β
ˆy −
p
∑
j=1
x∗
j βj
2
+ λn
p
∑
j=1
βj .
5. Output ˆβKL
j = ˆβ∗
j / ˆwj, j = 1,...,p.
Tuning is an important issue in practice. Suppose that we use ˆβ (ols)
to construct the adaptive weights in the KL adaptive lasso; we then want
to find an optimal pair of (γ,λn) which minimizes the objective function
among all other pairs. We can use two-dimensional cross-validation to tune
the KL adaptive lasso. Note that for a given γ, we can use cross-validation
along with the LARS algorithm to exclusively search for the optimal λn. In
principle, we can also replace ˆβ (ols) with other consistent estimators. Hence
we can treat it as the third tuning parameter and perform three-dimensional
cross-validation to find an optimal triple ( ˆβ,γ,λn). We suggest using ˆβ (ols)
11. Variable selection using KL divergence loss 163
unless collinearity is a concern, in which case we can try ˆβ (ridge) from the
best ridge regression fit, because it is more stable than ˆβ (ols).
4 Numerical example: diabetes data
From Efron et al.(2004), we have:
“Ten baseline variables, age, sex, body mass index (bmi), average blood
pressure (map), and six blood serum measurements (tc, ldl, hdl, tch, ltg, glu)
were obtained for each of n = 442 diabetes patients, as well as the response
of interest (y), a quantitative measure of disease progression one year after
baseline.”
By applying our KL adaptive lasso variable selection methodology on this
data, we get the regression coefficient estimates for the predictor variables
as follows:
age 0.0000
sex −201.6889
bmi 540.5075
map 314.5000
tc −514.2194
ldl 268.7080
hdl 0.0000
tch 119.6211
ltg 682.5415
glu 0.0000
It is very clear from the above estimates that the variables named “age”,
“hdl” and “glu” do not have any significant influence on the response. Hence,
we can only select the remaining 7 variables to predict the response. The
above result supports the result obtained from the lasso as well as the
adaptive lasso techniques applied on the same data.
5 Further extension
Having shown the oracle properties of the KL adaptive lasso in linear
regression models, we have further extended the theory and methodology to
12. 164 Journal, Indian Statistical Association
generalized linear models (GLMs). We consider the penalized KL divergence
loss function using the adaptively weighted l1 penalty, where the density
belongs to the exponential family with canonical parameter θ. The generic
density form can be written as, (McCullagh and Nelder, (1989))
f(y x,θ) = h(y)exp(yθ − φ(θ)).
Generalized linear models assume that θ = xT
β∗
. Suppose that ˆβ is the
maximum likelihood estimates (mle) of β∗
in the GLM. We construct the
weight vector ˆw = 1/ ˆβ γ
for some γ > 0.
Suppose that f(yi xi,β∗
) and f(yi xi,β) are the exponential family
densities evaluated at β∗
and β respectively. Note that, Eβ∗ (yi xi) =
φ′
(xT
i β∗
), where φ′
(⋅) is the first derivative of φ(⋅). Then the KL adaptive
lasso estimates ˆβKL
(glm) are given by,
ˆβKL
(glm)
= arg min
⎧⎪⎪
⎨
⎪⎪⎩
n
∑
i=1
Eβ∗ [log
f(yi xi,β∗
)
f(yi xi,β)
] + λn ∑
j
ˆwj βj
⎫⎪⎪
⎬
⎪⎪⎭
= arg min
⎧⎪⎪
⎨
⎪⎪⎩
n
∑
i=1
{φ′
(xT
i β∗
)(xT
i β∗
− xT
i β) − φ(xT
i β∗
) + φ(xT
i β)} + λn ∑
j
ˆwj βj
⎫⎪⎪
⎬
⎪⎪⎭
.
(5.4)
We need to replace the true but unknown β∗
by a root-n-consistent estimator
of β∗
. Hence, we replace β∗
by ˆβ (mle) in the KL divergence loss function.
Thus, equation (5.4) becomes,
ˆβKL
(glm)
= arg min
⎧⎪⎪
⎨
⎪⎪⎩
n
∑
i=1
{φ′
(xT
i
ˆβ)(xT
i
ˆβ − xT
i β) − φ(xT
i
ˆβ) + φ(xT
i β)} + λn ∑
j
ˆwj βj
⎫⎪⎪
⎬
⎪⎪⎭
.
(5.5)
For logistic regression, (5.5) becomes
13. Variable selection using KL divergence loss 165
ˆβKL
(logistic)
= arg min
⎧⎪⎪
⎨
⎪⎪⎩
n
∑
i=1
{φ′
(xT
i
ˆβ)(xT
i
ˆβ − xT
i β) − log(1 + exp(xT
i
ˆβ))
+ log(1 + exp(xT
i β))} + λn ∑
j
ˆwj βj
⎫⎪⎪
⎬
⎪⎪⎭
.
For Poisson log-linear regression models, (5.5) becomes
ˆβKL
(poisson)
= arg min
⎧⎪⎪
⎨
⎪⎪⎩
n
∑
i=1
{φ′
(xT
i
ˆβ)(xT
i
ˆβ − xT
i β) − exp(xT
i
ˆβ)
+ exp(xT
i β)} + λn ∑
j
ˆwj βj
⎫⎪⎪
⎬
⎪⎪⎭
.
Let KLn(β) = ∑n
i=1 {φ′
(xT
i
ˆβ)(xT
i
ˆβ − xT
i β) − φ(xT
i
ˆβ) + φ(xT
i β)}. Then,
∂2
Ln(β)
∂β∂βT
=
n
∑
i=1
φ′′
(xT
i
ˆβ)xixT
i ;
which is positive definite, since the variance function φ′′
(xT
i
ˆβ) > 0.
Now, let kln(β) = KLn(β) + λn ∑j ˆwj βj . Since, in this case, we have used
convex penalty, so kln(β) is necessarily convex in β and hence the local
minimizer ˆβKL
(glm) is the unique global KL adaptive lasso estimator.
Assume that the true model has a sparse representation. Without loss
of generality, let A = {j β∗
j ≠ 0} = {1,2,...,p0} and p0 < p. Let I11 is the
p0 × p0 upper left-corner partitioned sub-matrix of the Fisher information
matrix I(β∗
). Then, I11 is the Fisher information with the true sub-model
known.
We show that under some mild regularity conditions, the KL adaptive
lasso estimates
ˆβKL
(glm) enjoys the oracle properties if λn is chosen appropriately.
14. 166 Journal, Indian Statistical Association
Theorem 5.1. Let An = {j ˆβKL
j (glm) ≠ 0}. Suppose that λn/
√
n → 0 and
λnn(γ−1)/2
→ ∞. Then, the KL adaptive lasso estimates ˆβKL
(glm) must
satisfy the following:
1. Consistency in variable selection: limP (An = A) = 1.
2. Asymptotic normality:
√
n( ˆβKL
A (glm) − β∗
A) →d N (0,σ2
I−1
11 ).
Proof. We assume the following regularity conditions:
1. The Fisher information matrix I(β∗
) = E [φ
′′
(xT
β∗
)xxT
] is finite
and positive definite.
2. There is a sufficiently large enough open set O that contains β∗
such
that ∀β ∈ O,
φ
′′′
(xT
β) ≤ M(x) < ∞
and
E [M(x) xjxkxl ] < ∞ ∀1 ≤ j,k,l ≤ p.
We first prove the asymptotic normality part.
ˆβKL
(glm)
= arg min
⎧⎪⎪
⎨
⎪⎪⎩
n
∑
i=1
{φ′
(xT
i
ˆβ)(xT
i
ˆβ − xT
i β) − φ(xT
i
ˆβ) + φ(xT
i β)} + λn ∑
j
ˆwj βj
⎫⎪⎪
⎬
⎪⎪⎭
.
Let β = β∗
+ u√
n
; u ∈ Rp
.
Define:
Γn(u) =
n
∑
i=1
{xT
i ( ˆβ − β∗
−
u
√
n
)φ′
(xT
i
ˆβ) − φ(xT
i
ˆβ) + φ(xT
i (β∗
+
u
√
n
))}
+ λn ∑
j
ˆwj β∗
j +
uj
√
n
.
and
Γn(0) =
n
∑
i=1
{xT
i ( ˆβ − β∗
)φ′
(xT
i
ˆβ) − φ(xT
i
ˆβ) + φ(xT
i (β∗
)} + λn ∑
j
ˆwj β∗
j .
15. Variable selection using KL divergence loss 167
Let ˆun
= arg minΓn(u) = arg min{Γn(u) − Γn(0)}. Then,
ˆun
=
√
n( ˆβKL
(glm) − β∗
). Let
H(n)
(u) = Γn(u) − Γn(0)
=
n
∑
i=1
⎧⎪⎪
⎨
⎪⎪⎩
xT
i (−
u
√
n
)φ′
(xT
i
ˆβ) + φ(xT
i β∗
+
xT
i u
√
n
) − φ(xT
i β∗
)
⎫⎪⎪
⎬
⎪⎪⎭
+
λn
√
n
∑
j
ˆwj
√
n( β∗
j +
uj
√
n
− β∗
j ).
Notice that:
φ(xT
i β∗
+
xT
i u
√
n
) = φ(xT
i β∗
) + φ′
(xT
i β∗
)
xT
i u
√
n
+
1
2
φ′′
(xT
i β∗
)uT (xixT
i )
n
u
+
n−3/2
6
φ′′′
(xT
i β∗∗
)(xT
i u)3
,
where β∗∗
is between β∗
and β∗
+ u√
n
.
Hence,
H(n)
(u) =
n
∑
i=1
⎧⎪⎪
⎨
⎪⎪⎩
(φ′
(xT
i β∗
) − φ′
(xT
i
ˆβ))
xT
i u
√
n
+
1
2
φ′′
(xT
i β∗
)uT (xixT
i )
n
u
+
n−3/2
6
φ′′′
(xT
i β∗∗
)(xT
i u)3
⎫⎪⎪
⎬
⎪⎪⎭
+
λn
√
n
∑
j
ˆwj
√
n( β∗
j +
uj
√
n
− β∗
j )
= A
(n)
1 + A
(n)
2 + A
(n)
3 + A
(n)
4 ; (say)
Now,
A
(n)
1 = −
n
∑
i=1
[φ′
(xT
i
ˆβ) − φ′
(xT
i β∗
)]
xT
i u
√
n
= −
n
∑
i=1
[φ′′
(xT
i β∗
)(xT
i
ˆβ − xT
i β∗
)]
xT
i u
√
n
−
1
2
n
∑
i=1
[φ′′′
(xT
i
ˆβ∗∗
)(xT
i
ˆβ − xT
i β∗
)2
]
xT
i u
√
n
= −A
(n)
11 − A
(n)
12 ; (say),
where ˆβ∗∗
is in between ˆβ and β∗
. Notice that
A
(n)
11 =
n
∑
i=1
[φ′′
(xT
i β∗
)(xT
i
ˆβ − xT
i β∗
)]
xT
i u
√
n
= uT
⎡
⎢
⎢
⎢
⎢
⎣
1
n
n
∑
i=1
φ′′
(xT
i β∗
)xixT
i
⎤
⎥
⎥
⎥
⎥
⎦
W ′
n;
16. 168 Journal, Indian Statistical Association
where, W ′
n =
√
n( ˆβ − β∗
) →d W ′
, such that W ′
is N(0,[I(β∗
)]−1
).
Also, 1
n ∑n
i=1 φ′′
(xT
i β∗
)xixT
i →p I(β∗
) (by WLLN). Thus, by Slutsky’s
theorem,
A
(n)
11 →d uT
I(β∗
)W ′
= uT
W ,
where, W ∼ N(0,I(β∗
)). Consider:
A
(n)
12 =
1
2
n
∑
i=1
[φ′′′
(xT
i
ˆβ∗∗
)(xT
i
ˆβ − xT
i β∗
)2
]
xT
i u
√
n
=
1
2
√
n
W ′T
n
⎡
⎢
⎢
⎢
⎢
⎣
1
n
n
∑
i=1
φ′′′
(xT
i
ˆβ∗∗
)xixT
i uT
xi
⎤
⎥
⎥
⎥
⎥
⎦
W ′
n
≤
1
2
√
n
W ′T
n
⎡
⎢
⎢
⎢
⎢
⎣
1
n
n
∑
i=1
M(xi)xixT
i xT
i u
⎤
⎥
⎥
⎥
⎥
⎦
W ′
n ,
by the regularity condition 2 since ˆβ∗∗
∈ O. Also W ′
n = Op(1) since
W ′
n =
√
n( ˆβ − β∗
) →d W ′
. By the WLLN,
1
n
n
∑
i=1
M(xi)xixT
i xT
i u →p E [M(x)xxT
xT
u ].
Notice that, the (i,j)th.
element of the p × p matrix E [M(x)xxT
xT
u ],
∀1 ≤ i,j ≤ p is given by,
((E [M(x)xxT
xT
u ]))
i,j
= E [M(x)xixj xT
u ]
≤
p
∑
k=1
E [M(x) xi xj xk uk ]
=
p
∑
k=1
uk E [M(x) xixjxk ] < ∞,
again, by the regularity condition 2. Hence,
1
n
n
∑
i=1
M(xi)xixT
i xT
i u = Op(1).
Thus, A
(n)
12 →p 0 which implies that A
(n)
1 →d uT
W .
Now for the second term A
(n)
2 , we observe that: 1
n ∑n
i=1 φ′′
(xT
i β∗
)xixT
i →p
17. Variable selection using KL divergence loss 169
I(β∗
). Thus, by Slutsky’s theorem, A
(n)
2 →p
1
2uT
I(β∗
)u.
Since ˆβ∗∗
∈ O, so by the regularity condition 2, the third term A
(n)
3 can be
bounded as:
6
√
nA
(n)
3 ≤
1
n
n
∑
i=1
M(xi) xT
i u 3
→p E [M(x) xT
u 3
] < ∞.
Hence, A
(n)
3 →p 0.
The limiting behavior of the fourth term A
(n)
4 is already discussed in the
proof of Theorem 1. We summarize the results as follows:
λn√
n
ˆwj
√
n( β∗
j +
uj
√
n
− β∗
j ) →p 0, if β∗
j ≠ 0,
λn√
n
ˆwj
√
n( β∗
j +
uj
√
n
− β∗
j ) →p 0, if β∗
j = 0 & uj = 0 and
λn√
n
ˆwj
√
n( β∗
j +
uj
√
n
− β∗
j ) →p ∞, if β∗
j = 0 & uj ≠ 0.
Thus, by Slutsky’s theorem, we see that Hn
(u) →d H(u) for every u,
where
H(u) = uT
AI11uA − 2uT
AWA, if uj = 0∀j ∉ A and H(u) = ∞, otherwise,
where W ∼ N(0,I(β∗
)). H(n)
is convex and the unique minimum of H is
(I−1
11 WA,0)T
. Then, we have:
ˆu
(n)
A →d I−1
11 WA and ˆu
(n)
Ac →d 0.
Because WA ∼ N(0,I11), the asymptotic normality part is proven. Now
we show the consistency part. ∀j ∈ A, the asymptotic normality indicates
that P(j ∈ An) → 1. Then it suffices to show that ∀j′
∉ A, P(j′
∈ An) → 0.
Consider the event j′
∈ An. By the KKT optimality conditions, we must
have
n
∑
i=1
xij′ (φ′
(xT
i
ˆβ) − φ′
(xT
i
ˆβKL
(glm))) = λn ˆwj′ ;
thus P(j′
∈ An) ≤ P (∑n
i=1 xij′ (φ′
(xT
i
ˆβ) − φ′
(xT
i
ˆβKL
(glm))) = λn ˆwj′ ).
18. 170 Journal, Indian Statistical Association
Note that
n
∑
i=1
xij′ (φ′
(xT
i
ˆβ) − φ′
(xT
i
ˆβKL
(glm))) = B
(n)
1 + B
(n)
2 + B
(n)
3 ,
with
B
(n)
1 =
n
∑
i=1
xij′ (φ′
(xT
i
ˆβ) − φ′
(xT
i β∗
))/
√
n,
B
(n)
2 =
⎛
⎝
1
n
n
∑
i=1
xij′ φ′′
(xT
i β∗
)xT
i
⎞
⎠
√
n(β∗
− ˆβKL
(glm))
and B
(n)
3 =
1
2n
n
∑
i=1
xij′ φ′′′
(xT
i
ˆβ∗∗∗
)(xT
i
√
n(β∗
− ˆβKL
(glm)))
2
/
√
n,
where ˆβ∗∗∗
is in between ˆβKL
(glm) and β∗
.
Now,
B
(n)
1 =
n
∑
i=1
xij′ (φ′
(xT
i
ˆβ) − φ′
(xT
i β∗
))/
√
n
=
n
∑
i=1
[φ′′
(xT
i β∗
)(xT
i
ˆβ − xT
i β∗
)]
xij′
√
n
+
1
2
n
∑
i=1
[φ′′′
(xT
i
ˆβ∗∗∗
)(xT
i
ˆβ − xT
i β∗
)2
]
xij′
√
n
= B
(n)
11 + B
(n)
12 , (say).
Notice that
B
(n)
11 =
n
∑
i=1
[φ′′
(xT
i β∗
)(xT
i
ˆβ − xT
i β∗
)]
xij′
√
n
=
⎡
⎢
⎢
⎢
⎢
⎣
1
n
n
∑
i=1
xij′ φ′′
(xT
i β∗
)xT
i
⎤
⎥
⎥
⎥
⎥
⎦
W ′
n.
Now, 1
n ∑n
i=1 xij′ φ′′
(xT
i β∗
)xT
i →p Ij′ , where, Ij′ is the j′
th. row of I(β∗
)
and W ′
n →d W ′
. Thus, by Slustsky’s theorem, we have:
B
(n)
11 →d Ij′ W ′
∼ N(0,Ij′ [I(β∗
)]−1
IT
j′ ).
19. Variable selection using KL divergence loss 171
Also,
B
(n)
12 =
1
2
n
∑
i=1
[φ′′′
(xT
i
ˆβ∗∗∗
)(xT
i
ˆβ − xT
i β∗
)2
]
xij′
√
n
=
1
2
√
n
W ′T
n
⎡
⎢
⎢
⎢
⎢
⎣
1
n
n
∑
i=1
φ′′′
(xT
i
ˆβ∗∗∗
)xixT
i xij′
⎤
⎥
⎥
⎥
⎥
⎦
W ′
n
≤
1
2
√
n
W ′T
n
⎡
⎢
⎢
⎢
⎢
⎣
1
n
n
∑
i=1
M(xi)xixT
i xij′
⎤
⎥
⎥
⎥
⎥
⎦
W ′
n ;
by the regularity condition 2 since ˆβ∗∗∗
∈ O.
We know that W ′
n = Op(1) and by the WLLN,
1
n
n
∑
i=1
M(xi)xixT
i xij′ →p E [M(x)xxT
xij′ ],
which implies 1
n ∑n
i=1 M(xi)xixT
i xij′ = Op(1). Hence, B
(n)
12 →p 0.
Since 1
n ∑n
i=1 xij′ φ′′
(xT
i β∗
)xT
i →p Ij′ , the asymptotic normality part
implies that B
(n)
2 →d to some normal r.v. Meanwhile, we have,
λn ˆwj′
√
n
=
λn
√
n
nγ/2 1
√
nˆβj′
γ →p ∞,
since, λnn(γ−1)/2
→p ∞ for γ > 0, by the assumption and as j′
∉ A ⇒ ˆβj′ ≡ 0,
so
√
n(ˆβj′ − 0) →d some normal r.v. ⇒
√
nˆβj′
γ
= Op(1).
Hence, P(j′
∈ An) → 0. This completes the proof. ◻
6 Conclusion
In this article we have proposed the KL adaptive lasso for simultaneous
estimation and variable selection. We have shown that the KL adaptive
lasso also enjoys the oracle properties by utilizing the adaptively weighted l1
penalty. Owing to the efficient path algorithm, the KL adaptive lasso also
enjoys the computational advantage of the lasso. Our numerical example
has shown that the KL adaptive lasso performs similarly with the lasso and
20. 172 Journal, Indian Statistical Association
the adaptive lasso. In future, we want to compare the prediction accuracy
of the KL adaptive lasso with other existing sparse modeling techniques like
the lasso, the adaptive lasso, the nonnegative garotte etc. For comparison,
we are going to report the prediction error, E[(ˆy − ytest)2
]. We would also
like to extend our KL divergence based variable selection methodology in
the high dimensional regime (i.e; when the number of regressors p grows
to infinity at a certain rate relative to the growth of the sample size n) as
well as in a survival analysis context where we have to modify the data to
account for censoring, and investigate the oracle properties specified above
in this scenario.
Acknowledgments
The author would like to thank Prof. Malay Ghosh and Prof. Kshitij
Khare for their help with the paper.
References
[1] Breiman, L. (1995a). Better subset regression using the
nonnegative garrote, Technometrics, 37, 373–384.
[2] Bernardo, J. M. and Smith, A. F. M. (1994). Bayesian
Theory, John Wiley, New York.
[3] Draper, N. R. and Smith, H. (1998). Applied Regression
Aanalysis, third edition, John Wiley, New York.
[4] Efron, B., Hastie, T., Johnstone, I., and Tibshirani,
R. (2004). Least angle regression, The Annals of Statistics,
32, 407–499.
21. Variable selection using KL divergence loss 173
[5] Fan, J. and Li, R. (2001). Variable selection via nonconcave
penalized likelihood and its oracle properties, Journal of the
American Statistical Association, 96, 1348–1360.
[6] Fan, J. and Li, R. (2006). Statistical challenges
with high dimensionality: Feature selection in knowledge
discovery, Proceedings of the Madrid International Congress
of Mathematicians.
[7] Fan, J. and Peng, H. (2004). On nonconcave penalized
likelihood with diverging number of parameters, The Annals
of Statistics, 32, 928–961.
[8] Geyer, C. (1994). On the asymptotics of constrained M-
estimation, The Annals of Statistics, 22, 1993–2010.
[9] Hastie, T., Tibshirani, R. and Friedman, J. H.
(2009). The Elements of Statistical Learning, second edition,
Springer-Verlag, New York.
[10] Hoerl, A. E. and Kennard, R. W. (1970). Ridge
regression: Biased estimation for nonorthogonal problems,
Technometrics, 12(1), 55–67.
[11] Knight, K. and Fu, W. (2000). Asymptotics for lasso-type
estimators, The Annals of Statistics, 28, 1356–1378.
[12] McCullagh, P. and Nelder, J. (1989). Generalized Linear
Models, Second edition, Chapman & Hall, New York.
[13] Tibshirani, R. (1996). Regression shrinkage and selection
via the lasso, Journal of the Royal Statistical Society, Ser. B,
58, 267–288.
[14] Zou, H. (2006). The adaptive lasso and its oracle
properties, Journal of the American Statistical Association,
101(476), 1418–1429.
22. 174 Journal, Indian Statistical Association
Shibasish Dasgupta
Department of Mathematics and Statistics
University of South Alabama
411 University Boulevard North
Mobile, AL 36688-0002, U.S.A.
E-mail: sdasgupta@southalabama.edu