Minghui Conference Cross-Validation Talk

. . . . . .
Multilevel Models Decision-eoretic Model Assessment Framework Data and Model Results
.
......
Challenges with the Use of Cross-validation for
Comparing Structured Models
Wei Wang
joint work with Andrew Gelman
Department of Statistics, Columbia University
April 13, 2013

. . . . . .
Overview
...1 Multilevel Models
...2 Decision-eoretic Model Assessment Framework
...3 Data and Model
...4 Results

Overview
...1 Multilevel Models
...2 Decision-eoretic Model Assessment Framework
...3 Data and Model
...4 Results
. . . . . .

. . . . . .
Bayesian Interpretation of Multilevel Models
Multilevel Models have long been proposed to handle data with
group structures, e.g., longitudinal study with multiple obs. for
each participant, national survey with various demographic and
geographic variables.

. . . . . .
Bayesian Interpretation of Multilevel Models
Multilevel Models have long been proposed to handle data with
group structures, e.g., longitudinal study with multiple obs. for
each participant, national survey with various demographic and
geographic variables.
From a Bayesian point of view, what Multilevel Modeling does is
to partially pool the estimates through a prior, as opposed to
doing separate analysis for each group (no pooling) or analyzing
the data as if there is no group structure (complete pooling).

. . . . . .
Multilevel Models for Deeply Nested Data Structure
Our substantive interest is survey data with deeply nested
structures resulting from various categorical
demographic-geographic variables, e.g., state, income, education,
ethnicity et al.

. . . . . .
Multilevel Models for Deeply Nested Data Structure
Our substantive interest is survey data with deeply nested
structures resulting from various categorical
demographic-geographic variables, e.g., state, income, education,
ethnicity et al.
One typical conundrum is how many interactions between those
demographic-geographic variables to include in the model.

. . . . . .
ree Prototypes of Models
In the simple case of two predictors, the three prototypes of models are
shown below. e response yi is binary.
Complete Pooling model
Eyij ∼ g−1
(µij)
µij = µ0 + ai + bj
No Pooling model
Eyij ∼ g−1
(µij)
µij = µ0 + ai + bj + rij
Partial Pooling model
Eyij ∼ g−1
(µij)
µij = µ0 + ai + bj + γij
γ ∼ Φ(·)

. . . . . .
True model, Pseudo-true model and Actual Belief model
We assume there is a true underlying model pt(·), from which the
observations (both available and future observations) come from.
While acknowledging the fact that the true distribution is never
accessible, some researchers propose basing the discussion on a
rich enough Actual Belief Model), which supposedly fully re ects
the uncertainty of future data. (Bernardo and Smith 1994)

. . . . . .
M-closed, M-completed and M-open views
In M-closed view, it is assumed that the true model is included
in a enumerable collection of models, and the Actual Belief
Model is the Bayesian Model Averaging predictive distribution.
In M-completed view, the Actual Belief Model p(˜y|D, M) is
considered to be the best available description of the uncertainty
of future data.
In M-open view, the correct speci cation of the Actual Belief
Model is avoided and the strategy is to generate Monte Carlo
samples from it, such as sample re-use methods.

. . . . . .
A Decision-eoretical Framework
We de ne a loss function l(˜y, aM), which is the loss incurred
from our inferential action aM, based on a model M, in face of
future observation ˜y.
en the predictive loss from our inferential action aM is
Lp(pt
, M, D, l) = Ept(˜y)l(˜y, aM) =
∫
l(˜y, aM)pt
(˜y)d˜y
It is oen convenient and theoretically desirable to use the whole
posterior predictive distribution as aM and the log loss as l(·, ·).
Lpred(pt,M,D)=Ept [− log p(˜y|D,M)]=−
∫
pt(˜y) log p(˜y|D,M)d˜y

. . . . . .
Decision-eoretic Framework Cont'd
For Model Selection task, from a pool of candidate models
{Mk : k ∈ K}, we should select the model that minimizes the
expected predictive loss.
min
Mk:k∈K
−
∫
pt
(˜y) log p(˜y|D, M)d˜y

. . . . . .
Decision-eoretic Framework Cont'd
For Model Selection task, from a pool of candidate models
{Mk : k ∈ K}, we should select the model that minimizes the
expected predictive loss.
min
Mk:k∈K
−
∫
pt
(˜y) log p(˜y|D, M)d˜y
For Model Assessment task of a particular model M, we look at
the Kullback-Leibler divergence between the true model and the
posterior predictive distribution. We call it the predictive error.
Err(pt
, M, D) = −
∫
pt
(˜y) log p(˜y|D, M)d˜y +
∫
pt
(˜y) log pt
(˜y)d˜y
= KL(p(·|D, M); pt
(·))

. . . . . .
Estimating Expected Predictive Loss
e central obstacle of getting the Expected Predicitve Loss is
that we don't know the true distribution pt(·).

. . . . . .
A M-closed or M-completed view will substitute the true
distribution with a reference distribution.

. . . . . .
From a M-open view, plug in available sample gives us the
Training Loss, which has a downward bias, since we used the
sample twice.
Ltraining(M, D) = −
1
n
n∑
i=1
log p(yi|D, M)

. . . . . .
From a M-open view, plug in available sample gives us the
Training Loss, which has a downward bias, since we used the
sample twice.
Ltraining(M, D) = −
1
n
n∑
i=1
log p(yi|D, M)
ere exist two approaches to get an unbiased estimate of
Predictive Loss: Bias Correction which leads to various
Information Criteria; Held-out Practices which lead to
Leave-one-out Cross Validation and k-fold Cross Validation.

. . . . . .
Estimation Methods
ere is a long list of variants of Information Criteria,
AIC/BIC/DIC/TIC/NIC/WAIC et al.

. . . . . .
Estimation Methods
LOO Cross Validation has been shown to be asymptotically
equivalent to AIC/WAIC. But the computational burden is huge.
e Importance Sampling method introduces new problem of
the reliability of the importance weights.

. . . . . .
Estimation Methods
LOO Cross Validation has been shown to be asymptotically
equivalent to AIC/WAIC. But the computational burden is huge.
e Importance Sampling method introduces new problem of
the reliability of the importance weights.
We are using the computationally convenient k-fold cross
validation, in which the data set is randomly partitioned into k
parts, and in each fold, one part is used as the testing set while
the rest serve as the training set.

. . . . . .
k-fold Cross Validation
en the k-fold Cross Validation estimate of the Predictive Loss
is given by
LCV(M, D) = −
K∑
k=1
∑
i∈testk
log p(yi|Dk
, M) = −
N∑
i=1
log p(yi|D(i)
, M)

. . . . . .
k-fold Cross Validation
en the k-fold Cross Validation estimate of the Predictive Loss
is given by
LCV(M, D) = −
K∑
k=1
∑
i∈testk
log p(yi|Dk
, M) = −
N∑
i=1
log p(yi|D(i)
, M)
To estimate the Predictive Error, we still need an estimate of the
Entropy of the true distribution. We can use the training loss of
the saturated model as a surrogate.
−
∫
pt(˜y) log pt(˜y)d˜y = −
1
n
n∑
i=1
log p(˙yi|D, Msaturated)

. . . . . .
Data Set
Cooperative Congressional Election Survey 2006
N=30,000
71 social and political response outcomes
Deeply nested demographic variables, e.g., state, inc, edu, ethn,
gender et al.

. . . . . .
Data Set Cont'd
Figure: A sample of the questions in CCES 2006 survey.

. . . . . .
Model Setup
For demonstration, we only consider two demographic variables,
state and income, together with their interaction. e responses
are all yes-no binary outcomes.
Complete Pooling
πj1j2
= logit−1
(
βstt
j1
+ βinc
j2
)
No Pooling
πj1j2
= logit−1
(
βstt
j1
+ βinc
j2
+ βstt*inc
j1j2
)
Partial Pooling
πj1j2
= logit−1
(
βstt
j1
+ βinc
j2
+ βstt*inc
j1j2
)
βstt*inc
j1j2
∼ Φ(·)

. . . . . .
k-fold Cross Validation Estimate
Due to computational constraints, we are using Maximum A
Posteriori plug-in estimate instead of full Bayesian estimate.
p(˜y|D, M) ≈ p(˜y|ˆπij(D), M)

. . . . . .
k-fold Cross Validation Estimate
Due to computational constraints, we are using Maximum A
Posteriori plug-in estimate instead of full Bayesian estimate.
p(˜y|D, M) ≈ p(˜y|ˆπij(D), M)
en under the aforementioned setup, the Cross Validation
estimate of the Predictive Loss is
LCV(M,D)=− 1
N
∑K
k=1
∑
l∈testk
log p(yl|Dk,M)
=− 1
N
∑K
k=1
∑
i,j[y
testk
ij log ˆπij(Dtraink )+(n
testk
ij −y
testk
ij ) log(1−ˆπij(Dtraink ))]
=− 1
N
∑
i,j
∑K
k=1[log ˆπij(Dtraink )y
testk
ij +log(1−ˆπij(Dtraink ))(n
testk
ij −y
testk
ij )]
=− 1
N
∑
i,j
[
log ˆπij(Dtrain)yij+log(1−ˆπij(Dtrain))(nij−yij)
]
=−
∑
i,j
nij
N
[
log ˆπij(Dtrain)˜πij+log(1−ˆπij(Dtrain))(1−˜πij)
]

. . . . . .
Calibration of Improvement
Let's suppose we only have one cell, with true proportion .4, and
the good model gives a posterior estimate of log proportion at
roughly log(0.41), and the lesser model gives a estimate of
log(0.44) or log(0.38).
en the Predictive Loss under the good model is
−[.4 ∗ log(.41) + .6 ∗ log(.59)] = 0.67322, and under the two
lesser models is −[.4 ∗ log(.44) + .6 ∗ log(.56)] = 0.67386 and
−[.4 ∗ log(.38) + .6 ∗ log(.62)] = 0.67628. We can see the
improvement of the Predictive Loss is between 0.0006 to 0.003.
Also, the lower bound is given by
−[.4 ∗ log(.4) + .6 ∗ log(.6)] = 0.67301, so the Predictive Error
of the good model is about 0.0002.

. . . . . .
Cross Validation Results on All Outcomes
Responses (ordered by the lower bound)
EstimatedPredictiveError
0.01
0.02
0.03
0.04
0.05
10 20 30 40 50 60 70
models
complete pooling
partial pooling
no pooling
Figure: Measure of t (Estimated Predictive Error) for all response outcomes
in CCES 2006 survey data. Responses are ordered by the lower bound
(training loss of the saturated model). No Pooling model gives very bad t,
while Predictive Error of Partial Pooling is dominated by Complete Pooling,
but the diﬀerences seem small.

. . . . . .
Compare Partial Pooling and Complete Pooling
In the previous gure, apparently No Pooling is doing very badly,
but the diﬀerences between Partial Pooling and Complete
Pooling seem small. We need to further calibrate them.

. . . . . .
e summary of the diﬀerences between Partial Pooling and
Complete Pooling for all the outcomes is
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.0003405 0.0001821 0.0003827 0.0006041 0.0005630 0.0053770

. . . . . .
e summary of the diﬀerences between Partial Pooling and
Complete Pooling for all the outcomes is
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.0003405 0.0001821 0.0003827 0.0006041 0.0005630 0.0053770
We can see that the improvement in terms of the Predictive Loss
indeed corresponds to some meaningful improvement in
prediction accuracy.

. . . . . .
Simulations Based on Real Data
We want to explore how the structure of the multilevel models
aﬀects the dynamics of the performance of diﬀerent models.
Speci cally, we are interested in total sample size and how
balanced the cells are in terms of cell size.

. . . . . .
Simulations Based on Real Data
We want to explore how the structure of the multilevel models
affects the dynamics of the performance of different models.
Speci cally, we are interested in total sample size and how
balanced the cells are in terms of cell size.
We generated simulated data sets based on the real data set, i.e.,
we use the estimated from the Multilevel model t of the real data
sets and enlarge the total sample size by 2, 3 and 4 times, either
keeping the original relative proportions (highly unequal) of
different cells or making the proportions roughly equal.

. . . . . .
Simulation Results: Total Sample Size
0.002
0.003
0.004
0.005
10 20 30 40 50 60 70
models
complete pooling
partial pooling
no pooling
0.0020
0.0025
0.0030
0.0035
0.0040
0.0045
10 20 30 40 50 60 70
models
complete pooling
partial pooling
no pooling
0.002
0.003
0.004
0.005
0.006
10 20 30 40 50 60 70
models
complete pooling
partial pooling
no pooling
Figure: Estimated Predictive Error of all response outcomes for
``augmented'' data sets.

. . . . . .
Simulation Results: Total Sample Size on House Rep Vote
sample size
0.002
0.004
0.006
0.008
0.010
0.012
0.014
50000 100000 150000 200000
models
complete pooling
partial pooling
no pooling
Figure: Predictive Error of the three models as sample size grows. e
outcome under consideration is the Republican vote in the House election.

. . . . . .
Simulation Results: Balancedness of the Structure
0.010
0.015
0.020
0.025
0.030
10 20 30 40 50 60 70
models
complete pooling
partial pooling
no pooling
Figure: Measure of t (Predictive Error) for all responses, ordered by lower
bound. e data set is simulated from real data set, and has the same sample
size in total as the real data set, but keeping all demographic-geographic cells
balanced.

. . . . . .
Conclusions
Cross-validation is not a very sensitive instrument in comparing
multilevel models.
Careful calibrations are needed for better understanding of the
results.
We also explored how diﬀerent aspects of the data set structure
aﬀect the margin of improvement.

Minghui Conference Cross-Validation Talk

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (18)

Similaire à Minghui Conference Cross-Validation Talk

Similaire à Minghui Conference Cross-Validation Talk (20)

Dernier

Dernier (20)

Minghui Conference Cross-Validation Talk