From L to N: Nonlinear Predictors in Generalized Models

From L to N: Nonlinear Predictors in
Generalized Models

Heather Turner

Independent Statistical/R Consultant

owing much to
David Firth, University of Warwick

Heather Turner (Independent Consultant) From L to N GLM 40 Years On 2012 1 / 32

From L to N

In a GLM we have

g(µ) = β0 + β1 x1 + ... + βp xp

and
Var(Y ) = φV (µ)
A generalized nonlinear model (GNM) is the same as a GLM
except that we have
g(µ) = η(x; β)
where η(x; β) is nonlinear in the parameters β.


Motivation

GNMs may be thought of as...
... an extension of Nonlinear Least Squares

using a nonlinear function of a continuous variable to model a
non-Gaussian response

... an extension of GLMs

using nonlinear functions of parameters to produce a more
parsimonious model and interpretable model.


Example: Mental Health Status

The following contingency table cross-classiﬁes a sample of 1660
residents of Manhattan by child’s mental impairment and parents’
socioeconomic status (Agresti, 2002)

## MHS
## SES well mild moderate impaired
## A 64 94 58 46
## B 57 94 54 40
## C 57 105 65 60
## D 72 141 77 94
## E 36 97 54 78
## F 21 71 54 71


Independence

A simple analysis of these data might be to test for independence of
MHS and SES using a chi-squared test.
This is equivalent to testing the goodness-of-ﬁt of the independence
model
log(µrc ) = αr + βc

Such a test compares the independence model to the saturated model

log(µrc ) = αr + βc + γrc

which may be over-complex.


Row-column Association
One intermediate model is the Row-Column association model:

log(µrc ) = αr + βc + φr ψc

(Goodman, 1979), an example of a multiplicative interaction model.
For the Mental Health data:
## Analysis of Deviance Table
##
## Model 1: Freq ~ SES + MHS
## Model 2: Freq ~ SES + MHS + Mult ( SES , MHS )
## Model 3: Freq ~ SES + MHS + SES : MHS
## Resid . Df Resid . Dev Df Deviance Pr ( > Chi )
## 1 15 47.4
## 2 8 3.6 7 43.8 2.3 e -07
## 3 0 0.0 8 3.6 0.89


Parameterisation

The independence model was defined earlier in an over-parameterised
form:

log(µrc ) = αr + βc
= (αr + 1) + (βc − 1)
∗ ∗
= αr + βc

Identifiability constraints may be imposed
to fix a one-to-one mapping between parameter values and
distributions
to enable interpretation of parameters


Standard Implementation

The standard approach of all major statistical software packages is to
apply the identiﬁability constraints in the construction of the model

g(µ) = Xβ

so that rank(X) is equal to the number of parameters p.
Then the inverse in the score equations of the IWLS algorithm
−1
β (r+1) = X T W (r) X X T W (r) z (r)

exists.


Alternative Implementation

An alternative is to keep models in their over-parameterised form, so
that rank(X) < p, and use the generalised inverse in the IWLS
updates:
−
β (r+1) = X T W (r) X X T W (r) z (r)

This approach is more useful for GNMs, since in this case it is much
harder to define standard rules for specifying identifiability
constraints.
Rather, identifiability constraints can be applied post-fitting for
inference and interpretation.


Estimation of GNMs
GNMs present further technical diﬃculties vs. GLMs

automatic generation of starting values is hard
the likelihood may have multiple optima

The default approach used in the gnm package for R is as follows:

generate starting values randomly for nonlinear parameters and
using a GLM ﬁt for linear parameters
use one-parameter-at-a-time Newton method to update
nonlinear parameters
use the generalized IWLS to update all parameters

Consequently, the parameterisation returned is random.


Parameterisation of RC Model
The RC model is invariant to changes in scale or location of the
interaction parameters:
log(µrc ) = αr + βc + φr ψc
= αr + βc + (2φr )(0.5ψc )
= αr + (βc − ψc ) + (φr + 1)(ψc )
One way to constrain these parameters is as follows
wr φr
φr − r
wr
φ∗
r = r

wr φr
r wr φr − r
wr
r

where wr is the row probability, say, so that
wr φ∗ = 0
r wr (φ∗ )2 = 1
r
r r


Row and Column Scores
The row and columns scores for the RC model are

## Estimate Std . Error
## Mult (. , MHS ) . SESA 1.11 0.30
## Mult (. , MHS ) . SESB 1.12 0.31
## Mult (. , MHS ) . SESC 0.37 0.32
## Mult (. , MHS ) . SESD -0.03 0.27
## Mult (. , MHS ) . SESE -1.01 0.31
## Mult (. , MHS ) . SESF -1.82 0.28
## Estimate Std . Error
## Mult ( SES , .) . MHSwell 1.68 0.19
## Mult ( SES , .) . MHSmild 0.14 0.20
## Mult ( SES , .) . MHSmoderate -0.14 0.28
## Mult ( SES , .) . MHSimpaired -1.41 0.17

As one might expect, the scores are ordered for both factors,
suggesting the model for the dependence structure might be
simpliﬁed further.


Biplot Model

Biplots are graphical displays of data arrays which represent the
objects that index all dimensions of the array on the same plot.
So for a two-way table, a biplot represents both the rows and
columns at the same time.
The biplot is constructed from a rank-2 representation of the data.
Here we consider the generalized bilinear model

g(µij ) = α1i β1j + α2i β2j


Example: Leaf Blotch Data

The proportion of leaf area aﬀected by leaf blotch was recorded for
10 varieties of barley grown at nine sites (Gabriel, 1998).
Thus the response is a continuous variable in [0, 1].
Wedderburn (1974) suggested to model these data using a logit link
and a variance proportional to the square of that of the binomial, i.e.
V (µ) = µ2 (1 − µ)2 – a quasi-likelihood model.


Geometrical Intepretation
Given the bilinear model

logit(µij ) = α1i β1j + α2i β2j

the eﬀect of site i can be represented by the point

(α1i , α2i )

in the space spanned by the linearly independent basis vectors

a1 = (α11 , α12 , . . . α19 )T
a2 = (α21 , α22 , . . . α29 )T


Visualising Sites and Varieties
Thus we can represent the sites and varieties separately as follows
Site Effects Variety Effects
4

4
2

2
Component 2

Component 2
1 2
4
3
5
7 6
89
0

0
X

CE
−2

−2
F
B D G
H
I
A
−4

−4

−4 −2 0 2 4 −4 −2 0 2 4

Component 1 Component 1


Obtaining Orthogonal Bases

Given the SVD of the matrix of predictors

η = U DV T

matrices of orthogonal basis vectors on the same scale are given by
1 1
A = UD2 B = D2V T

The model stays the same, but the parametrization changes.


Biplot
Biplot for barley data Biplot for barley data

sites: A−I sites: A−I
4

4
varieties: 1−9, X varieties: 1−9, X v−axis

I I
2

2
9X H 9X H
Component 2

Component 2
6 8 6 8
G G
7 F D 7 F D
E E
0

0
5 C 5 C
3
2 4 B A 3
2 4 B A
1 1
−2

−2
h−axis
−4

−4
−4 −2 0 2 4 −4 −2 0 2 4

Component 1 Component 1


Model Reﬁnement
The biplot suggests that the sites could be represented by points
along a line, with co-ordinates

(γi , δ0 ).

and the varieties by points on two lines perpendicular to the site line:

(ν0 + ν1 I(i ∈ {2, 3, 6}), ωj )

This corresponds to the following simpliﬁcation of the bilinear model:

α1i β1j + α2i β2j
≈γi (ν0 + ν1 I(i ∈ {2, 3, 6})) + δ0 ωj

or equivalently

γi (ν0 + ν1 I(i ∈ {2, 3, 6})) + ωj ,

Double Additive Model

Gabriel (1998) described the model derived from the biplot as the
double additive model.
An analysis of deviance conﬁrms that this model is adequate for the
leaf blotch data
##
## Model 1: y ~ 0 + Mult ( site , variety , inst = 1) + Mult ( site ,
## variety , inst = 2)
## Model 2: y ~ variety + Mult ( site , variety . binary ) - 1
## 1 56 41
## 2 71 51 -15 -9.94 0.8


Stereotype Model

The stereotype model (Anderson, 1984) is suitable for ordered
categorical data. It is a special case of the multinomial logistic model:

exp(β0c + β T xi )
c
pr(yi = c|xi ) =
r exp(β0r + β T xi )
r

in which only the scale of the relationship with the covariates changes
between categories:

exp(β0c + γc β T xi )
pr(yi = c|xi ) = T
r exp(β0r + γr β xi )


Poisson Trick
The stereotype model can be ﬁtted as a GNM by re-expressing the
categorical data as category counts Yi = (Yi1 , . . . , Yik ).
Assuming a Poisson distribution for Yic , the joint distribution of Yi is
Multinomial(Ni , pi1 , . . . , pik ) conditional on the total count Ni .
The expected counts are then µic = Ni pic and the parameters of the
sterotype model can be estimated through ﬁtting

log µic = log(Ni ) + log(pic )
= αi + β0c + γc βr xir
r

where the “nuisance” parameters αi ensure that the multinomial
denominators are reproduced exactly, as required.


Augmented Least Squares
A disadvantage of using the Poisson trick is that the number of
nuisance parameters can be large, making computation slow.
The algorithm can be adapted using augmented least squares.
For an ordinary least squares model,
−1
T −1 yT y yT X A11 A12
(y|X) (y|X) = =
XT y XT X A21 A22

where A11 , A12 and A22 are functions of y T y, X T y and X T X.
Then it can be shown that

ˆ A21
β = (X T X)−1 X T y = −
A11
requiring only the ﬁrst row (column) of the inverse to be found.

Application to Nuisance Parameters I
The same approach can be applied to the IWLS algorithm, letting
1
˜
X = W 2 (z|X)

Now let
˜
X = (U |V )
where V is the part of the design matrix corresponding to the
nuisance factor.
U is an nk × p matrix where n is the number of nuisance parameters
and k is the number of categories and p is the number of model
parameters, typically with n >> p.
V is an nc × n matrix of dummy variables identifying each individual.


Application to Nuisance Parameters II

Then
−
˜T ˜ UTU UTV B 11 B 12
(X X)− = =
V TU V TV B 21 B 22

Again, only the ﬁrst row (column) of this generalised inverse is
ˆ
required to estimate β, so we are only interested in B 11 and B 12 .

B 11 = (U T U − U T V (V T V )−1 V T U )−
B 12 = −(V T V )−1 V T U B 11


Elimination of the Nuisance Factor

U T U is p × p, therefore not expensive to compute.
V T V and V T U can be computed without constructing the large
nk × n matrix V , due to the stucture of V
V T V is diagonal and the non-zero elements can be computed
directly
V T U is equivalent to aggregating the rows of U by levels of the
nuisance factor

Thus we only need to construct the U matrix, saving memory and
reducing the computational burden


Example: Back Pain Data

For 101 patients, 3 prognostic variables were recorded at baseline,
then after 3 weeks the level of back pain was recorded (Anderson,
1984)
These data were converted to counts, for example for the ﬁrst record:

## x1 x2 x3 pain count id
## 1 1 1 1 worse 0 1
## 1.1 1 1 1 same 1 1
## 1.2 1 1 1 slight . improvement 0 1
## 1.3 1 1 1 moderate . improvement 0 1
## 1.4 1 1 1 marked . improvement 0 1
## 1.5 1 1 1 complete . relief 0 1


Back Pain Model
In this example, the expanded data is not that long (606 records) and
the total number of parameters is only 115 (9 nonlinear), so the
model does not take long to ﬁt (< 1s!).
However, eliminating the linear parameters reduces the computation
time by almost two-thirds, showing the potential of this technique.
Compare the stereotype model to the multinomial logistic model:
##
## Model 1: count ~ pain + Mult ( pain , x1 + x2 + x3 ) - 1
## Model 2: count ~ pain + pain : x1 + pain : x2 + pain : x3 - 1
## 1 493 303
## 2 485 299 8 4.08 0.85


Identifiability Constraints

In order to make the category-specific multipliers identifiable, we
must constrain both the location and scale.
A simple way to do this is to set the first multiplier to zero and fix
the coefficient of the first covariate to one.
## estimate SE quasiSE quasiVar
## worse 0.000 0.000 1.7797 3.16745
## same -3.710 1.826 0.4281 0.18330
## slight . improvement -3.510 1.792 0.4025 0.16198
## moderate . improvement -2.633 1.669 0.5519 0.30454
## marked . improvement -4.612 1.895 0.3133 0.09817
## complete . relief -5.372 2.000 0.4920 0.24202

Quasi standard errors (Firth and de Menezes, 2004) are invariant to
reference class


Comparison Intervals
Intervals based on quasi standard errors
4
2

q
0
estimate
−2

q

q q
−4

q
q
−6

worse same slight moderate marked complete
improvement improvment improvement relief
pain


Summary

Moving from GLMs to GNMs present some technical diﬃculties, but
provides a framework that covers several useful models.
Further examples can be found in the help ﬁles and manual
accompanying the gnm package which is available on CRAN.


References
Agresti, A. (2002). Categorical Data Analysis (2nd ed.). New York: Wiley.
Anderson, J. A. (1984). Regression and Ordered Categorical Variables. J.
R. Statist. Soc. B 46 (1), 1–30.
Firth, D. and R. X. de Menezes (2004). Quasi-variances. Biometrika 91,
65–80.
Gabriel, K. R. (1998). Generalised bilinear regression. Biometrika 85,
689–700.
Goodman, L. A. (1979). Simple models for the analysis of association in
cross-classiﬁcations having ordered categories. J. Amer. Statist.
Assoc. 74, 537–552.
Wedderburn, R. W. M. (1974). Quasi-likelihood Functions, Generalized
Linear Models, and the Gauss-Newton Method. Biometrika 61,
439–447.


From L to N: Nonlinear Predictors in Generalized Models

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (16)

Similaire à From L to N: Nonlinear Predictors in Generalized Models

Similaire à From L to N: Nonlinear Predictors in Generalized Models (20)

From L to N: Nonlinear Predictors in Generalized Models