1. Basics of Statistical Inference
Lu Mao∗
Fall 2009
Contents
1 BEST UNBIASED ESTIMATORS 1
2 SUFFICIENCY AND UNBIASED ESTIMATION 3
3 CRAMER-RAO LOWER BOUND 3
4 COMPREHENSIVE EXAMPLES ON ESTIMATION 5
5 T TEST 8
6 WALD, SCORE AND LIKELIHOOD RATIO TESTS 9
1 BEST UNBIASED ESTIMATORS
Definition 1 (Point Estimator) A point estimator is any function W(X1, ...Xn) of the sample,
that is, any statistic is a point estimator.
This definition may seem unnecessarily vague, but at the moment we must be cautious not to
preclude any potential candidates. There is one restriction, however. and that is, an estimator
cannot contain an unknown parameter; it must be a function of the sample only, e.g. 1
2( ¯X + µ) is
not an estimator for the unknown µ.
A common way of evaluating the performance of an estimator is through MSE.
Definition 2 (MSE) The mean squared error (MSE) of an estimator W for a parameter θ is
defined by Eθ(W − θ)2.
MSE is thus the squared difference between the estimator and the parameter averaged over
the sample space. Obviously we have MSEθ(W) = V arθ(W) + (Eθ(W) − θ)2. Note that MSE
generally depends on the unknown parameter θ. Of course, we would like to find an estimator with
minimum MSE for all θ ∈ Θ. However, this is not possible unless we place restrictions on the pool
of potential estimators.
Example 1 Consider an iid sample from N(θ, 5) with sample size n=10. A reasonable estimator
for θ is the sample mean ¯X. Since ¯X is unbiased, MSE( ¯X) = V ar( ¯X) = 1/2, a constant function
of θ. Now consider the constant estimator 0. MSE(0) = θ2. MSEθ(0) < MSEθ( ¯X) when θ is in
a small neighborhood of 0.
∗
Department of Epidemiology and Biostatistics, Drexel University of Public Health
1
2. 1 BEST UNBIASED ESTIMATORS 2
The constant 0, though generally a bad estimator, does have a smaller MSE when θ is indeed
around 0. To eliminate such trivial cases, we restrict our consideration to the class of unbiased
estimators.
Definition 3 (Unbiasedness) An estimator W for a parameter θ is unbiased if Eθ(W) = θ for
all θ ∈ Θ.
Example 2 Let X1, ...Xn be an iid sample with E(Xi) = µ and V ar(Xi) = σ2 for i = 1, ..., n. Show
that E(S2) = σ2 where S2 = 1
n−1
n
i=1 (Xi − ¯X)2. That is, the sample variance is an unbiased
estimator of the population variance.
Proof First notice that n
i=1 (Xi − ¯X)2 = n
i=1 (X2
i − 2Xi
¯X + ¯X2) = n
i=1 X2
i − 2 ¯X n
i=1 Xi +
n ¯X = n
i=1 X2
i − 2n ¯X2 + n ¯X2 = n
i=1 X2
i − n ¯X2
Therefore,
ES2
=
1
n − 1
E(
n
i=1
X2
i − n ¯X2
)
=
1
n − 1
(
n
i=1
EX2
i − nE ¯X2
)
=
1
n − 1
(n(σ2
+ µ2
) − n(
σ2
n
+ µ2
))
= σ2
Note that in the above example, no assumption about the family of distribution has ever been
made except for finite variance. Thus for a Poisson random sample, for instance, both ¯X and S2
are unbiased estimators of λ. Furthurmore, for any t ∈ R, t ¯X + (1 − t)S2 is an unbiased estimator
of λ.
Definition 4 (UMVUE) An estimator W of θ is a Uniform Minimum Variance Unbiased Esti-
mator (UMVUE) if Eθ(W) = θ and for any estimator W with Eθ(W ) = θ we have V arθ(W) ≤
V arθ(W ) for all θ ∈ Θ. It is also said to be a best unbiased estimator.
While there can be a large number of unbiased estimators for a particular parameter, best unbiased
estimator is unique.
Theorem 1 If W is a best unbiased estimator for θ, then W is unique.
Proof Suppose W is another best unbiased estimator, then V arθW = V arθW and it is less that
the variance of any other unbiased estimators for all θ.
Consider the unbiased estimator T = 1
2(W + W ). We have
V arθ(T) =
1
4
V arθ(W + W )
=
1
4
(V arWθ + V arθW + 2Covθ(W, W ))
≤
1
4
(V arθW + V arθW + 2 V arθW · V arθW ) (1)
= V arθW
3. 2 SUFFICIENCY AND UNBIASED ESTIMATION 3
But since W has uniform minimum variance, we must have V arθ(T) ≥ V arθ(W). Therefore
V arθ(T) = V arθ(W). This implies Covθ(W, W ) =
√
V arθW · V arθW . The attainment of equality
in Cauchy-Schwartz inequality requires that:
W = aW + b (2)
Equal variance and equal expectation give a = 1, b = 0.
This completes the proof.
2 SUFFICIENCY AND UNBIASED ESTIMATION
Recall that T is a sufficient statistic for θ if the distribution of x|T = t does not depend on θ.
Informally, we say that T summarizes all the information about θ contained in the sample. This
notion is elegantly illustrated by the Rao-Blackwell Theorem.
Theorem 2 (Rao-Blackwell) Let W be an unbiased estimator of θ and let T be a sufficient
statistic for θ. Define φ(T) = E(W|T). Then φ(T) is an unbiased estimator and V arθφ(T) ≤
V arθW for all θ. That is, φ(T) is uniformaly better estimator of W.
Proof First notice that since the distribution of W(X1, ..., Xn)|T = t does not depend on θ,
φ(t) = E(W|T = t) is not a function of θ. Therefore, φ(T) is indeed an estimator.
Now,
Eφ(T) = E(E(W|T)) = E(W) = θ;
Also,
V arθ(W) = Eθ(V ar(W|T)) + V arθ(E(W|T))
≥ V arθ(E(W|T))
= V arθφ(T)
This completes the proof.
The Rao-Blackwell tells us that for any unbiased estimator W we can always find φ(T), an
estimator based on a sufficient statistic, such that it is uniformly better than W.
Corollary The best unbiased estimator for θ, if one exists, must be a function of a sufficient
statistic.
Intuitively, since the best unbiased estimator has the highest precision, it must have captured
all the information available in the sample.
3 CRAMER-RAO LOWER BOUND
Theorem 3 (Cram´er-Rao Inequality) Let X1, ..., Xn be an iid sample with density f(x|θ), and
let W(X1, ...Xn) be an estimator such that V ar(W) < ∞ and that
d
dθ
EθW =
χ
∂
∂θ
[W(x)f(x)]dx (3)
Then,
V arθ(W) ≥
( d
dθ EθW(X))2
nI(θ)
where I(θ) = Eθ(( ∂
∂θ log f(X|θ))2) is the Fisher Information.
4. 3 CRAMER-RAO LOWER BOUND 4
Corollary In addtion to the conditions specified in Theorem 3, if W is an unbiased estimator of
θ, then.
V arθ(W) ≥
1
nI(θ)
Condition (3) involves an interchange between differentiation and integration. A common situation
in which (3) fails to hold is when the support of the density f(x|θ)depends on θ.
Example 3 (A counter-example of CRLB) Let X1, ...Xn be iid Uniform [0, θ]. Calculate the
“Fisher Information” I(θ) and find an unbiased estimator W(θ) such that V arθW < 1
nI(θ) .
Solution Since log f(X|θ) = log 1
θ = − log θ, we have
I(θ) = Eθ((
∂
∂θ
log f(X|θ))2
)
= Eθ((
∂
∂θ
(− log θ))2
)
= Eθ
1
θ2
=
1
θ2
So 1
nI(θ) = θ2
n . Now, since obviously E ¯X = θ
2, we have that W = 2 ¯X is an unbiased estimator of θ.
But
V arθW = 4V arθ
¯X
=
4
n
· V arθXi
=
4
n
·
θ2
12
=
θ2
3n
<
θ2
n
Fortunately, condition (3) is satisfied for “reasonable” estimators in the exponential family. Further,
I(θ) = −Eθ(
∂2
∂θ2
log f(X|θ)) (4)
holds for the exponential family.
In light of previous discussions, if the variance of an unbiased estimator attains the CRLB, it
is the best unbiased estimator.
Example 4 Show that ¯X is the best unbiased estimator for λ in a Poisson(λ) random sample.
5. 4 COMPREHENSIVE EXAMPLES ON ESTIMATION 5
4 COMPREHENSIVE EXAMPLES ON ESTIMATION
The following appears in the Duke University first year qualifying exam in 2009
(http://www.stat.duke.edu/programs/grad/fye/fye2009.pdf).
Example 5 Suppose Yi follows a Poisson Distribution with mean βxi, where xi is a fixed known
positive constant.
P(Yi = yi) = (βxi)yi
e−βxi
/yi! yi = 0, 1, 2, · · ·
1. Find a minimal sufficient statistic 1 for this family of distributions.
2. Find the Cramer-Rao bound for the variance of an unbiased estimator of β.
3. Find the MLE for β
4. Find the variance of the MLE for β
5. Give an approximate 95% confidence interval for β when n is large. Carefully describe any
theorem that you use 2 .
Solution
1. The Likelihood function
f(y|β) = β yi
xyi
i e−β xi
/ yi!
= (β yi
e−β xi
) · ( xyi
i / yi!)
So Yi is a sufficient statistic.
2. Since
log f(y|β) = yi log β − β xi + c
we have
∂
∂β
log f(y|β) =
yi
β
− xi (5)
Therefore,
∂2 log f(y|β)
∂β2
= −
yi
β2
Finally,
1
I(β)
= −
1
E(∂2 log f(Y|β)
∂β2 )
=
β2
EYi
=
β
xi
3. From (5) we know that
ˆβ =
Yi
xi
Furthermore
E(ˆβ) =
EYi
xi
= β
1
Disregard “minimal”
2
Disregard this question
6. 4 COMPREHENSIVE EXAMPLES ON ESTIMATION 6
4. Variance of MLE
Var(ˆβ) =
1
( xi)2
VarYi
=
1
( xi)2
βxi
=
β
xi
5. By the asympotic normality of MLE
ˆβ − β
D
−→ N(0,
1
I(β)
)
That is,
I(β)(ˆβ − β) =
β
xi
(ˆβ − β)
D
−→ N(0, 1)
By consistency of MLE
ˆβ
P
−→ β
By the Continuous Mapping Theorem3
ˆβ
P
−→ β, or
ˆβ
β
P
−→ 1
By the Slutsky’s Theorem4
ˆβ
xi
(ˆβ − β) =
ˆβ
β
·
β
xi
(ˆβ − β)
D
−→ N(0, 1)
Therefore,
P(−z0.025 ≤
ˆβ
xi
(ˆβ − β) ≤ z0.025)
.
= 0.95
So,
P(ˆβ − 1.96
xi
ˆβ
≤ β ≤ ˆβ − 1.96
xi
ˆβ
)
.
= 0.95
Substitute ˆβ = Yi
xi
, a 95% CI for β is
Yi
xi
− 1.96
xi
Yi
≤ β ≤
Yi
xi
+ 1.96
xi
Yi
3
Let g be a continuous function and Xn
P
−→ X, then g(Xn)
P
−→ g(X). That is, continous functions preserve
convergence in probability
4
If Xn
D
−→ Xn and Yn
P
−→ c, then XnYn
D
−→ cX
7. 4 COMPREHENSIVE EXAMPLES ON ESTIMATION 7
The following is the last problem for Chapter 7 Point Estimation (Cassela & Berger, 2nd
Edition)
Example 6 (Jackknife) The jackknife is a general technique for reducing bias in an estimator.
A one-step jackknife estimator is defined as follows. Let X1, · · · , Xn be a random sample, and let
Tn(X1, · · · , Xn) be some estimator of a parameter θ. In order to “jackknife” Tn we calculate the
n statistics T
(i)
n , i = 1, · · · , n, where T
(i)
n is calculated just as Tn but using the n − 1 observations
with Xi removed from the sample. The jackknife estimator of θ, denoted by JK(Tn), is given by
JK(Tn) = nTn −
n − 1
n
n
i=1
T(i)
n (6)
(In general, JK(Tn) will have a smaller bias than Tn. See Miller 1974 for a good review of the
properties of the jackknife.)
1. Show that the MLE of θ2, ( n
i=1 Xi/n)2, is a biased estimator of θ2.
2. Derive the one-step jackknife estimator based on the MLE.
3. Show that the one-step jackknife estimator is an unbiased estimator of θ2. (In general,
jackknifing only reduces bias. In this special case however, it removes it entirely.)
8. 5 T TEST 8
5 T TEST
Definition 5 (t Distribution) If Z ∼ N(0, 1) and W ∼ χ2
p, Z, W are independent, then Z
W
p
is
said to follow t distribution with p degrees of freedom.
A direct consequence is that for iid N(µ, σ2),
√
n( ¯X − µ)
S
∼ tn−1 (7)
To see this, recall that
√
n( ¯X−µ)
σ ∼ N(0, 1) and that (n−1)S2
σ2 ∼ χ2
n−1, ¯X and S2 are independent.
Therefore √
n( ¯X − µ)
S
=
N(0, 1)
χ2
n−1
n−1
In fact, finding the distribution of
√
n( ¯X−µ)
S was orginally the motivation for Definition 5.
Example 7 (One-Sample t Test) Let X1, · · · , Xn be iid N(θ, σ2) whete θ0 is specified value of
θ and σ2 is unknown. We are interested in testing
H0 : θ = θ0 versus H1 : θ = θ0
Show that the LRT that rejects H0 is in the form
√
n| ¯X − θ0|
S
> c
Therefore, the level α LRT rejects
√
n| ¯X − θ0|
S
> tn−1,α/2 (8)
Solution The unrestricted MLEs are ˆθ = ¯X and ˆσ2 = 1
n i (Xi − ¯X)
2
. The restricted (under H0)
MLEs are ˆθ0 = θ0 and ˆσ2
0 = 1
n i (Xi − θ0)2. So the likelihood ratio is
λ(X) =
(2πˆσ0)−n/2e
− i (Xi−θ0)2
2ˆσ2
0
(2πˆσ)−n/2e− i (Xi− ¯X)2
2ˆσ2
= (
ˆσ
ˆσ0
)n/2
e−n
2
+n
2
= i (Xi − ¯X)2
i (Xi − θ0)2
n/2
= i (Xi − ¯X)2
i (Xi − ¯X)2 + n( ¯X − θ0)2
n/2
=
1
1 + n( ¯X−θ0)2
i (Xi− ¯X)2
n
2
=
1
1 + 1
n−1
n( ¯X−θ0)2
S2
n
2
9. 6 WALD, SCORE AND LIKELIHOOD RATIO TESTS 9
Obviously,
λ(X) < c ⇐⇒
√
n| ¯X − θ0|
S
> c for some c
For a particular test with significance level α, we want to choose c such that the probability under
H0
Pθ0 (
√
n| ¯X − θ0|
S
> c ) = α
That is
Pθ0 (
√
n( ¯X − θ0)
S
< −c or
√
n( ¯X − θ0)
S
> c ) = α
Since
√
n( ¯X−θ0)
S ∼ tn−1, we have that c = tn−1,α/2.
The two sample case is similar in rationale although slightly more involved.
Example 8 (Two-sample t Test) Let X1, · · · , Xn be an iid sample from N(µX, σ2), and let
Y1, · · · , Ym be an independent iid sample from N(µY , σ2). Again, σ2 is unknown. We are interested
in testing the means
H0 : µX = µY versus µX = µY
Show that the α level LRT rejects H0 if
| ¯X − ¯Y |
S2
p( 1
n + 1
m )
> tn+m−2,α/2
where S2
p is the pooled sample variance
S2
p =
1
n + m − 2
n
i=1
(Xi − ¯X)2
+
m
i=1
(Yi − ¯Y )2
6 WALD, SCORE AND LIKELIHOOD RATIO TESTS
There are three common method of constructing a large-sample test: (asymptotic) LRT, wald test
and score test.
6.1 LRT
In the previous section, the LR statistic is eventually simplied to a statistic whose exact distribution
can be found (in this case the t distribution). In most cases, however, an exact analytical solution
to the LRT is difficult/impossible. A large sample approximation to the distribution of the LR
statistic itself can be applied.
Theorem 4 (Asymptotic distribution of LR) 5 For testing H0 : θ = θ0, H1 : θ = θ0, let the
likelihood ratio statistic
λ(X) =
L(θ0|x)
L(ˆθ|x)
where ˆθ is the MLE. Then under H0, as n → ∞ we have
−2 log λ(X)
D
−→ χ2
1 (9)
5
Assumes regularity conditions on the pdf/pmf and an iid sample
10. 6 WALD, SCORE AND LIKELIHOOD RATIO TESTS 10
To see this, let l(θ) = log L(θ|x), then log λ(X) = l(θ0) − l(ˆθ). Now, Taylor expand l(θ) around ˆθ
l(θ) = l(ˆθ) + l (ˆθ)(θ − ˆθ) + l (ˆθ)
(θ − ˆθ)2
2!
+ · · ·
Recall that the MLE ˆθ is the root of l(θ), that is l (ˆθ) = 0. So when θ = θ0 we have
−2 log λ(X) = −2(l(θ0) − l(ˆθ))
.
= −l (ˆθ)(θ0 − ˆθ)2
Since ˆθ
P
−→ θ0, l (ˆθ)
P
−→ l (θ) (provided that l (θ) is a continuous). We should note that
l (θ) =
P
−→ −I(θ)
From asymptotic normality of ˆθ, we know that I(θ)(θ − ˆθ)
D
−→ N(0, 1). Therefore
−2 log λ(X)
.
= I(θ)(ˆθ − θ)2 D
−→ χ2
1
This shows the asymptotic distribution of LR.
We will state without proof the generalized version where parameters are vector-valued.
Theorem 5 (Multi-parameter version of LRT) Let X1, · · · , Xn be a random sample from f(x|θ).
The distribution of −2logλ(X) converges to a chisq distribution as the sample size n → ∞. The de-
grees of freedom of the limiting distribution is the difference between the number of free parameters
specified by θ ∈ Θ0, and the number of parameters specified by θ ∈ Θ.
6.2 Wald Test
Another asymptotic test based on MLE is the Wald test. Since the MLE
ˆθ − θ −→ N(0,
1
I(θ)
)
we have
I(θ)(ˆθ − θ) −→ N(0, 1)
or equivalently
I(θ)(ˆθ − θ0)2
−→ χ2
1
In pactice, the observed information number ˆI(ˆθ) is used. Therefore a level α Wald test rejects
θ = θ0 when
ˆI(ˆθ)(ˆθ − θ)2
> χ2
1(α)
6.3 Score Test
The score statistic is defined to be
Sθ(X) =
∂
∂θ
log L(θ|X) (10)
It is easily seen that
ESθ(X) = 0 and VarSθ(X) = I(θ)
11. REFERENCES 11
The test statistic is
Zs =
Sθ(X)
I(θ)
It can be shown that under H0,
Sθ0 (X)
I(θ0)
→ N(0, 1)
Therefore a level α score test rejects H0 : θ = θ0 when
S2
θ0
(X)
I(θ0)
> χ2
1(α)
References
[1] George Casella and Roger L. Berger, Statistical Inference, Duxbury Press, 2nd ed, 2001.
[2] E. L. Lehmann and George Casella, Theory of Point Estimation, Springer Texts in Statistics,
2nd ed, 1998.