1. PHIL 6334 - Probability/Statistics Lecture Notes 3:
Estimation (Point and Interval)
Aris Spanos [Spring 2014]
1
Introduction
In this lecture we will consider point estimation in its simplest
form by focusing the discussion on simple statistical models,
whose generic form is give in table 1.
Table 1 — Simple (generic) Statistical Model
[i] Probability model: Φ={ (; θ) θ∈Θ ∈R }
[ii] Sampling model: X:=(1 ) is a random (IID) sample.
What makes this type of statistical model ‘simple’ is the notion of random (IID) sample.
1.1
Random sample (IID)
The notion of a random sample is defined in terms of the joint
distribution of the sample X:=(1 2 ) say (1 2 ; θ)
for all x:=(1 2 )∈R by imposing two probabilistic
assumptions:
(I) Independence: the sample X is said to be Independent (I) if, for all x∈R the joint distribution splits up into
a product of marginal distributions:
Q
(x; θ)=1(1; θ1)·2(2; θ2)· · · · (; θ):= =1 ( ; θ )
(ID) Identically Distributed: the sample X is said to
be Identically Distributed (ID) if the marginal distributions
1
2. are identical:
( ; θ )= ( ; θ) for all =1 2
Note that this means two things, the density functions have
the same form and the unknown parameters are common to
all of them.
For a better understanding of these two crucial probabilistic
assumptions we need to simplify the discussion by focusing
first on the two r.v. variable case, which we denote by and
to avoid subscripts.
First, let us revisit the notion of a random variable in order
to motive the notions of marginal and joint distributions.
Example 5. Tossing a coin twice and noting the outcome.
In this case ={( ) ( ) ( ) ( )} and let us assume that the events of interest are ={() ( ) ( )}
and ={( ) ( ) ( )} Using these two events we can
generate the event space of interest F by applying the set
theoretic operations of union (∪), intersection (∩), and complementation (−). That is, F={ ∅ ∩ };
convince yourself that this will give rise to the set of all subsets of . Let us define the real-valued functions () and
() on as follows:
()=( )=( )=1 ( )=0
( ) = ( )= ( )=1 () =0
Do these two functions define proper r.v.s with respect to F
To check that we define all possible events generated by these
functions and check whether they belong to F:
{:()=0}={( )}=∈F {:()=1}=∈F
2
3. {: ()=0}={()}=∈F {: ()=1}=∈F
Hence, both functions do define proper r.v’s with respect
to F To derive their distributions we assume that we have a
fair coin, i.e. each event in has probability .25 of occurring.
Hence, both functions do define proper r.v’s with respect to
F To derive their distributions we assume that we have a fair
coin, i.e. each event in has probability .25 of occurring.
{:()=0}= (=0)=25 {: ()=0}= (=0)=25
{:()=1}= (=1)=75 {: ()=1}= (=1)=75
Hence, their ‘marginal’ density functions take the form:
0 1
() 25 75
0 1
() 25 75
(1)
How can one define the joint distribution of these two r.v.s?
To define the joint density function we need to specify all the
events:
(= =) ∈R ∈R
denoting ‘their joint occurrence’, and then attach probabilities to these events. These events belong to F by definition
because as a field is closed under the set theoretic operations
∪ ∩ so that:
(=0 =0)=
(=0 =1)=
(=1 =0)=
(=1 =1)=
{}=∅
{( )}
{()}
{( ) ( )}
3
(=0 =0)=
(=0 =1)=
(=1 =0)=
(=1 =1)=
0
25
25
50
4. Hence, the joint density is defined by:
Â
0
0 1
0 25
1
25 50
(2)
How is the joint density (2) connected to the individual (marginal) densities given in (1)? It turns out that if we sum
over the rows of the above table for each value of , i.e. use
P
∈R ( )= () we will get the marginal distribution
of : () ∈R and if we sum over the columns for each
P
value of , i.e. use ∈R ( )= () we will get the
marginal distribution of : () ∈R :
Â
0
0 1 ()
0 25 .25
1
25 50
() .25 .75
.75
1
(3)
Note: ()=0(25)+1(75)=75 = ( )
()=(0−75)2(25)+(1−75)2(75)=1875 = ( )
Armed with the joint distribution we can proceed to define the notions of Independence and Identically Distributed
between the r.v’s and .
Independence. Two r.v’s and are said to be Independent iff:
( )= ()· () for all values ( )∈R × R
(4)
That is, to verify that these two r.v’s are independent we need
to confirm that the probability of all possible pairs of values
( ) satisfies (4).
4
5. Example. In the case of the joint distribution in (3)
we can show that the r.v’s are not independent because for
( )=(0 0):
(0 0)=0 6= (0)· (0)=(25)(25)
It is important to emphasize that the above condition of
Independence is not equivalent to the two random variables
being uncorrelated:
( )=0 9 ( )= ()· () for all ( )∈R ×R
where ‘9’ denotes ‘does not imply’. This is because ( )
is a measure of linear dependence between and since it
is based on the covariance defined by:
( ) =[(-())( -( ))=2(0)(0-75) + (25)(0-75)(1-75)+
+(25)(0−75)(1−75) + (5)(1−75)(1−75) = −0625
A standardized covariance yields the correlation:
( )
= −0625 =
1875
()· ( )
( )= √
−1
3
The intuition underlying this result is that the correlation involves only the first two moments [mean, variance, covariance]
of and but independence is defined in terms of the density functions; the latter, in principle, involves all moments,
not just the first two!
Identically Distributed. Two r.v’s and are said
to be Identically Distributed iff:
(; )= (; ) for all values ( )∈R × R
(5)
Example. In the case of the joint distribution in (3) we
can show that the r.v’s are identically distributed because (5)
5
6. holds. In particular, both r.v’s and take the same values
with the same probabilities.
To shed further light on the notion of IID, consider the
three bivariate distributions given below.
Â
1
2
()
Â
0
1
()
Â
0
1
()
0
018 042
06
0
018 042
06
0
036 024
06
2
012 028
04
1
012 028
04
1
024 016
04
1
()
1
()
() 03
07
(A)
03
07
(B)
06
04
(C)
(I) and are Independent iff:
( )= ()· () for all ( )∈R × R
(6)
(ID) and are Identically Distributed iff:
() = () for all ( )∈R × R = and R =R
The random variables and are independent in all three
cases since they satisfy (4) (verify!).
The random variables in (A) are not Identically Distributed
because R 6=R and ()6= () for some ( )∈R ×R
The random variables in (B) are not Identically Distributed because even though R =R ()6= () for some
( )∈R × R
Finally, the random variables in (C) are Identically Distributed because R =R and ()= () for all ( )∈R ×
R
6
1
7. 2
Point Estimation: an overview
It turns out that all forms of frequentist inference, which include point and interval estimation, hypothesis testing and
prediction, are defined in terms of two sets:
X — sample space:
the set of all possible values of the sample X
Θ — parameter space: the set of all possible values of θ
Note that the sample space X is always a subset of R and
denoted by R
In estimation the objective is to use the statistical information to infer the ‘true’ value ∗ of the unknown parameter, whatever that happens to be, as along as it belongs to
Θ
In general, an estimator b of is a mapping (function)
from the sample space to the parameter space:
b X → Θ
():
(7)
Example 1. Let the statistical model of interest be the
simple Bernoulli model (table 2) and consider the question of
estimating the unknown parameter whose parameter space
is Θ:=[0 1] Note that the sample space is: X:={0 1}
Table 2 - Simple Bernoulli Model
Statistical GM: = + ∈N.
⎫
[1] Bernoulli:
v Ber( ) =0 1 ⎬
[2] constant mean:
( )=
∈N.
⎭
[3] constant variance: ( )=(1−)
[4] Independence: { ∈N} is an independent process
7
8. The notation b
(X) is used to denote an estimator in order to bring out the fact that it is a function of the sample
X and for different values it generates the sampling distribution (b
(x); ) for x∈X. Post-data b
(X) yields an estimate b 0) which constitutes a particular value of b
(x
(X) corresponding to data x0 Crucial distinction: b
(X)-estimator
(Plato’s world), b 0)-estimate (real world), and -unknown
(x
constant (Plato’s world); Fisher (1922).
In light of the definition in (7), which of the following mappings constitute potential estimators of ?
Table 3: Estimators of ?
[a] b1(X)=
[b] b2(X)=1 −
[c] b3(X)=(1 + )2
¡ ¢
b (X)= 1 P for some 3
[d]
=1
¡
¢
b+1(X)= 1 P
[e]
=1
+1
Do the mappings [a]-[e] in table 3 constitute estimators of
? All five functions [a]-[e] have X as their domain, but is the
range of each mapping a subset of Θ:=[0 1]? Mapping [a],
[c]-[e] can be possible estimators of because their ranges are
subsets of [0 1], but [b] cannot not because it can take the
value −1 [ensure you understand why!] which lies outside the
parameter space of
One can easily think of many more functions from X to Θ
that will qualify as possible estimators of Given the plethora
of such possible estimators, how does one decide which one is
the most appropriate?
8
9. To answer that question let us think about the possibility of an ideal estimator, ∗():X → ∗ i.e., ∗(x)=∗ for
all values x∈X . That is, ∗(X) pinpoints the true value ∗
of whatever the data. A moment’s reflection reveals that
no such estimator could exist because X is a random vector
with its own distribution (x; ) for all x∈X. Moreover, in
view of the randomness of X, any mapping of the form (7)
will be a random variable with its own sampling distribution,
(b
(x); ) which is directly derivable from (x; ). Let us
take stock of these distributions.
Let us keep track of these distributions and where they
come from. The distribution of the sample (x; ) for all x∈X
is given by the assumptions of statistical model in question.
I In the above case of the simple Bernoulli model, we can
combine assumptions [2]-[4] to give us:
[2]-[4] Y
(x; ) =
( ; )
=1
and then use [1]: ( ; )=(1 − )1− =1 2 to determine (x; ):
P
[2]-[4] Y
[1]-[4] P
=1
(x; ) =
( ; ) =
(1−) =1 1− = (1−)−
=1
P
where = =1 , and one can show that :
P
= v Bin( (1 − ))
(8)
=1
i.e. is Binomially distributed. note that the means and
variances are derived using the two formulae:
(i) (1 + 2 + )=(1) + (2) +
2
2
(ii) (1 + 2 + )= (1) + (2)
9
(9)
10. To derive the mean and variance of :
P
P
(i) P
( ) = ( ) = ()= =
=1
=1
=1
P
P
(ii) P
( )= ( ) = ()= (1−)=(1−)
=1
=1
=1
The result in (8) is a special case of a general result.
¥ The sampling distribution of any (well-behaved) function
of the sample, say =(1 2 ) can be derived from
(x; ) x∈X using the formula:
R
R
()=P( ≤ )= ··· {x: (x)≤} (x; θ)x ∈R (10)
In the Bernoulli case, all the estimators [a], [c]-[e] are linear
functions of (1 2 ) and thus, by (8), their distribution is Binomial. In particular,
Table 4: Estimators and their sampling distributions
[a] b1(X)= v Ber( (1−))
³
´
b3(X)=(1 + )2 v Bin [ (1−) ]
[c]
2
³
´
¡ 1 ¢ P
(1−)
[d] b(X)=
=1 v Bin [ ] for 3
³
´
¡ 1 ¢ P
(1−)
[e] b+1(X)= +1
=1 v Bin +1 [ (+1)2 ]
(11)
It is important to emphasize at the outset that the sampling
distributions [a]-[e] are evaluated under =∗ where ∗ is the
true value of
It is clear that none of the sampling distributions of the
estimators in table 4 resembles that of the ideal estimator,
∗(X), whose sampling distribution, if it exists, would be of
the form:
(12)
[i] P(∗(X)=∗)=1
10
11. In terms of its first two moments, the ideal estimator satisfies
[ii] (∗(X))=∗and [iii] (∗(X))=0 In contrast to the
(infeasible) ideal estimator in (12), when the estimators in
table 4 infer using an outcome x, the inference is always
subject to some error because the variance is not zero. The
sampling distributions of these estimators provide the basis
for evaluating such errors.
In the statistics literature the evaluation of inferential
errors in estimation is accomplished in two interconnected
stages.
The objective of the first stage is to narrow down the set
of all possible estimators of to an optimal subset, where
optimality is assessed by how closely the sampling distribution
of an estimator approximates that of the ideal estimator in
(12); the subject matter of section 3.
The second stage is concerned with using optimal estimators to construct the shortest Confidence Intervals (CI)
for the unknown parameter based on prespecifying the error of covering (encompassing) ∗ within a random interval of
the form ((X) (X)); the subject matter of section 4.
3
Properties of point estimators
As mentioned above, the notion of an optimal estimator can
be motivated by how well the sampling distribution of an estimator b(X) approximates that of the ideal estimator in (12).
In particular, the three features of the ideal estimator [i]-[iii]
motivate the following optimal properties of feasible estimators.
11
12. Condition [ii] motivates the property known as:
[I] Unbiasedness: An estimator b
(X) is said to be an unbiased for if:
(13)
(b
(X))=∗
That is, the mean of the sampling distribution of b
(X) coincides with the true value of the unknown parameter
Example. In the case of the simple Bernoulli model,
we can see from table 4 that the estimators b1(X) b3(X)
and b(X) are unbiased since in all three cases (13) is satis
fied. In contrast, estimator b+1(X) is not unbiased because
´
³
b+1(X) = +1 6=
Condition [iii] motivates the property known as:
[II] Full Efficiency: An unbiased estimator b(X) is said
to be a fully efficient estimator of if its variance is as small
as it can be, where the latter is expressed by:
´i−1
h ³
(x;)
b(X))=():= − 2 ln 2
(
where ‘()’ stands for the Cramer-Rao lower bound; note
that (x; ) is given by the assumed model.
Example (the derivations are not important!). In the case
of the simple Bernoulli model:
P
ln (x; )= ln +(− ) ln(1−) where = =1 ( )=
ln (x;)
1
= ( )( 1 ) − ( − )( 1− )
2 ln (x;)
1
=− ( 12 )−(− )( 1− )2
2
´
³ 2
(x;)
1
=( 12 )( ) + [ − ( )]( 1− )2
− ln2
and thus the Cramer-Rao lower bound is:
():= (1−)
12
= (1−)
13. Looking at the estimators of in (12) it is clear that only
one unbiased estimator achieves that bound, b(X) Hence,
b(X) is the only estimator of which is both unbiased and
fully efficient.
Comparisons between unbiased estimators can be made in
terms of relative efficiency:
(b1(X)) (b2(X)) for 2
asserting that b2(X) is relatively more efficient than b1(X)
but one needs to be careful with such comparisons because
they can be very misleading when both estimators are bad,
as in the case above; the fact that b2(X) is relatively more
efficient than b1(X) does not mean that the former is even
an adequate estimator. Hence, relative efficiency is not something to write home about!
What renders these two estimators practically useless? An asymptotic property motivated by condition [i] of
the ideal estimator, known as consistency.
Intuitively, an estimator b(X) is consistent when its preci
sion (how close to ∗ is) improves as the sample size increases.
Condition [i] of the ideal estimator motivates the property
known as:
[III] Consistency: an estimator b(X) is consistent if:
Strong: P(lim→∞ b(X)=∗)=1
(14)
¯
³¯
´
¯b
∗¯
Weak: lim→∞ P ¯(X) − ¯ ≤ =1
That is, an estimator b(X) is consistent if it approximates
(probabilistically) the sampling distribution of the ideal es13
14. timator asymptotically; as → ∞ The difference between
strong and weak consistency stems from the form of probabilistic convergence they involve, with the former being stronger
than the latter. Both of these properties constitute an extension of the Strong and Weak Law of Large Numbers (LLN)
P
1
which hold for the sample mean = of a process
=1
{ =1 2 } under certain probabilistic assumptions,
the most restrictive being that the process is IID; see Spanos
(1999), ch. 8.
0.65
0.60
0.60
Sample average
0.70
0.65
Sample Average
0.70
0.55
0.50
0.55
0.50
0.45
0.45
0.40
0.40
1
20
40
60
80
100
120
140
160
180
1
200
100
200
300
400
500
600
700
800
900
1000
Inde x
Inde x
Fig. 2: t-plot of for a BerIID
realization with =1000
Fig. 1: t-plot of for a BerIID
realization with =200
In practice, it is no-trivial to prove that a particular estimator is consistent or not by verifying directly the conditions
in (14). However, there is often a short-cut for verifying consistency in the case of unbiased estimators using the sufficient
³
´
condition:
lim b(X) =0
(15)
→∞
Example. In the case of the simple Bernoulli model, one
can verify that the estimators b1(X) and b3(X) are inconsis
tent because:
³
´
³
´
b1(X) =(1−)6=0 lim b1(X) = (1−) 6=0
lim
2
→∞
→∞
14
15. i.e. their variances do not decrease to zero as the sample size
goes to infinity.
In contrast, the estimators b(X) and b+1(X) are consis
tent because:
³
´
lim b(X) = lim ( (1−) )=0 lim (b+1(X))=0
→∞
→∞
→∞
Note that ‘MSE’ denotes the ‘Mean Square Error’, defined by:
(b; ∗)= (b) + [(b; ∗)]2
where (b; ∗)=(b)−∗ Hence:
lim (b)=0 if (a) lim (b)=0 and (b) lim (b)=∗
→∞
→∞
→∞
where (b) is equivalent to lim (b; ∗)=0
→∞
Let us take stock of the above properties and how they
can used by the practitioner in deciding which estimator is
optimal. The property which defines minimal reliability for
an estimator is that of consistency. Intuitively, consistency
indicates that as the sample size increases [as → ∞] the
estimator b(X) approaches ∗ the true value of in some
probabilistic sense; convergence almost surely or convergence
in probability. Hence, if an estimator b(X) is not consistent,
it is automatically excluded from the subset of potentially
optimal estimators, irrespective of any other properties this
estimator might enjoy. In particular, an unbiased estimator
which is inconsistent is practically useless. On the other hand,
just because an estimator b(X) is consistent does not imply
that it’s a ‘good’ estimator; it only implies that it’s minimally
acceptable.
15
16. ¥ It is important to emphasize that the properties of unbiasedness and fully efficiency hold for any sample size 1
and thus we call them finite sample properties, but consistency
is an asymptotic property because it holds as → ∞
Example. In the case of the simple Bernoulli model, if
the choice between estimators is confined (artificially) among
the estimators b1(X) b3(X) and b+1(X) the latter estimator
should be chosen, despite being biased, because it’s a consistent estimator of On the other hand, among the estimators
given in table 4, b (X) is clearly the best (most optimal) be
cause it satisfies all three properties. In particular, b(X), not
only satisfies the minimal property of consistency, but it also
has the smallest variance possible, which means that it comes
closer to the ideal estimator than any of the others, for any
sample size 2 The sampling distribution of b(X), when
∗
evaluated under = , takes the form:
³
´
∗
¡ ¢
∗
∗
b(X)= 1 P = Bin ∗ [ (1− ) ]
(16)
v
[d]
=1
whatever the ‘true’ value ∗ happens to be.
Additional Asymptotic properties
In addition to the properties of estimators mentioned above,
there are certain other properties which are often used in practice to decide on the optimality of an estimator. The most
important is given below for completeness.
[V] Asymptotic Normality: an estimator b(X) is said
to be asymptotically Normal if:
´
√ ³b
(X) − v N(0 ∞()) ∞()6=0
(17)
where ‘v’ stands for ‘can be asymptotically approximated by’.
16
17. This property is an extension of a well-known result in
probability theory: the Central Limit Theorem (CLT).
The CLT asserts that, under certain probabilistic assumptions
on the process { =1 2 } , the most restrictive
being that the process is IID, the sampling distribution of
P
1
= for a ‘large enough’ can be approximated
=1
by the Normal distribution (Spanos, 1999, ch. 8):
(√−())
v N(0 1)
(18)
( )
Note that the important difference between (17) and (18) is
that b(X) in the former does not have to coincide with ;
it can be any well-behaved function (X) of the sample X.
Example. In the case of the simple Bernoulli model the
sampling distribution of b (X) which we know is Binomial
(see (16)), it can also be approximated using (18). In the
graph below we compare the Normal approximation to the
Binomial for =10 and =20 in the case where =5, and the
improvement is clearly noticeable.
Distribution Plot
Distribution Plot
0.20
Distribution n p
Binomial
10 0.5
0.4
Distribution n p
Binomial
20 0.5
Distribution Mean StDev
Normal
5
1
Distribution Mean StDev
Normal
10
2.236
0.15
Density
Density
0.3
0.2
0.05
0.1
0.0
0.10
0
2
4
6
8
0.00
10
X
Normal approx. of Bin.:
(; =.5 =10)
2
4
6
8
10
X
12
14
16
18
Normal approx. of Bin.:
(; =.5 =20)
17
18. 4
Confidence Intervals (CIs): an overview
4.1
An optimal CI begins with an optimal point estimator
Example 2. Let us summarize the discussion concerning
point estimation by briefly discussing the simple (one parameter) Normal model, where =1 (table 5).
Table 5 - Simple Normal Model (one unknown parameter)
Statistical GM:
= + ∈N={1 2 }
⎫
[1] Normality:
v N( ) ∈R ⎬
[2] Constant mean:
( )=
∈N.
⎭
[3] Constant variance: ( )= 2 (known)
[4] Independence: { ∈N} independent process
In section 3 we discussed the question of choosing among numerous possible estimators of such as [a]-[e] (table 6) using
their sampling distributions. These results stem from the following theorem. If X:=(1 2 ) is a random (IID)
sample from the Normal distribution, i.e.
v NIID( 2) ∈N:=(1 2 )
P
then the sampling distribution of =1 is:
P
2
(19)
=1 v N( )
Among the above estimators the sample mean for 2=1:
¡ 1¢
P
1
(X):= = v N
b
=1
constitutes the optimal point estimator of because it is:
[U] Unbiased (( )=∗),
[FE] Fully Efficient ( ( )=()), and
[SC] Strongly Consistent (P(lim→∞ =∗)=1).
18
19. Table 6: Estimators
UN FE SC
1(X)= v N( 1)
b
2(X)=1 − v N(0 2)
b
3(X)=(¢ + )2 v N( 1 )¢
b
¡ 1 1 P
¡ 21
(X)=
b
=1 v N ³
´
P
1
[e] +1(X)= +1 =1 vN +1 (+1)2
b
X
×
X
X
×
×
×
X
×
×
×
X
×
×
X
[a]
[b]
[c]
[d]
Given that any ‘decent’ estimator (X) of is likely to
b
yield any value in the interval (−∞ ∞) can one say something more about its reliability than just "on average" its
values (x) for x∈X are more likely to occur around ∗ (the
b
true value) than those further away?
4.2
What is a Confidence Interval?
I This is what a Confidence Interval (CI) proposes to address.
In general, a 1− CI for takes the generic form:
P((X) ≤ ∗ ≤ (X))=1−
where (X) and (X) denote the lower and upper (random)
bounds of this CI. The 1− is referred to as the confidence
level and represents the coverage probability of the CI:
(X;)=((X) (X))
in the sense that the probability that the random interval
(X) covers (overlays) the true ∗ is equal to (1−)
This is often envisioned in terms of a long-run metaphor
of repeating the experiment underlying the statistical model
in question in order to get a sequence of outcomes (realizations
19
20. of X) x each of which will yield an observed
(x ;) =1 2
0
In the context of this metaphor (1−) denotes the relative
frequency of the observed CIs that will include (overlay) ∗
Example 2. In the case of the simple (one parameter)
Normal model (table 5), let us consider the question of constructing 95 CIs using the different unbiased estimators of
in table 6:
[a] P(b1(X)−196 ≤ ∗ ≤ 1(X)+196)=95
b
1
1
[c] P(b3(X)−196( √2 ) ≤ ∗ ≤ 3(X)+196( √2 ))=95
b
1
1
[d] P( −196( √ ) ≤ ∗ ≤ +196( √ ))=95
(20)
How do these CIs differ? The answer is in terms of their
precision (accuracy).
One way to measure precision for CIs is to evaluate their
length:
³
´
³
´
1
1
√
[a]: 2 (196) =392 [c]: 2 196( √2 ) =2772 [d]: 2 196( √ ) = 392
It is clear from this evaluation that the CI associated with
P
1
= is the shortest for any 2; e.g. for =100
=1
392
the length of this CI is √100 =392
20
22. 4.3
Constructing Confidence Intervals (CIs)
More generally, the sampling distribution of optimal estimator
gives rise to a pivot (a function of the sample and whose
distribution is known):
¢ =∗
√ ¡
(21)
− v N(0 1)
which can be used to construct the shortest CI among all
(1−) CIs for :
1
1
P( − ( √ ) ≤ ∗ ≤ + ( √ )) = (1−)
(22)
2
2
where P(|| ≥ )=(1−) for vN(0 1) (figures 1-2).
2
Example 3. In the case where 2 is unknown, and we use
P
2=[1( − 1)] ( − )2 to estimate it, the pivot in
=1
(21) takes the form:
√
( −) =∗
(23)
v St( − 1)
where St(−1) denotes the Student’s t distribution with (−1)
degrees of freedom.
Step 1. Attach a (1−) coverage probability using (23):
√
( −)
≤ ) = (1−)
P(− ≤
2
2
where P(| | ≥ )=(1−) for vSt(−1)
2
√
( −)
to isolate to derive the CI:
Step 2. Re-arrange
√
¡
¢
( −)
P(− ≤
≤ )=P(− ( √ ) ≤ − ≤ ( √ ))=
2
2
2
2
=P(− − ( √ ) ≤ − ≤ − + ( √ ))=
2
2
= P( − ( √ ) ≤ ∗ ≤ + ( √ ))= (1−)
2
2
In figures 1-2 the underlying distribution is Normal and in
figures 3-4 it’s Student’s t with 19 degrees of freedom. One
22
23. can see that, while the tail areas are the same for each , the
threshold values for the Normal for each are smaller than
2
∗
the corresponding values for the Student’s t because the
2
latter has heavier tails due to the randomness of 2
Distribution Plot
Distribution Plot
Normal, Mean=0, StDev=1
Normal, Mean=0, StDev=1
0.3
Density
0.4
0.3
Density
0.4
0.2
0.1
0.2
0.1
0.025
0.0
0.05
0.025
-1.96
0
X
0.0
1.96
0.05
-1.64
0
X
1.64
Fig. 1: P(|| ≥ 025)=.95
for vN(0 1)
Fig. 2: P(|| ≥ 05)=.90
for vN(0 1)
Distribution Plot
Distribution Plot
T, df=20
T, df=20
0.3
Density
0.4
0.3
Density
0.4
0.2
0.1
0.1
0.025
0.0
0.2
0.05
0.025
-2.09
0
X
0.0
2.09
Fig. 3: P(| | ≥ 025)=.95
for vSt(19)
0.05
-1.72
0
X
1.72
Fig. 4: P(| | ≥ 05)=.90
for vSt(19)
23
24. 5
Summary and conclusions
The primary objective in frequentist estimation is to learn
about ∗ the true value of the unknown parameter of interest using its sampling distribution (b ∗) associated with
;
particular sample size The finite sample properties are de;
fined directly in terms of (b ∗) and the asymptotic properties are defined in terms of the asymptotic sampling distri;
;
bution ∞(b ∗) aiming to approximate (b ∗) at the limit
as → ∞
The question that needs to be considered at this stage is:
what combination of the above mentioned
properties specifies an ‘optimal’ estimator?
A necessary but minimal property for an estimator is consistency (preferably strong). By itself, however, consistency
does not secure learning from data for a given ; it’s a promissory note for potential learning. Hence, for actual learning
one needs to supplement consistency with certain finite sample properties like unbiasedness and efficiency to ensure that
learning can take place with the particular data x0:=(1 2 )
of sample size .
Among finite sample properties full efficiency is clearly
the most important because it secures the highest degree of
learning for a given since it offers the best possible precision.
Relative efficiency, although desirable, needs to be investigated further to find out how large is the class of estimators being compared before passing judgement. Being the
24
25. best econometrician in my family, although worthy of something,
does not make me a good econometrician!!
Unbiasedness, although desirable, is not considered indispensable by itself. Indeed, as shown above, an unbiased
but inconsistent estimator is practically useless, and a consistent but biased estimator is always preferable.
Hence, a consistent, unbiased and fully efficient estimator sets the gold standard in estimation.
In conclusion, it is important to emphasize that point estimation is often considered inadequate for the purposes of
scientific inquiry because a ‘good’ point estimator b(X) by
itself, does not provide any measure of the reliability and
precision associated with the estimate b(x0); one would be
wrong to assume that b(x0) ' ∗ This is the reason why
b(x0) is often accompanied by its standard error [the es
q
timated standard deviation (b(X))] or the p-value of
some test of significance associated with the generic hypothesis =0.
Interval estimation rectifies this weakness of point estimation by providing the relevant error probabilities associated with inferences pertaining to ‘covering’ the true value
∗ of
25