Point Estimate, Confidence Interval, Hypotesis tests
1. Statistics Lab
Rodolfo Metulini
IMT Institute for Advanced Studies, Lucca, Italy
Lesson 3 - Point Estimate, Confidence Interval and Hypotesis
Tests - 16.01.2014
2. Introduction
Let’s start having empirical data (one variable of length N)
extracted from external file, suppose to consider it to be the
population. We define a sample of size n.
Suppose we do not have information on population (or, better, we
want to check if and how the sample can represent the
population)
We, in other words, want to make infererence using the
information contained in the sample, in order to obtain an
estimation for the population.
That sample is one of several samples we can randomly draw from
the population (the sample space).
What are the instruments to obtain infos about the population?
(1) Sample mean (point estimation) (2) Confidence interval (3)
Hypotesis tests
3. Sample space
In probability theory, the sample space of an experiment or random
trial is the set of all possible outcomes or results of that
experiment.
It is common to refer to a sample space by the labels S, Ω, or
U.
For example, for tossing two coins, the corresponding sample space
would be {(head,head), (head,tail), (tail,head), (tail,tail)}, so that
the dimension is 4. dim(Ω) = 4. It means that we can obtain 4
different samples with corresponding 4 different sample
means.
In pratice, we face up with only one sample took at random from
the sample space.
4. Point estimate
Point estimate permit us to summarize the information contained
in the population (dimension N), throughout only 1 value
constructed using n vales.
The most used, unbiased point estimator is the sample mean.
n
xi
ˆ
X n = 1=1
n
Other point estimators are: (1) Sample Median (2) Sample Mode
(3) Geometric mean.
Geometric Mean = Mg =
2
n
i=1 xi
1
= exp[ n
n
1=1 lnxi ]
An example of what is not an estimator is when you use the
sample mean after subsetting the sample truncating it on a certain
value.
P.S. A Naif definition of estimator: when the estimator is
computed using all the n informations in the sample.
5. Efficient estimators
The BLUE (Best Linear Unbiased Estimator) is defined as
follow:
1. is a linear function of all the sample values
ˆ
2. is unbiased (E (Xn ) = θ)
3. has the smallest sample variance among all unbiased
estimators.
The sample mean is BLUE for the parameter µ
Some estimators are biased but consistent: An estimator is
consistent when become unbiased for n −→ ∞
6. Point estimators - cases
ˆ
Normal samples: Xn is the BLUE estimator for µ parameter
(mean)
ˆ
Bernoulli samples f (x) = ρx (1 − ρ)1−x : Xn is a unbiased
estimator for ρ parameter (frequency)
e −k k x
ˆ
): Xn is a unbiased estimator
x!
for k parameter (which represent both mean and variance of
the distribution)
Poisson samples f (x) =
1
:is a unbiased
ˆ
Xn
estimator for λ parameter (density at value 0)
Exponential samples f (x) = λe −λy )
7. Confidence interval theory
With point estimators we make use of only one value to infer
about population.
With confidence interval we define a minimum and a maximum
value in which the population parameter we expect to lie.
Formally, we need to calculate:
σ
ˆ
µ1 = Xn − z ∗ √
n
σ
ˆ
µ2 = Xn + z ∗ √
n
and we end up with interval µ = {µ1 ; µ2 }
ˆ
ˆ
Here: Xn is the sample mean; z is the upper (or lower) critical
value of the theoretical distribution. σ is the standard deviation of
the theoretical distribution. n the sample size.
(See the graph)
8. Confidence interval theory - Gaussian
We will make some assumptions for what we might find in an
experiment and find the resulting confidence interval using a
normal distribution.
Let assume that the sample mean is 5, the standard deviation in
population is known and it is equal to 2, and the sample size is
n = 20. In the example below we will use a 95 per cent confidence
level and wish to find the confidence interval.
N.B. Here, since the confidence interval is 95, the z (the critical
value) to consider is the one corresponding with CDF (i.e. dnorm)
= 0.975.
We also can speak of α = 0.05, or 1 − α = 0.95, or
1 − α/2 = 0.975
9. Confidence interval theory - T-student
We use T − student distribution when n is small and sd is
unknown in population. We need to use a sample variance
estimation: σ =
ˆ
ˆ
(xi −Xn )2
n−1
The t-student distribution is more spread out.
In simple words, since we do not know the population sd, we need
for more large intervals (caution - approach).
The only difference with normal distribution, is that we use the
command associated with the t-distribution rather than the normal
distribution. Here we repeat the procedures above, but we will
assume that we are working with a sample standard deviation
rather than an exact standard deviation.
N.B. The T distribution is characterize by its degree of freedom. In
this test the degree aere equal to n − 1, because we use 1
estimation (1 constraint)
10. Confidence interval theory - comparison of two means
In some case we can have an experiment called (for example)
case-control.
Let’s imagine to have the population splitted in 2: one is the
treated group, the second is the non treated group.
Suppose to extract two samples from them with aim to test if the
two samples comes from a population with the same mean
parameter (is the treatment effective?)
The output of this test will be a confidence interval represting the
difference between the two means.
N.B. Here, the degree of freedom of the t-distribution are equal to
min(n1 , n2 ) − 1
12. Hypotesis testing
Researchers retain or reject hypothesis based on measurements of
observed samples.
The decision is often based on a statistical mechanism called
hypothesis testing.
A type I error is the mishap of falsely rejecting a null hypothesis
when the null hypothesis is true.
The probability of committing a type I error is called the
significance level of the hypothesis testing, and is denoted by the
Greek letter α (the same used in the confidence intervals).
We demonstrate the procedure of hypothesis testing in R first with
the intuitive critical value approach.
Then we discuss the popular p − value (and very quick) approach
as alternative.
13. Hypotesis testing - lower tail
The null hypothesis of the lower tail test of the population mean
can be expressed as follows:
µ ≥ µ0 ; where µ0 is a hypothesized lower bound of the true
population mean µ.
Let us define the test statistic z in terms of the sample mean, the
sample size and the population standard deviation σ:
z=
ˆ
Xn −µ0
√
σ/ n
Then the null hypothesis of the lower tail test is to be rejected if
z ≤ zα , where zα is the 100(α) percentile of the standard normal
distribution.
14. Hypotesis testing - upper tail
The null hypothesis of the upper tail test of the population mean
can be expressed as follows:
µ ≤ µ0 ; where µ0 is a hypothesized upper bound of the true
population mean µ.
Let us define the test statistic z in terms of the sample mean, the
sample size and the population standard deviation σ:
z=
ˆ
Xn −µ0
√
σ/ n
Then the null hypothesis of the upper tail test is to be rejected if
z ≥ z1−α , where z1−α is the 100(1 − α) percentile of the
standard normal distribution.
15. Hypotesis testing - two tailed
The null hypothesis of the two-tailed test of the population mean
can be expressed as follows:
µ = µ0 ; where µ0 is a hypothesized value of the true population
mean µ. Let us define the test statistic z in terms of the sample
mean, the sample size and the population standard deviation
σ:
z=
ˆ
Xn −µ0
√
σ/ n
Then the null hypothesis of the two-tailed test is to be rejected if
z ≤ zα/2 or z ≥ z1−α/2 , where zα/2 is the 100(α/2) percentile of
the standard normal distribution.
16. Hypotesis testing - lower tail with Unknown variance
The null hypothesis of the lower tail test of the population mean
can be expressed as follows:
µ ≥ µ0 ; where µ0 is a hypothesized lower bound of the true
population mean µ.
Let us define the test statistic t in terms of the sample mean, the
sample size and the sample standard deviation σ :
ˆ
t=
ˆ
Xn −µ0
√
σ/ n
ˆ
Then the null hypothesis of the lower tail test is to be rejected if
t ≤ tα , where tα is the 100(α) percentile of the Student t
distribution with n − 1 degrees of freedom.
17. Hypotesis testing - upper tail with Unknown variance
The null hypothesis of the upper tail test of the population mean
can be expressed as follows:
µ ≤ µ0 ; where µ0 is a hypothesized upper bound of the true
population mean µ.
Let us define the test statistic t in terms of the sample mean, the
sample size and the sample standard deviation σ :
ˆ
t=
ˆ
Xn −µ0
√
σ/ n
ˆ
Then the null hypothesis of the upper tail test is to be rejected if
t ≥ t1−α , where t1−α is the 100(1 − α) percentile of the Student
t distribution with n1 degrees of freedom.
18. Hypotesis testing - two tailed with Unknown variance
The null hypothesis of the two-tailed test of the population mean
can be expressed as follows:
µ = µ0 ; where µ0 is a hypothesized value of the true population
mean µ. Let us define the test statistic t in terms of the sample
mean, the sample size and the sample standard deviation σ :
ˆ
t=
ˆ
Xn −µ0
√
σ/ n
ˆ
Then the null hypothesis of the two-tailed test is to be rejected if
t ≤ tα/2 or t ≥ t1−α/2 , where tα/2 is the 100(α/2) percentile of
the Student t distribution with n − 1 degrees of freedom.
19. Lower Tail Test of Population Proportion
The null hypothesis of the lower tail test about population
proportion can be expressed as follows:
ρ ≥ ρ0 ; where ρ0 is a hypothesized lower bound of the true
population proportion ρ.
Let us define the test statistic z in terms of the sample proportion
and the sample size:
z=
ρ−ρ0
ˆ
ρ0 (1−ρ0 )
n
Then the null hypothesis of the lower tail test is to be rejected if
z ≤ zα , where zα is the 100(α) percentile of the standard normal
distribution.
20. Upper Tail Test of Population Proportion
The null hypothesis of the upper tail test about population
proportion can be expressed as follows:
ρ ≤ ρ0 ; where ρ0 is a hypothesized lower bound of the true
population proportion ρ.
Let us define the test statistic z in terms of the sample proportion
and the sample size:
z=
ρ−ρ0
ˆ
ρ0 (1−ρ0 )
n
Then the null hypothesis of the lower tail test is to be rejected if
z ≥ z1−α , where z1−α is the 100(1 − α) percentile of the standard
normal distribution.
21. Two Tailed Test of Population Proportion
The null hypothesis of the upper tail test about population
proportion can be expressed as follows:
ρ = ρ0 ; where ρ0 is a hypothesized true population
proportion.
Let us define the test statistic z in terms of the sample proportion
and the sample size:
z=
ρ−ρ0
ˆ
ρ0 (1−ρ0 )
n
Then the null hypothesis of the lower tail test is to be rejected if
z ≤ zα/2 or z ≥ z1−α/2
22. Sample size definition
The quality of a sample survey can be improved (worsened) by
increasing (decreasing) the sample size.
The formula below provide the sample size needed under the
requirement of population proportion interval estimate at (1 − α)
confidence level, margin of error E and planned parameter
estimation.
Here, z1−α/2 is the 100(1 − α/2) percentile of the standard normal
distribution.
For mean: n =
2
z1−α/2 ∗σ 2
E2
For proportion: n =
2
z1−α/2 ρ∗(1−ρ)
E2
23. Sample size definition - Exercises
Mean: Assume the population standard deviation σ of the
student height in survey is 9.48. Find the sample size needed
to achieve a 1.2 centimeters margin of error at 95 per cent
confidence level.
Since there are two tails of the normal distribution, the 95 per
cent confidence level would imply the 97.5th percentile of the
normal distribution at the upper tail. Therefore, z1−α/2 is
given by qnorm(.975).
Population: Using a 50 per cent planned proportion estimate,
find the sample size needed to achieve 5 per cent margin of
error for the female student survey at 95 per cent confidence
level.
Since there are two tails of the normal distribution, the 95 per
cent confidence level would imply the 97.5th percentile of the
normal distribution at the upper tail. Therefore, z1−α/2 is
given by qnorm(.975).
24. Homeworks
1: Confidence interval for the proportion. Suppose we have a
sample of size n = 25 of births. 15 of that are female. Define the
interval (at 99 per cent) for the proportion of female in the
population. HINT: Apply with the proper functions in R, the
formula in slide 11.
2: Hypotesis test to compare two proportions. Suppose we have
two schools. Sampling from the first, n = 20 and the Hispanics
students are 8. Sampling from the second, n = 18 and Hispanics
students are 4. Can we state (at 95 per cent) the frequency of
Hispanics are the same in the two schools? N.B.: the test here is
two tailed.
The hypotesis test here is:
z=
ρ=
ρ1 −ˆ2
ˆ ρ
sd ; where
(ρ1 ∗n1 +ρ2 +n2 )
n1 +n2
sd =
1
ρ(1 − ρ)[ n1 +
1
n2 ];
25. Charts - 1
Figure: Representation of the critical point for the upper tail hypotesis
test
26. Charts - 2
Figure: Representation of the critical point for the lower tail hypotesis
test
27. Charts - 3
Figure: Representation of the critical point for the two-tailed hypotesis
test