The document discusses confidence intervals and different methods for computing them, including exact, asymptotic, jackknife, and bootstrap methods. The exact method computes an interval based on the known distribution of the estimator, but this is often impossible. The asymptotic method uses the asymptotic normality of maximum likelihood estimators, but requires large sample sizes. The jackknife method uses leave-one-out resampling to estimate bias and variance up to O(1/n^2), while bootstrap resamples with replacement to estimate the full distribution and compute confidence intervals.
2. Why do we need Confidence Intervals?
• Very common use case
we have a few samples x1, ..., xn from an unknown distribution F
we need to estimate some parameter θ of the underlying distribution
• A single value or an interval of values?
use x1, ..., xn to compute our best guess for the parameter θ
but this single value does not take into account the intrinsic uncertainty
due to our limited information on F (finite number of samples n)
so use x1, ..., xn to also compute an interval that likely contains the true θ
• The frequentist solution
there are several ways to compute an interval estimate for θ
we follow the frequentist approach: computing a confidence interval
2/20
3. Point Estimates and Interval Estimates
• Given x1, ..., xn from a distribution F, estimate unknown parameter θ of F.
Given x1, ..., xn drawn from N(µ, σ2), we want to estimate the variance σ2.
Given x1, ..., xn and y1, ..., yn, we want to estimate the correlation ρ(X, Y ).
• A point estimate is a statistic ˆθ = T(X1, ..., Xn) estimating the unknown θ.
A classical example is the maximum likelihood estimator (MLE).
• Important properties of an estimator are bias and variance.
Bias(ˆθ) = E[ˆθ] − θ Var(ˆθ) = E[(ˆθ − E[ˆθ])2
]
Given x1, ..., xn drawn from N(µ, σ2), the MLE for σ2 is ˆσ2 = 1
n i (xi − ¯x)2.
This estimator has bias −σ2
n
and variance 2σ4
n
.
• An interval estimate is an interval statistic
I(X1, ..., Xn) = [L(X1, ..., Xn), U(X1, ..., Xn)]
containing possible values for the unknown θ.
Two classical examples are the confidence interval and the credible interval.
3/20
4. Confidence Intervals
• I(X1, ..., Xn) is a confidence interval for θ with confidence level α if, for
any fixed value of the unknown parameter θ,
P(θ ∈ I(X1, ..., Xn)) = α
• If α = 0.95, this means that for any fixed θ, if we repeat a sampling of n
values X1, ..., Xn ∼ Fθ for 100 times and compute a confidence interval every
time, 95 of such intervals contain the true value of θ.
• This does not mean that given the samples x1, ..., xn the probability that
θ ∈ I(x1, ..., xn) is α! This is a common misunderstanding, but the probability
in our definition is about the samples X1, ..., Xn and not on θ.
Indeed, in the frequentist approach θ is a fixed (albeit unknown) value, not a
random variable with associated probability.
For a Bayesian approach, fix x1, ..., xn instead and assign a posterior distribution
to θ. This yields a credible interval which contains θ with probability α.
4/20
5. Confidence Intervals: Example
90% Confidence Interval
Frequentist approach
θ is fixed, but unknown
X1, ..., Xn are drawn from Fθ 10
times
Build interval for each (X1, ..., Xn)
9/10 of the intervals contain the
true θ
90% Credible Interval
Bayesian approach
Associate to θ a probability
measuring our belief
x1, ..., xn are fixed observations
update posterior belief on θ
Build interval containing θ with
probability = 90%
5/20
6. How do we compute Confidence Interval?
• Depending on the situation, we have to use a different approach
Exact method: based on a known distribution of ˆθ
Asymptotic method: based on asymptotic normality of the MLE
Jackknife method: simple resampling technique
Bootstrap method: more elaborate resampling technique
6/20
7. Exact method
• The value of θ is fixed, but ˆθ = T(X1, ..., Xn) is a random variable. If we
know the exact distribution of ˆθ we can compute an exact confidence interval.
Example: Normal distribution N(µ, σ2
)
The MLE for the mean µ is the sample mean ˆµ = ¯x and we have
ˆµ − µ
σ/
√
n
∼ N(0, 1)
so if σ2 is known we can compute an exact confidence interval for µ.
The MLE for the variance is the sample variance ˆσ2 = 1
n i (xi − ¯x)2. This
estimator is biased, so we consider instead the Bessel correction yielding
s2 = 1
n−1 i (xi − ¯x)2 and we have
s2
σ2/(n − 1)
∼ χ2
n−1.
If σ2 is unknown we can use Using s2 we can compute an exact confidence
interval for µ n by using
ˆµ − µ
s/
√
n
∼ tn−1.
7/20
8. Exact Method: Pros and Cons
Pros
+ confidence level is exactly α
+ closed-form expression allows fast computation
+ works for any sample size n
Cons
– if we do not know the distribution F or the family of distributions it
belongs to (non-parametric statistics), we cannot compute the exact
distribution of ˆθ
– even if F is known, the exact distribution of ˆθ is often impossible to
compute: θ = ρ(X, Y ), θ = Median(X), ...
8/20
9. Asymptotic Method
• In many cases we choose ˆθ as the MLE for θ. This estimator has (under
reasonable assumptions) the key property of asymptotic normality
√
n(ˆθ − θ)
d
−→ N(0, I(θ)−1
)
where I(θ)=−EX [ (θ; x)] is the Fisher information and (θ; x) = log p(x; θ).
Example: Exponential distribution Exp(λ)
The p.d.f. is p(x; λ) = λxe−λx , so we have
(λ; x) = log(λ) − λx and (λ; x1, ..., xn) = n log(λ) − λ i xi
so that the MLE is ˆλ = 1/¯x. The Fisher information is
−EX [ (λ)] = −EX [−1/λ2] = 1/λ2
so we can use the asymptotic approximation
√
n(ˆλ − λ) ≈ N(0, λ2) ⇒
ˆλ
λ
≈ N(1, 1/n)
Example: Bernoulli distribution Bernoulli(λ)
The p.d.f. is p(x; p) = px (1 − p)1−x , so we have
(p;x)=xlog(p) + (1 − x)log(1 − p)
so that the MLE is ˆp = ¯x. The Fisher information is
−EX [ (p)] = −EX [− x
p
− 1−x
(1−p)2 ] = − p
p2 + 1−p
(1−p)2 = 1
p(1−p)
so we can use the asymptotic approximation
√
n(ˆp − p) ≈ N(0, p(1 − p)) ⇒ ˆp − p ≈ N(0, ˆp(1 − ˆp)/n)
9/20
10. Asymptotic Method: Example
Consider Exp(λ) and the estimator ˆk = i Xi /n for k = 1/λ = 1/3.
It can be shown that the exact distribution is ˆk ∼ 2nkχ2
2n
We have seen that the asymptotic distribution is ˆk ≈ N(k, k2
/n)
10/20
11. Asymptotic Method: Pros and Cons
Pros
+ easier computation if sampling distribution F is known
+ expected information I(θ) may be replaced by observed information I(ˆθ)
Cons
– works well only for n sufficiently large (typically at least n > 50)
– neglets skewness of the distribution of ˆθ
– requires to know F or the family of distributions it belongs to
– can be applied only if ˆθ is asymptotically normal (typically MLE)
11/20
12. Jackknife Method
• Given any estimator ˆθ, the jackknife is based on n leave-1-out estimators
ˆθ(i) = T(X1, ..., Xi−1, Xi+1, ..., Xn), with ˆθ(·) = 1
n i
ˆθ(i).
We also consider the n pseudo-values
˜θi = nˆθ − (n − 1)ˆθ(i)
• A first use of jackknife is bias correction. Indeed,
biasjack = (n − 1)(ˆθ(·) − ˆθ)
is a linear estimator of Bias(ˆθ) (i.e. error is O(1/n2
)). Then, a bias-corrected
estimator is given by the mean of the pseudo-values
ˆθjack = nˆθ − (n − 1)ˆθ(·) = 1
n i
˜θi = ˜θ
• Similarly, bootstrap is used to for the estimation of other properties of ˆθ.
E.g. the variance estimator for Var(ˆθ) given by
varjack = 1
n
˜s2
= 1
n
1
n−1 i (˜θi − ˜θ)2
= n−1
n i (ˆθ(i) − ˆθ(·))2
is a linear estimator, assuming that ˆθ = T(X1, ..., Xn) is smooth.
• We may use varjack and ˆθjack to compute a jackknife approximated
confidence interval using the asymptotic approximation
˜θ − θ
˜s2/n
≈ tn−1
but in practice this approximation is often too crude.
12/20
13. Jackknife Method: Limitations
• If ˆθ is non-smooth the jackknife variance estimator may be non-consistent.
If ˆθ is the sample mean, it can be proved that
varjack
Var(ˆθ)
d
−→
χ2
2
2
2
• To fix that we introduce an extension of jackknife. This time we consider the
n
d
leave-d-out estimators obtained by computing the statistic T on every
possible subset of X1, ..., Xn obtained by removing d elements.
For ˆθ = sample mean, choosing
√
n < d < n − 1 yields a consistent variance
estimator
varjack = n−d
d n
d
i (ˆθ(i) − ˆθ(·))2
13/20
14. Jackknife Method: Example
Consider (X, Y ) ∼ F for some F and the Pearson correlation coefficient ρ.
It can shown that the estimator ˆρ2
= i (xi −¯x)(yi −¯y)
√
i (xi −¯x)2
√
i (yi −¯y)2
is biased.
F =⇒ (x1, y1), (x2, y2), (x3, y3), ..., (x10, y10) −→ ˆρ2
⇓
(x2, y2), (x3, y3), (x4, y4), ..., (x10, y10) −→ ˆρ2
(1)
(x1, y1), (x3, y3), (x4, y4), ..., (x10, y10) −→ ˆρ2
(2)
...
(x1, y1), (x2, y2), (x3, y3), ..., (x9, y9) −→ ˆρ2
(10)
The jackknife estimator ˆσ2
jack = 10ˆσ2
− 9ˆρ2
(·) has bias correct up to O(1/n).
14/20
15. Jackknife Method: Pros and Cons
Pros
+ can be used for non-parametric statistics
+ fast computation
+ bias correction up to O(1/n2
)
+ leave-d-out provides consistent variance estimator
Cons
– leave-1-out may be non-consistent
– leave-d-out is more expensive
– confidence interval is based on crude approximations, bootstrap is better
15/20
16. Bootstrap Method
• The bootstrap consists in B resamplings with replacement from x1, ..., xn.
This is equivalent to sample from the empirical CDF ˆF.
• For each of the B resamples (x
(b)
1 , ..., x
(b)
n ), compute the estimator ˆθ∗
(b). We
use the values ˆθ∗
(b) to estimate the distribution of ˆθ∗
= ˆθ∗
( ˆF), which in turn is
an appoximation of the distribution of interest ˆθ = ˆθ(F).
• To compute point estimates for the properties of ˆθ we use the pair (ˆθ∗
, ˆθ)
to approximate the pair (ˆθ, θ).
The bootstrap bias estimator for Bias(ˆθ) = E[ˆθ] − θ is given by
biasboot = E[ˆθ∗] − ˆθ = 1
B b
ˆθ∗
b − ˆθ
so that the bias-corrected bootstrap estimator reads
ˆθboot = 2ˆθ + 1
B b
ˆθ∗
b
Similarly, the bootstrap variance estimator for Var(ˆθ) = E[(ˆθ − E[ˆθ])2] is
varboot = 1
B−1 b
ˆθ∗
b − 1
B b
ˆθ∗
b
2
• Notice that bootstrap is more generic than jackknife, since it estimates the
whole distribution of ˆθ and not only its bias and variance.
Actually, one can prove that jackknife is a first order approximation of bootstrap.
16/20
18. Bootstrap Method: Confidence Intervals
• Different techniques are available to compute bootstrap interval estimates.
Here p[α] denotes the α-quantile of distribution p, with z[α] for standard normal.
• The pivotal interval comes from P(l < ˆθ − ˆθ∗
< u) ≈ P(l < θ − ˆθ < u)
CI = (2ˆθ − ˆθ∗
[1 − α/2], 2ˆθ − ˆθ∗
[α/2])
• The studentized interval has an approach similar to the jackknife’s one
CI = (ˆθjack − tn−1[α/2]varjack , ˆθjack + tn−1[α/2]varjack )
• The BCa interval (bias-corrected and accelerated)
CI = (ˆθ∗
[g(α)], ˆθ∗
[g(1 − α)]), with g(α) = Φ z0 +
z0 + z[α]
1 − a(z0 + z[α])
where z0 = Φ−1
(#{ˆθ∗
b < ˆθ}/B) is the bias-correction and the acceleration
a =
1
6
Skew(I(ˆθ)) ≈
1
6
i (ˆθ(·) − ˆθ(i))3
[ i (ˆθ(·) − ˆθ(i))2]3/2
is approximated using jackknife. The BCa interval has an excellent O(1/n)
coverage error, so it is preferred to the other bootstrap methods.
18/20
19. Bootstrap Method: Pros and Cons
Pros
+ can be used for non-parametric statistics
+ more powerful than jackknife, it approximates whole distribution of ˆθ
+ more accurate than jackknife for computing variance of ˆθ
+ BCa interval has O(1/n) coverage error
Cons
– more expensive than jackknife (B should be large enough)
– if n is very small bootstrap may fail
– if the family of F is known, much better results with exact methods
19/20
20. Conclusions
• We want to estimate a parameter θ, using samples X1, ..., Xn ∼ Fθ.
• Confidence intervals are needed to express uncertainty of estimator ˆθ.
• If distribution F is known, preferably use exact or asymptotic methods.
• If F is unknown or distribution of ˆθ is complex, use jackknife or bootstrap.
• Use jackknife to estimate properties of ˆθ. Not so good for c. intervals.
• Use bootstrap to estimate distribution of ˆθ. Good for c. intervals (BCa).
20/20