12 Nov 2015•0 j'aime•616 vues

Signaler

- 1. Basics of Statistical Inference Lu Mao∗ Fall 2009 Contents 1 BEST UNBIASED ESTIMATORS 1 2 SUFFICIENCY AND UNBIASED ESTIMATION 3 3 CRAMER-RAO LOWER BOUND 3 4 COMPREHENSIVE EXAMPLES ON ESTIMATION 5 5 T TEST 8 6 WALD, SCORE AND LIKELIHOOD RATIO TESTS 9 1 BEST UNBIASED ESTIMATORS Deﬁnition 1 (Point Estimator) A point estimator is any function W(X1, ...Xn) of the sample, that is, any statistic is a point estimator. This deﬁnition may seem unnecessarily vague, but at the moment we must be cautious not to preclude any potential candidates. There is one restriction, however. and that is, an estimator cannot contain an unknown parameter; it must be a function of the sample only, e.g. 1 2( ¯X + µ) is not an estimator for the unknown µ. A common way of evaluating the performance of an estimator is through MSE. Deﬁnition 2 (MSE) The mean squared error (MSE) of an estimator W for a parameter θ is deﬁned by Eθ(W − θ)2. MSE is thus the squared diﬀerence between the estimator and the parameter averaged over the sample space. Obviously we have MSEθ(W) = V arθ(W) + (Eθ(W) − θ)2. Note that MSE generally depends on the unknown parameter θ. Of course, we would like to ﬁnd an estimator with minimum MSE for all θ ∈ Θ. However, this is not possible unless we place restrictions on the pool of potential estimators. Example 1 Consider an iid sample from N(θ, 5) with sample size n=10. A reasonable estimator for θ is the sample mean ¯X. Since ¯X is unbiased, MSE( ¯X) = V ar( ¯X) = 1/2, a constant function of θ. Now consider the constant estimator 0. MSE(0) = θ2. MSEθ(0) < MSEθ( ¯X) when θ is in a small neighborhood of 0. ∗ Department of Epidemiology and Biostatistics, Drexel University of Public Health 1
- 2. 1 BEST UNBIASED ESTIMATORS 2 The constant 0, though generally a bad estimator, does have a smaller MSE when θ is indeed around 0. To eliminate such trivial cases, we restrict our consideration to the class of unbiased estimators. Deﬁnition 3 (Unbiasedness) An estimator W for a parameter θ is unbiased if Eθ(W) = θ for all θ ∈ Θ. Example 2 Let X1, ...Xn be an iid sample with E(Xi) = µ and V ar(Xi) = σ2 for i = 1, ..., n. Show that E(S2) = σ2 where S2 = 1 n−1 n i=1 (Xi − ¯X)2. That is, the sample variance is an unbiased estimator of the population variance. Proof First notice that n i=1 (Xi − ¯X)2 = n i=1 (X2 i − 2Xi ¯X + ¯X2) = n i=1 X2 i − 2 ¯X n i=1 Xi + n ¯X = n i=1 X2 i − 2n ¯X2 + n ¯X2 = n i=1 X2 i − n ¯X2 Therefore, ES2 = 1 n − 1 E( n i=1 X2 i − n ¯X2 ) = 1 n − 1 ( n i=1 EX2 i − nE ¯X2 ) = 1 n − 1 (n(σ2 + µ2 ) − n( σ2 n + µ2 )) = σ2 Note that in the above example, no assumption about the family of distribution has ever been made except for ﬁnite variance. Thus for a Poisson random sample, for instance, both ¯X and S2 are unbiased estimators of λ. Furthurmore, for any t ∈ R, t ¯X + (1 − t)S2 is an unbiased estimator of λ. Deﬁnition 4 (UMVUE) An estimator W of θ is a Uniform Minimum Variance Unbiased Esti- mator (UMVUE) if Eθ(W) = θ and for any estimator W with Eθ(W ) = θ we have V arθ(W) ≤ V arθ(W ) for all θ ∈ Θ. It is also said to be a best unbiased estimator. While there can be a large number of unbiased estimators for a particular parameter, best unbiased estimator is unique. Theorem 1 If W is a best unbiased estimator for θ, then W is unique. Proof Suppose W is another best unbiased estimator, then V arθW = V arθW and it is less that the variance of any other unbiased estimators for all θ. Consider the unbiased estimator T = 1 2(W + W ). We have V arθ(T) = 1 4 V arθ(W + W ) = 1 4 (V arWθ + V arθW + 2Covθ(W, W )) ≤ 1 4 (V arθW + V arθW + 2 V arθW · V arθW ) (1) = V arθW
- 3. 2 SUFFICIENCY AND UNBIASED ESTIMATION 3 But since W has uniform minimum variance, we must have V arθ(T) ≥ V arθ(W). Therefore V arθ(T) = V arθ(W). This implies Covθ(W, W ) = √ V arθW · V arθW . The attainment of equality in Cauchy-Schwartz inequality requires that: W = aW + b (2) Equal variance and equal expectation give a = 1, b = 0. This completes the proof. 2 SUFFICIENCY AND UNBIASED ESTIMATION Recall that T is a suﬃcient statistic for θ if the distribution of x|T = t does not depend on θ. Informally, we say that T summarizes all the information about θ contained in the sample. This notion is elegantly illustrated by the Rao-Blackwell Theorem. Theorem 2 (Rao-Blackwell) Let W be an unbiased estimator of θ and let T be a suﬃcient statistic for θ. Deﬁne φ(T) = E(W|T). Then φ(T) is an unbiased estimator and V arθφ(T) ≤ V arθW for all θ. That is, φ(T) is uniformaly better estimator of W. Proof First notice that since the distribution of W(X1, ..., Xn)|T = t does not depend on θ, φ(t) = E(W|T = t) is not a function of θ. Therefore, φ(T) is indeed an estimator. Now, Eφ(T) = E(E(W|T)) = E(W) = θ; Also, V arθ(W) = Eθ(V ar(W|T)) + V arθ(E(W|T)) ≥ V arθ(E(W|T)) = V arθφ(T) This completes the proof. The Rao-Blackwell tells us that for any unbiased estimator W we can always ﬁnd φ(T), an estimator based on a suﬃcient statistic, such that it is uniformly better than W. Corollary The best unbiased estimator for θ, if one exists, must be a function of a suﬃcient statistic. Intuitively, since the best unbiased estimator has the highest precision, it must have captured all the information available in the sample. 3 CRAMER-RAO LOWER BOUND Theorem 3 (Cram´er-Rao Inequality) Let X1, ..., Xn be an iid sample with density f(x|θ), and let W(X1, ...Xn) be an estimator such that V ar(W) < ∞ and that d dθ EθW = χ ∂ ∂θ [W(x)f(x)]dx (3) Then, V arθ(W) ≥ ( d dθ EθW(X))2 nI(θ) where I(θ) = Eθ(( ∂ ∂θ log f(X|θ))2) is the Fisher Information.
- 4. 3 CRAMER-RAO LOWER BOUND 4 Corollary In addtion to the conditions speciﬁed in Theorem 3, if W is an unbiased estimator of θ, then. V arθ(W) ≥ 1 nI(θ) Condition (3) involves an interchange between diﬀerentiation and integration. A common situation in which (3) fails to hold is when the support of the density f(x|θ)depends on θ. Example 3 (A counter-example of CRLB) Let X1, ...Xn be iid Uniform [0, θ]. Calculate the “Fisher Information” I(θ) and ﬁnd an unbiased estimator W(θ) such that V arθW < 1 nI(θ) . Solution Since log f(X|θ) = log 1 θ = − log θ, we have I(θ) = Eθ(( ∂ ∂θ log f(X|θ))2 ) = Eθ(( ∂ ∂θ (− log θ))2 ) = Eθ 1 θ2 = 1 θ2 So 1 nI(θ) = θ2 n . Now, since obviously E ¯X = θ 2, we have that W = 2 ¯X is an unbiased estimator of θ. But V arθW = 4V arθ ¯X = 4 n · V arθXi = 4 n · θ2 12 = θ2 3n < θ2 n Fortunately, condition (3) is satisﬁed for “reasonable” estimators in the exponential family. Further, I(θ) = −Eθ( ∂2 ∂θ2 log f(X|θ)) (4) holds for the exponential family. In light of previous discussions, if the variance of an unbiased estimator attains the CRLB, it is the best unbiased estimator. Example 4 Show that ¯X is the best unbiased estimator for λ in a Poisson(λ) random sample.
- 5. 4 COMPREHENSIVE EXAMPLES ON ESTIMATION 5 4 COMPREHENSIVE EXAMPLES ON ESTIMATION The following appears in the Duke University ﬁrst year qualifying exam in 2009 (http://www.stat.duke.edu/programs/grad/fye/fye2009.pdf). Example 5 Suppose Yi follows a Poisson Distribution with mean βxi, where xi is a ﬁxed known positive constant. P(Yi = yi) = (βxi)yi e−βxi /yi! yi = 0, 1, 2, · · · 1. Find a minimal suﬃcient statistic 1 for this family of distributions. 2. Find the Cramer-Rao bound for the variance of an unbiased estimator of β. 3. Find the MLE for β 4. Find the variance of the MLE for β 5. Give an approximate 95% conﬁdence interval for β when n is large. Carefully describe any theorem that you use 2 . Solution 1. The Likelihood function f(y|β) = β yi xyi i e−β xi / yi! = (β yi e−β xi ) · ( xyi i / yi!) So Yi is a suﬃcient statistic. 2. Since log f(y|β) = yi log β − β xi + c we have ∂ ∂β log f(y|β) = yi β − xi (5) Therefore, ∂2 log f(y|β) ∂β2 = − yi β2 Finally, 1 I(β) = − 1 E(∂2 log f(Y|β) ∂β2 ) = β2 EYi = β xi 3. From (5) we know that ˆβ = Yi xi Furthermore E(ˆβ) = EYi xi = β 1 Disregard “minimal” 2 Disregard this question
- 6. 4 COMPREHENSIVE EXAMPLES ON ESTIMATION 6 4. Variance of MLE Var(ˆβ) = 1 ( xi)2 VarYi = 1 ( xi)2 βxi = β xi 5. By the asympotic normality of MLE ˆβ − β D −→ N(0, 1 I(β) ) That is, I(β)(ˆβ − β) = β xi (ˆβ − β) D −→ N(0, 1) By consistency of MLE ˆβ P −→ β By the Continuous Mapping Theorem3 ˆβ P −→ β, or ˆβ β P −→ 1 By the Slutsky’s Theorem4 ˆβ xi (ˆβ − β) = ˆβ β · β xi (ˆβ − β) D −→ N(0, 1) Therefore, P(−z0.025 ≤ ˆβ xi (ˆβ − β) ≤ z0.025) . = 0.95 So, P(ˆβ − 1.96 xi ˆβ ≤ β ≤ ˆβ − 1.96 xi ˆβ ) . = 0.95 Substitute ˆβ = Yi xi , a 95% CI for β is Yi xi − 1.96 xi Yi ≤ β ≤ Yi xi + 1.96 xi Yi 3 Let g be a continuous function and Xn P −→ X, then g(Xn) P −→ g(X). That is, continous functions preserve convergence in probability 4 If Xn D −→ Xn and Yn P −→ c, then XnYn D −→ cX
- 7. 4 COMPREHENSIVE EXAMPLES ON ESTIMATION 7 The following is the last problem for Chapter 7 Point Estimation (Cassela & Berger, 2nd Edition) Example 6 (Jackknife) The jackknife is a general technique for reducing bias in an estimator. A one-step jackknife estimator is deﬁned as follows. Let X1, · · · , Xn be a random sample, and let Tn(X1, · · · , Xn) be some estimator of a parameter θ. In order to “jackknife” Tn we calculate the n statistics T (i) n , i = 1, · · · , n, where T (i) n is calculated just as Tn but using the n − 1 observations with Xi removed from the sample. The jackknife estimator of θ, denoted by JK(Tn), is given by JK(Tn) = nTn − n − 1 n n i=1 T(i) n (6) (In general, JK(Tn) will have a smaller bias than Tn. See Miller 1974 for a good review of the properties of the jackknife.) 1. Show that the MLE of θ2, ( n i=1 Xi/n)2, is a biased estimator of θ2. 2. Derive the one-step jackknife estimator based on the MLE. 3. Show that the one-step jackknife estimator is an unbiased estimator of θ2. (In general, jackkniﬁng only reduces bias. In this special case however, it removes it entirely.)
- 8. 5 T TEST 8 5 T TEST Deﬁnition 5 (t Distribution) If Z ∼ N(0, 1) and W ∼ χ2 p, Z, W are independent, then Z W p is said to follow t distribution with p degrees of freedom. A direct consequence is that for iid N(µ, σ2), √ n( ¯X − µ) S ∼ tn−1 (7) To see this, recall that √ n( ¯X−µ) σ ∼ N(0, 1) and that (n−1)S2 σ2 ∼ χ2 n−1, ¯X and S2 are independent. Therefore √ n( ¯X − µ) S = N(0, 1) χ2 n−1 n−1 In fact, ﬁnding the distribution of √ n( ¯X−µ) S was orginally the motivation for Deﬁnition 5. Example 7 (One-Sample t Test) Let X1, · · · , Xn be iid N(θ, σ2) whete θ0 is speciﬁed value of θ and σ2 is unknown. We are interested in testing H0 : θ = θ0 versus H1 : θ = θ0 Show that the LRT that rejects H0 is in the form √ n| ¯X − θ0| S > c Therefore, the level α LRT rejects √ n| ¯X − θ0| S > tn−1,α/2 (8) Solution The unrestricted MLEs are ˆθ = ¯X and ˆσ2 = 1 n i (Xi − ¯X) 2 . The restricted (under H0) MLEs are ˆθ0 = θ0 and ˆσ2 0 = 1 n i (Xi − θ0)2. So the likelihood ratio is λ(X) = (2πˆσ0)−n/2e − i (Xi−θ0)2 2ˆσ2 0 (2πˆσ)−n/2e− i (Xi− ¯X)2 2ˆσ2 = ( ˆσ ˆσ0 )n/2 e−n 2 +n 2 = i (Xi − ¯X)2 i (Xi − θ0)2 n/2 = i (Xi − ¯X)2 i (Xi − ¯X)2 + n( ¯X − θ0)2 n/2 = 1 1 + n( ¯X−θ0)2 i (Xi− ¯X)2 n 2 = 1 1 + 1 n−1 n( ¯X−θ0)2 S2 n 2
- 9. 6 WALD, SCORE AND LIKELIHOOD RATIO TESTS 9 Obviously, λ(X) < c ⇐⇒ √ n| ¯X − θ0| S > c for some c For a particular test with signiﬁcance level α, we want to choose c such that the probability under H0 Pθ0 ( √ n| ¯X − θ0| S > c ) = α That is Pθ0 ( √ n( ¯X − θ0) S < −c or √ n( ¯X − θ0) S > c ) = α Since √ n( ¯X−θ0) S ∼ tn−1, we have that c = tn−1,α/2. The two sample case is similar in rationale although slightly more involved. Example 8 (Two-sample t Test) Let X1, · · · , Xn be an iid sample from N(µX, σ2), and let Y1, · · · , Ym be an independent iid sample from N(µY , σ2). Again, σ2 is unknown. We are interested in testing the means H0 : µX = µY versus µX = µY Show that the α level LRT rejects H0 if | ¯X − ¯Y | S2 p( 1 n + 1 m ) > tn+m−2,α/2 where S2 p is the pooled sample variance S2 p = 1 n + m − 2 n i=1 (Xi − ¯X)2 + m i=1 (Yi − ¯Y )2 6 WALD, SCORE AND LIKELIHOOD RATIO TESTS There are three common method of constructing a large-sample test: (asymptotic) LRT, wald test and score test. 6.1 LRT In the previous section, the LR statistic is eventually simplied to a statistic whose exact distribution can be found (in this case the t distribution). In most cases, however, an exact analytical solution to the LRT is diﬃcult/impossible. A large sample approximation to the distribution of the LR statistic itself can be applied. Theorem 4 (Asymptotic distribution of LR) 5 For testing H0 : θ = θ0, H1 : θ = θ0, let the likelihood ratio statistic λ(X) = L(θ0|x) L(ˆθ|x) where ˆθ is the MLE. Then under H0, as n → ∞ we have −2 log λ(X) D −→ χ2 1 (9) 5 Assumes regularity conditions on the pdf/pmf and an iid sample
- 10. 6 WALD, SCORE AND LIKELIHOOD RATIO TESTS 10 To see this, let l(θ) = log L(θ|x), then log λ(X) = l(θ0) − l(ˆθ). Now, Taylor expand l(θ) around ˆθ l(θ) = l(ˆθ) + l (ˆθ)(θ − ˆθ) + l (ˆθ) (θ − ˆθ)2 2! + · · · Recall that the MLE ˆθ is the root of l(θ), that is l (ˆθ) = 0. So when θ = θ0 we have −2 log λ(X) = −2(l(θ0) − l(ˆθ)) . = −l (ˆθ)(θ0 − ˆθ)2 Since ˆθ P −→ θ0, l (ˆθ) P −→ l (θ) (provided that l (θ) is a continuous). We should note that l (θ) = P −→ −I(θ) From asymptotic normality of ˆθ, we know that I(θ)(θ − ˆθ) D −→ N(0, 1). Therefore −2 log λ(X) . = I(θ)(ˆθ − θ)2 D −→ χ2 1 This shows the asymptotic distribution of LR. We will state without proof the generalized version where parameters are vector-valued. Theorem 5 (Multi-parameter version of LRT) Let X1, · · · , Xn be a random sample from f(x|θ). The distribution of −2logλ(X) converges to a chisq distribution as the sample size n → ∞. The de- grees of freedom of the limiting distribution is the diﬀerence between the number of free parameters speciﬁed by θ ∈ Θ0, and the number of parameters speciﬁed by θ ∈ Θ. 6.2 Wald Test Another asymptotic test based on MLE is the Wald test. Since the MLE ˆθ − θ −→ N(0, 1 I(θ) ) we have I(θ)(ˆθ − θ) −→ N(0, 1) or equivalently I(θ)(ˆθ − θ0)2 −→ χ2 1 In pactice, the observed information number ˆI(ˆθ) is used. Therefore a level α Wald test rejects θ = θ0 when ˆI(ˆθ)(ˆθ − θ)2 > χ2 1(α) 6.3 Score Test The score statistic is deﬁned to be Sθ(X) = ∂ ∂θ log L(θ|X) (10) It is easily seen that ESθ(X) = 0 and VarSθ(X) = I(θ)
- 11. REFERENCES 11 The test statistic is Zs = Sθ(X) I(θ) It can be shown that under H0, Sθ0 (X) I(θ0) → N(0, 1) Therefore a level α score test rejects H0 : θ = θ0 when S2 θ0 (X) I(θ0) > χ2 1(α) References [1] George Casella and Roger L. Berger, Statistical Inference, Duxbury Press, 2nd ed, 2001. [2] E. L. Lehmann and George Casella, Theory of Point Estimation, Springer Texts in Statistics, 2nd ed, 1998.