Biostatics 8.pptx

Outline
Statistical inference
 Samplingdistribution andits properties
 Estimation
 Hypothesis testing
 Paired andindependent sample t-tests
 Chi-square test
24 January
2022
1

24 January
2022
2
Objective
At the end you should be able
 Estimate parameters
 Conduct hypothesis testing
 Testthe associationsbetween variables

Inferential Statistics
 It isthe processofgeneralizingor makingconclusionsto the target population
basedon the information obtained from the sample
24 January
2022
3

Inferential Statistics
Notations
 Populationvalues(parameters) are denoted usingGreek letters
 The sample values (statistic) are denoted byroman letters
4

Inferential Statistics process
 Howinformation from the sampleislinked to the population?
5

Sampling Distributions
 The probabilitydistribution ofasamplesstatistic that isformed whenrepeated
sampleswere taken from the whole population
 Ifwetakemany,manysamplesandget the statistic for eachofthose samples,the
distribution ofallthose statistic.
 The frequency distribution ofallthese samplesforms the samplingdistribution of
these sample statistic.
6

7
 Practically repeated samplesdo not taken from the population
 Wedo not encounter samplingdistribution empirically, but it isnecessaryto
knowtheir properties in order to drawstatistical inferences.
 Three thingsthat determine sampling distribution
Its mean
Its variance
Its shape

8
Properties of Sampling Distributions
 The mean of the sample means will be the sameasthe population
mean.
 The standard deviation of the sample means will be equalto the
population standard deviation divided bythe squareroot ofthe sample
size.
 The standard deviation ofthe samplemeanswill be smaller than the standard
deviation ofthe population
 The Standarddeviationofthesamplingdistributionofthesamplestatisticsiscalled
the standarderror

9
Standard deviation vs Standarderror
Standard deviation
 Isameasureofvariability between
individual observations
 Descriptiveindex relevant to mean
Standard error
 Thevariability ofsummary statistics
e.g. the variabilityofthe samplemeanor
asample proportion
 It isameasureofuncertainty in a
sample statistic.
i.e. precision ofthe estimate of the
estimator

10
 Basedon the nature ofsummary statistics
Sampling distribution of the mean
Sampling distribution of the proportion

Properties of sampling distribution of the means
 The mean of the sampling distribution of the means is the same as the
population mean( )
 The SDofthe samplingdistribution ofthe meansis / n.
 The shape of the sampling distribution of means is approximately a normal
curve, regardless of the population distribution when n is large enough
(Central limit theorem).
11

Properties of sampling distribution of the proportions
 The sampleproportion p will be anestimate ofthe population proportions
 The SDofthe samplingdistribution ofthe proportion is
 The shape of the sampling distribution of proportion is approximately a
normal curve, regardless of the population distribution when n is large
enough(Central limit theorem).
12

13
Central LimitTheorem
 Statesthat regardlessofthe shape ofthe parent population distribution;
 Thesamplingdistribution ofanystatisticwill be normal or nearlynormal, ifthe samplesize
islarge enough.
 Butthe question is" how large enough"?
Asarough rule of thumb,
Asamplesizeof30 islargeenoughfor continuous data and
np≥ 5and nq≥ 5for categorical data whichare measuredby proportion

14
Assumptions of statistical inference
 T
o make valid inference or conclusions the following assumptions must be
satisfied
 Samplesmust be randomly selected
 Samplesizemust be large enough
 The population must be normally or approximately normally distributed ifthe
samplesizeislessthan 30.
That meansthe population varianceshouldbe known
 What if n is not large enough and population variance is unknown?

15
Student’s t- distribution
 Weusestudent t distribution in statistical inferencewhichdependson degrees of freedom:
 Thet-distribution isatheoretical probability distribution whichissymmetrical, bell-shaped,
andsimilarto the normal but more spread out.
 Theconditions to usethe student t distribution
Thesampleisfrom anormallydistributed population,
Populationvarianceisunknown, and
Thesample size is small i.e. lessthan30 and np < 5 or nq<5

Student’s t-distributions
24 January
2022
16

24 January
2022
18
Student’s t-distributions
 The t distribution andstandard normal distribution are similar in :
It isbell shaped.
 It issymmetrical about the mean.
The mean, median, andmode are equal to 0 andlocated at the center.
The curve never touches the x axis
 The t distribution differs from the normal distribution:
The varianceisgreater than one
The t distribution isbasedon DF,whichisrelated to sample size.
Assamplesizeincrease, the t distribution approachesthe SND(Z).

24 January
2022
19
Parameter Estimations
 Wegenerallyassumethe underlyingdistribution ofthe variableofinterest is
adequatelydescribed byone or more unknown parameters
 Butit isusuallynot possibleto makemeasurements on everyindividualin a
population, parameters cannot usuallybedetermined exactly.
 Instead we estimate parameters by calculating the corresponding
characteristics from arandomsample estimates .

24 January
2022
20
Estimation
 It isaprocedure in whichthe information obtained from asampleare used to
get the true population parameter.
 The processofestimating population parameters byusing samplestatistics
 An estimator is any statistic that is used to estimate unknown population
parameter.
 The valueor valuesthat the estimator assumesare called estimates

24 January
2022
21
Characteristics of good estimator
Anestimator shouldbe:
 Unbiased: the expected valueofthe estimator must be equalto the
parameter to be estimated.
 Consistent: asthe samplesizeincrease, the valueofthe estimator should
approachesto the valueofthe estimated parameter.
 Efficient: thevarianceofthe estimator shouldbe smallest.
 Sufficient: the samplefrom whichthe estimator iscalculatedmust contain the
maximumpossibleinformation about the population.

24 January
2022
22
Estimation
There are two types of estimation:
Point estimation and
Interval estimation

Point Estimation
 A single numerical value is used to estimate the corresponding population
parameter.
 The corresponding point estimator for the parameters:
24 January
2022
23

24 January
2022
24
Point Estimation
However, there are pitfalls of point estimation.
 Different samples end with different estimate for a single unknown population
parameter.
 However,point estimate doesnot take sampleto samplevariability into account.
 Point estimate does not give the precision of the estimate and hence we need
another method ofestimation whichhandlesthese problems.

24 January
2022
25
Interval Estimation
 It isaninterval computed from sampledata containing the true population
parameter within acertain levelof confidence.
 CI=point estimate ± margin oferror (reliability coefficient × StandardError)
 CIconsists of three parts:
The statistic,
Aconfidence level and
Standard error
 Interval estimators are commonlycalled confidence intervals.

24 January
2022
26
Interval Estimation
Level of confidence
 Is the probabilityof obtaining the populationparameter within the error margin.
 Levelofconfidenceisdenoted as(1-α)100%.
 Confidencelevelcannever be 100%!
 Mostcommonly the 95%confidenceintervals are calculated
 However, 90%and99%confidenceintervals are sometimes used

Interval Estimation
ACIin general:
 Considers variationin samplestatisticsfrom sampleto sample
 Basedon observation from one sample
 Givesinformation about closenessto unknownpopulation parameters
 Statedin terms oflevelof confidence
 Interpretation ofconfidenceinterval (e.g. a95% CI)
Ifwetake100 repeated n samplesandconstruct confidenceinterval, weexpect that 95 of
them will contain the true population parameter.
24 January
2022
27

Interval Estimation
Thegeneralformula for allCIs is:
point estimate (measure of how confident we
want to be or reliability coefficient) (standard
error)
The value of the statistic in my sample (e.g., mean, proportion , mean
difference, proportion difference, etc.)
From a Z table or a T table, depending on
the sampling distribution of the statistic.
Standard error
of the statistic.
24 January
2022
28

24 January
2022
29
Error of Margin
 It is the amount added and subtracted to the point estimate in confidence
interval estimation
 It isameasure of precision
 Error margin isaproduct of
Reliability coefficient corresponding to confidence level and
Standard error ofthe estimator.

24 January
2022
30
Interval Estimation
The width ofthe confidence interval depends on:
 Sample size
The larger the samplesize, the narrower the confidence interval andthe
more preciseour estimate. Because as sample sizeincreasesstandard error
decreases.
It isto meanthe samplestatistic will approach the population parameter
 Standard deviation
The more the variation amongthe individualvalues,the wider the
confidence interval andthe lessprecisethe estimate.

24 January
2022
31
Interval Estimation
 Confidence level
Thelarger confidencelevel, the wider the confidence interval
90%CIisnarrower than 95%CIsinceweare only90%certain that the interval
includesthe population parameter.
The99%CIiswider than 95%CI; the extra width meaningthat wecanbe more
certain that the interval willcontain the population parameter.

24 January
2022
32
Interval Estimation
Confidenceinterval canbe estimated for
 Singlepopulation
One population mean
One population proportion
 Double population
Twopopulation(difference) inmean
Twopopulation(difference) inproportion

Estimation for Single
Population
33

CIfor a Single Population Mean
 When the followingassumptionsare fulfilled
Populationstandard deviation () is known
 Population isnormally distributed
 Ifpopulation isnot normal, uselarge sample
 A100(1-)% C.I. for  iscalculated by:
  isto be chosenbythe researcher, most commonvaluesof are
0.05, 0.01 and 0.1.
34

Confidence interval
 Thepoint estimate ofμ isthe samplemean 𝑥
ҧ
 The standard error of𝑥ҧ is 𝛔
ൗ 𝑛
 CommonlyusedCLsare 90%, 95%, and 99%
35

36
Example:
1. W
aiting times (in hours) at a particular hospital are believed to be
approximately normally distributed with a variance of 2.25hr.
a. Asampleof20 outpatients revealedameanwaitingtime of1.52 hours. Calculatethe
point estimate andconstruct the 95% CI.
b. Suppose that the mean of 1.52 hours had resulted from a sample of 32 patients.
Calculatethe point estimate andfindthe 95% CI.
c. What effect doeslarger samplesizehaveon the CI?

Solutions
A.
 Weare 95%confident that the true meanwaitingtime isbetween 0.87 and2.17 hrs.
 Althoughthe true meanmayor maynot be inthis interval, 95%ofthe intervalsformed in this
manner willcontain the true mean.
 Anincorrect interpretation isthat there is95%probability that this interval containsthe true
population mean.
20
1.52.65(.87,2.17)
37
1.521.96
2.25
1.521.96(.33)

Solutions
B.
32
38
 1.52 .53  (.99, 2.05)
 Thelarger the samplesizemakes the CI narrower (more precision).
 When constructing CIs, it hasbeen assumedthat the standard deviation
ofthe underlying population,  , isknown
 What if  isnot known?
1.52 1.96
2.25
 1.52 1.96(.27)

Unknown variance (small sample size, n ≤ 30)
 Ifthe  for the underlying populationisunknownandthe samplesize
is small
 Asanalternative weuseStudent’
st distribution.
39

Degrees of Freedom (df)
 df= Number ofobservations that are allowedto varyfreelyafter the estimator
hadcalculated. df= n-1
40

Example
 Compute a 95% CI for the mean birth weight based on n = 10, sample mean =
116.9 Oz ands =21.70.
From the t table, t (9, 0.975) = 2.262
Answer:(101.4, 132.4)
Interpretations?

CIs for single population proportion, p
 An interval estimate for the population proportion (π) canbe
calculated byaddinganallowancefor uncertainty to the sample
proportion (p)
 Isbasedon three elements of CI.
Point estimate
SEof point estimate
Reliability coefficient

CIs for single population proportion, p

Example
 A random sample of 100 people shows that 25 are left-handed.
Calculate the point estimate and form a 95% CI for the true
proportion of left-handers.

Example
 It was found that 28.1% of 153 cervical-cancer cases had never had a Pap smear prior to the
time of case’s diagnosis. Calculate a 95% CI for the percentage of cervical-cancer cases who
never hadaPap smear.

Estimation for two Populations

CIfor the difference between population means
 Known variances and large sample size
 When 1 and2 are knownandboth populationsare normal or both samplesizesare at least
30
 Thetest statistic isa z-value
 The point estimation of (μ1- μ2) is(𝑥
1
ҧ − 𝑥
2
ҧ )
 Thestandard error is (
𝑥
1ҧ − 𝑥ҧ2
)
 Finally,
 Ifpopulation variancesare unknown, theycanbe approximatedbythe samplevariances:𝑠1
2
and𝑠2
2 whenthe Sample islarge (n≥ 30)

Example 1
• Researchers wishto knowifthe data they havecollected provide sufficient
evidence to indicate adifference in mean serum uric acidlevels between
normal individualsandindividualswith mongolism.The data consist of
serum uric acidreadings on 12 mongoloid individualsand15 normal
individuals.The meansare 𝑥ҧ1= 4.5 mg/100 ml and𝑥ҧ2= 3.4 mg/100
m
l
.The data constitute two independent simple random samples each
drawn from anormally distributed population with avarianceequal to 1
mg/100 ml.
• Compute the point estimate andconstruct a95%CIfor the difference in
meanserum uric acidlevels between the two populations.

Example 2
 Researchers are interested in the difference between serum uric acid levels in patients
with and without Down’ssyndrome.
 Patientswithout Down’s syndrome
n=12, samplemean=4.5 mg/100ml,2=1.0
 Patientswith Down’s syndrome
n=15, samplemean=3.4 mg/100ml,2=1.5
 Calculate the 95% CI.
SE= 0.43, 95% CI = 1.1 ± 1.96 (0.43) = (0.26, 1.94)
 Weare 95%confident that the true differencebetween the two population meansis between
0.26 and 1.94.

UnknownVariances (σ1
2and σ2
2) and small sample size (n < 30)
 Ifthe followingassumptions satisfied
The two random samplesare independent
Bothsamplesare pickedfrom population with normal distribution.
The population variancesare unknownbut are assumedto be equal.
 the test statistic isat-value with degrees offreedom = 𝑛1 + 𝑛2-2
 The point estimation of(μ1- μ2) is (𝑥1
ҧ− 𝑥ҧ2)
 The standard error is (𝑥1
ҧ− 𝑥
2
ҧ )=

 Thepooled samplevariance (S2)
 Finally,(1- α) 100% confidence interval for (μ1-
μ2):

Example
 Aresearch team collected serum amylasedata from asampleofhealthy
subjects andfrom asampleofhospitalized subjects.They wishto knowif
they wouldbe justified in concluding that the population meansare
different.The data consist ofserum amylasedeterminations on 𝑛2=15
healthy subjects and 𝑛1=22 hospitalized subjects.The samplemeansand
standard deviations are as follows:
𝑥ҧ1= 120 units/ml, 𝑠1=40 units/ml
𝑥ҧ2= 96 units/ml, 𝑠2=35 units/ml
 Construct a95%CIfor the difference between the two population mean
serum amylase.

Example
 Calculate the pooled variance
S2
 Calculate the 95%confidence interval
 95%CI

CIfor the difference between populationproportions
 Supposethat n1andn2are largeenoughso that;
– 𝑛1𝑝1≥5,𝑛1(1 − 𝑝1)≥5,𝑛2𝑝2≥5,and 𝑛1(1 − 𝑝1)≥5
 Thepoint estimate for the differenceoftwo population proportion, 𝜋1− 𝜋2isby𝑃1− 𝑃2.
1 2
𝑃1(1−𝑃1)
+ 𝑃2(1−𝑃2)
𝑛1 𝑛2
 The standard deviation 𝑃 − 𝑃=
 A(1-α)100% confidenceinterval estimate for the differenceofpopulation proportions, 𝑃1−
𝑃2= 𝑃1− 𝑃2± 𝑧𝛼
Τ2 𝑛
+
𝑃1(1−𝑃1) 𝑃2(1−𝑃2)
𝑛
1 2

Example
 Each of two groups consists of 100 patients who have leukemia. Anew
drug is given to the first group but not to the second (the control
group). It is found that in the first group 75 people have remission for
2 years; but only 60 in the second group. Find 95% confidence limits
for the difference in the proportion of all patients with leukemia who
haveremissionfor 2 years.

Example
 𝑝1= 0.75, 𝑞1= 0.25, 𝑛1=100; 𝑝2= 0.60, 𝑞2= 0.40, 𝑛2=100
 𝑛1𝑝1=75>5 and 𝑛1𝑞1= 25>5
 𝑛2𝑝2=60>5 and𝑛2𝑝2=40>5
 𝜎1
2 = 1 1
= 0.001875 and𝜎2
2
𝑝 𝑞 𝑝 𝑞
2 2
𝑛1 𝑛2
= = 0.0024
 Hence, the 𝜎2for 𝑝1− 𝑝2= 0.001875+ 0.0024 = 0.004275
 𝜎 for 𝑝1 − 𝑝2= 0.004275 = 0.0653
 At a 95% Confidence level, Z = ± 1.96; 𝑝2− 𝑝1= 0.75 - 0.60 = 0.15
 Therefore, 95 %C.I. =(0.15±1.96(0.065))= (0.15 ± 0.13)=(0.02,0.28).

Summary
Is σ
known?
Is n ≥ 30 or np and nq≥5
Use tα/2 values and s in the formula.
ye
s
ye
s
Use zα/2 values
no maters what the sample size is
Use zα/2 values and
s in place of σ in the formula.
N
o
N
o
• When to usetα/2 or zα/2 for findingconfidenceinterval

HypothesisTesting
 Researchers are interested to conduct a study for answering many research
questions/hypothesis.
 The best wayto determine whether their hypothesisistrue wouldbe to examine
the entire population.
 Butit isoften impractical, researchers typicallyexamine arandomsamplefrom
the population.
 The purpose ofthe anystudy isto collect datawhichwill allowthe researcher to
test the hypothesisor answertheir question.
 Statistical tests canprove(with acertain degree ofconfidence), that ahypothesis
are true or not.

HypothesisTesting
 Inhypothesistesting:-the researcher must definethe population under study,
-state the particular hypothesisthat will be investigated,
-Determine significance level,
-select asamplefrom the population and collect the data, and
-perform the appropriate statisticaltest andreacha conclusion.

Hypothesis Testing
 Hypothesis is a testable statement that describes the nature proposed
relationship between two or more variablesof interest.
 Hypothesisare formulated, experiments are performed, andresults are evaluated
for their consistency with a hypothesis.
 HypothesisTesting(HT) providesanobjectiveframework for makingdecisions
usingprobabilistic methods
 The purpose ofHTisto aidthe clinician, researcher or administrator in reaching
adecision (conclusion).

Types of Hypothesis
 The Null Hypothesis, H0
 Isastatement claimingthat there isno difference between the hypothesizedvalue
andthe population value(parameter= hypothesized value)
 It isastatement ofagreement (no difference)(no difference between groupsor
the intervention isnot effective)
 Statesthe assumption (hypothesis) to be tested
 It isalwaysabout apopulation parameter (mean, proportion, OR, RR, etc.),
not about asample statistic
 Alwayscontains“=” , “ ≤” or“≥ ” sign
 Mayor maynot be rejected

Types of Hypothesis
TheAlternative Hypothesis,HA
 It isastatement wewillbelieveastrue ifwereject the H0.
 It isgenerally the hypothesisthat isbelieved(or needsto be supported) bythe
researcher.
 Is a statement that disagrees (opposes) with H0 (there is difference between
groupsor the intervention effective)
 Never contains“=” , “ ≤” or “≥ ” sign,it contains“≠”,“>”, or”<“
 May or maynot beaccepted

Rules for Stating Statistical Hypotheses
 Indicationof equality(either =, ≤ or ≥) mustappearinH0.
H0 : μ = μo, HA: μ ≠ μo; when our hypothesis is expressed in terms of population mean
H0: P= Po, HA: P≠ Po; when our hypothesisisexpressed interms ofpopulationproportion
 Canweconcludethat acertain populationmean is
 not 50?;H0: μ = 50 andHA: μ ≠50
 greater than 50?; H0: μ ≤ 50 andHA: μ > 50
 Canweconcludethat the proportion ofpatients with leukemiawhosurvivemore than six years
isnot 60%?
HA: P= 0.6 and HA: P≠0.6
 Canweconcludedissmokingissignificantlyassociatedwith lungcancer
H0: there isno associationbetween smokingandlung cancer.
HA:there isanassociationbetween smokingandlung cancer

Hypothesis testing process
 Nowthink about howthe hypothesistest shouldbe carried out
 Wedrawarandom sampleofsizenfrom the underlying population and
calculateits samplemean (𝑥ҧ)
 Wecompare(𝑥ҧ)to the postulated mean μ0
 Is the difference between (𝑥ҧ) and μ0 too large to
be attributed to chance alone?

Steps in HypothesisTesting
1. Formulatethe appropriate statisticalhypotheses clearly
SpecifyH0and HA
H0:  = 0 H0:  ≤0 H0:  ≥0
HA:   0 HA:  > 0 HA:  < 0
two-tailed one-tailed one-tailed
2. Decide on the appropriate test statistic for the hypothesis. E.g., one
population
or

Steps in HypothesisTesting
3. Specifythe desired levelofsignificance(=0.05, 0.01, etc.)
4. Determine the critical value.
5. Compute the test statistic or the p-value
6. Reachadecisionanddrawthe conclusion
 IfH0isrejected,weconcludethatHAistrue(oraccepted).
 IfH0isnotrejected,weconcludethatHomaybetrue.

One tail and two tailtests
 Depend on the waythe H0iswritten, hypothesistesting canbe:
 Twotail test
 Therejection region issplit into the two tails.
 Alternative hypothesistakestheform ”differentfrom”.
 One tail test
 Therejection region isat one end ofthe distribution or the other.
 Alternative hypothesistakesthe form ”lessthan”or ”greater than”.

Level of Significance, α
 Isthe probabilityofrejecting atrue H0
 Definesrejection region ofthe sampling distribution
 The decisionismadeon the basisofthe levelofsignificance,designated byα.
 More frequently used valuesofα are 0.01, 0.05 and 0.10.
 α isselected bythe researcher at the beginning

Test statistic
 Anyobserveddifferences or associationsmayhaveoccurred bychance.
 Becausethere israndomvariation, evenanunbiasedsamplemaynot accurately
represent the population asa whole.
 Atest statisticsisavaluewecancompare with knowndistribution ofwhatwe
expect when the null hypothesisis true.
 The general formula of any test statisticsis:
𝒐
𝒃
𝒔
𝒆
𝒓
𝒃
𝒆
𝒅𝒗
𝒂
𝒍
𝒖
𝒆
−
𝒉
𝒚
𝒑
𝒐
𝒕
𝒆
𝒔
𝒊
𝒛
𝒆
𝒅
𝒗
𝒂
𝒍
𝒖
𝒓
𝒔
𝒕
𝒂
𝒏
𝒅
𝒂
𝒓
𝒅𝒆
𝒓
𝒓
𝒐
𝒓
 Anexampleofatest statistic isz-test , t-test, X2-test

Critical value
 The valuethat separates the rejection region from the acceptance region for a
givenlevelof significance
 The valuesofthe test statistic assumethe points on the horizontal axisofthe
normal distribution andseparatestwo regions:
 Rejection region, and
Non-rejection region.
 Thevaluesofthe test statistic forming the rejection region are lesslikelyto occur ifthe H0is
true.
 Thevaluesmakingthe acceptance(non-rejection) region are more likelyto occur ifthe H0 is
true.

Rejection and Non-Rejection Regions
Rejection region Non-rejection region Rejection region
= 0.025 = 0.025
0.95
1.96
-1.96

P-value
 Inmost applications, the outcome ofperforming ahypothesistest isto produce a
p-value.
 P-valueisthe probabilityofobtainingatest statistic asextreme or more extreme
valuethan the actual test statisticobtained if the H0 is true
• P-valueisthe probabilitythat the observeddifference isdue to chance.
 The larger the test statistic, the smaller is the P
-value, the value observed
occurring just bychanceis low.
 The smaller the P-valuethe stronger the evidencefor rejecting H0 .
Reject H0 ifP-value< α
AcceptH0 ifP-value> α
What ifP-value =α??????

How to calculateP-value
 Usestatistical software likeSPSS, SAS,STA
TA, or R, etc.
 Manual calculations
Obtained from the test statistics (Z calculated)
Findthe probability oftest statistics from standard normal table
Subtract the probability from 0.5
Ifthe test two tailed multiply 2 the result.

Statistical Decision
 Basedon the computation from the data ofthe sample
 The decision to reject or not to reject the Ho isbased on
The magnitude ofthe test statistic.
CI
P-value
 Reject Ho ifthe valueofthe test statistic in the rejection region
 Don’t reject Ho ifthe computed valueofthe test statistic isone ofthe valuesin
the non-rejection region.

Errors in hypothesis testing
 Whenever wereject or accept the H0 wecommit errors.
 Twotypes oferrors are committed.
TypeI Error
TypeII Error

TypeI Error
 Theerror committed whenatrue H0is rejected
 Considered aserious type of error
 The probability ofatype Ierror isthe probabilityofrejecting the H0
whenit is true
 The probability oftype Ierror isα, Called levelofsignificanceofthe
test
 Setbyresearcher in advance

TypeII Error
 Theerror committed whena false H0 isnot rejected
 The probability ofTypeIIError is 
 Usuallyunknownbut larger than α

Power
 Theprobability ofrejecting the H0 whenit is false.
 Power= 1 – β = 1- probability oftype IIerror
 Wewouldliketo maintainlowprobability ofatype Ierror (α) and low
probability ofatype IIerror (β) [highpower = 1 - β].

Summary
Decision
(Conclusion)
Reality
H0 True H0 False
Do not
reject Ho
Correct action
(Prob. = 1-α)
Type II error (β)
(Prob. = β= 1-Power)
Reject Ho Type I error (α)
(Prob. = α = Sign. level)
Correct action
(Prob. = Power = 1-β)

Summary
Example
HO =there isno pregnancy;HA= there ispregnancy

Factors Affecting the Power of theTest
 The power depends on:
1. Asn↑, power↑
2. As|µ1-µo|↑, power↑
3. As↑, power↓
4. Asα↓, power↓

Hypothesis Test for OneSample
 Test for single mean
 Test for single proportion

HypothesisTesting of a Single Mean
(Normally Distributed)

HypothesisTesting for KnownVariance
 Twotailed test
H 0 : 0
H A : 1
 0


n
 z for two tailed test
2
cal tab
if | zcal | ztab do not reject H o
if | z | z reject H o
ztabulated
cal
Decision :
z


x  0

Example
 Asimplerandomsampleof10 peoplefrom a certain population hasa meanage of 27. Canwe
The variance is known to be
conclude that the mean age of the population is not 30?
20. Let α = .05.
Data
n = 10, sample mean = 27, 2 = 20, α = 0.05
Assumptions
Simplerandom sample
Normallydistributed population

Example
A. Hypothesis
Ho: µ= 30
HA: µ≠ 30
B.Test statistic
Asthe population varianceisknown, weuseZ asthe test statistic
C. Determine the levelof significance

Example
D. Determine the criticalvalue
 Reject Ho ifthe Z valuefallsinthe rejection region.
 Don’t rejectHo if theZ valuefallsin the non-rejection region.
 Becauseofthe structure ofHo it isatwo tail test. Therefore, reject Ho ifZ ≤ -1.96 or Z ≥
1.96.

Example
E.Calculation of test statistic or computeCI
F
.Statistical decision
Wereject the HobecauseZ = -2.12 isinthe rejection region.Atα of 5%.
Conclusion
Weconcludethat µisnot 30. P-value= 0.0340
AZ value of -2.12 correspondsto an area of0.0170. Sincethere are two parts to the rejection region in a
two tail test,the P-value is twice this which is .0340.
  2 . 1 2
2 7  3 0   3
1 0
2 0 1 . 4 1 4 2
x  0

z 
n
c a l

Hypothesis testing using confidenceinterval
 Aproblem like the above example can also be solved using aconfidence interval.
 A confidence interval will show that the calculated value of Z does not fall within the
boundaries ofthe interval. However,it willnot givea probability.
 Confidence interval
 27 1.96(1.4142)
 (24.228,29.772)
n
CI  x  z


2

HypothesisTesting for KnownVariance
 One tailed test
𝐻0 ∶ 𝜇≥ 𝜇𝑂
𝐻𝐴 ∶ 𝜇< 𝜇𝑂

Example
 A simple random sample of 10 people from a certain population has a mean age of 27. Can
we conclude that the mean age of the population is less than 30? The variance is known to be
20. Let α = 0.05.
 Data
n = 10, sample mean = 27, 2 = 20, α = 0.05
 Hypotheses
Ho: µ ≥ 30, HA: µ < 30

Example
 Test
statistic
e have the entire rejection
region at
the left. The critical value will
be Z
 With α = 0.05 and the inequality,
w
= -1.645. Reject Ho if Z < -
1.645.
=
 Rejection Region
Lower tail test

Example
• Statistical decision
– We reject the Ho because -2.12 < -1.645.
• Conclusion
– We conclude that µ < 30.
– p = .0170 this time because it is only a one tail test and not a
two tail test.

Hypothesis testing for unknown variance (nsmall)
 Inmost practical applicationsthe standard deviationofthe underlying
population isnot known
 Inthis case,  canbe estimated bythe samplestandard deviation s.
 Ifthe underlying population isnormallydistributed, then the test
statistic is:

Example
 Asimplerandom sampleof14 people from acertain population givesasamplemeanbody mass
index (BMI)of30.5 andSDof10.64. Canweconcludethat the BMIisnot 35 at α 5%?
 Ho: µ= 35, HA:µ≠35
 Test statistic
 Ifthe assumptionsare correct andHo istrue, the test statistic followsStudent's t distribution
with 13 degrees of freedom.

Example
 Decision rule
 Wehaveatwo tailed test. With α = 0.05 it meansthat eachtail is0.025.The critical t valueswith
13 dfare -2.1604 and 2.1604.
 Wereject Ho ifthe t ≤ -2.1604 or t ≥ 2.1604.
 Dono
possib
in the rejection
5
t reject Ho because-1.58 isnot
le that µ= 35. P-value= 0.137
region. Basedon the dataoft hesample, it is

Hypothesis testing for proportions
 Involvescategorical values
 Twopossibleoutcomes
 “Success”(possessesacertaincharacteristic)
 “Failure”(doesnot possessesthatcharacteristic)
 Fractionor proportion of populationin the“success”categoryis
denoted by p

Hypothesis testing for proportions
t−test atn −1 df

Example
 We are interested in the probability of developing asthma over a given one-year period for
children 0 to 4 years of age whose mothers smoke in the home. In the general population of 0
to 4-year-olds, the annual incidence ofasthma is1.4%. If10 casesofasthmaare observedover a
single year in a sample of 500 children whose mothers smoke, can we conclude that this is
different from the underlyingprobability ofp0= 0.014?Α = 5%
H0 : p = 0.014
HA: p ≠ 0.014

Example
• The test statistic is given
by:

Example
 Thecritical valueofZα/2 at α=5% is±1.96.
 Don’t rejectHosinceZ(=1.14) in the non-rejection region between
±1.96.
 P-value = 0.2548
 We do not have sufficient evidence to conclude that the probability of
developing asthma for children whose mothers smoke in the home is
different from the probability in the general population

Hypothesis testing for two samples
 ComparingTwo Population Means;
 Independent samples: variancesknown
 Independent samples: variancesunknown
• Paired Difference Experiments
 Paired/matched/repeated sampling
• ComparingTwo Population Proportions
 Large,independent samplescase

Hypothesis testing for two populationmeans
 Independent sample with known variance or both groups have large sample size
Thesteps to test the hypothesisfor differenceofmeansisthe samewith the singlemean
Step1: state the hypothesis
Ho: µ1-µ2 =0 vsHA: µ1-µ2≠0, HA: µ1-µ2<0, HA: µ1-µ2 >0
Step2: Significancelevel(α)
Step3:Test statistic
n1 n2
2
2
1
   2

( x  y )  (1 2)
zc al



if zcal  ztab
cal  ztab




A
cal
do not reject Ho
reject Ho
do not reject Ho
if zcal  zcal
if z  zt a b reject Ho
: 1  2 0
A
cal cal
if | zcal | zcal do not reject Ho
if | z | reject Ho
A
ztabulated
ztabulated
if z
: 1  2 0
For H
For H
For H 2
1
:     0
 z for two tailed test
2
 z for one tailed test
 z

Example
• Aresearchers wishto knowifthe datathey havecollected providesufficientevidenceto indicate
a difference in mean serum uric acid levels between normal individual and individual with
down’s syndrome. The data consists of serum uric acid readings on 12 individuals with down’s
syndrome and 15 normal individuals. The means are 4.5mg/100ml and 3.4 mg/100ml with
population standard deviationof2.9 and3.5 mg/100ml respectively.
HO : 1  2  0
H A : 1  2  0

SOLUTION
 The
 1 . 9 6
1 .2 3
1 .5 1 7 8
1 2 1 5
1 . 6
2 . 9 2
3 . 5 2

1 . 6
5 . 3 3
 z0 . 0 2 5
z 
2
1
   2
n1 n 2
2
 2


z ca l


( x  y )  ( 1   2 )

( 4 . 3  3 . 4 )  0

114
Independent Samples,variancesunknown
 Generally, in most ofthe real lifesituations, the true valuesofthe population
variances 𝜎1
2 and 𝜎2
2are notknown.
 Theyhave to be estimated from samplevariance 𝑆1
2
and 𝑆2
2
,respectively.
 Alsoneed to estimate the standard deviation ofthe samplingdistributions ofthe
differencein means (
𝑋
ത
1
-
𝑋
ത
2
)
 Twoapproach's
1.The varianceofthe two populationsare assumedto be equal
2.The varianceofthe two groupsare assumedto be not equal

 Assumed that the unknownvariances are equal; 𝝈𝟏
𝟐=𝝈𝟐
𝟐=𝝈𝟐
 Thepooled estimate of𝜎2isthe weightedaverageofthe two sample
variances,𝑆1
2
𝑎𝑛𝑑𝑆2
2
 Thepooled estimate ofisdenoted by𝑆𝑝
2
 Standarddeviationofthe samplingdistribution is;
𝑠
𝑥
1
ҧ−𝑥2ҧ = 𝑝
𝑛1 𝑛2
𝑆 2
(
1
+
1
)
115

 The t-statistic will be
used
𝑡=
𝑥ҧ1−𝑥ҧ2 −(𝜇
1−𝜇2)
116
𝑆𝑝
2( 1
+ 1
)
𝑛1 𝑛2
 The df = 𝑛1+ 𝑛2− 2

𝑠𝑥ҧ1−𝑥ҧ2
=
𝑛
Assumethat the unknownvariances are not equal;𝝈𝟏
𝟐 ≠ 𝝈𝟐
𝟐
 The 𝜎1
2 and 𝜎2
2will be estimated by𝑆1
2
𝑎𝑛𝑑𝑆2
2
, respectively
 Standarddeviationofthe samplingdistribution is;
𝑆1
2
𝑆2
2
117
1 2
( + )
𝑛
 Howto compute the dfwhenthe unknownvariancesandassumednot
to be equal?(reading assignment)

Example
 We have 20 subjects, all males between the ages 25 and 35 who volunteer for our experiment.
One half of the group will be given coffee containing caffeine; the other half will be given
decaffeinated coffee as the placebo control. We measure the pulse rate after the subjects drink
their coffee.The results are:
A) Testthe hypothesis that caffeinehasno effect on the pulse rates ofyoungmen byassuming both
groups hadequalvariance?(α = .05)
B) Findthe 95%C.I. for the population mean difference.
118

119
SOLUTION
 Hypotheses:Ho : μt = μc
HA: μt ≠μc
 where, μt = population meanoftreatment group, μc = population meanofcontrol (placebo)
group.
Compute the pooled(combined) varianceofboth groups
S2= { (10-1)x 28.67+ (10-1) x 31.11 } / 18
= (258.03 + 279.99)/18 = 538.02 / 18 = 29.89
Therefore,t calc = (75 - 68) / √ 29.89(1/10 + 1/10 ) = 7/ √ 5.978
= 2.86 (Thiscorresponds to aP-valueoflessthan 0.02)
t tab ( α = 0.05 , df = 18 ) = 2.10, t calc> t tab ⇒rejectHo
• Hence, caffeinatedcoffeehasaneffect on the pulserates ofyoung men.

120
Hypothesis testing for two population means
Dependent/paired/matched/repeated sampling
 Risesfrom two differentprocesseson same studyunits (e.g. "before” and“after”
treatments)
 Use of the same/matched individuals, eliminates any differences in the
individualsthemselves(confounding factors).
 Inference concerning the differencebetween two population meansissimilarto
one population mean; except that wemanipulateon the difference here.

121

 Ifthe populationofdifferencesisnormally distributed with mean𝜇𝑑
 Thetest statistic =𝑑ത−𝑑𝑂
𝑆𝑑
ൗ 𝑛
 Thetest statistic= Z-test ifthe samplesizeislarge(n1&n2>30) or
varianceis known.
 The test statistic= t-test ifthe sample size is not largeenoughand
unknown variance
 A(1-α)100% confidence interval for µd= µ1- µ2is:
ҧ
𝑑±𝑧 Τ
𝑎
2
𝑎
2
𝑜
𝑟𝑡( Τ, df) ൗ
𝑆𝑑
𝑛
122

123
Example
 SerumCholesterol Levelsfor 12 SubjectsBeforeandAfter Diet-Exercise Program
Subject
Serum Cholesterol Difference
(after-before)
Before (x1) After (x2)
1 201 200 -1
2 231 236 +5
3 221 216 -5
4 260 233 -27
5 228 224 -4
6 237 216 -21
7 326 296 -30
8 235 195 -40
9 240 207 -33
10 267 247 -20
11 284 210 -74
12 201 209 +8

Solution

1 5 ...  8 
 242
 20.17
n 12 12
d   di

1 2 11
124
1 2 1 0 7 6 6    2 4 2 2
2
 5 3 5 . 0 6
i i i
d
s 2
n d 2
 d
 
n  1 n n 1 
 d  d 2

 
1. State the hypothesis
Ho: The mean difference between before and after diet-
exercise- program is  zero
HA: The mean difference between before and after diet-
exercise-
program is < zero

Solution
2. Select the appropriate test statistic
3. Select the level of significance = 0.05
4. Determine the critical ratio or critical value of t test = - 1.7959
5. Perform the calculation for the test statistic
t 
 20.17  0

 20.17
 3.02
• Reject Ho since - 3.02 < - 1.7959
• Conclude that the diet-exercise program is effective.
535.06 12
6. Draw and state the conclusion
6.68
125

Hypothesis testing for two populationproportions
 Supposethat n1andn2are largeenoughso that;
– 𝑛1𝑝1≥5,𝑛1(1 − 𝑝1)≥5,𝑛2𝑝2≥5,and 𝑛1(1 − 𝑝1)≥5
1 2
𝑃1(1−𝑃1)
+ 𝑃2(1−𝑃2)
𝑛1 𝑛2
 The standard deviation 𝑃 − 𝑃=
 Thetest statistic could be
𝑍
𝑐
𝑎
𝑙=
(𝑃1−𝑃2)−(𝜋1−𝜋2)
+
𝑃1(1−𝑃1) 𝑃2(1−𝑃2)
126
𝑛1 𝑛2
 What if the sample size issmall?
 weuse t-statistic with df of 𝑛1+ 𝑛2− 2

127
Example
 Aresearcher is trying to study the malaria situation of Ethiopia. From the records of seasonal
blood survey (SBS) results he came to understand that the proportion of people having malaria
in Ethiopia was 3.8% in 2019 (Eth. Cal). The size of the sample considered was 15000. He also
realized that during the year that followed (2020), blood samples were taken from 10,000
randomly selected persons. The result of the 2020 seasonal blood survey showed that 200
persons were positivefor malaria.
 Doesthe researcher concludethat the malariasituationof2020 did not showanysignificant
differencefrom that of2021 (take the levelofsignificance,α =.01).

Solution
HO : P2019= P2020( or P2019- P2020= 0 ); HA: P2019≠ P2020( or P2019- P2020≠ 0 )
P2019= 0.038 , P2019= 15,000
 P2020= 0.02 , P2020= 10,000
 Z tab ( α = 0.01 ) = 2.58 (two tail)
1 5 , 0 0 0 1 0 , 0 0 0
 Zcalc= 8.2,Which corresponds to aP-valueoflessthan 0.003.
 Decision: reject Ho (because Zcal > Z tab); in other words, the p-value is less
than the level of significance (i.e., α = 0.01)
128
0.038(1  0.038)

0 . 0 2 ( 1  0 . 0 2 )
( 0 . 0 3 8  0 . 0 2 )  (0)

zc a l

129
Example
 Astudy wasconducted to look at the effects of oral contraceptives (OC) on heart
disease in women 40–44 years of age. It is found that among n1 = 500 current
OC users, 13 develop a myocardial infarction (MI) over a three-year period,
while among n2 = 1000 non- OC users, seven develop a MI over a three-year
period. Then can you conclude that rate of MIis significantly greater among OC
users?(Report the P-valuefor your test)

130
Measures of Association
 While a test of hypothesis can be used to determine whether an
association exists between two random variables, it cannot provide a
measureofthe strength ofthe association
• Several methods are available for estimating the magnitude of the effect
giventhe categoricaldatain a2× 2 contingency table
1. Chi-SquareTest
2. Relative Risk (RR)
3. Odds Ratio (OR)

131
Chi-SquareTest
 AChi-Square (χ2) is a probability distribution used to make statistical
inferences about categorical data (proportions) in which the numbers
ofcategories are two or more.
 Widelyusedin the analysisofcontingency tables.
 Chi-Square test allows us to test for association between two
categorical variables.
 Ho: No associationbetween the variables;HA:There is association
 Consequently asignificantp-valueimplies association.

X2 Distribution
 Indexedbythe degrees offreedom (n)
 Unlike z and t distributions, which are always symmetric about 0, the
X2distribution only takes on positive values and is alwaysskewed to the
right.
 The skewnessdiminishesasn increases
18.307 2
1 0
0.05
A cceptance
region
0,95
R ejection
region
132

133
X2 Distribution
 Ast distributions, there isadifferent X2distribution for eachpossiblevalueof df.
 X2distributions with asmallnumber ofdfare highlyskewed;however,this
skewnessisattenuated asthe number ofdfincreases.
 The dfdistribution isconcentrated overnonnegative values.
 It hasmeanequalto its degrees offreedom (df), andits standard deviation equals
√(2df ).
 Asdfincreases, the distribution concentrates around larger valuesandismore
spread out.
 The distribution isskewedto the right, but it becomesmore bell-shaped
(normal) asdf increases

X2 Distribution
 Asdfincreasesit becomesmore bell-shaped (normal)
134

Chi-Square test
 It isastatistic whichmeasuresthe discrepancybetween kobservedfrequencies
O1, O2,…Ok andthe corresponding expected frequencies E1, E2,… Ek.
 Ifthe valueofχ2 iszero, no discrepancybetween the observedandthe expected
frequencies.
 The greater the discrepancy,the larger willbe the valueof χ2.
 The calculatedvalueofχ2 iscompared with the tabulated value for the givendf.
• Chi-Squaretest isbasedon the table ofΧ2 for df. ForRx Ctable the dfisgiven
by: (row-1)(columon-1) or (R-1)(C-1)
135

Chi-SquareTest
 Counts in the Chi-SquareTestofa2x2 tablearerepresentedas“a”, “b”,
“c” and“d”.
 Thegeneral formula
for 2x 2 table.
nadbc2
 We canalso use
2
(ac)(bd)(ab)(cd)
136


Chi-SquareTest
ExpectedValue
 Isthe product ofthe row total multiplied bythe column total, divided
bythe grand total
 The expected numbers must be computed for each cell.
137

Chi-SquareTest
 Assumptions
 Datamust be categorical
 The data shouldbe afrequency data
 Thenumbersin eachcell are‘not too small’. No expected frequency = zero
 No more than 20% of the expected frequenciesshouldbe lessthan 5.
 Ifthis does not hold
 combined(re-categorized) row or columnvariablescategories to makethe expected
frequencieslarger or
 useYatescontinuity correction
138

139
Example
 A study was conducted to investigate the possible cause of
gastroenteritis outbreak following a lunch served in a high school
cafeteria. Among the 225 students who ate the sandwiches, 109
became ill. While, among the 38 students who did not eat the
sandwiches,4 became ill.
 Present the data by2x2 contingency table

Example
 With this method, dataare arranged in the form ofacontingency
table
 Thisisa2 × 2 table for two dichotomous random variables
140

Solution
 We again wish to know whether the proportions of students
who becameill in eachofthe groupsare identical
 Tocarry out the test, wefirst calculate the expected counts for the
table assuming that:
H0: p1 = p2
HA: p1 ≠p2
141

Example
 The chi-square test compares the observed frequencies in
each categorywith the expected frequencies giventhat H0is true
 Are the deviations between Observed and Expected too large to
be attributed to chance?
 Todetermine this, deviationsfrom all4 cellsmust be combined
 Calculate the sum:
142

143
Example
 TheHo isrejected at α levelifX2istoo large, in particular, ifX2>
X21,α
 If α = 0.05, wewouldreject H0for X2greater than X21,α = 3.84
 Therefore, wereject the Ho
 The p-valueisgivenbythe area under the X2distribution to the right
of X2
 P-value< 0.001

144
Example
 Astudy was conducted to look at the effects of oral contraceptives (OC) on heart
disease in women 40 to 44 years of age. It is found that among 5000 current OC
users at baseline, 13 women develop a myocardial infarction (MI) over a 3-year
period, where as among 10,000 non-OC users, 7 develop an MI over a 3-year
period. Compare the relation between Chi-SquareTestandz-test ?
– P1= 0.0026, P2= 0.0007
– Z-test = 2.77, P-value= 0.006
– There isahighlysignificantassociationbetweenMIandOC use

145
Solution
 Displaythe abovedatain the form ofa2x2 contingency table
OC-use group
MI statusover
3 years
Total
Yes No
OC users 13 4987 5000
Non-OC users 7 9993 10,000
Total 20 14,980 15,000
Isthe proportion ofMIthe samein OC users andnon-OC users?
What canbe saidabout the relationship between MIstatus andOC use?

Solutions
 Compute the expected frequencyfor the OC-MI data
 Relationship betwe
X2 ≈ 8, 0.001<p-value < 0.005
en X2andZtest isX2= Z2
146

147
Summary
1. Everyχ2 distribution extends indefinitely to the right from 0.
2. Everyχ2 distribution hasonlyone (right ) tail.
3. Asdfincreases, the χ2 curves get more bell shapedandapproach the normal
curve in appearance (but remember that a chi square curve starts at 0, not at -
∞)
4. If the value of χ2 is zero, then there is aperfect agreement between the observed
and the expected frequencies. The greater the discrepancy between the
observedandexpected frequencies, the larger willbe the valueofχ2.

Biostatics 8.pptx

Recommandé

Recommandé

Contenu connexe

Similaire à Biostatics 8.pptx

Similaire à Biostatics 8.pptx (20)

Plus de EyobAlemu11

Plus de EyobAlemu11 (12)

Dernier

Dernier (20)

Biostatics 8.pptx