This document discusses hypothesis testing, which involves drawing inferences about a population based on a sample from that population. It outlines the key elements of a hypothesis test, including the null and alternative hypotheses, test statistics, critical regions, significance levels, critical values, and p-values. Type I and Type II errors are explained, where a Type I error involves rejecting the null hypothesis when it is true, and a Type II error involves failing to reject the null when it is false. The power of a hypothesis test is defined as the probability of correctly rejecting the null hypothesis when it is false. Controlling type I and II errors involves considering the significance level, sample size, and population parameters in the null and alternative hypotheses.
7. Statistical Inference
∗ Inferences about a population are made on
the basis of results obtained from a sample
drawn from that population
∗ Want to talk about the larger population from
which the subjects are drawn, not the
particular subjects!
8. What Do We Test ?
∗ Effect or Difference we are interested in
∗ Difference in Means or Proportions
∗ Odds Ratio (OR)
∗ Relative Risk (RR)
∗ Correlation Coefficient
∗ Clinically important difference
∗ Smallest difference considered biologically
or clinically relevant
12. Hypothesis Testing
Goal: Make statement(s) regarding unknown population
parameter values based on sample data
∗Elements of a hypothesis test:
∗ Null hypothesis - Statement regarding the value(s) of
unknown parameter(s). Typically will imply no association
between explanatory and response variables in our
applications (will always contain an equality)
∗ Alternative hypothesis - Statement contradictory to the null
hypothesis (will always contain an inequality)
13. Null Hypothesis
∗ Usually that there is no effect
∗ Mean = 0
∗ OR = 1
∗ RR = 1
∗ Correlation Coefficient = 0
15. Null Hypothesis expresses no difference
Example:
H0: µ = 0
Often said
“H naught” Or any number
Later…….
H0: µ1 = µ2
16. Alternative Hypothesis
H0: µ = 0; Null Hypothesis
HA: µ = 0; Alternative Hypothesis
Researcher’s predictions should be
a priori, i.e. before looking at the data
17. Estimation: From the Sample
∗ Point estimation
∗ Mean
∗ Median
∗ Change in mean/median
∗ Interval estimation
∗ 95% Confidence interval
∗ Variation
18. Parameters and
Reference Distributions
Continuous outcome data
∗ Normal distribution: N( μ, σ2
)
∗ t distribution: tω (ω = degrees of freedom)
∗Mean = (sample mean)
∗Variance = s2
(sample variance)
Binary outcome data
∗ Binomial distribution: B (n, p)
X
22. Hypothesis Testing
Goal: Make statement(s) regarding unknown population
parameter values based on sample data
Elements of a hypothesis test:
∗ Test statistic - Quantity based on sample data and null
hypothesis used to test between null and alternative
hypotheses.
∗ The test statistic is found by converting the sample statistic
(proportion, mean or standard deviation) to a score (z, tz, t or xx22
))
23. ∗ Critical region (Rejection region): Values of the test
statistic for which we reject the null in favor of the
alternative hypothesis
Critical Region, Significant level,
Critical value and p-value
24. ∗ Significant level (α ): the probability that the test
statistic will fall in the critical region when the null
hypothesis is actually true.
Critical Region, Significant level,
Critical value and p-value
25. ∗ Critical value: is any value that separates the critical
region from the values of the test statistic that do not
lead to rejection of the null hypothesis .
Critical Region, Significant level,
Critical value and p-value
26. ∗ Two tailed: the critical region is in the two extreme
regions (tails) under the curve
Two-Tailed, Left Tailed, Right Tailed
27. ∗ Left tailed: the critical region is in the extreme left
region (tails) under the curve
Two-Tailed, Left Tailed, Right Tailed
28. ∗ Right tailed: the critical region is in the extreme right
region (tails) under the curve
Two-Tailed, Left Tailed, Right Tailed
29. ∗ P-value (p-value or probability value: is the
probability of getting a value of the test statistic that
is at least as extreme as the one representing the
sample data assuming the null hypothesis is true.
∗ The null hypothesis is rejected if the p-value is very
small such as 0.05 or less.
Critical Region, Significant level,
Critical value and p-value
30. Reject the null hypothesis (or other)
Fail to reject the null hypothesis
Prove the null hypothesis to be true
Accept the null hypothesis
Support the null hypothesis
Statistically
correct
Ok but misleading
31. ∗ Traditional Method: Rejection of the null hypothesis
if the statistic falls within the critical region
Fail to reject the null hypothesis if the test statistic does
not fall within the critical region
∗ P – value methodP – value method: rejection H0 if p-value < α (where α
is the significant level such as 0.05)
Decision Criterion
32. ∗ Another option:Another option: Instead of using a significant level
such as α = 0.05, simply identify the P value and leave
the decision to the reader
∗ Confidence intervals:Confidence intervals: Because a Confidence interval
estimate of the population parameter contains the
likely values of that parameter, reject a claim that the
population parameter has a value that is not included
in the confidence interval
Decision Criterion
33. Statistical Error
Sometimes H0 will be rejected (based on large test
statistic & small p-value) even though H0 is really true
i.e., if you had been able to measure the entire
population, not a sample, you would have found
no difference between and µ some value but
based on X you see a difference.
The mistake of rejecting a true H0 will happen with frequency α
So, if H0 is true, it will be rejected ~5% of the time as α frequently = 0.05
34. 0
0 20
Population mean = 0
Sample mean = 20
Conclude based on sample mean that population mean ≠ 0, but it really
does (H0 true), therefore you have falsely rejected H0
Type I Error
population=“True”
Sample=What you see
H0 : mean = 0
35. Statistical Error
Sometimes H0 will be accepted (based on small test statistic & large
p-value) even though H0 is really false
i.e., if you had been able to measure the entire
population, not a sample, you would have found
a difference between and µ some value- but
based on X you do not see a difference.
The mistake of accepting a false H0 will happen
with frequency β
36. 0
Sample mean = 0
0 20
Sample mean = 20
Conclude based on sample mean that population mean = 0, but it really
does not (H0 really false), therefore you have falsely failed to reject H0
Type II Error
Population= “True”
Sample= what you see
H0 : mean = 0
20
37. 1. The treatments do not differ, and we correctly conclude
that they do not differ.
2. The treatments do not differ, but we conclude that they
do differ.
3. The treatments differ, but we conclude that they do not
differ.
4. The treatments do differ, and we correctly conclude that
they do differ.
Four Possibilities in Testing Whether
the Treatments Differ
38.
39. Type I error
∗ Concluded that
there is difference
while in reality
there is no
difference
∗ α probability
Type II error
∗Concluded that
there is no
difference while in
reality there is a
difference
∗β probability
43. ∗ The power of the hypothesis test is the probability (1-(1- ββ))
rejecting a false null hypothesis, which is computed by
using:
∗ A particular significant level α
∗ Sample size nn
∗ A particular assumed value of the population parameter
in the null hypothesis
∗ A particular assumed value of the population parameter
that is alternative to the value in the null hypothesis
Power of the test
44.
45. Term Definitions
α = Probability of making a type I error
= Probability of concluding the treatments differ when in reality
they do not differ
β = Probability of making a type II error
= Probability of concluding that the treatments do not differ when
in reality they do differ
Power = 1 - Probability of making a type II error
= 1 - β
= Probability of correctly concluding that the treatments differ
= Probability of detecting a difference between the treatments if
the treatments do in fact differ
Editor's Notes
Randomized trials can be used for many purposes. They can be used for evaluating new drugs and other treatments of disease, including tests of new health and medical care technology. Such trials can be used to assess new programs for screening and early detection, or new ways of organizing and delivering health services.
A planned trial was described by the Scottish surgeon James Lind in 1747. Lind became interested in scurvy, which killed thousands of British seamen each year. He was intrigued by the story of a sailor who had developed scurvy and had been put ashore on an isolated island where he subsisted on a diet of grasses and then recovered from the scurvy. Lind conducted an experiment which he described as follows: I took 12 patients in the scurvy on board the Salisbury at sea. The cases were as similar as I could have them … they lay together in one place and had one diet common to them all. Two of these were ordered a quart of cider per day.… Two others took 25 gutts of elixir vitriol.… Two others took two spoonfuls of vinegar.… Two were put under a course of sea water.… Two others had two oranges and one lemon given them each day.… Two others took the bigness of nutmeg. The most sudden and visible good effects were perceived from the use of oranges and lemons, one of those who had taken them being at the end of 6 days fit for duty.… The other … was appointed nurse to the rest of the sick. Interestingly, the idea of a dietary cause of scurvy proved unacceptable in Lind's day. Only 47 years later did the British Admiralty permit him to repeat his experiment-this time on an entire fleet of ships. The results were so dramatic that, in 1795, the Admiralty made lemon juice a required part of the standard diet of British seamen and later changed this to lime juice. Scurvy essentially disappeared from British sailors, who, even today, are referred to as "limeys."
When we carry out a study we are only looking at the sample of subjects in our study, such as a sample of patients with a certain illness who are being treated with treatment A or with treatment B. From the study results, we want to draw a conclusion that goes beyond the study population-is treatment A more effective than treatment B in the total universe of all patients with this disease who might be treated with treatment A or treatment B?
Under the assumption that the binomial distribution will approximate the normal distribution The probability of at least 52 of 100 attempts 0.3821 (high probability to be due to chance)
In probability and statistics, Student’s t-distribution (or simply the t-distribution) is a continuous probability distribution that arises when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown. It plays a role in a number of widely-used statistical analyses, including the Student’s t-test for assessing the statistical significance of the difference between two sample means, the construction of confidence intervals for the difference between two population means, and in linear regression analysis. The Student’s t-distribution also arises in the Bayesian analysis of data from a normal family. The t-distribution is symmetric and bell-shaped, like the normal distribution, but has heavier tails, meaning that it is more prone to producing values that fall far from its mean. This makes it useful for understanding the statistical behavior of certain types of ratios of random quantities, in which variation in the denominator is amplified and may produce outlying values when the denominator of the ratio falls close to zero. The Student’s t-distribution is a special case of the generalized hyperbolic distribution.
Binomial Experiment A binomial experiment (also known as a Bernoulli trial) is a statistical experiment that has the following properties: ■ The experiment consists of n repeated trials. ■ Each trial can result in just two possible outcomes. We call one of these outcomes a success and the other, a failure. ■ The probability of success, denoted by P, is the same on every trial. ■ The trials are independent; that is, the outcome on one trial does not affect the outcome on other trials. Consider the following statistical experiment. You flip a coin 2 times and count the number of times the coin lands on heads. This is a binomial experiment because: ■ The experiment consists of repeated trials. We flip a coin 2 times. ■ Each trial can result in just two possible outcomes - heads or tails. ■ The probability of success is constant - 0.5 on every trial. ■ The trials are independent; that is, getting heads on one trial does not affect whether we get heads on other trials. Notation The following notation is helpful, when we talk about binomial probability. ■ x: The number of successes that result from the binomial experiment. ■ n: The number of trials in the binomial experiment. ■ P: The probability of success on an individual trial. ■ Q: The probability of failure on an individual trial. (This is equal to 1 - P.) ■ b(x; n, P): Binomial probability - the probability that an n-trial binomial experiment results in exactly x successes, when the probability of success on an individual trial is P. ■ nCr: The number of combinations of n things, taken r at a time. Binomial Distribution A binomial random variable is the number of successes x in n repeated trials of a binomial experiment. The probability distribution of a binomial random variable is called a binomial distribution (also known as a Bernoulli distribution). Suppose we flip a coin two times and count the number of heads (successes). The binomial random variable is the number of heads, which can take on values of 0, 1, or 2. The binomial distribution is presented below.
A large z value would be the one above 2 02 below -2
Z = + or – 1.96 is equal to an alpha level of .05
Alpha is divided between the two tails 0.05 .025 each
Alpha is one side 0.05
Alpha is in one side 0.05
Z = + or – 1.96 is equal to an alpha level of .05
Given this background, let us now consider a trial in which groups receiving one of two therapies, therapy A and therapy B, are being compared. (Keep in mind the sampling of beads just discussed.) Before beginning our study, we can list the four possible study outcomes: It is possible that in reality there is no difference in efficacy between therapy A and therapy B (i.e., therapy A is no better and no worse than therapy B), and when we do our study we correctly conclude on the basis of our samples that the two groups do not differ. It is possible that in reality there is no difference in efficacy between therapy A and therapy B (i.e., therapy A is no better and no worse than therapy B), but in our study we found a difference between the groups and therefore concluded, on the basis of our samples, that there is a difference between the therapies. This conclusion, based on our samples, is in error. It is possible that in reality there is a difference between therapy A and therapy B, but when we examine the groups in our study we find no difference between them. We therefore conclude, on the basis of our samples, that there is no difference between therapy A and therapy B. This conclusion is in error. It is possible that in reality there is a difference between therapy A and therapy B, and when we examine the groups in our study we find that they differ. On the basis of these samples, we correctly conclude that therapy A differs from therapy B.
These four possibilities constitute the universe of outcomes after we complete our study. Let us look at these four possibilities as presented in a 2 × 2 table: Two columns represent reality-either therapy A differs from therapy B or therapy A does not differ from therapy B. The two rows represent our decision: We conclude either that they differ or that they do not differ. In this figure, the four possibilities that were just listed are represented as four cells in the 2 × 2 table. If there is no difference, and on the basis of the samples included in our study we conclude there is no difference, this is a correct decision (cell a). If there is a difference, and on the basis of our study we conclude that there is a difference (cell d), this too is a correct decision. In the best of all worlds, all of the possibilities would fall into one of these two cells. Unfortunately, this is rarely, if ever, the case. There are times when there is no difference between the therapies, but on the basis of the samples of subjects included in our study, we erroneously conclude that they differ (cell c). This is called a type I error. It is also possible that there really is a difference between the therapies, but on the basis of the samples included in our study we erroneously conclude that there is no difference (cell b); this is called a type II error. (In this situation, the therapies differ, but we fail to detect the difference in our study samples.) The probability that we will make a type I error is designated α, and the probability that we will make a type II error is designated β.
α is the so-called P value, which is seen in many published papers and has been sanctified by many years of use. When you see " P < .05," the reference is to α. What does P < .05 mean? It tells us that we have concluded that therapy A differs from therapy B on the basis of the sample of subjects included in our study, which we found to differ. The probability that such a difference could have arisen by chance alone, and that this difference between our groups does not reflect any true difference between therapies A and B, is only .05 (or 1 in 20).
How do these concepts help us to arrive at an estimate of the sample size that we need? If we ask the question, "How many people do we have to study in a clinical trial?" we must be able to specify a number of items
Use the same method as described in Figure 7-6 . Use the standard normal distribution (Table A-2).