TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
C2 st lecture 10 basic statistics and the z test handout
1. Lecture 10 - Basic Statistics and the Z-test
C2 Foundation Mathematics (Standard Track)
Dr Linda Stringer Dr Simon Craik
l.stringer@uea.ac.uk s.craik@uea.ac.uk
INTO City/UEA London
2. Lecture 9 skills
Calculate the following measures of location (AVERAGES)
Mode
Median
Mean
Calculate the following measures of dispersion
(MEASURES OF SPREAD)
Interquartile range
Standard deviation
Absolute deviation
Perform a Z-test
Write the null and alternative hypothesis
Look up the critical value
Calculate the test statistic
Make the decision
Write a conclusion
3. A data set
A data set is usually a list of values (numbers) that has
been gathered in a survey.
We will use the following data set to demonstrate the ideas
in the first part of this lecture.
A statistician wants to find how many pets the average
person has. He interviews 10 people and gets the following
values
0 2 0 1 0 8 2 1 0 0
4. Bar charts
A bar chart showing how many pets 10 people have:
0 2 0 1 0 8 2 1 0 0
1 2 3 4 5 6 7 8 9 10
0
2
4
6
8
5. Pie charts
A pie chart of the data
0 2 0 1 0 8 2 1 0 0
0
50%
1
20%
2
20% 8
10%
6. Histogram
A histogram of the data showing how many people have each
number of pets.
0 2 0 1 0 8 2 1 0 0
0 1 2 8
1
2
3
4
5
7. Mode
In a data set the mode is the most frequent value (the value
which occurs most often). The mode is a type of average.
Example: Find the mode of the following data set
0 2 0 1 0 8 2 1 0 0
In this data set the mode is 0.
8. Mode
There can be more than one mode in a data set
Example:
0 5 5 0 1 5 0 1 6
There are two modes, they are 0 and 5.
9. Median
The median is the middle value in an ordered data set. It is
another type of average.
First order the data, with values increasing from left to right.
Let n be the size of the data set (the number of values).
If n
2 is an integer (whole number) then the median is the
midpoint of the n
2 th value and the n
2 + 1th value (to find the
midpoint, add the values together and divide by 2).
If n
2 is not an integer (whole number) then round it up to the
nearest integer (n+1
2 ). The median is the n+1
2 th value.
OR find the median by crossing off pairs of values, starting
from the ends of the data set.
10. Example
Order the data:
0 0 0 0 0 1 1 2 2 8
n = 10 (the number of values)
n
2 = 10
2 = 5, which is an integer
The median is the midpoint of the 5th and 6th value =
0+1
2 = 0.5.
11. Example 2
Order the data:
0 0 0 1 1 5 5 5 6
n = 9 (the number of values)
n
2 = 9
2 = 4.5, which is not an integer.
Round up to 5. The median is the 5th value = 1.
12. Interquartile range
First order the data, with values increasing from left to right.
We want to find two values: the first quartile Q1 and the
third quartile Q3.
Let n be the size of the data set (the number of values).
To find Q1 we multiply n by 1
4 . If n
4 is an integer (whole
number) then Q1 is the midpoint of the (n
4 )th value and the
(n
4 + 1)th value
If n
4 is not an integer then round it up to the nearest integer.
Q1 is the corresponding value.
To find Q3 we multiply n by 3
4 . If 3n
4 is an integer then Q3 is
the midpoint of the (3n
4 )th value and the (3n
4 + 1)th value
If 3n
4 is not an integer then round it up to the nearest
integer. Q3 is the corresponding value.
The interquartile range is Q3 − Q1.
13. Example
Order the data
0 0 0 0 0 1 1 2 2 8
n
4 = 10
4 = 2.5, which is not an integer.
Round up to 3.
Q1 is the third value, so Q1 = 0.
3n
4 = 3×10
4 = 7.5, which is not an integer.
Round up to 8.
Q3 is the eighth value, so Q3 = 2.
The interquartile range is Q3 − Q1 = 2 − 0 = 2.
14. Sigma notation Σ
Given a data set X, we denote the sum of all the values x
in X by
x
Example: If
X = 0 2 0 1 0 8 2 1 0 0
then x = 0 + 2 + 0 + 1 + 0 + 8 + 2 + 1 + 0 + 0 = 14
15. Mean
The mean is our third average.
In a data set of size n the mean, denoted ¯x, is the sum of
all the values divided by n.
¯x =
x
n
Example: What is the mean number of pets?
Calculate the sum of all the values and divide by n
¯x =
x
n
=
0 + 2 + 0 + 1 + 0 + 8 + 2 + 1 + 0 + 0
10
=
14
10
= 1.4
16. Standard deviation, σ
The standard deviation, σ is a measure of dispersion.
First calculate the variance, σ2. The standard deviation, σ,
is the square root of the variance.
There are two formulae for variance. They give the same
answer. Usually the second formula is easier to use.
σ2
=
(x − ¯x)2
n
=
x2
n
− ¯x2
When you have found the variance, do not forget to take
the square root !
σ =
x2
n
− ¯x2
17. Proof that the two formulae for standard deviation are
equivalent
σ2
= (x−¯x)2
n
= x2
−2x¯x+¯x2
n
= x2
n − 2¯x x
n +
¯x2
n
= x2
n − 2¯x2
+ ¯x2 1
n
= x2
n − ¯x2
18. Example
What is the standard deviation of the following data ?
0 2 0 1 0 8 2 1 0 0
Use the second formula to calculate the variance.
σ2
=
x2
n
− ¯x2
We previously worked out the mean ¯x = 1.4.
x2
= 02
+22
+02
+12
+02
+82
+22
+12
+02
+02
= 74
The variance is
σ2
=
x2
n
− ¯x2
=
74
10
− 1.42
= 5.44
The standard deviation is σ =
√
5.44 = 2.33 to 2 d.p.
19. Absolute value
The absolute value function gives the positive value of any
number
|x| =
x if x ≥ 0
−x if x < 0
|5| = 5,
| − 8| = 8,
| − 1.213| = 1.213.
|1, 000, 000| = 1, 000, 000.
20. Absolute deviation
The absolute deviation measures the average distance
from each value to the mean. It is another measure of
dispersion.
As a formula:
AD =
|x − ¯x|
n
21. Example
What is the absolute deviation of the data
0 2 0 1 0 8 2 1 0 0
The mean is ¯x = 1.4. We first work out |x − ¯x|:
1.4 0.6 1.4 0.4 1.4 6.6 0.6 0.4 1.4 1.4
The absolute deviation is
AD =
|x − ¯x|
n
=
15.6
10
= 1.56
22. Hypothesis testing
We use hypothesis testing to compare the mean of a very large
data set, a population mean, with the mean of a sample data
set, a sample mean.
Example: A lightbulb company says their lightbulbs last a mean
time of 1000 hours with a standard deviation of 50. We think
their lightbulbs last longer than this and propose a test at a 5%
level of significance. We buy 75 lightbulbs and they last a mean
time of 1022 hours.
The population mean is 1000 hours.
The sample is the 75 light bulbs that we test.
The sample mean is 1022 hours.
23. Hypothesis testing
The null hypothesis, H0 is a statement which is assumed to
be true.
Sample data is collected and tested to see if it is consistent
with the null hypothesis.
If the sample mean is significantly different from the
population mean, then we say that we have sufficient
evidence to reject the null hypothesis, H0, in favour of the
alternative hypothesis, H1.
24. The null hypothesis and the alternative hypothesis
The null hypothesis concerns the population mean.
It is of the form
H0 : µ = A
where µ is ’population mean’ and A is the hypothetical
value
The alternative hypothesis is that the null hypothesis is
incorrect and will be one of
H1 : µ = A
H1 : µ < A
H1 : µ > A
The question will direct you which of the above to use.
25. Significance level
The null hypothesis will always be tested to a given level of
significance.
A 5% level of significance means we are testing to see if
the probability of getting the sample data is less than 0.05.
If the probability is less we reject the null hypothesis in
favour of the alternative hypothesis.
A 1% level of significance translates to a probability of 0.01.
26. Critical value
A critical value is the value beyond which we reject the null
hypothesis. It tells us the boundary of the critical region(s)
In a Z-test this depends on the alternative hypothesis and
the significance level.
We look up the critical value(s) in tables.
Sig. Lev. 5% Sig. Lev. 1%
One-tail Two-tail One-tail Two-tail
Critical value 1.65 1.96 2.33 2.58
27. H1 : µ = A
If our alternative hypothesis is H1 : µ = A we are doing a
two-tailed test and we have 2 critical values, one negative and
one positive.
The critical value is the boundary of the rejection region.
For a 5% level of significance we have the following picture:
−1.96 1.96
x
y
The rejection (shaded) regions have a combined area of 0.05.
28. H1 : µ > A
If our alternative hypothesis is H1 : µ > A we are doing a
one-tailed test and we have 1 critical value which is positive.
The critical value is the boundary of the rejection region.
For a 5% level of significance we have the following picture:
1.65
x
y
The rejection region has an area of 0.05.
29. H1 : µ < A
If our alternative hypothesis is H1 : µ < A we are doing a
one-tailed test and we have 1 critical value which is negative.
The critical value is the boundary of the rejection region.
For a 5% level of significance we have the following picture:
1.65
x
y
The rejection region has an area of 0.05.
30. Test statistic
The test statistic is difference between the sample mean, ¯x
and the (hypothetical) population mean A, divided by the
standard error.
The standard error is σ/
√
n for the Z-test and s/
√
n for the
T-test, where n is the sample size, σ is the population
standard deviation and s is the sample standard deviation.
The Z-test statistic is
Z =
¯x − A
σ/
√
n
If the test statistic lies beyond the critical value(s) (in the
rejection region) we reject H0. If it does not, we accept H0.
31. Z-test - Example 1
Research says that the mean height for a man is 182cm with a
standard deviation of 9. We suspect men might be shorter than
this. We get the heights of 100 men and their mean height is
176. We test at a 1% level of significance.
32. Z-test - Example 1
The null hypothesis and alternative hypothesis are:
H0 : µ = 182
H1 : µ < 182
We are doing a 1-tailed test at a 1% level of significance so
the critical value is: C = −2.33.
The test statistic is Z = 176−182
9/
√
100
= −6.67.
−6.67 < −2.33 so we reject the null hypothesis.
33. Z-test - Example 2
A company says employees are supposed to work an average
of 40 hours a week with a standard deviation of 5 hours. Alfred
wants to know if he fits this to a 5% level of significance. He
notes down how many hours he works over 48 weeks and has
a mean of 39 hours.
34. Z-test - Example 2
The null hypothesis and alternative hypothesis are:
H0 : µ = 40
H1 : µ = 40
We are doing a 2-tailed test at a 5% level of significance so
the critical values are: C = −1.96, 1.96.
The test statistic is Z = 39−40
5/
√
48
= −1.39.
−1.96 < −1.39 < 1.96 so we accept the null hypothesis.
35. Z-test - Example 3
A lightbulb company says their lightbulbs last a mean time of
1000 hours with a standard deviation of 50. We think their
lightbulbs last longer than this and propose a test at a 5% level
of significance. We buy 75 lightbulbs and they last a mean time
of 1022 hours.
36. Z-test - Example 3
The null hypothesis and alternative hypothesis are:
H0 : µ = 1000
H1 : µ > 1000
We are doing a 1-tailed test at a 5% level of significance so
the critical value is: C = 1.65.
The test statistic is Z = 1022−1000
50/
√
75
= 3.81.
1.65 < 3.81 so we reject the null hypothesis.
37. Z-test summary
You will be given
1. Population mean, A
2. Population standard deviation, σ
3. Significance level
4. Sample mean, ¯x
5. Sample size, n
6. Quantifying word.
You have to work out
1. Null hypothesis, alternative hypotheis
2. Critical value(s)
3. Test statistic
4. Decision - accept/reject H0 (sketch a picture if possible)
5. Conclusion
38. The theory behind the Z-test and the T-test
If samples of size n are taken from a population with mean A
and standard deviation σ, then the sample means are
distributed normally, with mean A and standard deviation σ/
√
n
When we calculate the test statistic, we are calculating the
Z-score of the sample mean
The critical value is the Z-score of a sample mean which we
have a 5% (or 1%) probability of obtaining
For further information, try a statistics book from the library, or
the khanacademy videos on youtube
39. Normal distribution X ∼ N(µ, σ2
)
The normal distribution is defined as
f(x) =
1
σ
√
2π
e
−
(x−µ)2
2σ2
where σ is the population standard deviation and µ is the
population mean.
The graph below is when µ = 0 and σ = 1.
−4 −2 2 4
0.1
0.2
0.3
0.4
0.5
x
y
Probabilities correspond to areas under this curve