Elementary statistics

Elementary Statistics

Davis Lazarus
Assistant Professor
ISIM, The IIS University

Too few categories

Age of Spring 1998 Stat 250 Students

60
Frequency (Count)

50

40

30

20

10

0

18 23 28
Age (in years)
n=92 students

Too many categories

GPAs of Spring 1998 Stat 250 Students

7

6
Frequency (Count)

5

4

3

2

1

0

2 3 4
GPA
n=92 students

•Scatter Plot
75 •Scatter diagram
Y 70
•Scattergram
65
60
55
50
45
40
35
30
30 40 50 60 70 80
X

Classes Class Tally Marks Freq. x
boundaries

70 – 78 69.5 – 78.5 ///// 5 74
61 – 69 60.5 – 69.5 ///// 5 65
52 – 60 51.5 – 60.5 0 56
43 – 51 42.5 – 51.5 // 2 47
34 – 42 33.5 – 42.5 /////-// 7 38
25 – 33 24.5 – 33.5 /////-/////-//// 14 29
16 – 24 15.5 – 24.5 /////-/////-/////-// 17 20

A frequency distribution
table lists categories of
scores along with their
corresponding frequencies.

The frequency for a
particular category or
class is the number of
original scores that fall
into that class.

The classes or
categories refer to the
groupings of a
frequency table

• The range is the difference
between the highest value
and the lowest value.

R = highest value – lowest value

The class width is the
difference between two
consecutive lower class limits
or class boundaries.

The class limits are the
smallest or the largest
numbers that can actually
belong to different classes.

• Lower class limits are the
smallest numbers that can
actually belong to the different
classes.
• Upper class limits are the
largest numbers that can
actually belong to the different
classes.

• The class boundaries are obtained by
increasing the upper class limits and
decreasing the lower class limits by the
same amount so that there are no gaps
between consecutive under classes. The
amount to be added or subtracted is ½
the difference between the upper limit of
one class and the lower limit of the
following class.

Essential Question :

• How do we construct a
frequency distribution
table?

Process of Constructing
a Frequency Table
• STEP 1: Determine the

range.

R = Highest Value – Lowest Value

• STEP 2. Determine the
tentative number of classes (k)

k = 1 + 3.322 log N

• Always round – off
• Note: The number of classes should be between
5 and 20. The actual number of classes may be
affected by convenience or other subjective
factors

• STEP 3. Find the class width
by dividing the range by the
number of classes.
Range R
class width = ⇔ c=
number of classes k

(Always round – off )

• STEP 4. Write the classes or
categories starting with the
lowest score. Stop when the class
already includes the highest
score.
• Add the class width to the starting point to get the
second lower class limit. Add the class width to the
second lower class limit to get the third, and so on. List
the lower class limits in a vertical column and enter the
upper class limits, which can be easily identified at this
stage.

• STEP 5. Determine the
frequency for each class by
referring to the tally columns
and present the results in a
table.

When constructing frequency
tables, the following guidelines
should be followed.
• The classes must be mutually
exclusive. That is, each score
must belong to exactly one
class.
• Include all classes, even if the
frequency might be zero.

• All classes should have the
same width, although it is
sometimes impossible to avoid
open – ended intervals such as
“65 years or older”.
• The number of classes should
be between 5 and 20.

Let’s Try!!!
• Time magazine collected
information on all 464 people who
died from gunfire in the Philippines
during one week. Here are the ages
of 50 men randomly selected from
that population. Construct a
frequency distribution table.

19 18 30 40 41 33 73 25
23 25 21 33 65 17 20 76
47 69 20 31 18 24 35 24
17 36 65 70 22 25 65 16
24 29 42 37 26 46 27 63
21 27 23 25 71 37 75 25
27 23

Using Table:
• What is the lower class limit
of the highest class? Upper
class limit of the lowest class?
• Find the class mark of the
class 43 – 51.
• What is the frequency of the
class 16 – 24?

Example 1
The manager of Hudson Auto would like to have a better
understanding of the cost of parts used in the engine
tune-ups performed in the shop.
She examines 50 customer invoices
for tune-ups. The costs of parts,
rounded off to the nearest dollar,
are listed on the next slide.

91 78 93 57 75 52 99 80 97 62
71 69 72 89 66 75 79 75 72 76
104 74 62 68 97 105 77 65 80 109
85 97 88 68 83 68 71 69 67 74
62 82 98 101 79 105 79 69 62 73

CUMULATIVE FREQUENCY
DISTRIBUTION
• The less than cumulative frequency
distribution (F<) is constructed by adding the
frequencies from the lowest to the highest
interval while the more than cumulative
frequency distribution (F>) is constructed by
adding the frequencies from the highest class
interval to the lowest class interval.

Tabular Summary
Frequency Distribution of
engine tune-ups Cumulative Frequency

Cost ($) Frequency Relative Frequency less than more than

50-59 2 0.04 2 50
60-69 13 0.26 15 48
2 + 13
70-79 16 0.32 31 35

80-89 7 0.14 38 5 + 7 18
90-99 7 0.14 45 12

100-109 5 0.10 50 5
50 1.00
45 tune-ups 12 tune-ups
cost less cost more
than $ 100 than $ 89

Graphical Summary: Histogram

18
16
14
Frequency

12
10
8
6
4
2
50-59 60-69 70-79 80-89 90-99 100-110 Cost ($)

Unlike a bar graph, a histogram has no natural
separation between rectangles of adjacent classes.

Ogive
less than ogive
50

40
Frequency

30

20
more than ogive
10

Tune-up
60 70 80 90 100 110 Cost ($)

median

Stem-and-Leaf Display
5 2 7
6 2 2 2 2 5 6 7 8 8 8 9 9 9
7 1 1 2 2 3 4 4 5 5 5 6 7 8 9 9 9
8 0 0 2 3 5 8 9
9 1 3 7 7 7 8 9
10 1 4 5 5 9

a stem
a leaf

A single digit is used to define each leaf
Leaf units may be 100, 10, 1, 0.1, and so on
Where the leaf unit is not shown, it is assumed to equal 1
In the above example, the leaf unit was 1

Leaf Unit = 0.1 8.6 11.7 9.4 9.1 10.2 11.0 8.8

8 6 8
9 1 4
10 2
11 0 7

Leaf Unit = 10 1806 1717 1974 1791 1682 1910 1838

16 8
17 1 9
The 82 in 1682
18 0 3
is rounded down
19 1 7 to 80 and is
represented as an 8

Measures of Central Tendency
Arithmetic Mean, Weighted Mean, Geometric Mean,
Median, Mode, Partition Values – Quartiles, Deciles and
Percentiles

Measures of Dispersion
Range, Mean deviation, Standard deviation, Variance,
Co-efficient of variation

Measures of Position
Quartile deviation

• What is the “location” or “centre” of the data? (measures
of location or central tendency)
• How do the data vary? (measures of variability or
dispersion)

Mean: the average obtained by finding the sum of the
numbers and dividing by the number of numbers in the sum.

Median: When the numbers are listed from highest to lowest
or lowest to highest, the median is the average number found
in the middle. If there are an even number of data, find the
average of the middle two numbers.

Mode: The number that occurs the most often.

Mean is the most widely used measure of location and
shows the central value of the data.

µ is thepopulation mean
µ=
∑Xi N is the population size
Xi is a particular population value
N Σ indicates the operation of adding

ΣX
xi µ is thesample mean
X = n is the sample size
n xi is a particular sample value

• all values are used
• unique
• sum of the deviations from the mean is 0
• affected by unusually large or small data values

The Median is the midpoint of the values after they
have been ordered from the smallest to the largest.

For an even set of values, the median will be the
arithmetic average of the two middle numbers and is
found at the (n+1)/2 ranked observation.

There are as many values above the median as below it
in the data array.

 unique
 not affected by extremely large or small values
⇒ good measure of location when such values occur

The Mode is another measure of location and represents
the value of the observation that appears most frequently.

Data can have more than one mode.
If it has two modes, it is referred to as bimodal, three
modes, trimodal, and the like.

Weighted Mean of a set of numbers X , X , ..., X ,1 2 n

with corresponding weights w1, w2, ...,wn

( w1 X 1 + w2 X 2 + ... + wn X n )
Xw =
( w1 + w2 + ...wn )

Geometric Mean of a set of n numbers is
defined as the nth root of the product of the n numbers.

GM = n ( X 1)( X 2 )( X 3)...( Xn )

GM is used to average percents, indexes, and relatives.

Example 1

The interest rate on three bonds were 5, 21, and 4 percent.
The arithmetic mean is (5+21+4) / 3 =10.0
The geometric mean is

GM = 3 (5)(21)(4) = 7.49

The GM gives a more conservative profit figure because
it is not heavily weighted by the rate of 21%

Example 2
Grow th in Sales 1999-2004
Another use of GM
is to determine the 50
percent increase in

Sales in Millions($)
40
sales, production 30
or other business
20
or economic series
10
from one time
0
period to another.
1999 2000 2001 2002 2003 2004
Year

(Value at end of period)
GM = n −1
(Value at beginning of period)

Example 3

The total number of females enrolled in American
colleges increased from 755,000 in 1992 to 835,000 in
2000. That is, the geometric mean rate of increase is
1.27%.

835,000
GM = 8 −1 = .0127
755,000

Measures of Dispersion

•Range
• Mean Deviation
•Quartile Deviation
•Standard Deviation
•Variance
•Co-efficient of Variation

Dispersion 30
refers to the
25
spread or
variability in 20

the data. 15

10

5 mean
0
0 2 4 6 8 10 12

Range = Largest value – Smallest value

Range Example

The following represents the current year’s Return on
Equity of the 25 companies in an investor’s portfolio.

-8.1 3.2 5.9 8.1 12.3
-5.1 4.1 6.3 9.2 13.3
-3.1 4.6 7.9 9.5 14.0
-1.4 4.8 7.9 9.7 15.0
1.2 5.7 8.0 10.3 22.1

Highest value: 22.1 Lowest value: -8.1

Range = Highest value – lowest value
= 22.1-(-8.1)
= 30.2

Mean Deviation
The arithmetic mean of the absolute values of the
deviations from the arithmetic mean.

 All values are used

M D = Σ X - X in the calculation.

n  Itis not unduly
influenced by large
or small values.
 The absolute values
are difficult to
manipulate.

Example 5

The weights of a sample of crates containing books for
the bookstore (in pounds ) are: 103, 97, 101, 106, 103

X = 102

ΣX −X 103 −102 + ... + 103 −102
MD = =
n 5
1 + 5 +1 + 4 + 5
= = 2.4
5

Standard deviation and Variance
the arithmetic mean of Standard deviation = √(variance)
the squared deviations
from the mean

σ 2
= Σ (X - µ)2
Population Variance
N
X is the value of an observation in the population
μ is the arithmetic mean of the population
N is the number of observations in the population

Population Standard Deviation, σ

Example 6

In Example 4, the variance and standard deviation are:

σ 2
= Σ (X - µ)2
N
( - 8 .1 - 6 .6 2 ) 2 + ( - 5 .1 - 6 .6 2 ) 2 + ... + ( 2 2 .1 - 6 .6 2 ) 2
σ2= 25

σ2 = 4 2 .2 2 7 σ == 6 . 4 9 8

Σ(X - X ) 2 Sample variance

s2 = n -1 Sample standard deviation, s

Example 7

The hourly wages earned by a sample of five students are
$7, $5, $11, $8, $6.

ΣX 37
X = = = 7.40
n 5
Σ( X − X ) ( 7 − 7.4 ) +... + ( 6 − 7.4 )
2 2 2
s 2
= =
n −1 5 −1
21.2
= = 5.30
5 −1

s= s 2
= 5.30 = 2.30

Example:
Data: X = {6, 10, 5, 4, 9, 8}; N=6
Mean:
X X−X (X − X ) 2

X=
∑X =
42
=7
6 -1 1 N 6
10 3 9 Variance:
5 -2 4
s =
2 ∑ ( X − X )2
=
28
= 4.67
4 -3 9 N 6
9 2 4 Standard Deviation:
8 1 1 s = s 2 = 4.67 = 2.16
Total: 42 Total: 28

Empirical Rule:
For any symmetrical, bell-shaped distribution

About 68% of the observations will lie within 1s the mean

About 95% of the observations will lie within 2s of the
mean

Nearly all the observations will be within 3s of the mean

Interpretation and Uses of the Standard Deviation

68%

95%
99.7%
µ− 3σ µ−2σ µ−1σ µ µ+1σ µ+2σ µ+ 3 σ

Quartiles Q1, Q2, Q3 divides ranked
data into four equal parts

25% 25% 25% 25%

Q1 Q2 Q3
Fra
cti
10 Deciles: D , D , D , D , D , D , D , D , D
1 2 3 4 5

divides ranked data into ten equal parts
6
les
7 8 9

10% 10% 10% 10% 10% 10% 10% 10% 10% 10%

D1 D2 D3 D4 D5 D6 D7 D8 D9

99 Percentiles: divides ranked data into 100 equal parts

Relative Standing
Percentiles

percentile of value x = ((number of values < x)/ total number of
values)*100
(round the result to the nearest whole number
Suppose that in a class of 25 people we have the following averages
(ordered in ascending order)

42, 59, 63, 67, 69, 69, 70, 73, 73, 74, 74, 74, 77, 78, 78, 79, 80, 81, 84, 85, 87, 89,
91, 94, 98
If you received a 77, what percentile are you?

percentile of 77 = (12/25)*100 = 48

Relative Standing
Quartiles

Instead of finding the percentile of a single data value as we did on
the previous page, it is often useful to group the data into 4, or more,
(nearly) equal groups. When grouping the data into four equal
groupings, we call these groupings quartiles.

Let n = number of items in the data set
k = percent desired (ex. k= 25)
L = locator  the value separating the first k
percent of the data from the rest

L = (k/100) * n

Relative Standing
Let’s separate the 25 class grades into four quartiles.

•Step 1 – order the data in ascending order

42, 59, 63, 67, 69, 69, 70, 73, 73, 74, 74, 74, 77, 78, 78, 79, 80, 81, 84, 85, 87, 89,
L
91, 94, 98 25
Q1 Q2 Q3
Now find the 3 locators L25, L50, L75,

Round fraction part up
L25 = (25/100) * 25 = 6.25 7
to the next integer
L50 = (50/100) * 25 = 12.5 13
L75 = (75/100) * 25 = 18.75
19

Relative Standing

Other measures of relative standing
include
•Interquartile range (IQR) = Q3 - Q1
•Semi-interquartile range = (Q3 - Q1)/ 2
•Midquartile = (Q3 + Q1)/2
•10 – 90 percentile range = P90 - P10

For the data on the previous page we have:

IQR = 84 – 70 = 16
Measures of variation
Semi IQR = (84 – 70)/2 = 8
Midquartile = (84 + 70)/2 = 77 Measure of central
tendency

Box Diagram

65, 67, 68, 68, 69, 69, 71, 71, 71, 72, 72, 72, 73, 73, 73, L25
74, 74, 75, 75, 75, 75, 76, 76, 77, 77, 77, 77, 77, 77, 78,
media
78, 78, 78, 79, 79, 79, 79, 80, 81, 81, 81, 81, 81, 81, 81, n
L75
81, 82, 82, 83, 84, 85, 85, 85, 86, 86, 87, 87, 88, 89, 92

To construct a box diagram to illustrate the extent to which the
extreme data values lie beyond the interquartile range, draw a line
with the low and high value highlighted at the two ends. Mark the
gradations between these two extremes, then locate the quartile
boundaries Q1, Med., and Q3 on this line. Construct a box about
Q1 = (73 + 74)/2 = 73.5
these values.
Q1 M
Q3

65 69 73 77 81 85 92
89

number of scores less than a
Percentile of score a = * 100
total number of scores

Relation between the different fractiles D1 = P10
D2 = P20
• Q1 = P25 D3 = P30
• Q2 = P50 •
•
• Q3 = P75
•
D9 = P90

Interquartile Range: Q 3 – Q1

Box plot graphical display, based on quartiles,
that helps to picture a set of data.

Five pieces of data are needed to construct a box plot:
Minimum Value,
First Quartile, Q1
The box represents the interquartile
Median,
range which contains the 50% of
Third Quartile, Q3
values.
Maximum Value. The whiskers represent the range;
they extend from the box to the
highest and lowest values,
excluding outliers.
A line across the box indicates the
median.

Example 8
Based on a sample of 20 deliveries, Buddy’s Pizza
determined the following information. The minimum
delivery time was 13 minutes and the maximum 30 minutes.
The first quartile was 15 minutes, the median 18 minutes, and
the third quartile 22 minutes. Develop a box plot for the
delivery times.

M in Q M e d ia n Q M ax
1 3
1.5 times the IQ range 1.5 times the interquartile range

12 14 16 18 20 22 24 26 28 30 32

Skewness
measurement of the lack of symmetry of the distribution.

Symmetric distribution: A distribution having the same
shape on either side of the centre

Skewed distribution: One whose shapes on either side of
the center differ; a nonsymmetrical distribution.

Can be positively or negatively skewed, or bimodal

Relative Positions of the Mean, Median, and
Mode in a Symmetric Distribution

M e a n
M e d ia n
M o d e

Relative Positions of the Mean, Median, and Mode in a Right
Skewed or Positively Skewed Distribution

Mean > Median > Mode

M o d e M e a n
M e d ia n

The Relative Positions of the Mean, Median, and Mode in a Left
Skewed or Negatively Skewed Distribution

Mean < Median < Mode

M e a n M o d e
M e d ia n

The coefficient of skewness can range from -3.00 up to 3.00

A value of 0 indicates a symmetric distribution.

Example 9

Using the twelve stock prices, we find the mean to be
84.42, standard deviation, 7.18, median, 84.5.

3 ( X - Median )
sk = = -.035
s

Kurtosis
• derived from the Greek word κυρτός, kyrtos or kurtos,
meaning bulging
• measure of the "peakedness" of the probability
distribution of a real-valued random variable
• higher kurtosis means more of the variance is due to
infrequent extreme deviations, as opposed to frequent
modestly-sized deviations.

distribution with positive kurtosis is called leptokurtic,
or leptokurtotic.
In terms of shape, a leptokurtic distribution has a more acute
"peak" around the mean (that is, a higher probability than a
normally distributed variable of values near the mean) and
"fat tails" (that is, a higher probability than a normally
distributed variable of extreme values).

distribution with negative kurtosis is called platykurtic,
or platykurtotic.
In terms of shape, a platykurtic distribution has a smaller
"peak" around the mean (that is, a lower probability than a
normally distributed variable of values near the mean) and
"thin tails" (that is, a lower probability than a normally
distributed variable of extreme values).

Other distribution – Leptokurtic

Normal distribution - Mesokurtic

Normal distribution
- Mesokurtic

Other distribution
– Platykurtic

Comparing Standard Deviations

Data A Mean =
15.5
11 12 13 14 15 16 17 18 19 20 21 s = 3.338

Data B
Mean =
11 12 13 14 15 16 17 18 19 20 21 15.5
s = .9258

Data C
Mean =
15.5
11 12 13 14 15 16 17 18 19 20 21
s = 4.57

Co-efficient of variation
• Measures relative variation S 
CV =  ÷100%
• Always in percentage (%) X 
• Shows variation relative to mean
• Is used to compare two or more sets of data measured in different
units

When the mean value is near zero, the coefficient of
variation is sensitive to change in the standard deviation,
limiting its usefulness.

Stock A:
Average price last year = $50
Standard deviation = $5

S   $5 
CV =  ÷100% =  ÷100% = 10%
X   $50 

Stock B:
Average price last year = $100
Standard deviation = $5

S   $5 
CV =  ÷100% =  ÷100% = 5%
X   $100 

Elementary statistics

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (6)

Similaire à Elementary statistics

Similaire à Elementary statistics (20)

Elementary statistics

Notes de l'éditeur