1. Statistics and Displays
A Basic Tutorial
This tutorial is designed as a refresher study for
teachers but may be adapted for use with students.
2. Tutorial Contents
This tutorial consists of
vocabulary;
data displays and their properties by grade
level.
The user of this tutorial may skip to desired
sections and pages via embedded links. Links
are indicated by black font and underlined text.
3. Vocabulary
Outliers
Measures of Central Tendency
Measures of Spread
Skewness
Types of data
Types of variables
Back to Contents
4. Outliers
An outlier is a data point that lies outside
the overall pattern of a distribution.
Specifically, it is a point which falls more
than 1.5 times the interquartile range
above the third quartile or below the first
quartile.
An outlier may exist in both uni-variate
data and bi-variate data.
Back to Vocabulary
5. Measures of Central Tendency
Mean
Median
Mode
All of these measures can describe the
“average” of a data set, thus the term
“average” is not to be synonymous
with the term “mean.” More notes…
Back to Vocabulary
6. Mean (Arithmetic Mean)
The mean is the sum of all the values in
the data set divided by the number of
data points in the set.
Mean is good measure for roughly
symmetric sets of data.
It may be misleading in skewed sets of
data as it is influenced by extreme values.
x1 x2 ...xn
mean
n
Back to Measures of Central Tendency Back to Vocabulary
7. Median
The median is the middle term in an ordered list
of data points.
It is the middle of a distribution of data values.
Thus, half the scores lie on one side, and half
lie on the other side.
The median is less sensitive to extreme scores.
It is a good measure to use when describing a
set with extreme outlier values.
Back to Measures of Central Tendency Back to Vocabulary
8. Mode
The mode is the value that appears most frequently in
the data set.
More than one mode can exist when two (or more)
values appear equally as often.
Bi-modal, tri-modal, etc can be used to describe the number of
modes in a data set when there is more than one.
Mode is the ONLY measure of central tendency that
can be used with nominal data.
The mode greatly fluctuates with changes in a sample,
and is not recommended as the only measure of central
tendency to describe a data set.
Back to Measures of Central Tendency Back to Vocabulary
9. Interesting Notes about
Measures of Central Tendency
In a normal distribution of scores, the
mean, median and mode all have the
same value.
Mean is the most efficient measure to
use in a normal distribution.
Median is usually the best to use when
the distribution is skewed with outliers.
Back to Measures of Central Tendency Back to Vocabulary
11. Range
The range is the difference between
the highest and lowest value in the
data set.
The range is highly sensitive to
extreme scores (outliers) and, thus,
is not good to use as the only
measure of spread.
Back to Measures of Spread Back to Vocabulary
12. Variation
Variation is a measure of how spread
out a distribution is.
Variation is computed as the mean of
the squared differences of each value
from the mean of the set.
Variation is basically a measure of how
far apart, on average, each value is
from the next value in the set.
(X x1 ) 2 ( X x2 ) 2 ...( X xn ) 2
n
Back to Measures of Spread Back to Vocabulary
13. Standard Deviation
The standard deviation is computed as the
square root of the variance.
It is the best and most commonly used
measure of spread for a data set because it
takes into account all the data points rather
than just the extreme ends.
It is most often used as a measure of risk in
real world applications such as stock
investments.
The standard deviation is very useful when
working with a normal distribution.
Back to Measures of Spread Back to Vocabulary
14. Skewness
A distribution of data is skewed if one of the
tails (ends) is longer than the other
Positive skew – long tail in the positive (right)
end
Mean is larger than the median
Negative skew – long tail in the negative (left)
end
Mean is smaller than the median
Symmetric distributions look like the normal
curve and are symmetrical on both tails
Mean and median are equal
Back to Vocabulary
15. Types of Data
Uni-variate data
The data is collected on only one
variable.
Bi-variate data
The data is collected on two
variables and plotted together for
investigation.
Back to Vocabulary
16. Types of Variables
A categorical variable has values that are labels for a
particular attribute (e.g., ice cream flavors).
Nominal – categories are in no particular order
Ordinal – categories are in a particular order
A quantitative variable has values that not only are
numerical but also allow descriptions such as mean and
range to be meaningful (e.g., test scores).
A discrete variable has only countable values (e.g., the
number of students in a class).
A continuous variable has numerical values that can be
any of the values in a range of numbers (e.g., the
speed of a car).
Back to Vocabulary
17. Data Displays
and their Properties
6th Grade
7th Grade
8th Grade
Back to Contents
18. 6th Grade
Line plot
Line graph
Bar graph
Stem and leaf
Circle graph (sketch only)
Back to Data Displays
19. 7th Grade
Line plot
Line graph
Bar graph
Stem and leaf
Circle graph
Venn diagram
Back to Data Displays
20. 8th Grade
Line plot Venn diagram
Line graph Box and
Bar graph whisker
Stem and leaf Histogram
Circle graph Scatterplot
Back to Data Displays
21. Line Plot
Consists of
a horizontal number line of the possible data
values;
one X for each element in the data set placed
over the corresponding value on the number
line.
Works well when
the data is quantitative (numerical);
there is one group of data (uni-variate);
the data set has fewer than 50 values;
the range of possible values is not too great.
22. Line Plot Example
Suppose thirty people live in an The graph is easier to create
apartment building. The ages of when the ages are placed in
the residents are below. order from largest to smallest as
the values will appear on the
58, 30, 37, 36, 34, 49, 35, 40,
number line.
47, 47, 39, 54, 47, 48, 54, 50,
35, 40, 38, 47, 48, 34, 40, 46, 30, 34, 34, 35, 35, 35, 36, 37,
49, 47, 35, 48, 47, 46 38, 39, 40, 40, 40, 46, 46, 47,
47, 47, 47, 47, 47, 48, 48, 48,
49, 49, 50, 54, 54, 58
23. Advantages of Line Plots
The plot shows all the data.
Line plots allow several features of
the data to become more obvious,
including any outliers, data
clusters, or gaps.
The mode is easily visible.
The range can be calculated quite
easily from this data display.
24. Disadvantages of Line Plots
A line plot may only be used for
quantitative (numerical) data.
A line plot is not efficient when
the data is large and/or the the
range is large.
25. Questions to Ask
Is the data skewed?
How do the mean, median, and
mode compare to each other?
Are there any outliers, data
clusters, or gaps in the data?
Back to Data Displays
26. Line Graph
Consists of
paired values graphed as points on a
plane defined by an x- and y-axis;
line segments connecting the graphed
points (much like a dot-to-dot).
Works well when
the data is paired (bi-variate);
the data is continuous.
27. Line Graph Example
75
John's Weight in Kilograms 74
73
72
71
70
69
68
67
66
65
1991 1992 1993 1994 1995
Year
John weighed 68 kg in 1991, 70 kg in 1992, 74 kg in
1993, 74 kg in 1994, and 73 kg in 1995.
28. Advantages of Line Graphs
A line graph is a way to summarize how two
pieces of information are related and how they
vary depending on one another.
29. Disadvantages of Line Graphs
Changing the scale of either axes can
dramatically change the visual impression of
the graph.
30. Questions to Ask
As one variable (displayed on the x-axis)
increases, what happens to the other variable
(displayed on the y-axis)?
What other trends in the data do you notice?
Back to Data Displays
31. Bar Graph
Consists of
bars of the same width drawn either horizontally
or vertically;
bars whose length (or height) represents the
frequencies of each value in a data set.
Works well when
the data is numerical or categorical;
the data is discrete;
the data is collected using a frequency table.
33. Contrast Bar Graphs
with Case-Value Plots
In a case-value plot, the length of the bar drawn for each data
element represents the data value.
In a bar graph, the length of the bar drawn for each data value
represents the frequency of that value.
Lenth of Six Cats
30
25
Length in Inches
20
15
10
5
0
A B C D E F
Cat
34. Advantages of Bar Graphs
The mode is easily visible.
A bar graph can be used with numerical or
categorical data.
35. Disadvantages of Bar Graphs
A bar graph shows only the frequencies of the
elements of a data set.
36. Questions to Ask
Is the data skewed?
What is the mode?
What if the data were collected _____ instead
of _______?
Why do you suppose ______ appears only
____ times in the data set?
What other conclusions can you draw about the
data?
Back to Data Displays
37. Stem and Leaf Plot
Consists of
Numbers on the left, called the stem, which are the first
half of the place value of the numbers (such as tens
values);
Numbers on the right, called the leaf, which are the
second half of the place value of the numbers (such as
ones values) so that each leaf represents one of the
data elements.
Works well when
the data contains more than 25 elements;
the data is collected in a frequency table;
the data values span many “tens” of values.
38. Stem and Leaf Plot
Additional Notes
A stem and leaf plot is also called a stem plot.
It is usually used for one set of data, but a back-to-back
stem and leaf plot can be used to compare two data
sets.
Data Data
Set A Set B
Leaf Stem Leaf
320 4 1567
The numbers 40, 42, and 43 are from Data Set A.
The numbers 41, 45, 46, and 47 are from Data Set B.
39. Stem and Leaf Plot Example
The number of points scored by the Vikings basketball team this season:
78, 96, 88, 74, 63, 86, 92, 66, 72, 88, 83, 90, 67, 81, 85, 94.
Writing the data in numerical order
may help to organize the data, but is 63, 66, 67, 72, 74, 78, 81, 83, 85,
NOT a required step. 86, 88, 88, 90, 92, 94, 96
Separate each number into a stem The number 63 would be
and a leaf. Since these are two digit represented as
numbers, the tens digit is the stem Stem Leaf
and the units digit is the leaf.
6 3
Group the numbers with the same Points scored by the Vikings
stems. List the stems in numerical
order. Title the graph. Stem Leaf
6 3 6 7
7 2 4 8
8 1 3 5 6 8 8
9 0 2 4 6
40. Advantages of
Stem and Leaf Plots
It can be used to quickly organize a large list of
data values.
It is convenient to use in determining median or
mode of a data set quickly.
Outliers, data clusters, or gaps are easily
visible.
41. Disadvantages of
Stem and Leaf Plots
A stem and leaf plot is not very informative for a small
set of data.
42. Questions to Ask
Is the data skewed?
Are there any outliers, data clusters, or gaps?
What is the mode?
What is the median?
How would the median be effected by
removing a particular data element?
adding a particular data element?
What other conclusions can you draw about the data?
Back to Data Displays
43. Circle Graph
also called Pie Chart
Consists of
a circle divided into sectors (or wedges) that show the
percent of the data elements that are categorized
similarly.
Works well when
there is only one set of data (uni-variate);
comparing the composition of each part to the whole
set of data.
44. Circle Graph Example
Cars in School Parking Lot
Color Number
White 19
White
Black 25
Black
Gray 11
Gray
Red 18
Red
B lue 7
B lue
Other
Other 10
Total 90
A proportion can be used to calculate the angle measure for each
sector. Using white as the example, 19 white cars compare to the
total of 90 in the same way that 76 degrees compares to the total
degrees (360) in a circle.
45. Advantages of Circle Graphs
A circle graph can be used for either numerical
or categorical data.
A circle graph shows a part to whole
relationship.
46. Disadvantages of Circle Graphs
Without technology, a circle graph may be difficult to
make. Each percent must be converted to an angle by
calculating the fraction of 360 degrees. Then the
correct angle must be drawn.
A circle graph does not provide information about
measures of central tendency or spread.
47. Questions to Ask
How does each part compare to another?
Why do you suppose ________ was selected more
than _______?
What conclusions can you draw about the data?
Back to Data Displays
48. Venn Diagram
Consists of
circles containing the value of each set or group;
overlapping or intersecting circles to illustrate the
common elements in groups;
any nonexamples displayed with a value outside of all
circles.
Works well when
a relationship exists between different groups of things
(sets).
50. Advantages of Venn Diagram
A Venn diagram visually illustrates the
relationship between different groups of things
(sets).
It shows the occurrence of sharing of common
properties.
51. Disadvantages of Venn Diagram
A Venn diagram provides little usefulness when there
are no shared features among sets.
52. Questions to Ask
How many elements are in each set?
How many elements are common to set ___ and set
___?
How many elements are in set ___ but not in set ___?
What conclusions can you draw about the data?
Back to Data Displays
53. Box and Whisker Plot
Consists of
the “five-point summary” (the least value, the greatest value, the
median, the first quartile, and the third quartile);
a box drawn to show the interval from the first (25th percentile) to
the third quartile (75th percentile) with a line drawn through the box
at the median;
line segments, called the whiskers, connecting the box to the least
and greatest values in the data distribution.
Works well when
there is only one set of data (uni-variate);
there are many data values.
54. Box and Whisker Plot Example
Math test scores 80, 75, 90, 95, 65, 65, 80, 85, 70, 100.
Write the data in numerical order and Median
find the five point summary..
median = 80
first quartile = 70 65, 65, 70, 75, 80, 80, 85, 90, 95, 100
third quartile = 90
smallest value = 65 Median of Lower Part, Median of Upper Part,
largest value = 100 First Quartile Third Quartile
Place a point beneath each of these 65 70 75 80 85 90 95 100
values on a number line.
Draw the box and whiskers and 65 70 75 80 85 90 95 100
median line.
55. Box and Whisker Plot Example
The following set of numbers 52 is the lower quartile
are the amount (arranged
The lower quartile is the median of
from least to greatest) of
the lower half of the values (18 27
video games owned by each
34 52 54 59 61).
boy in the club.
87 is the upper quartile
18 27 34 52 54 59 61 68 78
82 85 87 91 93 100 The upper quartile is the median of
the upper half of the values (78,
68 is the median 82, 85, 87, 91, 93, 100).
The median is the value
exactly in the middle of
an ordered set of
numbers.
56. Advantages of Box and Whisker Plots
Immediate visuals of a box-and-whisker plot
are the center, the spread, and the overall
range of distribution.
Box plots are useful for comparing data sets,
especially when the data sets are large or when
they have different numbers of data elements.
57. Disadvantages of Box and Whisker Plots
It shows only certain statistics rather than
all the data.
Since the data elements are not
displayed, it is impossible to determine if
there are gaps or clusters in the data.
58. Questions to Ask
Is the data skewed?
What is the median?
How does the median compare to the mean?
What other conclusions can you draw about the
data?
Back to Data Displays
59. Histogram
Consists of
equal intervals marked on the horizontal axis;
bars of equal width drawn for each interval, with the
height of each bar representing either the number of
elements or the percent of elements in that interval.
(There is no space between the bars.)
Works well when
data elements could assume any value in a range;
there is one set of data (uni-variate);
the data is collected using a frequency table.
61. Advantages of Histograms
A histogram provides a way to display the
frequency of occurrences of data along an
interval.
62. Disadvantages of Histograms
The use of intervals prevents the calculation of
an exact measure of central tendency.
63. Questions to Ask
What is the most frequently occurring interval of
values?
What is the least frequently occurring interval of
values?
What conclusions can you draw from the data?
Back to Data Displays
64. Scatterplot
Consists of
paired data (bi-variate) displayed on
a two-dimensional grid.
Works well when
multiple measurements are made for
each element of a sample.
65. Additional Notes about Scatterplots
If the relationship is thought to be a causal one, then
the independent variable is represented along the x-
axis and the dependent variable on the y-axis
A scatterplot can show that there is a positive, negative,
constant, or no relationship (correlation) between the
variables.
Positive: As the value of one variable increases, so does the
other.
Negative: As the value of one variable increases, the other
decreases.
Constant: As the value of one variable increases (or
decreases), the other remains constant.
No relationship: There is no pattern to the points.
67. Advantages of Scatterplots
A scatter plot is one of the best ways to
determine if two characteristics are related.
A scatterplot may be used when there are
multiple trials for the same input variable in an
experiment.
68. Disadvantages of Scatterplots
When a scatterplot shows an association
between two variables, there is not necessarily
a cause and effect relationship. Both variables
could be related to some third variable that
explains their variation or there could be some
other cause. Alternatively, an apparent
association could simply be a result of chance.
69. Questions to Ask
Is there a relationship between the variables?
If so, what kind?
What predictions can you make about the data
based on the graph?
Back to Data Displays