This document discusses measures of distribution shape, relative location, and detecting outliers in data. It covers skewness, z-scores, the empirical rule, and box plots. Skewness measures the asymmetry of a distribution. Z-scores indicate how many standard deviations a value is from the mean. The empirical rule states what percentage of values fall within 1, 2, or 3 standard deviations of the mean. Box plots provide a graphical summary of a dataset's distribution using quartiles.
2. Measures of Distribution Shape,
Relative Location, and Detecting Outliers
Distribution Shape
z-Scores
Empirical Rule
Detecting Outliers
2
3. Distribution Shape: Skewness
An important measure of the shape of a distribution
is called skewness.
The formula for the skewness of sample data is
3
n xi − x
Skewness = ∑ s
(n − 1)( n − 2)
Skewness can be easily computed using statistical
software.
3
4. Distribution Shape: Skewness
Symmetric (not skewed)
• Skewness is zero.
• Mean and median are equal.
.35
Skewness = 0
.30
Relative Frequency
.25
.20
.15
.10
.05
0
4
5. Distribution Shape: Skewness
Moderately Skewed Left
• Skewness is negative.
• Mean will usually be less than the median.
.35
Skewness = − .31
.30
Relative Frequency
.25
.20
.15
.10
.05
0
5
6. Distribution Shape: Skewness
Moderately Skewed Right
• Skewness is positive.
• Mean will usually be more than the median.
.35
Skewness = .31
.30
Relative Frequency
.25
.20
.15
.10
.05
0
6
7. Distribution Shape: Skewness
Highly Skewed Right
• Skewness is positive (often above 1.0).
• Mean will usually be more than the median.
.35
Skewness = 1.25
.30
Relative Frequency
.25
.20
.15
.10
.05
0
7
10. z-Scores
The z-score is often called the standardized value.
The z-score is often called the standardized value.
It denotes the number of standard deviations a data
It denotes the number of standard deviations a data
value xii is from the mean.
value x is from the mean.
xi − x
zi =
s
Excel’s STANDARDIZE function can be used to
Excel’s STANDARDIZE function can be used to
compute the z-score.
compute the z-score.
10
11. z-Scores
An observation’s z-score is a measure of the relative
location of the observation in a data set.
A data value less than the sample mean will have a
z-score less than zero.
A data value greater than the sample mean will have
a z-score greater than zero.
A data value equal to the sample mean will have a
z-score of zero.
11
13. Empirical Rule
When the data are believed to approximate a
bell-shaped distribution with moderate skew …
The empirical rule can be used to determine the
The empirical rule can be used to determine the
percentage of data values that must be within a
percentage of data values that must be within a
specified number of standard deviations of the
specified number of standard deviations of the
mean.
mean.
The empirical rule is based on the normal
The empirical rule is based on the normal
distribution, which we will discuss later.
distribution, which we will discuss later.
13
14. Empirical Rule
For data having a bell-shaped distribution, approximately
68.26% of the values are within
of the values are within
+/- 1 standard deviation of its mean.
of its mean.
95.44% values are within
of the values are within
of the
+/- 2 standard deviations of its mean.
of its mean.
99.72% values are within
of the values are within
of the
+/- 3 standard deviations its mean.
of its mean.
of
14
16. Detecting Outliers
An outlier is an unusually small or unusually large
value in a data set.
A data value with a z-score less than -3 or greater
than +3 might be considered an outlier.
It might be:
• an incorrectly recorded data value
• a data value that was incorrectly included in the
data set
• a data value that has occurred by chance
16
17. Detecting Outliers
Example: Apartment Rents
• The most extreme z-scores are -1.20 and 2.27
• Using |z| > 3 as the criterion for an outlier, there
are no outliers in this data set.
Standardized Values for Apartment Rents
-1.20 -1.11 -1.11 -1.02 -1.02 -1.02 -1.02 -1.02 -0.93 -0.93
-0.93 -0.93 -0.93 -0.84 -0.84 -0.84 -0.84 -0.84 -0.75 -0.75
-0.75 -0.75 -0.75 -0.75 -0.75 -0.56 -0.56 -0.56 -0.47 -0.47
-0.47 -0.38 -0.38 -0.34 -0.29 -0.29 -0.29 -0.20 -0.20 -0.20
-0.20 -0.11 -0.01 -0.01 -0.01 0.17 0.17 0.17 0.17 0.35
0.35 0.44 0.62 0.62 0.62 0.81 1.06 1.08 1.45 1.45
1.54 1.54 1.63 1.81 1.99 1.99 1.99 1.99 2.27 2.27
17
18. Exploratory Data Analysis
Exploratory data analysis is looking at methods
Exploratory data analysis is looking at methods
to summarize data.
to summarize data.
For now we simply sort the data values into ascending
For now we simply sort the data values into ascending
order and identify the five-number summary and then
order and identify the five-number summary and then
construct a box plot..
construct a box plot
18
19. Five-Number Summary
1 Smallest Value
2 First Quartile
3 Median
4 Third Quartile
5 Largest Value
19
21. Box Plot
A box plot is a graphical summary of data that is
A box plot is a graphical summary of data that is
based on a five-number summary.
based on a five-number summary.
A key to the development of a box plot is the
A key to the development of a box plot is the
computation of the median and the quartiles Q11 and
computation of the median and the quartiles Q and
Q33..
Q
Box plots provide another way to identify outliers.
Box plots provide another way to identify outliers.
They also tell us whether the data are skewed.
They also tell us whether the data are skewed.
21
22. Box Plot
Example: Apartment Rents
• A box is drawn with its ends located at the first and
third quartiles (Q1 & Q3).
• A vertical line is drawn in the box at the location of
the median (second quartile).
400 425 450 475 500 525 550 575 600 625
Q1 = 445 Q3 = 525
Q2 = 475
22
23. Box Plot
Limits are located (not drawn) using the interquartile
range (IQR = Q3-Q1): they are 1.5IQR below Q1 and
1.5 IQR above Q3.
Data outside these limits are considered outliers.
The locations of each outlier is shown with the
symbol * .
23
24. Box Plot
Example: Apartment Rents
• The lower limit is located 1.5(IQR) below Q1.
Lower Limit: Q1 - 1.5(IQR) = 445 - 1.5(80) = 325
• The upper limit is located 1.5(IQR) above Q3.
Upper Limit: Q3 + 1.5(IQR) = 525 + 1.5(80) = 645
• There are no outliers (values less than 325 or
greater than 645) in the apartment rent data.
24
25. Box Plot
Example: Apartment Rents
• Whiskers (dashed lines) are drawn from the ends
of the box to the smallest and largest data values
inside the limits.
400 425 450 475 500 525 550 575 600 625
Smallest value Largest value
inside limits = 425 inside limits = 615
25
26. Box Plot
An excellent graphical technique for making
comparisons among two or more groups.
26
27. Measures of Association
Between Two Variables
Thus far we have examined numerical methods used
Thus far we have examined numerical methods used
to summarize the data for one variable at a time.
to summarize the data for one variable at a time.
Often a manager or decision maker is interested in
Often a manager or decision maker is interested in
the relationship between two variables..
the relationship between two variables
Two descriptive measures of the relationship
Two descriptive measures of the relationship
between two variables are covariance and correlation
between two variables are covariance and correlation
coefficient..
coefficient
27
28. Covariance
The covariance is a measure of the linear association
The covariance is a measure of the linear association
between two variables.
between two variables.
Positive values indicate a positive relationship.
Positive values indicate a positive relationship.
Negative values indicate a negative relationship.
Negative values indicate a negative relationship.
28
29. Covariance
The covariance is computed as follows:
The covariance is computed as follows:
(x1 − µ x )(y1 − µ y ) + L + (x N − µ x )(y N − µ y )
σ xy = for
N populations
(x1 − x)(y1 − y) + L + (x n − x)(y n − y)
s xy = for
n−1 samples
29
30. Correlation Coefficient
Correlation is a measure of linear association.
Correlation is a measure of linear association.
There are also other types of associations not captured
There are also other types of associations not captured
by correlation.
by correlation.
30
31. Correlation Coefficient
The correlation coefficient is computed as follows:
The correlation coefficient is computed as follows:
sxy σ xy
rxy = ρ xy =
sx s y σ xσ y
for for
samples populations
(x1 − µ x ) + L + (x N − µ x )
2 2 (y1 − µ y )2 + L + (y N − µ y )2
σ =
2
σ2=
x N y N
(x1 − x)2 + L + (x n − x) 2 (y1 − y) + L + (y n − y)
2 2
s2 =
x s =
2
n−1 y
n−1
31
32. Correlation Coefficient
The coefficient can take on values between -1 and +1.
The coefficient can take on values between -1 and +1.
Values near -1 indicate a strong negative linear
Values near -1 indicate a strong negative linear
relationship..
relationship
Values near +1 indicate a strong positive linear
Values near +1 indicate a strong positive linear
relationship..
relationship
The closer the correlation is to zero, the weaker the
The closer the correlation is to zero, the weaker the
relationship.
relationship.
32
33. Correlation
A Positive Relationship: correlation close to 1
y
x
33
36. Covariance and Correlation Coefficient
Example: Golfing Study
A golfer is interested in investigating the
relationship, if any, between driving distance and
18-hole score.
Average Driving Average
Distance (yds.) 18-Hole Score
277.6 69
259.5 71
269.1 70
267.0 70
255.6 71
272.9 69
36
37. Covariance and Correlation Coefficient
Example: Golfing Study
x y ( xi − x ) ( yi − y ) ( xi − x )( yi − y )
277.6 69 10.65 -1.0 -10.65
259.5 71 -7.45 1.0 -7.45
269.1 70 2.15 0 0
267.0 70 0.05 0 0
255.6 71 -11.35 1.0 -11.35
272.9 69 5.95 -1.0 -5.95
Average 267.0 70.0 Total -35.40
Std. Dev. 8.2192 .8944
37
38. Covariance and Correlation Coefficient
Example: Golfing Study
• Sample Covariance
sxy =
∑ (x − x )(y − y ) = − 35.40 =
i i
− 7.08
n− 1 6−1
• Sample Correlation Coefficient
sxy −7.08
rxy = = = -.9631
sx sy (8.2192)(.8944)
So, increasing driving distance decreases score, and the relation
is really strong.
38
39. Random Variables
A random variable is a numerical description of the
A random variable is a numerical description of the
outcome of an experiment.
outcome of an experiment.
A discrete random variable may assume either a
A discrete random variable may assume either a
finite number of values or an infinite sequence of
finite number of values or an infinite sequence of
values.
values.
A continuous random variable may assume any
A continuous random variable may assume any
numerical value in an interval or collection of
numerical value in an interval or collection of
intervals.
intervals.
39
40. Random Variables
Question Random Variable x Type
Family x = Number of dependents Discrete
size reported on tax return
Distance from x = Distance in miles from Continuous
home to store home to the store site
Own dog x = 1 if own no pet; Discrete
or cat = 2 if own dog(s) only;
= 3 if own cat(s) only;
= 4 if own dog(s) and cat(s)
40
41. Discrete Probability Distributions
The probability distribution for a random variable
The probability distribution for a random variable
describes how probabilities are distributed over
describes how probabilities are distributed over
the values of the random variable.
the values of the random variable.
41
42. Discrete Probability Distributions
The probability distribution is defined by a
The probability distribution is defined by a
probability function,, denoted by ff((x), which provides
probability function denoted by x), which provides
the probability for each value of the random variable.
the probability for each value of the random variable.
The required conditions for a discrete probability
The required conditions for a discrete probability
function are:
function are:
>0 (probabilities are not negative)
(x) = 1 (sum of all probabilities =1)
Remember that any probability is a number between 0 and 1.
42
43. Expected Value
The expected value,, or mean, of a random variable
The expected value or mean, of a random variable
is a measure of its central location.
is a measure of its central location.
E(x) = µ = Σxf(x)
The expected value does not have to be a value the
The expected value does not have to be a value the
random variable can assume.
random variable can assume.
43
44. Variance and Standard Deviation
The variance summarizes the variability in the
The variance summarizes the variability in the
values of a random variable.
values of a random variable.
Var(x) = σ 2 = Σ(x - µ)2f(x)
The standard deviation,, σ,, is defined as the positive
The standard deviation σ is defined as the positive
square root of the variance.
square root of the variance.
44
45. Binomial Probability Distribution
Four Properties of a Binomial Experiment
1. The experiment consists of a sequence of n
identical trials.
2. Two outcomes, success and failure, are possible
on each trial.
3. The probability of a success, denoted by p, does
not change from trial to trial.
4. The trials are independent.
45
47. Binomial Probability Distribution
Binomial Probability Function
n x
f ( x ) = p (1 − p )( n − x )
x
where:
x = the number of successes
p = the probability of a success on one trial
n = the number of trials
f(x) = the probability of x successes in n trials
n n!
=
( 1 × 2 × 3L × n )
÷=
x (n − x )! x ! ( 1 × 2 × 3L × (n − x ) ) ( 1 × 2 × 3L × x )
= `n choose x’ = number of ways x people can be chosen out of n
47
48. Binomial Probability Distribution
Binomial Probability Function
n x
f ( x ) = p (1 − p)( n − x )
x
Probability of a particular
Number of experimental
sequence of trial outcomes
outcomes providing exactly
with x successes in n trials
x successes in n trials
These values are available in Table 5 of our textbook.
48
49. Binomial Probability Distribution
Example: IIT Entrance
It is known that about 10% of the examinees taking
the IIT entrance qualify.
Thus, for any examinee chosen at random, there is a
probability of 0.1 that the person will qualify.
Choosing 3 examinees at random, what is
the probability that exactly 1 of them will qualify?
49
50. Binomial Probability Distribution
Example: IIT Entrance
Using the
p = .10, n = 3, x = 1 probability
function
n!
f ( x) = p x (1 − p ) (n − x )
x !( n − x )!
3!
f (1) = (0.1)1 (0.9)2 = 3(.1)(.81) = .243
1!(3 − 1)!
You can just check the binomial probability table in textbook for
n= 3, p = 0.1, x = 1.
Just f(1) if
Or, in Excel, use ‘=BINOMDIST(1,3,0.1,FALSE)’ FALSE,
f(0)+f(1) if
TRUE 50
51. Binomial Probability Distribution
Expected Value
E(x) = µ = np
Variance
Var(x) = σ 2 = np(1 − p)
Standard Deviation
σ = np(1 − p )
51
52. Binomial Probability Distribution
Example: Evans Electronics
• Expected Value
E(x) = np = 3(.1) = .3 employees out of 3
• Variance
Var(x) = np(1 – p) = 3(.1)(.9) = .27
• Standard Deviation
σ = 3(.1)(.9) = .52 employees
52
53. Poisson Probability Distribution
A Poisson distributed random variable is often
A Poisson distributed random variable is often
useful in estimating the number of occurrences
useful in estimating the number of occurrences
over a specified interval of time or space
over a specified interval of time or space
It is a discrete random variable that may assume
It is a discrete random variable that may assume
an infinite sequence of values (x = 0, 1, 2, .. .. .. ).
an infinite sequence of values (x = 0, 1, 2, ).
53
54. Poisson Probability Distribution
Examples of a Poisson distributed random variable:
Examples of a Poisson distributed random variable:
the number of defects in 14 pages of a book
the number of defects in 14 pages of a book
the number of customers arriving at the post
the number of customers arriving at the post
office in one hour
office in one hour
Bell Labs used the Poisson distribution to model the
Bell Labs used the Poisson distribution to model the
arrival of phone calls.
arrival of phone calls.
54
55. Poisson Probability Distribution
Two Properties of a Poisson Experiment
1. The probability of an occurrence is the same
1. The probability of an occurrence is the same
for any two time intervals of equal length.
for any two time intervals of equal length.
2. The occurrence or nonoccurrence in any time
2. The occurrence or nonoccurrence in any time
interval is independent of the occurrence or
interval is independent of the occurrence or
nonoccurrence in any other time interval.
nonoccurrence in any other time interval.
55
56. Poisson Probability Distribution
Poisson Probability Function
µ xe−µ
f ( x) =
x!
where:
x = the number of occurrences in an interval
f(x) = the probability of x occurrences in an interval
µ = mean number of occurrences in an interval
e = 2.71828
These values are available in Table 7 of our textbook.
56
57. Poisson Probability Distribution
Poisson Probability Function
Since there is no stated upper limit for the number
of occurrences, the probability function f(x) is
applicable for values x = 0, 1, 2, … without limit.
In practical applications, x will eventually become
large enough so that f(x) is very small and negligible.
57
58. Poisson Probability Distribution
Example: Mercy Hospital
Patients arrive at the emergency room of Mercy
Hospital at the average rate of 6 per hour on
weekend evenings.
What is the probability of 4 arrivals in 30 minutes
on a weekend evening?
58
59. Poisson Probability Distribution
Using the
Example: Mercy Hospital
probability
function
µ = 6/hour = 3/half-hour, x = 4
3 4 (2.71828)−3
f (4) = = .16801
4!
Or, simply check the table for Poisson probabilities in the book
for μ = 3, x = 4.
Just f(4) if FALSE,
In Excel, use ‘=POISSON(4,3,FALSE)’ f(0)+f(1)+...+f(4) if
TRUE
59
60. Poisson Probability Distribution
Example: Mercy Hospital
Poisson Probabilities
0.25
0.20
Probability
0.15
actually, the
sequence
0.10 continues:
11, 12, …
0.05
0.00
0 1 2 3 4 5 6 7 8 9 10
Number of Arrivals in 30 Minutes
60
61. Poisson Probability Distribution
A property of the Poisson distribution is that
A property of the Poisson distribution is that
the mean and variance are equal.
the mean and variance are equal.
µ=σ2
61
62. Continuous Probability Distributions
A continuous random variable can assume any value
in an interval on the real line or in a collection of
intervals.
It is not possible to talk about the probability of the
random variable assuming a particular value.
Instead, we talk about the probability of the random
variable assuming a value within a given interval.
62
63. We denote the ‘density function’ by f(x). Also
f ( x ) ≥ 0; ∫ f ( x )dx = 1
E( X ) = ∫ xf ( x )dx
Var ( X ) = ∫ ( x − E( X ) ) f ( x )dx
2
63
64. Area as a Measure of Probability
The area under the graph of f(x) and probability are
identical.
This is valid for all continuous random variables.
The probability that x takes on a value between some
lower value x1 and some higher value x2 can be found
by computing the area under the graph of f(x) over
the interval from x1 to x2.
64
65. Normal Probability Distribution
The normal probability distribution is the most
important distribution for describing a continuous
random variable.
It is used in a wide variety of applications
including:
• Heights of people • Test scores
• Rainfall amounts • Scientific measurements
For a large number of similar variables that are
unrelated, sum and average are approximately normal.
65
66. Normal Probability Distribution
Normal Probability Density Function
1 − ( x − µ )2 /2σ 2
f (x) = e
σ 2π
where:
µ = mean
σ = standard deviation
π = 3.14159
e = 2.71828
66
68. Normal Probability Distribution
Characteristics
The highest point on the normal curve is at the
mean, the middle point.
x
68
69. Normal Probability Distribution
Characteristics
The mean can be any numerical value: negative,
zero, or positive.
x
-10 0 25
69
70. Normal Probability Distribution
Characteristics
The standard deviation determines the width of the
curve: larger values result in wider, flatter curves.
σ = 15
σ = 25
x
70
71. Normal Probability Distribution
Characteristics
Probabilities for the normal random variable are
given by areas under the curve. The total area
under the curve is 1 (.5 to the left of the mean and
.5 to the right).
.5 .5
x
71
72. Normal Probability Distribution
Characteristics (basis for the empirical rule)
68.26% of values of a normal random variable
68.26%
are within +/- 1 standard deviation of its mean.
+/- 1 standard deviation
95.44% of values of a normal random variable
95.44%
are within +/- 2 standard deviations of its mean.
+/- 2 standard deviations
99.72% of values of a normal random variable
99.72%
are within +/- 3 standard deviations of its mean.
+/- 3 standard deviations
72
73. Normal Probability Distribution
Characteristics (basis for the empirical rule)
99.72%
95.44%
68.26%
µ
x
µ – 3σ µ – 1σ µ + 1σ µ + 3σ
µ – 2σ µ + 2σ
73
74. Standard Normal Probability Distribution
Characteristics
A random variable having a normal distribution
A random variable having a normal distribution
with a mean of 0 and a standard deviation of 1 is
with a mean of 0 and a standard deviation of 1 is
said to have a standard normal probability
said to have a standard normal probability
distribution..
distribution
74
75. Standard Normal Probability Distribution
Characteristics
The letter z is used to designate the standard
normal random variable.
σ=1
z
0
75
77. Example: Demand
The daily demand of the new ipad in a store seems
to follow a normal distribution with an average of
15 and a standard deviation of 6.
The manager, who does not want to keep more than
20 ipads in his store at a time, would like to know
the probability of a stockout, i.e. that the demand in
a day will exceed 20.
P(x > 20) = ?
77
78. Solving for the Stockout Probability
Step 1: Convert x to the standard normal distribution.
z = (x - µ)/σ
= (20 - 15)/6
= .83
78
79. Step 2: Find the area under the standard normal
curve to the left of z = .83.
Cumulative Probability Table for Standard Normal Distribution
z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09
. . . . . . . . . . .
.5 .6915 .6950 .6985 .7019 .7054 .7088 .7123 .7157 .7190 .7224
.6 .7257 .7291 .7324 .7357 .7389 .7422 .7454 .7486 .7517 .7549
.7 .7580 .7611 .7642 .7673 .7704 .7734 .7764 .7794 .7823 .7852
.8 .7881 .7910 .7939 .7967 .7995 .8023 .8051 .8078 .8106 .8133
.9 .8159 .8186 .8212 .8238 .8264 .8289 .8315 .8340 .8365 .8389
. . . . . . . . . . .
P(z < .83)
These values are available in Table 1 of our textbook.
79
80. Just f(0.83) if
FALSE, area
upto 0.83 if
In Excel, use ‘=NORMDIST(0.83,0,1,TRUE)’ TRUE
In fact, you can straightaway use ‘=NORMDIST(20,15,6,TRUE)’
P(X ≤ 20) with
μ = 15, σ = 6
80
81. Standard Normal Probability Distribution
Solving for the Stockout Probability
Step 3: Compute the area under the standard normal
Step 3: Compute the area under the standard normal
curve to the right of z = .83.
curve to the right of z = .83.
P(z > .83) = 1 – P(z < .83)
= 1- .7967
= .2033
Probability
of a stockout P(x > 20)
81
82. Standard Normal Probability Distribution
Solving for the Stockout Probability
Area = 1 - .7967
Area = .7967
= .2033
z
0 .83
These values are available in Table 1 of our textbook.
82
83. Standard Normal Probability Distribution
If the manager of wants the probability of a stockout
during replenishment lead-time to be no more than .
05, what should the reorder point be?
---------------------------------------------------------------
(Hint: Given a probability, we can use the standard
normal table in an inverse fashion to find the
corresponding z value. Give it a try.)
83