孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)

Basic concepts Data visualization Data summarization
Statistics and Data Analysis for Engineers
Part 1:
Introduction and Descriptive Statistics
Ling-Chieh Kung
Department of Information Management
National Taiwan University
September 4, 2016
Introduction and Descriptive Statistics 1 / 62 Ling-Chieh Kung (NTU IM)

What is Statistics?
Many things are unknown...
Consumers’ tastes.
Quality of a product.
Stock prices.
The eﬀectiveness of a new way of teaching/training.
Statistics is the science of collecting, analyzing, interpreting, and
presenting (numerical) data.
Ultimate goal (of Business Statistics): to achieve better decision making.
The study of Statistics includes:
Descriptive Statistics.
Probability.
Inferential Statistics: Estimation.
Inferential Statistics: Hypothesis testing.
Inferential Statistics: Prediction.
In summary: To estimate, test, and predict those unknowns.

My plan for today
Descriptive Statistics.
Visualization and summarization.
Inferential Statistics.
(Probability).
Hypothesis testing and p-value.
Regression analysis.
Case studies.

Road map
Basic concepts.
Data visualization.
Data summarization.

Populations vs. samples
A population is a collection of persons, objects, or items.
A census is to investigate the whole population.
A sample is a portion of the population.
Sampling is to investigate only a subset of the population.
We then use the information contained in the sample to infer (“guess”)
about the population.
What are samples for the following populations?
All students in NTU.
All students in the business school.
All chips made in one factory.
All consumers who have bought iPhone 6.
Two important questions:
Why sampling?
Is a sample representative?

Descriptive vs. inferential statistics
Descriptive statistics:
Graphical or numerical summaries of data.
Describing (visualizing or summarizing) a set of data.
Inferential statistics:
Making a “scientiﬁc guess” on unknowns.
Trying to say something about the population.
Which is descriptive and which is inferential?
Calculating the average height of 1000 randomly selected NTU students.
Using this number to estimate the average height of all NTU students.
Another example (pharmaceutical research):
All the potential patients form the population.
A group of randomly selected patients is a sample.
Use the result on the sample to infer the result on the population.

Parameters vs. statistics
A numerical summary of a population is a parameter.
The average height of all NTU students.
The expected coﬀee demand when the price is 50 NTD.
A numerical summary of a sample is a statistic.
The average height of all NTU male students.
The average coﬀee demand when the price is 50 NTD in the past 6 days.
Almost always people use a statistic to infer a parameter.
Some statistics are “good” while some are “bad.”

Parameters vs. statistics: an example
What is the average height of all NTU students?
While a census is possible, it is still quite costly.
It is natural to:
Sample some NTU students.
Calculate a statistic.
Use that statistic to estimate the average height (the parameter).
Some (good or bad) samples and statistics:
The average height of all students in this classroom.
The average height of 100 students randomly drawn from all students.
The maximum height of 100 students randomly drawn from all students.
The sum of heights of 100 students randomly drawn from all students.
The average height of 60 male and 40 female students randomly drawn
from the population.

Levels of data measurement
Most data we will play with are numerical.
Numerical data may be categorized to three levels:
Nominal.
Ordinal.
Quantitative: interval or ratio.

Nominal level
A nominal scale classiﬁes data into categories with no ranking.
Data are labels or names used to identify an attribute of the element.
The label may be numeric or non-numeric label.
Examples:
Categorical variables Values (Categories)
Laptop ownership Yes / No
Citizenship Taiwan / Japan / ...
Country code 886 / 86 / 1 / ...
Arithmetic operations cannot be applied on nominal data.

Ordinal level
An ordinal scale classifies data into categories with ranking.
The order or rank of the data is meaningful.
However, differences between numerical labels do not imply
distances.
Examples:
Categorical variables Values (Categories)
Product satisfaction Satisfied, neutral, unsatisfied
Professor rank Full, associate, assistant
Ranking of scores 1, 2, 3, 4, ...
It is still not meaningful to do arithmetic on ordinal data.
Assistant + associate = full?!
The grade difference between no. 1 and no. 5 may not be equal to that
between no. 11 and no. 15.

Quantitative (interval and ratio) levels
An interval scale is an ordered scale in which the diﬀerence between
measurements is a meaningful quantity but the measurements do not
have a true zero point.
A ratio scale is an ordered scale in which the diﬀerence between
measurements is a meaningful quantity and the measurements have a
true zero point.
Ratio data appear more often in the world.
Heights, weights, income, prices.
Interval data are actually rare.
Degrees in Celsius or Fahrenheit.
GRE or GMAT scores.
How about degrees in Kelvin?

Some remarks
Nominal and ordinal data are called qualitative data.
Interval and ratio data are called quantitative data.
Most statistical methods are for quantitative data; some are for
qualitative data.
Distinguishing nominal and ordinal scales is important.
Distinguishing interval and ratio scales is not.
Sometimes qualitative data are called categorical data.
Sometimes quantitative data are called numeric data.

A short summary
Understand these terms:
Populations vs. samples.
Parameters vs. statistics.
Inferential statistics vs. descriptive statistics.
For each scale of measurement, is it meaningful to calculate the
following numbers?
Level Ranking Distance
Nominal No No
Ordinal Yes No
Quantitative Yes Yes

Road map
Basic concepts.
Data visualization.
Data summarization.

An example
For each day in 2011 and 2012, we record
the number of daily rentals of the public
bike rental system in Washington, D.C.
985, 801, 1349, 1562, 1600, 1606, 1510, ...,
1341, 1796. and 2729.
The smallest and largest numbers are 22
and 8714, respectively.
How to get some feeling on 731 numbers?
date rental
2011/1/1 985
2011/1/2 801
2011/1/3 1349
2011/1/4 1562
2011/1/5 1600
2011/1/6 1606
2011/1/7 1510
...
2012/12/29 1341
2012/12/30 1796
2012/12/31 2729

Frequency distributions
The original 731 numbers form a set of ungrouped data.
We start by grouping them into a frequency distribution.
Grouped data presented in the form of class intervals and frequencies.
Let’s create an intuitive frequency distribution.

Frequency distributions: an example
The resulting classes:
Class Class interval (Which means)
1 [0, 1000) 0 ≤ x < 1000
2 [1000, 2000) 1000 ≤ x < 2000
3 [2000, 3000) 2000 ≤ x < 3000
...
8 [7000, 8000) 7000 ≤ x < 8000
9 [8000, 9000) 8000 ≤ x < 9000
How about [0, 999], [1000, 1999], etc.?
How about (0, 1000], (1000, 2000], etc.?

Frequency distributions: an example
Then we count to get the frequency
distribution at the right.
This is a set of grouped data.
Some remarks:
Typically we have 5 to 15 classes.
Typically all classes have the same
width.
Be aware of class endpoints! Classes
should NOT overlap with each other.
If there are outliers, they should be
removed ﬁrst.
Class interval Frequency
[0, 1000) 18
[1000, 2000) 80
[2000, 3000) 74
[3000, 4000) 107
[4000, 5000) 166
[5000, 6000) 106
[6000, 7000) 86
[7000, 8000) 82
[8000, 9000) 12

Something more
We may add class midpoints, relative frequencies, and
cumulative frequencies into a frequency table:
Class
Frequency
Class Relative Cumulative
interval midpoint frequency frequency
[0, 1000) 18 500 2.46% 18
[1000, 2000) 80 1500 10.94% 98
[2000, 3000) 74 2500 10.12% 172
[3000, 4000) 107 3500 14.64% 279
[4000, 5000) 166 4500 22.71% 445
[5000, 6000) 106 5500 14.50% 551
[6000, 7000) 86 6500 11.76% 637
[7000, 8000) 82 7500 11.22% 719
[8000, 9000) 12 8500 1.64% 731
How about cumulative relative frequencies?

Histograms
A frequency distribution may be depicted as a histogram.
Interval Freq.
[0, 1000) 18
[1000, 2000) 80
[2000, 3000) 74
[3000, 4000) 107
[4000, 5000) 166
[5000, 6000) 106
[6000, 7000) 86
[7000, 8000) 82
[8000, 9000) 12
It consists of a series of contiguous rectangles, each representing the
frequency in a class.

Histograms
Histograms may be the most important type of data graphs.
One particular reason to draw histograms is to get some ideas about
the distribution.
Bell shape? M shape? Skewed?
Any outlier?
We will discuss distributions in more details.

Frequency polygons
Alternatively, we may draw a frequency polygon by using line
segments connecting dots plotted at class midpoints.
The information contained in a frequency polygon is quite similar to that
contained in a histogram.

Frequency polygons
It is more convenient to use a frequency polygon to compare
multiple frequency distributions.
Both: Uni-modal and
symmetric.
2011: Bi-modal and
skewed to the right
(right-tailed).
2012: Uni-modal and
skewed to the left
(left-tailed).
Warning: People may misinterpret a frequency polygon as a line
chart (for data with a time sequence).

Line charts
A line chart is useful in depicting a time series data set.
A two-dimensional data set whose ﬁrst dimension (the x-axis) is for
labels of time points.
It visualizes how a quantity changes as time goes by.
For our monthly bike rentals:

Pie charts
A pie chart is a circular depiction of data where each slice represents
the percentage of the corresponding category.
It visualizes relative frequency distributions well.
For our bike rental data set:
What are the proportions of rentals in the four seasons?
What are the proportions of rentals on the seven days of a week?

A pie chart for seasonal rentals
Season Total rentals Proportion
Winter (12/20-3/20) 471348 14.3%
Spring (3/21-6/20) 918589 27.9%
Summer (6/21-9/20) 1061129 32.2%
Fall (9/21-12/20) 841613 25.6%

A pie chart for rentals among weekdays
Day Total rentals
Sunday 444027
Monday 455503
Tuesday 469109
Wednesday 473048
Thursday 485395
Friday 487790
Saturday 477807

Data not appropriate for pie charts
Pie charts are used to visualize proportions, i.e., subtotals over the
overall total.
It should not be used to compare averages.
The total numbers of rentals made by male and female users are
appropriate for a pie chart.
The average numbers of rentals per male and female users are not
appropriate for a pie chart.

Bar charts
Pie charts are useful in visualizing the proportions of each categories.
In demonstrating the diﬀerences among categories, a bar chart is a
better choice.
The larger the category, the longer the bar.
Some people draw bars vertically; some horizontally.

Bar charts
Let’s replace the pie chart to a bar chart.
Day Total rentals
Sunday 444027
Monday 455503
Tuesday 469109
Wednesday 473048
Thursday 485395
Friday 487790
Saturday 477807
Note that the y-axis does not start at 0!

Bar charts v.s. histograms
What are the diﬀerences that distinguish a bar chart from a histogram?
A bar chart uses noncontiguous bars to visualize categorical data.
A histogram uses contiguous bars to visualize quantitative data.

Visualizing two variables
When we have data for two variables, typically we want to identify
whether there is any relationship between them.
Visualizing the data in a two-dimensional manner helps.
When the two vales are both measured in quantitative scales, we may
depict each observation as a point on a plane to create a scatter plot.
For our bike rental example:
How do monthly rentals in 2011 and those in 2012 relate with each other?
How do daily casual and registered rentals relate with each other?

Monthly rentals in 2011 and 2012
Month 2011 2012
1 38189 96744
2 48215 103137
3 64045 164875
4 94870 174224
5 135821 195865
6 143512 202830
7 141341 203607
8 136691 214503
9 127418 218573
10 123511 198841
11 102167 152664
12 87323 123713

Road map
Basic concepts.
Data visualization.
Data summarization.

Summarizing the data with numbers
Descriptive Statistics includes some common ways to describe data.
Summarization with numbers.
Visualization with graphs.
This is always the ﬁrst step of any data analysis project: To get
intuitions that guide our directions.
Here we talk about summarization.
For a set of (a lot of) numbers, we use a few numbers to summarize them.
For a population: these numbers are parameters.
For a sample: these numbers are statistics.
We will talk about three things:
Measures of central tendency for the center or middle part of data.
Measures of variability for how variable the data are.
Measures of correlation for the relationship between two variables.

Medians
The median is the middle value in an ordered set of numbers.
Roughly speaking, half of the numbers are below and half are above it.
Suppose there are N numbers:
If N is odd, the median is the N+1
2
th large number.
If N is even, the median is the average of the N
2
th and the (N
2
+ 1)th
large number.
For example:
The median of {1, 2, 4, 5, 6, 8, 9} is 5.
The median of {1, 2, 4, 5, 6, 8} is 4+5
2
= 4.5.

Medians
A median is unaﬀected by the magnitude of extreme values:
The median of {1, 2, 4, 5, 6, 8, 9} is 5.
The median of {1, 2, 4, 5, 6, 8, 900} is still 5.
Medians may be calculated from quantitative or ordinal data.
It cannot be calculated from nominal data.
Unfortunately, a median uses only part of the information contained in
these numbers.
For quantitative data, a median only treats them as ordinal.

Means
The mean is the average of a set of data.
Can be calculated only from quantitative data.
The mean of {1, 2, 4, 5, 6, 8, 9} is
1 + 2 + 4 + 5 + 6 + 8 + 9
7
= 5.
A mean uses all the information contained in the numbers.
Unfortunately, a mean will be aﬀected by extreme values.
The mean of {1, 2, 4, 5, 6, 8, 900} is 1+2+4+5+6+8+900
7
≈ 132.28!
Using the mean and median simultaneously can be a good idea.
We should try to identify outliers (extreme values that seem to be
“strange”) before calculating a mean (or any statistics).

Population means vs. sample means
Let {xi}i=1,...,N be a population with N as the population size. The
population mean is
µ ≡
N
i=1 xi
N
.
Let {xi}i=1,...,n be a sample with n < N as the sample size. The
sample mean is
¯x ≡
n
i=1 xi
n
.
People use µ and ¯x in almost the whole statistics world.

Population means v.s. sample means
µ ≡
N
i=1 xi
N
¯x ≡
n
i=1 xi
n
.
Isn’t these two means the same?
From the perspective of calculation, yes.
From the perspective of statistical inference, no.
Typically the population mean is ﬁxed but unknown.
The sample mean is random: We may get diﬀerent values of ¯x today
and tomorrow.
To start from ¯x and use inferential statistics to estimate or test µ, we
need to apply probability.

Quartiles and percentiles
The median lies at the middle of the data.
The first quartile lies at the middle of the first half of the data.
The third quartile lies at the middle of the second half of the data.
For the pth percentile:
p
100
of the values are below it.
1 − p
100
of the values are above it.
Median, quartiles, and percentiles:
The 25th percentile is the first quartile.
The 50th percentile is the median (and the second quartile).
The 75th percentile is the third quartile.

Modes
The mode(s) is (are) the most frequently occurring value(s) in a set
of qualitative data.
In the set {A, A, A, B, B, C, D, E, F, F, F, G, H}, the modes are A and F.
The frequency of the modes (A and F) are 3.
Though the above deﬁnition may also be applied to quantitative data,
sometimes it is useless.
In many case, all values are modes!
For quantitative data, we instead look for the modal class(es).

Modal classes
In a baseball team, players’ heights
(in cm) are:
178 172 175 184
172 175 165 178
177 175 180 182
177 183 180 178
179 162 170 171
For the classes [160, 165), [165, 170),
..., and [185, 190), the modal class is
[175, 180).
We sometimes say the mode of this
set is 177.5.
The way of grouping matters!

Variability
Measures of variability describe the spread or dispersion of a set
of data.
Especially important when two sets of data have the same center.

Ranges and Interquartile ranges
The range of a set of data {xi}i=1,...,N is the difference between the
maximum and minimum numbers, i.e.,
max
i=1,...,N
{xi} − min
i=1,...,N
{xi}.
The interquartile range of a set of data is the difference of the first
and third quartile.
It is the range of the middle 50 of data.
It excludes the effects of extreme values.

Deviations from the mean
Consider a set of population data
{xi}i=1,...,N with mean µ.
Intuitively, a way to measure the
dispersion is to examine how each number
deviates from the mean.
For xi, the deviation from the population
mean is deﬁned as
xi − µ.
For a sample, the deviation from the
sample mean of xi is
xi − ¯x.
i xi deviation
1 1 1 − 5 = −4
2 2 2 − 5 = −3
3 4 4 − 5 = −1
4 5 1 − 5 = 0
5 6 6 − 5 = 1
6 8 8 − 5 = 3
7 9 9 − 5 = 4
Mean 5

Mean deviations
May we summarize the N deviations into
a single number to summarize the
aggregate deviation?
Intuitively, we may sum them up and then
calculate the mean deviation:
N
i=1(xi − µ)
N
.
Is it always 0?
i xi deviation
1 1 1 − 5 = −4
2 2 2 − 5 = −3
3 4 4 − 5 = −1
4 5 1 − 5 = 0
5 6 6 − 5 = 1
6 8 8 − 5 = 3
7 9 9 − 5 = 4
Mean 5 0

Adjusting mean deviations
People use two ways to adjust
mean deviations:
Mean absolute deviations/errors
(MAD):
N
i=1 |xi − µ|
N
.
Mean squared deviations/errors
(variance or MSE):
N
i=1(xi − µ)2
N
.
A larger MAD or variance means
that the data are more disperse.
i xi di |di| d2
i
1 1 −4 4 16
2 2 −3 3 9
3 4 −1 1 1
4 5 0 0 0
5 6 1 1 1
6 8 3 3 9
7 9 4 4 16
Mean 5 0 2.29 7.43

MAD vs. variance
The main diﬀerence:
An MAD puts the same weight on all values.
A variance puts more weights on extreme values.
They may give diﬀerent ranks of dispersion:
i xi di |di| d2
i
1 0 −5 5 25
2 4 −1 1 1
3 5 0 0 0
4 6 1 1 1
5 10 5 5 25
Mean 5 0 2.4 10.4
i xi di |di| d2
i
1 1 4 4 16
2 2 3 3 9
3 5 0 0 0
4 8 3 3 9
5 9 4 4 16
Mean 5 0 2.8 10
In general, people use variances more than MADs.
But MADs are still popular in some areas, e.g., demand forecasting.
It is the analyst’s discretion to choose the appropriate one.

Standard deviations
One drawback of using variances is that the unit of measurement is the
square of the original one.
For the baseball team, the variance of
member heights is 34.05 cm2
. What is it?!
People take the square root of a variance
to generate a standard deviation.
The standard deviation of member heights
is √
34.05 ≈ 5.85 cm.
178 172 175 184
172 175 165 178
177 175 180 182
177 183 180 178
179 162 170 171
A standard deviation typically has more managerial implications.

Population v.s. sample variances
Recall that the formulas for population and sample means are
µ ≡
N
i=1 xi
N
and ¯x ≡
n
i=1 xi
n
, respectively.
Formula-wise there is no diﬀerence.
However, population and sample variances are
σ2
≡
N
i=1(xi − µ)2
N
and s2
≡
n
i=1(xi − ¯x)2
n − 1
, respectively.
Note the diﬀerence between N and n − 1!
Population and sample standard deviations are σ =
N
i=1(xi−µ)2
N
and
s =
n
i=1(xi−¯x)2
n−1
, respectively.
People use σ2
, σ, s2
, and s in almost the whole statistics world.

Coefficient of variation
The coefficient of variation is the ratio of the standard deviation to
the mean:
Coefficient of variation =
σ
µ
.
When will you use coefficients of variation?

z-scores
Consider a set of sample data {xi}i=1,...,n with sample mean ¯x and
sample standard deviation s. For xi, the z-score is
zi =
xi − ¯x
s
.
In a set of population data {xi}i=1,...,N with population mean µ and
population standard deviation σ, the z-score of xi is
zi =
xi − µ
σ
.
A value’s z-score measures for how many standard deviations it
deviates from the mean.

z-scores vs. outliers
For detecting outliers, one common way is double check whether xi is
an outlier if
|zi| =
xi − µ
σ
> 3.
It is quite rare for a value’s magnitude of z-score to be so large.
For sample data, use xi−¯x
s
.
Some people propose the use of median and MAD is a similar way:
double check whether xi is an outlier if1
xi − median
MAD
> 3.
The above rules only suggest one to investigate some extreme values
again. These rules are neither suﬃcient nor necessary for outliers.
1The “MAD” here can be mean absolute deviation from mean, mean absolute
deviation from median, median absolute deviation from median, etc.

Correlation
Consider the size of a house and its price in a city:
Size Price
(in m2
) (in $1000)
75 315
59 229
85 355
65 261
72 234
46 216
107 308
91 306
75 289
65 204
88 265
59 195
How do we measure/describe the correlation (linear relationship)
between the two variables?

Intuition
Consider a set of paired data
{(xi, yi)}i=1,...,N .
When one variable goes up, does
the other one tend to go up or
down?
More precisely, if xi is larger than
µx (the mean of the xis), is it more
likely to see yi > µy or yi < µy?
We say that the two variables have
a positive correlation.
If one goes up when the other goes
down, there is a negative
correlation.

Covariances
We deﬁne the covariance of a set of two-dimensional (sample) data as
sxy ≡
n
i=1(xi − ¯x)(yi − ¯y)
n − 1
.
If most points fall in the ﬁrst and third quadrants, most
(xi − µx)(y − µy) will be positive and sxy tends to be positive.
Otherwise, sxy tends to be negative.
So the covariance of house size and price is 617.16.
Is it large or small?
This depends on how variable the two variables themselves are.

Pearson’s correlation coefficients
To take away the auto-variability of each variable itself, we define the
population and sample correlation coefficients as
r ≡
sxy
sxsy
,
sx and sy are the sample standard deviations of xis and yis.
In our example, we have r = 617.16
16.78×50.45
≈ 0.729.
It can be shown that we always have −1 ≤ r ≤ 1.
r > 0: Positive correlation.
r = 0: No correlation.
r < 0: Negative correlation.
People often determine the degree of correlation based on |s|:
0 ≤ |s| < 0.25: A weak correlation.
0.25 ≤ |s| < 0.5: A moderately weak correlation.
0.5 ≤ |s| < 0.75: A moderately strong correlation.
0.75 ≤ |s| ≤ 1: A strong correlation.

Correlation vs. independence
A correlation coeﬃcient only measures how one variable linearly
depends on the other variable.
(r = 0.5973) (r = 0)
Being uncorrelated does not mean being independent!

Correlation vs. causation
A correlation coeﬃcient only measures whether two variables correlate
with each other. High correlation does not mean causation.
A causes B or B causes A? C causes A and B? Or just by chance?

Correlation of qualitative variables
Sometimes the variables are not quantitative/numeric.
For ordinal data, we calculate their Spearman’s rank correlation.
For nominal data, we calculate Cramer’s V.

Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Part 2:
Hypothesis Testing and p-value
Ling-Chieh Kung
September 4, 2016
Hypothesis Testing and p-value 1 / 71 Ling-Chieh Kung (NTU IM)

Road map
Sampling.
Sampling distributions.
Hypothesis testing.
p-value, t test, and more.

Random vs. nonrandom sampling
Sampling is the process of selecting a subset of entities from the whole
population.
Sampling can be random or nonrandom.
If random, whether an entity is selected is probabilistic.
Randomly select 1000 phone numbers on the telephone book and then
call them.
If nonrandom, it is deterministic.
Ask all your classmates for their preferences on iOS/Android.
Most statistical methods are only for random sampling.
Some popular random sampling techniques:
Simple random sampling.
Stratiﬁed random sampling.
Cluster (or area) random sampling.

Simple random sampling
In simple random sampling, each entity has the same probability of
being selected.
The good part of simple random sampling is simple.
However, it may result in nonrepresentative samples.
In simple random sampling, there are some possibilities that too
much data we sample fall in the same stratum.
They have the same property.
E.g., it is possible that all randomly sampled voters are younger than 40.
The sample is thus nonrepresentative.
How to ﬁx this problem?

Stratified random sampling
We may apply stratified random sampling.
We first split the whole population into several strata.
Data in one stratum should be (relatively) homogeneous.
Data in different strata should be (relatively) heterogeneous.
We then use simple random sampling for each stratum.

As an example, suppose that we want to sample 40 out of 1000
graduates to understand the number of credits they get at school.
Suppose that 100 students double majored, then we can split the whole
population into two strata:
Stratum Strata size
Double major 100
No double major 900
To sample 40 graduates, we sample 40 × 100
1000 = 4 from the
double-major stratum and 36 from the other stratum.

We may further split the population into more strata.
Double major: Yes or no.
Class: 1994-1998, 1999-2003, 2004-2008, or 2009-2012.
This stratification makes sense only if students in different classes tend
to take different numbers of units.
Stratified random sampling is good in reducing sample error.
But it can be hard to identify a reasonable stratification.
It is also more costly and time-consuming.

Cluster (or area) random sampling
Imagine that you are going to introduce a new product into all the
retail stores in Taiwan.
If the product is actually unpopular, an introduction with a large
quantity will incur a huge lost.
How to get an idea about the popularity?
Typically we ﬁrst try to introduce the product in a small area. We
put the product on the shelves only in those stores in the speciﬁed area.
This is the idea of cluster (or area) random sampling.
Those consumers in the area form a sample.

Cluster (or area) random sampling
In cluster random sampling, we deﬁne clusters.
We will only choose one or some clusters and then collect all the
data in these clusters.
If a cluster is too large, we may further split it into multiple
second-stage clusters.
Therefore, we want data in a cluster to be heterogeneous, and data
across clusters somewhat homogeneous.
For example, people may do cluster random sampling to understand
the popularity of a new product. Those chosen cities (counties, states,
etc.) are called test market cities (counties, states, etc.).
People use cluster random sampling in this case because of its feasibility
and convenience.
We should select test market cities whose population proﬁles are similar
to that of the entire country.

Nonrandom sampling
Sometimes we do nonrandom sampling.
Convenience sampling.
The researcher sample data that are easy to sample.
Judgment sampling.
The researcher decides who to ask or what data to collect.
Quota sampling.
In each stratum, we use whatever method that is easy to ﬁll the quota, a
predetermined number of samples in the stratum.
Snowball sampling.
Once we ask one person, we ask her/him to suggest others.
Nonrandom sampling cannot be analyzed by the statistical methods
we introduce in this course.

Road map
Sampling.
Hypothesis testing.
p-value, t test, and more. .

Sampling distributions
When we cannot examine the whole population, we study a sample.
What will be contained in a random sample is unpredictable.
We need to know the probability distribution of a sample so that we
may connect the sample with the population.
The probability distribution of a sample is a sampling distribution.

Sampling distributions
A factory produces bags of candies. Ideally, each bag should weigh 2
kg. As the production process cannot be perfect, a bag of candies
should weigh between 1.8 and 2.2 kg.
Let X be the weight of a bag of candies. Let µ and σ be its expected
value and standard deviation.
Is µ = 2?
Is 1.8 < µ < 2.2?
How large is σ?
Let’s sample:
In a random sample of 1 bag of candies, suppose it weighs 2.1 kg. May
we conclude that 1.8 < µ < 2.2?
What if the average weight of 5 bags in a random sample is 2.1 kg?
What if the sample size is 10, 50, or 100?
What if the mean is 2.3 kg?
We need to know the sampling distribution of those statistics (sample
mean, sample standard deviation, etc.).

Sample means
The sample mean is one of the most important statistics.
Deﬁnition 1
Let {Xi}i=1,...,n be a sample from a population, then
¯x =
n
i=1 Xi
n
is the sample mean.
Sometimes we write ¯xn to emphasize that the sample size is n.
We assume that Xi and Xj are independent for all i = j.
This is ﬁne if n N, i.e., we sample a few items from a large population.
In practice, we require n ≤ 0.05N.

Means and variances of sample means
Suppose the population mean and variance are µ and σ2
, respectively.
These two numbers are ﬁxed.
A sample mean ¯x is a random variable.
It has its expected value E[¯x], variance Var(¯x), and standard deviation
Var(¯x). These numbers are all ﬁxed
They are also denoted as µ¯x, σ2
¯x, and σ¯x, respectively.
For any population, we have the following theorem:
Proposition 1 (Mean and variance of a sample mean)
Let {Xi}i=1,...,n be a size-n random sample from a population with
mean µ and variance σ2
, then we have
µ¯x = µ, σ2
¯x =
σ2
n
, and σ¯x =
σ
√
n
.

Means and variances of sample means
Do the terms confuse you?
The sample mean vs. the mean of the sample mean.
The sample variance vs. the variance of the sample mean.
By deﬁnition, they are:
¯x = 1
n
n
i=1 Xi; a random variable.
E[¯x]; a constant.
s2
= 1
n−1
n
i=1(Xi − ¯x)2
; a random variable.
Var(¯x); a constant.
The sample variance also has its mean and variance.

Example: Quality inspection
The weight of a bag of candies follow a normal distribution with mean
µ = 2 and standard deviation σ = 0.2.
Suppose the quality control oﬃcer decides to sample 4 bags and
calculate the sample mean ¯x. She will punish me if ¯x /∈ [1.8, 2.2].
Note that my production process is actually “good:” µ = 2.
Unfortunately, it is not perfect: σ > 0.
We may still be punished (if we are unlucky) even though µ = 2.
What is the probability that I will be punished?
We want to calculate 1 − Pr(1.8 < ¯x < 2.2).
We know that µ¯x = µ = 2 and σ¯x = σ√
4
= 0.1.
But we do not know the probability distribution of ¯x!

Sampling from a normal population
If the population is normal, the sample mean is also normal!
Proposition 2
Let {Xi}i=1,...,n be a size-n random sample from a normal population
with mean µ and standard deviation σ. Then
¯x ∼ ND µ,
σ
√
n
.
We already know that µ¯x = µ and σ¯x = σ√
n
. This is true regardless of
the population distribution.
When the population is normal, the sample mean will also be normal.

Example revisited: Quality inspection
The weight of a bag of candies follow a normal distribution with mean
µ = 2 and standard deviation σ = 0.2.
Suppose the quality control oﬃcer decides to sample 4 bags and
calculate the sample mean ¯x. She will punish me if ¯x /∈ [1.8, 2.2].
What is the probability that I will be punished?
The distribution of the sample mean ¯x is ND(2, 0.1).
Pr(¯x < 1.8) + Pr(¯x > 2.2) ≈ 0.045.

Adjusting the standard deviation
When the population is
ND(µ = 2, σ = 0.2) and the sample
size is n = 4, the probability of
punishment is 0.045.
If we adjust our standard deviation
σ (by paying more or less attention
to the production process), the
probability will change.
Reducing σ reduces the probability
of being punished. With the
sampling distribution of ¯x, we may
optimize σ.
An improvement from 0.2 to 0.15
is helpful; from 0.15 to 0.1 is not.

Adjusting the sample size
When the population is ND(2, 0.2)
and the sample size is n = 4, the
probability of punishment is 0.045.
If the quality control oﬃcer
increases the sample size n, the
probability will decrease.
µ = 2 is actually ideal. A larger
sample size makes the oﬃcer less
likely to make a mistake.

Distribution of the sample mean
So now we have one general conclusion: When we sample from a
normal population, the sample mean is also normal.
And its mean and standard deviation are µ and σ√
n
, respectively.
What if the population is non-normal?
Fortunately, we have a very powerful theorem, the central limit
theorem, which applies to any population.

Central limit theorem
The theorem says that a sample mean is approximately normal
when the sample size is large enough.
Proposition 3 (Central limit theorem)
Let {Xi}i=1,...,n be a size-n random sample from a population with
mean µ and standard deviation σ. Let ¯xn be the sample mean. If
σ < ∞, then ¯xn converges to ND(µ, σ√
n
) as n → ∞.
How large is “large enough”?
In practice, typically n ≥ 30 is believed to be large enough.

Road map
Sampling.
Hypothesis testing.

Hypothesis testing
How do scientists (physicists, chemists, etc.) do research?
Observe phenomena.
Make hypotheses.
Test the hypotheses through experiments (or other methods).
Make conclusions about the hypotheses.
Social scientists and business researchers do the same thing with
hypothesis testing.
One of the most important technique of statistical inference.
A technique for (statistically) proving things.
Relying on sampling distributions.

People ask questions
In the business (or social science) world, people ask questions:
Are older workers more loyal to a company?
Does the newly hired CEO enhance our proﬁtability?
Is one candidate preferred by more than 50% voters?
Do teenagers eat fast food more often than adults?
Is the quality of our products stable enough?
How should we answer these questions?
Statisticians suggest:
First make a hypothesis.
Then test it with samples and statistical methods.

Statistical hypotheses
A statistical hypothesis is a formal way of stating a hypothesis.
Typically it is a mathematical description of parameters to test.
It contains two parts:
The null hypothesis (denoted as H0).
The alternative hypothesis (denoted as Ha or H1).
The alternative hypothesis is:
The thing that we want (need) to prove.
The conclusion that can be made only if we have a strong evidence.
The null hypothesis corresponds to a default position.
We ﬁrst assume that the null hypothesis is correct.
Then we collect sample data.
If under the null hypothesis it is quite unlikely to see our observed
result, we claim that the null hypothesis is wrong.

Statistical hypotheses: example 1
In our factory, we produce packs of candy whose average weight should
be 1 kg.
One day, a consumer told us that his pack only weighs 900 g.
We need to know whether this is just a rare event or our production
system is out of control.
If (we believe) the system is out of control, we need to shutdown the
machine and spend two days for inspection and maintenance. This will
cost us at least $100,000.
So we should not to believe that our system is out of control just
because of one complaint. What should we do?

We ﬁrst state a hypothesis: “Our production system is under control.”
Then we ask: Is there a strong enough evidence showing that the
hypothesis is wrong, i.e., the system is out of control?
Initially, we assume that our system is under control.
Then we do a survey to see if we have a strong enough evidence.
We shutdown machines only if we can “prove” that the system is indeed
out of control.
Let µ be the average weight, the statistical hypothesis is
H0 : µ = 1
Ha : µ = 1.

In our society, we adopt the presumption of innocence.
One is considered innocent until proven guilty.
So when there is a person who probably stole some money:
H0 : The person is innocent
Ha : The person is guilty.
There are two possible errors:
One is guilty but we think she/he is innocent.
One is innocent but we think she/he is guilty.
Which one is more critical?
It is unacceptable that an innocent person is considered guilty.
We will say one is guilty only if there is a strong evidence.

Consider the following hypothesis: “The candidate is preferred by more
than 50% voters.”
As we need a default position, and the percentage that we care about
is 50%, we will choose our null hypothesis as
H0 : p = 0.5.
p is the population proportion of voters preferring the candidate.
More precisely, let Xi = 1 if voter i prefers this candidate and 0
otherwise, i = 1, ..., N, then p =
N
i=1 Xi
N
.
How about the alternative hypothesis? Should it be
Ha : p > 0.5 or Ha : p < 0.5?

The choice of the alternative hypothesis depends on the related
decisions or actions to make.
Suppose one will go for the election only if she thinks she will win (i.e.,
p > 0.5), the alternative hypothesis will be
Ha : p > 0.5.
Suppose one tends to participate in the election and will give up only if
the chance is slim, the alternative hypothesis will be
Ha : p < 0.5.
The alternative hypothesis is “the thing we want (need) to prove.”

Two types of errors
Type-1 error (false positive): Rejecting a true null hypothesis.
There is nothing, but we say there is one.
Type-2 error (false negative): Do not reject a false null hypothesis.
There is something, but we do not see it.

Remarks
We want to control the chances for us to make these mistakes.
Unfortunately, we cannot control both.
We choose to control the probability of a type-1 error.
The choice of the default position is important.
For setting up a statistical hypothesis:
Our default position will be put in the null hypothesis.
The thing we want to prove (i.e., the thing that needs a strong evidence)
will be put in the alternative hypothesis.
For writing the mathematical statement:
The equal sign (=) will always be put in the null hypothesis.
The alternative hypothesis contains an unequal sign or strict
inequality: =, >, or <.
The direction of the alternative hypothesis, when it is an inequality,
depends on the context.

One-tailed tests and two-tailed tests
If the alternative hypothesis contains an unequal sign (=), the test is a
two-tailed test.
If it contains a strict inequality (> or <), the test is a one-tailed test.
Suppose we want to test the value of the population mean.
In a two-tailed test, we test whether the population mean significantly
deviates from a hypothesized value. We do not care whether it is larger
than or smaller than.
In a one-tailed test, we test whether the population mean significantly
deviates from a hypothesized value in a specific direction.

The ﬁrst example: a two-tailed test
Let’s test the average weight (in g) of our products.
H0 : µ = 1000
Ha : µ = 1000.
The variance of the product weights is σ2
= 40000 g2
.
The case with unknown σ2
will be discussed later.
A random sample has been collected.
Suppose the sample size n = 100.
Suppose the sample mean X = 963.
How to make a conclusion?

Controlling the error probability
All we can do is to collect a random sample and make our conclusion
based on the observed sample.
It is natural that we may be wrong when we claim µ = 1000.
We want to control the error probability.
Let α be the maximum probability for us to make this error.
α is called the signiﬁcance level.
1 − α is called the conﬁdence level.
Target: If µ = 1000, our sampling and testing process will make us claim
that µ = 1000 with probability at most α.

Rejection rule
Now let’s test with the signiﬁcance level α = 0.05.
Intuitively, if X deviates from 1000 a lot, we should reject the null
hypothesis and believe that µ = 1000.
If µ = 1000, it is so unlikely to observe such a large deviation.
So such a large deviation provides a strong evidence.
So we start by sampling and calculating the sample mean.
We want to construct a rejection rule: If |X − 1000| > d, we reject
H0. We need to calculate d.

Rejection rule
We want a distance d such that if
H0 is true, the probability of
rejecting H0 is at most 5%, i.e.,
Pr |X − 1000| > d µ = 1000 ≤ 0.05.
The smallest d that satisﬁes the
above inequality requires
Pr(|X − 1000| > d) = 0.05.
Consider X:
We know σ = 200 and n = 100.
We assume that µ = 1000.
Thanks to the central limit
theorem, X ∼ ND(1000, 20).
Pr(|X − 1000| > d) = 0.05.

Rejection rule: the critical value
According to X ∼ ND(1000, 20), Pr(|X − 1000| > 39.2) = 0.05. The
rejection region is R = (−∞, 960.8) ∪ (1039.2, ∞).
If X falls in the rejection region, we reject H0.

Because ¯x = 963 /∈ R, we cannot reject H0.
The deviation from 1000 is not large enough.
The evidence is not strong enough.

In this example, the two values 960.8 and 1039.2 are the critical
values for rejection.
If the sample mean is more extreme than one of the critical values, we
reject H0.
Otherwise, we do not reject H0.
¯x = 963 is not strong enough to support Ha: µ = 1000.
Concluding statement:
Because the sample mean does not lie in the rejection region, we cannot
reject H0.
With a 95% conﬁdence level, there is no strong evidence showing that
the average weight is not 1000 g.
Therefore, we should not shutdown machines to do an inspection.

Summary
We want to know whether the machine is out of control.
If the machine is actually good, we do not want to reach a conclusion
that requires an inspection and maintenance.
We will do the inspection only if we have a strong evidence suggesting
that µ = 1000.
We want to know whether H0 is false, i.e., µ = 1000.
We control the probability of making a wrong conclusion.
We should not reject H0 if it is true.
We limit the probability at α = 5%.
We will conclude that H0 is false if X falls in the rejection region.
The calculation of the the critical values is based on the normal
distribution, which can always be transformed to the z distribution.
This is called a z test.

Not rejecting vs. accepting
We should be careful in writing our conclusions:
Wrong: Because the sample mean does not lie in the rejection region,
we accept H0. With a 95% conﬁdence level, there is a strong evidence
showing that the average weight is 1000 g.
Right: Because the sample mean does not lie in the rejection region, we
cannot reject H0. With a 95% conﬁdence level, there is no strong
evidence showing that the average weight is not 1000 g.
Unable to prove one thing is false does not mean it is true!

The ﬁrst example (part 2)
Suppose that we modify the hypothesis into a directional one:1
H0 : µ = 1000.
Ha : µ < 1000.
We still have σ2
= 40000, n = 100, and α = 0.05.
This is a one-tailed test.
Once we have a strong evidence supporting Ha, we will claim that
µ < 1000.
We need to ﬁnd a distance d such that
Pr 1000 − X > d µ = 1000 = 0.05.
1Some researchers write µ ≥ 1000 in this case.

For 0.05 = Pr(1000 − X > d), we have d = 32.9.
As the observed sample mean ¯x = 963 ∈ (−∞, 967.1), we reject H0.
The deviation from 1000 is large enough.
The evidence is strong enough.

In this example, 967.1 is the critical values for rejection.
If the sample mean is more extreme than (in this case, below) the critical
value, we reject H0.
Otherwise, we do not reject H0.
There is a strong evidence supporting Ha: µ < 1000.
Concluding statement:
Because the sample mean lies in the rejection region, we reject H0.
With a 95% conﬁdence level, there is a strong evidence showing that the
average weight is less than 1000 g.

One-tailed tests vs. two-tailed tests
When should we use a two-tailed test?
We use a two-tailed test when we are lack of the direction information.
E.g., we suspect that the population mean has changed, but we have
no idea about whether it becomes larger or smaller.
If we know or believe that the change is possible only in one
direction, we may use a one-tailed test.
Having more information (i.e., knowing the direction of change) makes
rejection “easier,”, i.e., easier to ﬁnd a strong enough evidence.

Summary
Distinguish the following pairs:
One- and two-tailed tests.
No evidence showing H0 is false and having evidence showing H0 is true.
Not rejecting H0 and accepting H0.
Using = and using ≥ or ≤ in the null hypothesis.

Road map
Sampling.
Hypothesis testing.

The p-value
The p-value is an important, meaningful, and widely-adopted tool for
hypothesis testing.
Deﬁnition 2
For an observed value of a statistic in a statistical test, the p-value is
the probability of observing a value that is more extreme than the
observed value under the assumption that the null hypothesis is true.
Calculated based on an observed value of the statistic.
Is the tail probability of the observed value.
Assuming that the null hypothesis is true.

The p-value
Mathematically:
Suppose we test a population
mean µ with a one-tailed test
H0 : µ = 1000
Ha : µ < 1000.
Given an observed ¯x, the p-value
is deﬁned as
Pr(X ≤ ¯x).
In the previous example, σ = 200,
n = 100, α = 0.05, and ¯x = 963.
If H0 is true, i.e., µ = 1000, we
have Pr(X ≤ 963) = 0.032.
The p-value of ¯x is 0.032.

How to use the p-value?
The p-value can be used for constructing a rejection rule.
For a one-tailed test:
If the p-value is smaller than α, we reject H0.
If the p-value is greater than α, we do not reject H0.
In our example, the one-tailed test is
H0 : µ = 1000
Ha : µ < 1000.
We have α = 0.05.
Because the p-value 0.032 < 0.05, we reject H0.

p-values vs. critical values
Using the p-value is equivalent to using the critical values.
The rejection-or-not decision we make will be the same based on the two
methods.

The benefit of using the p-value
In many studies, researchers do not determine the significance level α
before a test is conducted.
They calculate the p-value and then mark the significance of the
result with stars.
One typical way of assigning stars:
p-value Significant? Mark
(0, 0.01] Highly significant ***
(0.01, 0.05] Moderately significant **
(0.05, 0.1] Slightly significant *
(0.1, 1) Insignificant (Empty)

The size of a p-value
Suppose one is testing whether people at diﬀerent ages sleep for at
least eight hours per day in average.
Age groups: [10, 15), [15, 20), [20, 35), etc.
For group i, a one-tailed test is conducted. Ha : µi > 8.
The result may be presented in a table:
Group Age group p-value
1 [10,15) 0.0002***
2 [15,20) 0.2
3 [20,25) 0.06*
4 [25,30) 0.04**
5 [30,35) 0.03**
A smaller p-value does NOT mean a larger deviation!
We cannot conclude that µ5 > µ4, µ1 > µ3, etc.
There are other tests for the diﬀerence between two population means.

The p-value for two-tailed tests
How to construct the rejection rule for a two-tailed test?
If the p-value is smaller than α
2
, we reject H0.
If the p-value is greater than α
2
, we do not reject H0.
Consider the two-tailed test
H0 : µ = 1000.
Ha : µ = 1000.
We have α = 0.05.
Because the p-value 0.032 > α
2
= 0.025, we do not reject H0.
Some researchers/books/software use another deﬁnition:
The p-value for a two-tailed test is two times of that for the
corresponding one-tailed test.
They then compare this p-value with α.

Summary
The p-value is the tail probability of the realized value of a statistics
assuming the null hypothesis is true.
The p-value method is an alternative way of forming the rejection rule.
It is equivalent to the critical-value method.
The p-value is related to the probability for H0 to be false.
It does not measure the magnitude of the deviation.

The z test
In example 1, basically we use the fact that X ∼ ND(µ, σ√
n
.
This implies that X−µ
σ/
√
n
∼ ND(0, 1), the so-called standard normal
distribution, or the z distribution.
Therefore, this test is called the z test.
This requires the knowledge about σ.

When the variance is unknown
When the population variance σ2
is unknown, the quantity X−µ
σ/
√
n
is
unknown.
What if we use the sample variance S2
as a substitute?
Proposition 4
For a normal population, the quantity
T =
X − µ
S/
√
n
follows the t distribution with degree of freedom n − 1.
What is the t distribution?

The t distribution
The t distribution is deﬁned as follows:
Deﬁnition 3
A random variable X follows the t distribution with degree of freedom
n, denoted as X ∼ t(n), if
f(x|n) =
Γ(n+1
2 )
√
nπΓ(n
2 )
1 +
x2
n
− n+1
2
,
for all x ∈ (−∞, ∞).
Γ(x) =
∞
0
zx−1
e−z
dz is the gamma function.

The z and t distributions
Let’s compare Z = X−µ
σ/
√
n
and T = X−µ
S/
√
n
.
Because we do not know σ, we use S to substitute it.
Z ∼ ND(0, 1) and T ∼ t(n − 1).
As the t distribution is a substitution of the z distribution, it is designed
to be also centered at 0: E[T] = E[Z] = 0.
However, as we add one more random variable into the formula (σ is a
known constant), T will be “more random” than Z, i.e.,
Var(T) > Var(Z).
Graphically, t curves will be ﬂatter than the z curve.
Fact: t(n) → ND(0, 1) as n → ∞.

The t test
We will use the t test to test the population mean if the population is
normal.
If the sample size is large, we may still use the z distribution with s
substituting σ.

Example 2
An MBA program seldom admits applicants without a work experience
longer than two years.
To test whether the average work year of admitted students is above
two years, 20 admitted applicants are randomly selected.
Their work experiences prior to entering the program are recorded.
Prior to entering the program, they have an average work experience of
2.5 years. This is the sample mean.
The sample standard deviation is 1.3765 years.
The population is believed to be normal.
The conﬁdence level is set to 95%.

Example 2: hypothesis
Suppose the one asking the question is a potential applicant with one
year of work experience. He is pessimistic and will apply for the
program only if the average work experience is proven to be less than
two years.
The hypothesis is
H0 : µ = 2
Ha : µ < 2.
µ is the average work experience (in years) of all admitted applicants
prior to entering the program.
To encourage him, we need to give him a strong evidence showing that
his chance is high.

Example 2: hypothesis and test
Suppose he is optimistic and will not apply for the program only if
the average work experience is proven to be greater than two.
The hypothesis becomes
H0 : µ = 2
Ha : µ > 2.
To discourage him, we need to give him a strong evidence showing that
his chance is slim.
Let’s consider the optimistic candidate (and Ha : µ > 2) ﬁrst.
Because the population variance is unknown and the population is
normal, we may use the t test.

Example 2A: calculation and interpretation
Calculation:
The p-value is Pr(X > 2.5|µ = 2) = 0.0604.
Conclusion:
For this one-tailed test, as the p-value > 0.05 = α, we do not reject H0.
There is no strong evidence showing that the average work experience
is longer than two years.
The result is not strong enough to discourage the potential applicant,
who has only one year of work experience.
Decision:
The (optimistic) applicant should apply.

Example 2B – a pessimistic applicant
Suppose the applicant is pessimistic and the hypothesis is
H0 : µ = 2
Ha : µ < 2.
The p-value will be Pr(X < 2.5|µ = 2) = 1 − 0.0604 = 0.9396.
This is calculated based on the t distribution.
We do not reject H0 and cannot conclude that µ < 2. There is no strong
evidence to encourage him.
He should not apply.
Note that when we write different alternative hypotheses, the final
decision is different!
This happens if and only if in both cases we do not reject H0.

Summary
To test the population mean µ:
σ2
Sample size
Population distribution
Normal Nonnormal
Known
n ≥ 30 z z
n < 30 z Nonparametric
Unknown
n ≥ 30 t or z z
n < 30 t Nonparametric
More parameters that may be tested:
Population proportion (z test).
Population variance (χ2
test).
Diﬀerence of two population means (t test).
Ratio of two population variances (F test).

Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Part 3:
Regression Analysis
Ling-Chieh Kung
September 4, 2016
Regression Analysis 1 / 83 Ling-Chieh Kung (NTU IM)

Correlation and prediction
We often try to ﬁnd correlation among variables.
For example, prices and sizes of houses:
House 1 2 3 4 5 6
Size (m2) 75 59 85 65 72 46
Price ($1000) 315 229 355 261 234 216
House 7 8 9 10 11 12
Size (m2) 107 91 75 65 88 59
Price ($1000) 308 306 289 204 265 195
We may calculate their correlation coeﬃcient as r = 0.729.
Now given a house whose size is 100 m2
, may we predict its price?

Correlation among more than two variables
Sometimes we have more than two variables:
For example, we may also know the number of bedrooms in each house:
House 1 2 3 4 5 6
Size (m2) 75 59 85 65 72 46
Price ($1000) 315 229 355 261 234 216
Bedroom 1 1 2 2 2 1
House 7 8 9 10 11 12
Size (m2) 107 91 75 65 88 59
Price ($1000) 308 306 289 204 265 195
Bedroom 3 3 2 1 3 1
How to summarize the correlation among the three variables?
How to predict house price based on size and number of bedrooms?

Regression analysis
Regression is a solution!
As one of the most widely used tools in Statistics, it discovers:
Which variables affect a given variable.
How they affect the target.
In general, we will predict/estimate one dependent variable by one
or multiple independent variables.
Independent variables: Potential factors that may affect the outcome.
Dependent variable: The outcome.
Independent variables are explanatory variables; the dependent variable
is the response variable.
As another example, suppose we want to predict the number of arrival
consumers for tomorrow:
Dependent variable: Number of arrival consumers.
Independent variables: Weather, holiday or not, promotion or not, etc.

Types of regression analysis
Based on the number of independent variables:
Simple regression: One independent variable.
Multiple regression: More than one independent variables.
The dependent variable may be quantitative or qualitative.
In ordinary regression, the dependent variable is quantitative.
In logistic regression, the dependent variable is qualitative.
There are other types of regression models.

Road map
Simple regression.
Multiple regression.
Indicator variables and interaction.
Endogeneity and residual analysis.
Logistic regression.

Basic principle
Consider the price-size relationship again. In the sequel, let xi be the
size and yi be the price of house i, i = 1, ..., 12.
Size Price
(in m2
) (in $1000)
46 216
59 229
59 195
65 261
65 204
72 234
75 315
75 289
85 355
88 265
91 306
107 308
How to relate sizes and prices “in the best way?”

Linear estimation
If we believe that the relationship between the two variables is linear,
we will assume that
yi = β0 + β1xi + i.
β0 is the intercept of the equation.
β1 is the slope of the equation.
i is the random noise for estimating record i.
Somehow there is such a formula, but we do not know β0 and β1.
β0 and β1 are the parameter of the population.
We want to use our sample data (e.g., the information of the twelve
houses) to estimate β0 and β1.
We want to form two statistics ˆβ0 and ˆβ1 as our estimates of β0 and β1.

Linear estimation
Given the values of ˆβ0 and ˆβ1, we will use ˆyi = ˆβ0 + ˆβ1xi as our
estimate of yi.
Then we have
yi = ˆβ0 + ˆβ1xi + i,
where i is now interpreted as the estimation error.
Let ˆyi = ˆβ0 + ˆβ1xi be our estimate of yi. We hope i = yi − ˆyi to be small.
For all data points, let’s minimize the sum of squared errors (SSE):
n
i=1
2
i = (yi − ˆyi)2
=
n
i=1
(yi − (ˆβ0 + ˆβ1xi)
2
.
The solution of
min
ˆβ0, ˆβ1
n
i=1
2
is our least square approximation (estimation) of the given data.

Least square approximation
The least square approximation problem
min
ˆβ0, ˆβ1
n
i=1
2
has a closed-form formula for the best (ˆβ0, ˆβ1):
ˆβ1 =
n
i=1(xi − ¯x)(yi − ¯y)
n
i=1(xi − ¯x)2
and ˆβ0 = ¯y − ˆβ1 ¯x.
For our house example, we will get (ˆβ0, ˆβ1) = (102.717, 2.192).
Its SSE is 13118.63.
We will never know the true values of β0 and β1. However, according to
our sample data, the best (least square) estimate is (102.717, 2.192).
We tend to believe that β0 = 102.717 and β1 = 2.192.

Interpretations
Our regression model is
y = 102.717 + 2.192x.
Interpretation: When the house
size increases by 1 m2
, the price is
expected to increase by $2, 192.
(Bad) interpretation: For a house
whose size is 0 m2
, the price is
expected to be $102,717.

Linear multiple regression
In most cases, more than one independent variable may be used to
explain the outcome of the dependent variable.
For example, consider the number of bedrooms.
We may take both variables as
independent variables to do linear
multiple regression:
yi = β0 + β1x1,i + β2x2,i + i.
yi is the house price (in $1000).
x1,i is the house size (in m2
).
x2,i is the number of bedrooms.
i is the random noise.
Our (least square) estimate is
(ˆβ0, ˆβ1, ˆβ2) = (82.737, 2.854, −15.789).
Price Size
Bedroom
(in $1000) (in m2
)
315 75 1
229 59 1
355 85 2
261 65 2
234 72 2
216 46 1
308 107 3
306 91 3
289 75 2
204 65 1
265 88 3
195 59 1

Interpretations
Our regression model is
y = 82.737 + 2.854x1 − 15.789x2.
When the house size increases by 1 m2
(and all other independent
variables are fixed), we expect the price to increase by $2, 854.
When there is one more bedroom (and all other independent variables
are fixed), we expect the price to decrease by $15, 789.
One must interpret the results and determine whether the result is
meaningful by herself/himself.
The number of bedrooms may not be a good indicator of house price.
At least not in a linear way.
We need more than finding coefficients:
We need to judge the overall quality of a given regression model.
We may want to compare multiple regression models.
We must test the significance of regression coefficients.

Model validation: How good is a model?
How to measure the quality of a model?
For the model y = 102.717 + 2.192x, how good is it?
In general, for a given regression model y = ˆβ0 + ˆβ1x1 + · · · ˆβkxk, how
may we evaluate its overall quality?
The sum of squared total errors (SST), SST =
n
i=1(yi − ¯y)2
, is
for the worst model.
With our regression model, the sum of squared errors (SSE) is
SSE =
n
i=1
(yi − ˆyi)2
=
n
i=1
2
.
The proportion of total variability that is explained by the regression
model is
0 ≤ R2
= 1 −
SSE
SST
≤ 1.
The larger R2
, the better the regression model.

Obtaining R2
Whenever we find the estimated coefficients, we have R2
.
Statistical software includes R2
in the regression report.
For the regression model y = 102.717 + 2.192x, we have R2
= 0.5315:
Around 53% of a house price is determined by its house size.
If (and only if) there is only one independent variable, then R2
= r2
,
where r is the correlation coefficient between the dependent and
independent variables.
−1 ≤ r ≤ 1.
0 ≤ r2
= R2
≤ 1.

Comparing regression models
Now we have a way to compare regression models.
For our example:
Size only Bedroom only Size and bedroom
R2
0.5315 0.29 0.5513
Using prices only is better than using numbers of bedrooms only.
Is using prices and bedrooms better?
In general, adding more variables always increases R2
!
In the worst case, we may set the corresponding coeﬃcients to 0.
Some variables may actually be meaningless.
To perform a “fair” comparison and identify those meaningful factors,
we need to adjust R2
based on the number of independent variables.

Adjusted R2
The standard way to adjust R2
to adjusted R2
is
R2
adj = 1 −
n − 1
n − k − 1
(1 − R2
).
n is the sample size and k is the number of independent variables used.
For our example:
Size only Bedroom only Size and bedroom
R2
0.5315 0.290 0.5513
R2
adj 0.4846 0.219 0.4516
Actually using sizes only results in the best model!

Testing coefficient significance
Another important task for validating a regression model is to test the
significance of each coefficient.
Recall our model with two independent variables
y = 82.737 + 2.854x1 − 15.789x2.
Note that 2.854 and −15.789 are solely calculated based on the sample.
We never know whether β1 and β2 are really these two values!
In fact, we cannot even be sure that β1 and β2 are not 0. We need to
test them:
H0 : βi = 0
Ha : βi = 0.
We look for a strong enough evidence showing that βi = 0.

The testing results are provided in regression reports.
Statistical software (e.g., R) tells us:
Coeﬃcients Standard Error t Stat p-value
Intercept 82.737 59.873 1.382 0.200
Size 2.854 1.247 2.289 0.048 **
Bedroom −15.789 25.056 −0.630 0.544
As we have no idea about population variance, we apply the t test.
“Coeﬃcients” records sample means ¯x; “Standard Error” records S√
n
; “t
Stat” records T = ¯x−0
S/
√
n
.
“p-value” are the tail probabilities of T multiplied by 2 (done by most
software). Simply compare them with α!
Recall the assumption that i is normal!

Statistical software tells us:
Intercept 82.737 59.873 1.382 0.200
Size 2.854 1.247 2.289 0.048 **
Bedroom −15.789 25.056 −0.630 0.544
At a 95% confidence level, we believe that β1 = 0. House size really has
some impact on house price.
At a 95% confidence level, we have no evidence for β2 = 0. We cannot
conclude that the number of bedrooms has an impact on house price.
If we use only size as an independent variable, its p-value will be
0.00714. We will be quite confident that it has an impact.

Road map
Simple regression.

House age
The age of a house may also aﬀect its price.
Price Size
Bedroom
Age
(in $1000) (in m2
) (in years)
315 75 1 16
229 59 1 20
355 85 2 16
261 65 2 15
234 72 2 21
216 46 1 16
308 107 3 15
306 91 3 15
289 75 2 14
204 65 1 21
265 88 3 15
195 59 1 26
Let’s add age as an independent variable in explaining house prices.
Because the number of bedroom seems to be unhelpful, let’s ignore it.

House age
For house i, let yi be its price, x1,i be its size, and x3,i be its age. We
assume the following linear relationship:
yi = β0 + β1x1,i + β2x3,i + i.
Software gives us the following regression report:
Intercept 262.882 83.632 3.143 0.012
Size 1.533 0.628 2.443 0.037 **
Age −6.368 2.881 −2.211 0.054 *
R2
= 0.696, R2
adj = 0.629
R2
goes up from 0.485 (size only) to 0.629. Age is signiﬁcant at a 10%
signiﬁcance level. Seems good!

“Nonlinear” relationship
May we do better?
By looking at the age-price scatter plot
(and our intuition), maybe the impact of
age on price is “nonlinear”:
A new house’s value depreciates fast.
The value depreciates slowly when the
house is old.
At least this is true for a car.
It is worthwhile to try a capture this
nonlinear relationship.
For example, we may try to replace house
age by its reciprocal:
yi = β0 + β1x1,i + β2
1
x3,i
+ i.

Variable transformation
To ﬁt
yi = β0 + β1x1,i + β2
1
x3,i
+ i.
to our sample data:
Prepare a new column as 1
age
.
Input these three columns to software.
Read the report.
We may consider any kind of nonlinear
relationship.
This technique is called variable
transformation.
Price Size 1/Age
(in $1000) (in m2
) (in 1/years)
315 75 0.063
229 59 0.05
355 85 0.063
261 65 0.067
234 72 0.048
216 46 0.063
308 107 0.067
306 91 0.067
289 75 0.071
204 65 0.048
265 88 0.067
195 59 0.038

The reciprocal of house age
Software gives us the following regression report:
Intercept 22.905 57.154 0.401 0.698
Size 1.524 0.647 2.356 0.043 **
1/Age 2185.575 1044.497 2.092 0.066 *
R2
= 0.685, R2
adj = 0.615
Validation:
Variables are both significant (at different significance level).
Using size and age better explains house price (at least for the given
sample data).
The intuition that house value depreciates at different speeds is not
supported by the data.
Changing 1
age to age2
also does not help.

Typical ways of variable transformation

Variable selection and model building
In general, we may have a lot of candidate independent variables.
Size, number of bedrooms, age, distance to a park, distance to a hospital,
safety in the neighborhood, etc.
If we consider only linear relationships, for p candidate independent
variables, we have 2p
− 1 combinations.
For each variable, we have many ways to transform it.
In the next lecture, we will introduce the way of modeling interaction
among independent variables.
How to ﬁnd the “best” regression model (if there is one)?

Variable selection and model building
There is no “best” model; there are “good” models.
Some general suggestions:
Take each independent variable one at a time and observe the
relationship between it and the dependent variable. A scatter plot
helps. Use this to consider variable transformation.
For each pair of independent variables, check their relationship. If two
are highly correlated, quite likely one is not needed.
Once a model is built, check the p-values. You may want to remove
insignificant variables (but removing a variable may change the
significance of other variables).
Go back and forth to try various combinations. Stop when a good
enough one (with high R2
and R2
adj and small p-values) is found.
Software can somewhat automate the process, but its power is limited
(e.g., it cannot decide transformation).
We may need to find new independent variables.
Intuitions and experiences may help (or hurt).

Summary
With a regression model, we try to identify how independent variables
affect the dependent variable.
For a regression model, we adopt the least square criterion for estimating
the coefficients.
Model validation:
The overall quality of a regression model is decided by its R2
and R2
adj.
We may test the significance of independent variables by their p-values.
Modeling building:
Variable transformation.
Variable selection.

Case study: ticket selling
A theater made hundreds of stage performances in the past six years.
The owner hopes that statistics and data analysis may help her
improve the ticket sales.
Key questions: What makes a show popular?
Popularity is deﬁned as the numbers of tickets sold.
Potential factors: year, month, day, time, location, actors/actresses,
drama type, ticket prices, etc.
100 performances are randomly drawn from the whole pool.
All were made during weekends.
Tickets were all publicly sold.
Tickets for all performances were sold through the same channels.
For each performance, the ticket price(s) remained the same.
As a group of consultants, how may we help the theater?

Variables
Six variables are obtained:
Variable Meaning
Year The year in which the performance was made
Time Morning, afternoon, or evening
Capacity The number of seats in the theater hall
AvgPrice The average of all prices
SalesQty The number of tickets sold
SalesDuration Performance day − Announcement day
Labeling and scaling:
Years are labeled as 1, 2, ..., and 6 (6 means the last year).
Capacities and sales quantities have been scaled in the same proportion.

Data (incomplete)
Yr. Tm. Cap. A.P. Qty S.D. Yr. Tm. Cap. A.P. Qty S.D.
5 A 230 400 218 50 2 M 190 575 190 289
5 A 150 500 119 46 6 A 130 500 108 89
5 A 230 400 160 126 4 E 200 775 169 100
5 A 200 775 200 324 4 E 200 775 135 259
6 E 190 1175 178 115 5 A 310 650 251 346
6 A 190 1175 183 109 2 A 250 550 250 145
5 E 190 775 161 58 1 A 190 675 183 254
3 A 200 675 200 112 6 A 200 1175 146 110
5 E 200 775 158 323 1 M 200 575 140 94
1 M 200 575 128 360 4 A 200 775 195 255

Regression
To construct a regression model, we first consider quantitative
independent variables.
Dependent variable: SalesQty.
Independent variables: Capacity, AvgPrice, Year.
Let’s ignore SalesDuration for a while.
Note that Year is a quantitative variable.
The difference between two values makes sense: 4 − 2 and 5 − 3 both
mean a difference of two years.
The values will keep increasing.
If we have a variable Month whose possible values are 1, 2, ..., and 12,
the difference between 12 and 1 is ambiguous: 11 months or 1 month.
Scatter plots help us consider:
Variable selection: Does a variable has an impact?
Transformation: What is a variable’s impact?
Multicollinearity: Are two variables highly correlated?

Regression
It seems that Capacity, AvgSales, and Year are all worth a try.
Let’s put them into a regression model.
If we do this one by one:
SalesQty = 20.79 + 0.72Capacity: R2
= 0.538, p-value ≈ 0.
SalesQty = 174.9 + 0.0028AvgPrice: R2
= 0.0002, p-value = 0.885.
SalesQty = 203.6 − 6.77Y ear: R2
= 0.063, p-value = 0.0115.
If we include them together:
The regression model is
SalesQty = 24.742 + 0.702Capacity + 0.027AvgPrice − 4.696Y ear.
R2
= 0.57, R2
adj = 0.556; p-values are 0, 0.056, and 0.019, respectively.
Do not try independent variables separately; try them together.

Adding Time into the model
Time may also be an inﬂuential variable.
However, it is qualitative.
More precisely, it is nominal.
Even if we label Time with numeric values, we cannot treat it as a
quantitative variable and put it into a regression model.
For each qualitative variable, we need to introduce several indicator
variables to represent its values.

Road map
Simple regression.

Numeric labeling does not work
The variable Time has three values.
Morning, afternoon, and evening.
Why can’t we label them as 1, 2, and 3 and do regression?
Suppose we label (morning, afternoon, evening) as (1, 2, 3):
The regression model is
SalesQty = 164.021 + 6.313Time.
Why is this wrong?

Numeric labeling does not work
Diﬀerent labeling gives diﬀerent regression results.
We may also label (morning, afternoon, evening) as (1, 2, 10) or (3, 1, 2):
SalesQty =
164.021 + 6.313Time
p-value = 0.294
SalesQty =
177.224 − 0.075Time
p-value = 0.95
SalesQty =
205.725 − 15.091Time
p-value = 0.0084

Binary variables
There is one exception: If a qualitative variable is binary, we may
label the values as 0 and 1 and then treat it as quantitative.
Labeling values as 1 and 0, 1 and 2, or 7 and 8 is also good.
Labeling values as 1 and −1, 1 and 5, or 4 and 8 is bad.
This is because a regression coeﬃcient measures what happens to the
dependent variable “when that independent variable increases by 1.”
When the binary variable is labeled with 0 and 1, its regression
coeﬃcient ˆβi tells us that “if the value changes from 0 to 1 (while all
others remain the same), we expect the dependent variable to increase
by ˆβi.”
What if we have more than two values?

Indicator variables
Consider a variable x with three values A, B, and C.
We ﬁrst choose a reference level, say, A.
We then manually create two indicator variables xB
and xC
:
xB
=
1 if x = B
0 otherwise
and xC
=
1 if x = C
0 otherwise
In other words, we have a mapping:
x xB
xC
A 0 0
B 1 0
C 0 1

孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)

孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à 孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)

Similaire à 孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4) (20)

Plus de 台灣資料科學年會

Plus de 台灣資料科學年會 (20)

Dernier

Dernier (20)

孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)