SlideShare une entreprise Scribd logo
1  sur  253
Télécharger pour lire hors ligne
Basic concepts Data visualization Data summarization
Statistics and Data Analysis for Engineers
Part 1:
Introduction and Descriptive Statistics
Ling-Chieh Kung
Department of Information Management
National Taiwan University
September 4, 2016
Introduction and Descriptive Statistics 1 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
What is Statistics?
Many things are unknown...
Consumers’ tastes.
Quality of a product.
Stock prices.
The effectiveness of a new way of teaching/training.
Statistics is the science of collecting, analyzing, interpreting, and
presenting (numerical) data.
Ultimate goal (of Business Statistics): to achieve better decision making.
The study of Statistics includes:
Descriptive Statistics.
Probability.
Inferential Statistics: Estimation.
Inferential Statistics: Hypothesis testing.
Inferential Statistics: Prediction.
In summary: To estimate, test, and predict those unknowns.
Introduction and Descriptive Statistics 2 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
My plan for today
Descriptive Statistics.
Visualization and summarization.
Inferential Statistics.
(Probability).
Hypothesis testing and p-value.
Regression analysis.
Case studies.
Introduction and Descriptive Statistics 3 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Road map
Basic concepts.
Data visualization.
Data summarization.
Introduction and Descriptive Statistics 4 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Populations vs. samples
A population is a collection of persons, objects, or items.
A census is to investigate the whole population.
A sample is a portion of the population.
Sampling is to investigate only a subset of the population.
We then use the information contained in the sample to infer (“guess”)
about the population.
What are samples for the following populations?
All students in NTU.
All students in the business school.
All chips made in one factory.
All consumers who have bought iPhone 6.
Two important questions:
Why sampling?
Is a sample representative?
Introduction and Descriptive Statistics 5 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Descriptive vs. inferential statistics
Descriptive statistics:
Graphical or numerical summaries of data.
Describing (visualizing or summarizing) a set of data.
Inferential statistics:
Making a “scientific guess” on unknowns.
Trying to say something about the population.
Which is descriptive and which is inferential?
Calculating the average height of 1000 randomly selected NTU students.
Using this number to estimate the average height of all NTU students.
Another example (pharmaceutical research):
All the potential patients form the population.
A group of randomly selected patients is a sample.
Use the result on the sample to infer the result on the population.
Introduction and Descriptive Statistics 6 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Parameters vs. statistics
A numerical summary of a population is a parameter.
The average height of all NTU students.
The expected coffee demand when the price is 50 NTD.
A numerical summary of a sample is a statistic.
The average height of all NTU male students.
The average coffee demand when the price is 50 NTD in the past 6 days.
Almost always people use a statistic to infer a parameter.
Some statistics are “good” while some are “bad.”
Introduction and Descriptive Statistics 7 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Parameters vs. statistics: an example
What is the average height of all NTU students?
While a census is possible, it is still quite costly.
It is natural to:
Sample some NTU students.
Calculate a statistic.
Use that statistic to estimate the average height (the parameter).
Some (good or bad) samples and statistics:
The average height of all students in this classroom.
The average height of 100 students randomly drawn from all students.
The maximum height of 100 students randomly drawn from all students.
The sum of heights of 100 students randomly drawn from all students.
The average height of 60 male and 40 female students randomly drawn
from the population.
Introduction and Descriptive Statistics 8 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Levels of data measurement
Most data we will play with are numerical.
Numerical data may be categorized to three levels:
Nominal.
Ordinal.
Quantitative: interval or ratio.
Introduction and Descriptive Statistics 9 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Nominal level
A nominal scale classifies data into categories with no ranking.
Data are labels or names used to identify an attribute of the element.
The label may be numeric or non-numeric label.
Examples:
Categorical variables Values (Categories)
Laptop ownership Yes / No
Citizenship Taiwan / Japan / ...
Country code 886 / 86 / 1 / ...
Arithmetic operations cannot be applied on nominal data.
Introduction and Descriptive Statistics 10 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Ordinal level
An ordinal scale classifies data into categories with ranking.
The order or rank of the data is meaningful.
However, differences between numerical labels do not imply
distances.
Examples:
Categorical variables Values (Categories)
Product satisfaction Satisfied, neutral, unsatisfied
Professor rank Full, associate, assistant
Ranking of scores 1, 2, 3, 4, ...
It is still not meaningful to do arithmetic on ordinal data.
Assistant + associate = full?!
The grade difference between no. 1 and no. 5 may not be equal to that
between no. 11 and no. 15.
Introduction and Descriptive Statistics 11 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Quantitative (interval and ratio) levels
An interval scale is an ordered scale in which the difference between
measurements is a meaningful quantity but the measurements do not
have a true zero point.
A ratio scale is an ordered scale in which the difference between
measurements is a meaningful quantity and the measurements have a
true zero point.
Ratio data appear more often in the world.
Heights, weights, income, prices.
Interval data are actually rare.
Degrees in Celsius or Fahrenheit.
GRE or GMAT scores.
How about degrees in Kelvin?
Introduction and Descriptive Statistics 12 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Some remarks
Nominal and ordinal data are called qualitative data.
Interval and ratio data are called quantitative data.
Most statistical methods are for quantitative data; some are for
qualitative data.
Distinguishing nominal and ordinal scales is important.
Distinguishing interval and ratio scales is not.
Sometimes qualitative data are called categorical data.
Sometimes quantitative data are called numeric data.
Introduction and Descriptive Statistics 13 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
A short summary
Understand these terms:
Populations vs. samples.
Parameters vs. statistics.
Inferential statistics vs. descriptive statistics.
For each scale of measurement, is it meaningful to calculate the
following numbers?
Level Ranking Distance
Nominal No No
Ordinal Yes No
Quantitative Yes Yes
Introduction and Descriptive Statistics 14 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Road map
Basic concepts.
Data visualization.
Data summarization.
Introduction and Descriptive Statistics 15 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
An example
For each day in 2011 and 2012, we record
the number of daily rentals of the public
bike rental system in Washington, D.C.
985, 801, 1349, 1562, 1600, 1606, 1510, ...,
1341, 1796. and 2729.
The smallest and largest numbers are 22
and 8714, respectively.
How to get some feeling on 731 numbers?
date rental
2011/1/1 985
2011/1/2 801
2011/1/3 1349
2011/1/4 1562
2011/1/5 1600
2011/1/6 1606
2011/1/7 1510
...
2012/12/29 1341
2012/12/30 1796
2012/12/31 2729
Introduction and Descriptive Statistics 16 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Frequency distributions
The original 731 numbers form a set of ungrouped data.
We start by grouping them into a frequency distribution.
Grouped data presented in the form of class intervals and frequencies.
Let’s create an intuitive frequency distribution.
Introduction and Descriptive Statistics 17 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Frequency distributions: an example
The resulting classes:
Class Class interval (Which means)
1 [0, 1000) 0 ≤ x < 1000
2 [1000, 2000) 1000 ≤ x < 2000
3 [2000, 3000) 2000 ≤ x < 3000
...
8 [7000, 8000) 7000 ≤ x < 8000
9 [8000, 9000) 8000 ≤ x < 9000
How about [0, 999], [1000, 1999], etc.?
How about (0, 1000], (1000, 2000], etc.?
Introduction and Descriptive Statistics 18 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Frequency distributions: an example
Then we count to get the frequency
distribution at the right.
This is a set of grouped data.
Some remarks:
Typically we have 5 to 15 classes.
Typically all classes have the same
width.
Be aware of class endpoints! Classes
should NOT overlap with each other.
If there are outliers, they should be
removed first.
Class interval Frequency
[0, 1000) 18
[1000, 2000) 80
[2000, 3000) 74
[3000, 4000) 107
[4000, 5000) 166
[5000, 6000) 106
[6000, 7000) 86
[7000, 8000) 82
[8000, 9000) 12
Introduction and Descriptive Statistics 19 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Something more
We may add class midpoints, relative frequencies, and
cumulative frequencies into a frequency table:
Class
Frequency
Class Relative Cumulative
interval midpoint frequency frequency
[0, 1000) 18 500 2.46% 18
[1000, 2000) 80 1500 10.94% 98
[2000, 3000) 74 2500 10.12% 172
[3000, 4000) 107 3500 14.64% 279
[4000, 5000) 166 4500 22.71% 445
[5000, 6000) 106 5500 14.50% 551
[6000, 7000) 86 6500 11.76% 637
[7000, 8000) 82 7500 11.22% 719
[8000, 9000) 12 8500 1.64% 731
How about cumulative relative frequencies?
Introduction and Descriptive Statistics 20 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Histograms
A frequency distribution may be depicted as a histogram.
Interval Freq.
[0, 1000) 18
[1000, 2000) 80
[2000, 3000) 74
[3000, 4000) 107
[4000, 5000) 166
[5000, 6000) 106
[6000, 7000) 86
[7000, 8000) 82
[8000, 9000) 12
It consists of a series of contiguous rectangles, each representing the
frequency in a class.
Introduction and Descriptive Statistics 21 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Histograms
Histograms may be the most important type of data graphs.
One particular reason to draw histograms is to get some ideas about
the distribution.
Bell shape? M shape? Skewed?
Any outlier?
We will discuss distributions in more details.
Introduction and Descriptive Statistics 22 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Frequency polygons
Alternatively, we may draw a frequency polygon by using line
segments connecting dots plotted at class midpoints.
The information contained in a frequency polygon is quite similar to that
contained in a histogram.
Introduction and Descriptive Statistics 23 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Frequency polygons
It is more convenient to use a frequency polygon to compare
multiple frequency distributions.
Both: Uni-modal and
symmetric.
2011: Bi-modal and
skewed to the right
(right-tailed).
2012: Uni-modal and
skewed to the left
(left-tailed).
Warning: People may misinterpret a frequency polygon as a line
chart (for data with a time sequence).
Introduction and Descriptive Statistics 24 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Line charts
A line chart is useful in depicting a time series data set.
A two-dimensional data set whose first dimension (the x-axis) is for
labels of time points.
It visualizes how a quantity changes as time goes by.
For our monthly bike rentals:
Introduction and Descriptive Statistics 25 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Pie charts
A pie chart is a circular depiction of data where each slice represents
the percentage of the corresponding category.
It visualizes relative frequency distributions well.
For our bike rental data set:
What are the proportions of rentals in the four seasons?
What are the proportions of rentals on the seven days of a week?
Introduction and Descriptive Statistics 26 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
A pie chart for seasonal rentals
Season Total rentals Proportion
Winter (12/20-3/20) 471348 14.3%
Spring (3/21-6/20) 918589 27.9%
Summer (6/21-9/20) 1061129 32.2%
Fall (9/21-12/20) 841613 25.6%
Introduction and Descriptive Statistics 27 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
A pie chart for rentals among weekdays
Day Total rentals
Sunday 444027
Monday 455503
Tuesday 469109
Wednesday 473048
Thursday 485395
Friday 487790
Saturday 477807
Introduction and Descriptive Statistics 28 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Data not appropriate for pie charts
Pie charts are used to visualize proportions, i.e., subtotals over the
overall total.
It should not be used to compare averages.
The total numbers of rentals made by male and female users are
appropriate for a pie chart.
The average numbers of rentals per male and female users are not
appropriate for a pie chart.
Introduction and Descriptive Statistics 29 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Bar charts
Pie charts are useful in visualizing the proportions of each categories.
In demonstrating the differences among categories, a bar chart is a
better choice.
The larger the category, the longer the bar.
Some people draw bars vertically; some horizontally.
Introduction and Descriptive Statistics 30 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Bar charts
Let’s replace the pie chart to a bar chart.
Day Total rentals
Sunday 444027
Monday 455503
Tuesday 469109
Wednesday 473048
Thursday 485395
Friday 487790
Saturday 477807
Note that the y-axis does not start at 0!
Introduction and Descriptive Statistics 31 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Bar charts v.s. histograms
What are the differences that distinguish a bar chart from a histogram?
A bar chart uses noncontiguous bars to visualize categorical data.
A histogram uses contiguous bars to visualize quantitative data.
Introduction and Descriptive Statistics 32 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Visualizing two variables
When we have data for two variables, typically we want to identify
whether there is any relationship between them.
Visualizing the data in a two-dimensional manner helps.
When the two vales are both measured in quantitative scales, we may
depict each observation as a point on a plane to create a scatter plot.
For our bike rental example:
How do monthly rentals in 2011 and those in 2012 relate with each other?
How do daily casual and registered rentals relate with each other?
Introduction and Descriptive Statistics 33 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Monthly rentals in 2011 and 2012
Month 2011 2012
1 38189 96744
2 48215 103137
3 64045 164875
4 94870 174224
5 135821 195865
6 143512 202830
7 141341 203607
8 136691 214503
9 127418 218573
10 123511 198841
11 102167 152664
12 87323 123713
Introduction and Descriptive Statistics 34 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Road map
Basic concepts.
Data visualization.
Data summarization.
Introduction and Descriptive Statistics 35 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Summarizing the data with numbers
Descriptive Statistics includes some common ways to describe data.
Summarization with numbers.
Visualization with graphs.
This is always the first step of any data analysis project: To get
intuitions that guide our directions.
Here we talk about summarization.
For a set of (a lot of) numbers, we use a few numbers to summarize them.
For a population: these numbers are parameters.
For a sample: these numbers are statistics.
We will talk about three things:
Measures of central tendency for the center or middle part of data.
Measures of variability for how variable the data are.
Measures of correlation for the relationship between two variables.
Introduction and Descriptive Statistics 36 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Medians
The median is the middle value in an ordered set of numbers.
Roughly speaking, half of the numbers are below and half are above it.
Suppose there are N numbers:
If N is odd, the median is the N+1
2
th large number.
If N is even, the median is the average of the N
2
th and the (N
2
+ 1)th
large number.
For example:
The median of {1, 2, 4, 5, 6, 8, 9} is 5.
The median of {1, 2, 4, 5, 6, 8} is 4+5
2
= 4.5.
Introduction and Descriptive Statistics 37 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Medians
A median is unaffected by the magnitude of extreme values:
The median of {1, 2, 4, 5, 6, 8, 9} is 5.
The median of {1, 2, 4, 5, 6, 8, 900} is still 5.
Medians may be calculated from quantitative or ordinal data.
It cannot be calculated from nominal data.
Unfortunately, a median uses only part of the information contained in
these numbers.
For quantitative data, a median only treats them as ordinal.
Introduction and Descriptive Statistics 38 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Means
The mean is the average of a set of data.
Can be calculated only from quantitative data.
The mean of {1, 2, 4, 5, 6, 8, 9} is
1 + 2 + 4 + 5 + 6 + 8 + 9
7
= 5.
A mean uses all the information contained in the numbers.
Unfortunately, a mean will be affected by extreme values.
The mean of {1, 2, 4, 5, 6, 8, 900} is 1+2+4+5+6+8+900
7
≈ 132.28!
Using the mean and median simultaneously can be a good idea.
We should try to identify outliers (extreme values that seem to be
“strange”) before calculating a mean (or any statistics).
Introduction and Descriptive Statistics 39 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Population means vs. sample means
Let {xi}i=1,...,N be a population with N as the population size. The
population mean is
µ ≡
N
i=1 xi
N
.
Let {xi}i=1,...,n be a sample with n < N as the sample size. The
sample mean is
¯x ≡
n
i=1 xi
n
.
People use µ and ¯x in almost the whole statistics world.
Introduction and Descriptive Statistics 40 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Population means v.s. sample means
µ ≡
N
i=1 xi
N
¯x ≡
n
i=1 xi
n
.
Isn’t these two means the same?
From the perspective of calculation, yes.
From the perspective of statistical inference, no.
Typically the population mean is fixed but unknown.
The sample mean is random: We may get different values of ¯x today
and tomorrow.
To start from ¯x and use inferential statistics to estimate or test µ, we
need to apply probability.
Introduction and Descriptive Statistics 41 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Quartiles and percentiles
The median lies at the middle of the data.
The first quartile lies at the middle of the first half of the data.
The third quartile lies at the middle of the second half of the data.
For the pth percentile:
p
100
of the values are below it.
1 − p
100
of the values are above it.
Median, quartiles, and percentiles:
The 25th percentile is the first quartile.
The 50th percentile is the median (and the second quartile).
The 75th percentile is the third quartile.
Introduction and Descriptive Statistics 42 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Modes
The mode(s) is (are) the most frequently occurring value(s) in a set
of qualitative data.
In the set {A, A, A, B, B, C, D, E, F, F, F, G, H}, the modes are A and F.
The frequency of the modes (A and F) are 3.
Though the above definition may also be applied to quantitative data,
sometimes it is useless.
In many case, all values are modes!
For quantitative data, we instead look for the modal class(es).
Introduction and Descriptive Statistics 43 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Modal classes
In a baseball team, players’ heights
(in cm) are:
178 172 175 184
172 175 165 178
177 175 180 182
177 183 180 178
179 162 170 171
For the classes [160, 165), [165, 170),
..., and [185, 190), the modal class is
[175, 180).
We sometimes say the mode of this
set is 177.5.
The way of grouping matters!
Introduction and Descriptive Statistics 44 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Variability
Measures of variability describe the spread or dispersion of a set
of data.
Especially important when two sets of data have the same center.
Introduction and Descriptive Statistics 45 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Ranges and Interquartile ranges
The range of a set of data {xi}i=1,...,N is the difference between the
maximum and minimum numbers, i.e.,
max
i=1,...,N
{xi} − min
i=1,...,N
{xi}.
The interquartile range of a set of data is the difference of the first
and third quartile.
It is the range of the middle 50 of data.
It excludes the effects of extreme values.
Introduction and Descriptive Statistics 46 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Deviations from the mean
Consider a set of population data
{xi}i=1,...,N with mean µ.
Intuitively, a way to measure the
dispersion is to examine how each number
deviates from the mean.
For xi, the deviation from the population
mean is defined as
xi − µ.
For a sample, the deviation from the
sample mean of xi is
xi − ¯x.
i xi deviation
1 1 1 − 5 = −4
2 2 2 − 5 = −3
3 4 4 − 5 = −1
4 5 1 − 5 = 0
5 6 6 − 5 = 1
6 8 8 − 5 = 3
7 9 9 − 5 = 4
Mean 5
Introduction and Descriptive Statistics 47 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Mean deviations
May we summarize the N deviations into
a single number to summarize the
aggregate deviation?
Intuitively, we may sum them up and then
calculate the mean deviation:
N
i=1(xi − µ)
N
.
Is it always 0?
i xi deviation
1 1 1 − 5 = −4
2 2 2 − 5 = −3
3 4 4 − 5 = −1
4 5 1 − 5 = 0
5 6 6 − 5 = 1
6 8 8 − 5 = 3
7 9 9 − 5 = 4
Mean 5 0
Introduction and Descriptive Statistics 48 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Adjusting mean deviations
People use two ways to adjust
mean deviations:
Mean absolute deviations/errors
(MAD):
N
i=1 |xi − µ|
N
.
Mean squared deviations/errors
(variance or MSE):
N
i=1(xi − µ)2
N
.
A larger MAD or variance means
that the data are more disperse.
i xi di |di| d2
i
1 1 −4 4 16
2 2 −3 3 9
3 4 −1 1 1
4 5 0 0 0
5 6 1 1 1
6 8 3 3 9
7 9 4 4 16
Mean 5 0 2.29 7.43
Introduction and Descriptive Statistics 49 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
MAD vs. variance
The main difference:
An MAD puts the same weight on all values.
A variance puts more weights on extreme values.
They may give different ranks of dispersion:
i xi di |di| d2
i
1 0 −5 5 25
2 4 −1 1 1
3 5 0 0 0
4 6 1 1 1
5 10 5 5 25
Mean 5 0 2.4 10.4
i xi di |di| d2
i
1 1 4 4 16
2 2 3 3 9
3 5 0 0 0
4 8 3 3 9
5 9 4 4 16
Mean 5 0 2.8 10
In general, people use variances more than MADs.
But MADs are still popular in some areas, e.g., demand forecasting.
It is the analyst’s discretion to choose the appropriate one.
Introduction and Descriptive Statistics 50 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Standard deviations
One drawback of using variances is that the unit of measurement is the
square of the original one.
For the baseball team, the variance of
member heights is 34.05 cm2
. What is it?!
People take the square root of a variance
to generate a standard deviation.
The standard deviation of member heights
is √
34.05 ≈ 5.85 cm.
178 172 175 184
172 175 165 178
177 175 180 182
177 183 180 178
179 162 170 171
A standard deviation typically has more managerial implications.
Introduction and Descriptive Statistics 51 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Population v.s. sample variances
Recall that the formulas for population and sample means are
µ ≡
N
i=1 xi
N
and ¯x ≡
n
i=1 xi
n
, respectively.
Formula-wise there is no difference.
However, population and sample variances are
σ2
≡
N
i=1(xi − µ)2
N
and s2
≡
n
i=1(xi − ¯x)2
n − 1
, respectively.
Note the difference between N and n − 1!
Population and sample standard deviations are σ =
N
i=1(xi−µ)2
N
and
s =
n
i=1(xi−¯x)2
n−1
, respectively.
People use σ2
, σ, s2
, and s in almost the whole statistics world.
Introduction and Descriptive Statistics 52 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Coefficient of variation
The coefficient of variation is the ratio of the standard deviation to
the mean:
Coefficient of variation =
σ
µ
.
When will you use coefficients of variation?
Introduction and Descriptive Statistics 53 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
z-scores
Consider a set of sample data {xi}i=1,...,n with sample mean ¯x and
sample standard deviation s. For xi, the z-score is
zi =
xi − ¯x
s
.
In a set of population data {xi}i=1,...,N with population mean µ and
population standard deviation σ, the z-score of xi is
zi =
xi − µ
σ
.
A value’s z-score measures for how many standard deviations it
deviates from the mean.
Introduction and Descriptive Statistics 54 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
z-scores vs. outliers
For detecting outliers, one common way is double check whether xi is
an outlier if
|zi| =
xi − µ
σ
> 3.
It is quite rare for a value’s magnitude of z-score to be so large.
For sample data, use xi−¯x
s
.
Some people propose the use of median and MAD is a similar way:
double check whether xi is an outlier if1
xi − median
MAD
> 3.
The above rules only suggest one to investigate some extreme values
again. These rules are neither sufficient nor necessary for outliers.
1The “MAD” here can be mean absolute deviation from mean, mean absolute
deviation from median, median absolute deviation from median, etc.
Introduction and Descriptive Statistics 55 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Correlation
Consider the size of a house and its price in a city:
Size Price
(in m2
) (in $1000)
75 315
59 229
85 355
65 261
72 234
46 216
107 308
91 306
75 289
65 204
88 265
59 195
How do we measure/describe the correlation (linear relationship)
between the two variables?
Introduction and Descriptive Statistics 56 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Intuition
Consider a set of paired data
{(xi, yi)}i=1,...,N .
When one variable goes up, does
the other one tend to go up or
down?
More precisely, if xi is larger than
µx (the mean of the xis), is it more
likely to see yi > µy or yi < µy?
We say that the two variables have
a positive correlation.
If one goes up when the other goes
down, there is a negative
correlation.
Introduction and Descriptive Statistics 57 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Covariances
We define the covariance of a set of two-dimensional (sample) data as
sxy ≡
n
i=1(xi − ¯x)(yi − ¯y)
n − 1
.
If most points fall in the first and third quadrants, most
(xi − µx)(y − µy) will be positive and sxy tends to be positive.
Otherwise, sxy tends to be negative.
So the covariance of house size and price is 617.16.
Is it large or small?
This depends on how variable the two variables themselves are.
Introduction and Descriptive Statistics 58 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Pearson’s correlation coefficients
To take away the auto-variability of each variable itself, we define the
population and sample correlation coefficients as
r ≡
sxy
sxsy
,
sx and sy are the sample standard deviations of xis and yis.
In our example, we have r = 617.16
16.78×50.45
≈ 0.729.
It can be shown that we always have −1 ≤ r ≤ 1.
r > 0: Positive correlation.
r = 0: No correlation.
r < 0: Negative correlation.
People often determine the degree of correlation based on |s|:
0 ≤ |s| < 0.25: A weak correlation.
0.25 ≤ |s| < 0.5: A moderately weak correlation.
0.5 ≤ |s| < 0.75: A moderately strong correlation.
0.75 ≤ |s| ≤ 1: A strong correlation.
Introduction and Descriptive Statistics 59 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Correlation vs. independence
A correlation coefficient only measures how one variable linearly
depends on the other variable.
(r = 0.5973) (r = 0)
Being uncorrelated does not mean being independent!
Introduction and Descriptive Statistics 60 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Correlation vs. causation
A correlation coefficient only measures whether two variables correlate
with each other. High correlation does not mean causation.
A causes B or B causes A? C causes A and B? Or just by chance?
Introduction and Descriptive Statistics 61 / 62 Ling-Chieh Kung (NTU IM)
Basic concepts Data visualization Data summarization
Correlation of qualitative variables
Sometimes the variables are not quantitative/numeric.
For ordinal data, we calculate their Spearman’s rank correlation.
For nominal data, we calculate Cramer’s V.
Introduction and Descriptive Statistics 62 / 62 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Statistics and Data Analysis for Engineers
Part 2:
Hypothesis Testing and p-value
Ling-Chieh Kung
Department of Information Management
National Taiwan University
September 4, 2016
Hypothesis Testing and p-value 1 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Road map
Sampling.
Sampling distributions.
Hypothesis testing.
p-value, t test, and more.
Hypothesis Testing and p-value 2 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Random vs. nonrandom sampling
Sampling is the process of selecting a subset of entities from the whole
population.
Sampling can be random or nonrandom.
If random, whether an entity is selected is probabilistic.
Randomly select 1000 phone numbers on the telephone book and then
call them.
If nonrandom, it is deterministic.
Ask all your classmates for their preferences on iOS/Android.
Most statistical methods are only for random sampling.
Some popular random sampling techniques:
Simple random sampling.
Stratified random sampling.
Cluster (or area) random sampling.
Hypothesis Testing and p-value 3 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Simple random sampling
In simple random sampling, each entity has the same probability of
being selected.
The good part of simple random sampling is simple.
However, it may result in nonrepresentative samples.
In simple random sampling, there are some possibilities that too
much data we sample fall in the same stratum.
They have the same property.
E.g., it is possible that all randomly sampled voters are younger than 40.
The sample is thus nonrepresentative.
How to fix this problem?
Hypothesis Testing and p-value 4 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Stratified random sampling
We may apply stratified random sampling.
We first split the whole population into several strata.
Data in one stratum should be (relatively) homogeneous.
Data in different strata should be (relatively) heterogeneous.
We then use simple random sampling for each stratum.
Hypothesis Testing and p-value 5 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Stratified random sampling
As an example, suppose that we want to sample 40 out of 1000
graduates to understand the number of credits they get at school.
Suppose that 100 students double majored, then we can split the whole
population into two strata:
Stratum Strata size
Double major 100
No double major 900
To sample 40 graduates, we sample 40 × 100
1000 = 4 from the
double-major stratum and 36 from the other stratum.
Hypothesis Testing and p-value 6 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Stratified random sampling
We may further split the population into more strata.
Double major: Yes or no.
Class: 1994-1998, 1999-2003, 2004-2008, or 2009-2012.
This stratification makes sense only if students in different classes tend
to take different numbers of units.
Stratified random sampling is good in reducing sample error.
But it can be hard to identify a reasonable stratification.
It is also more costly and time-consuming.
Hypothesis Testing and p-value 7 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Cluster (or area) random sampling
Imagine that you are going to introduce a new product into all the
retail stores in Taiwan.
If the product is actually unpopular, an introduction with a large
quantity will incur a huge lost.
How to get an idea about the popularity?
Typically we first try to introduce the product in a small area. We
put the product on the shelves only in those stores in the specified area.
This is the idea of cluster (or area) random sampling.
Those consumers in the area form a sample.
Hypothesis Testing and p-value 8 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Cluster (or area) random sampling
In cluster random sampling, we define clusters.
We will only choose one or some clusters and then collect all the
data in these clusters.
If a cluster is too large, we may further split it into multiple
second-stage clusters.
Therefore, we want data in a cluster to be heterogeneous, and data
across clusters somewhat homogeneous.
For example, people may do cluster random sampling to understand
the popularity of a new product. Those chosen cities (counties, states,
etc.) are called test market cities (counties, states, etc.).
People use cluster random sampling in this case because of its feasibility
and convenience.
We should select test market cities whose population profiles are similar
to that of the entire country.
Hypothesis Testing and p-value 9 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Nonrandom sampling
Sometimes we do nonrandom sampling.
Convenience sampling.
The researcher sample data that are easy to sample.
Judgment sampling.
The researcher decides who to ask or what data to collect.
Quota sampling.
In each stratum, we use whatever method that is easy to fill the quota, a
predetermined number of samples in the stratum.
Snowball sampling.
Once we ask one person, we ask her/him to suggest others.
Nonrandom sampling cannot be analyzed by the statistical methods
we introduce in this course.
Hypothesis Testing and p-value 10 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Road map
Sampling.
Sampling distributions.
Hypothesis testing.
p-value, t test, and more. .
Hypothesis Testing and p-value 11 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Sampling distributions
When we cannot examine the whole population, we study a sample.
What will be contained in a random sample is unpredictable.
We need to know the probability distribution of a sample so that we
may connect the sample with the population.
The probability distribution of a sample is a sampling distribution.
Hypothesis Testing and p-value 12 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Sampling distributions
A factory produces bags of candies. Ideally, each bag should weigh 2
kg. As the production process cannot be perfect, a bag of candies
should weigh between 1.8 and 2.2 kg.
Let X be the weight of a bag of candies. Let µ and σ be its expected
value and standard deviation.
Is µ = 2?
Is 1.8 < µ < 2.2?
How large is σ?
Let’s sample:
In a random sample of 1 bag of candies, suppose it weighs 2.1 kg. May
we conclude that 1.8 < µ < 2.2?
What if the average weight of 5 bags in a random sample is 2.1 kg?
What if the sample size is 10, 50, or 100?
What if the mean is 2.3 kg?
We need to know the sampling distribution of those statistics (sample
mean, sample standard deviation, etc.).
Hypothesis Testing and p-value 13 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Sample means
The sample mean is one of the most important statistics.
Definition 1
Let {Xi}i=1,...,n be a sample from a population, then
¯x =
n
i=1 Xi
n
is the sample mean.
Sometimes we write ¯xn to emphasize that the sample size is n.
We assume that Xi and Xj are independent for all i = j.
This is fine if n N, i.e., we sample a few items from a large population.
In practice, we require n ≤ 0.05N.
Hypothesis Testing and p-value 14 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Means and variances of sample means
Suppose the population mean and variance are µ and σ2
, respectively.
These two numbers are fixed.
A sample mean ¯x is a random variable.
It has its expected value E[¯x], variance Var(¯x), and standard deviation
Var(¯x). These numbers are all fixed
They are also denoted as µ¯x, σ2
¯x, and σ¯x, respectively.
For any population, we have the following theorem:
Proposition 1 (Mean and variance of a sample mean)
Let {Xi}i=1,...,n be a size-n random sample from a population with
mean µ and variance σ2
, then we have
µ¯x = µ, σ2
¯x =
σ2
n
, and σ¯x =
σ
√
n
.
Hypothesis Testing and p-value 15 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Means and variances of sample means
Do the terms confuse you?
The sample mean vs. the mean of the sample mean.
The sample variance vs. the variance of the sample mean.
By definition, they are:
¯x = 1
n
n
i=1 Xi; a random variable.
E[¯x]; a constant.
s2
= 1
n−1
n
i=1(Xi − ¯x)2
; a random variable.
Var(¯x); a constant.
The sample variance also has its mean and variance.
Hypothesis Testing and p-value 16 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Example: Quality inspection
The weight of a bag of candies follow a normal distribution with mean
µ = 2 and standard deviation σ = 0.2.
Suppose the quality control officer decides to sample 4 bags and
calculate the sample mean ¯x. She will punish me if ¯x /∈ [1.8, 2.2].
Note that my production process is actually “good:” µ = 2.
Unfortunately, it is not perfect: σ > 0.
We may still be punished (if we are unlucky) even though µ = 2.
What is the probability that I will be punished?
We want to calculate 1 − Pr(1.8 < ¯x < 2.2).
We know that µ¯x = µ = 2 and σ¯x = σ√
4
= 0.1.
But we do not know the probability distribution of ¯x!
Hypothesis Testing and p-value 17 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Sampling from a normal population
If the population is normal, the sample mean is also normal!
Proposition 2
Let {Xi}i=1,...,n be a size-n random sample from a normal population
with mean µ and standard deviation σ. Then
¯x ∼ ND µ,
σ
√
n
.
We already know that µ¯x = µ and σ¯x = σ√
n
. This is true regardless of
the population distribution.
When the population is normal, the sample mean will also be normal.
Hypothesis Testing and p-value 18 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Example revisited: Quality inspection
The weight of a bag of candies follow a normal distribution with mean
µ = 2 and standard deviation σ = 0.2.
Suppose the quality control officer decides to sample 4 bags and
calculate the sample mean ¯x. She will punish me if ¯x /∈ [1.8, 2.2].
What is the probability that I will be punished?
The distribution of the sample mean ¯x is ND(2, 0.1).
Pr(¯x < 1.8) + Pr(¯x > 2.2) ≈ 0.045.
Hypothesis Testing and p-value 19 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Adjusting the standard deviation
When the population is
ND(µ = 2, σ = 0.2) and the sample
size is n = 4, the probability of
punishment is 0.045.
If we adjust our standard deviation
σ (by paying more or less attention
to the production process), the
probability will change.
Reducing σ reduces the probability
of being punished. With the
sampling distribution of ¯x, we may
optimize σ.
An improvement from 0.2 to 0.15
is helpful; from 0.15 to 0.1 is not.
Hypothesis Testing and p-value 20 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Adjusting the sample size
When the population is ND(2, 0.2)
and the sample size is n = 4, the
probability of punishment is 0.045.
If the quality control officer
increases the sample size n, the
probability will decrease.
µ = 2 is actually ideal. A larger
sample size makes the officer less
likely to make a mistake.
Hypothesis Testing and p-value 21 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Distribution of the sample mean
So now we have one general conclusion: When we sample from a
normal population, the sample mean is also normal.
And its mean and standard deviation are µ and σ√
n
, respectively.
What if the population is non-normal?
Fortunately, we have a very powerful theorem, the central limit
theorem, which applies to any population.
Hypothesis Testing and p-value 22 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Central limit theorem
The theorem says that a sample mean is approximately normal
when the sample size is large enough.
Proposition 3 (Central limit theorem)
Let {Xi}i=1,...,n be a size-n random sample from a population with
mean µ and standard deviation σ. Let ¯xn be the sample mean. If
σ < ∞, then ¯xn converges to ND(µ, σ√
n
) as n → ∞.
How large is “large enough”?
In practice, typically n ≥ 30 is believed to be large enough.
Hypothesis Testing and p-value 23 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Road map
Sampling.
Sampling distributions.
Hypothesis testing.
p-value, t test, and more. .
Hypothesis Testing and p-value 24 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Hypothesis testing
How do scientists (physicists, chemists, etc.) do research?
Observe phenomena.
Make hypotheses.
Test the hypotheses through experiments (or other methods).
Make conclusions about the hypotheses.
Social scientists and business researchers do the same thing with
hypothesis testing.
One of the most important technique of statistical inference.
A technique for (statistically) proving things.
Relying on sampling distributions.
Hypothesis Testing and p-value 25 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
People ask questions
In the business (or social science) world, people ask questions:
Are older workers more loyal to a company?
Does the newly hired CEO enhance our profitability?
Is one candidate preferred by more than 50% voters?
Do teenagers eat fast food more often than adults?
Is the quality of our products stable enough?
How should we answer these questions?
Statisticians suggest:
First make a hypothesis.
Then test it with samples and statistical methods.
Hypothesis Testing and p-value 26 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Statistical hypotheses
A statistical hypothesis is a formal way of stating a hypothesis.
Typically it is a mathematical description of parameters to test.
It contains two parts:
The null hypothesis (denoted as H0).
The alternative hypothesis (denoted as Ha or H1).
The alternative hypothesis is:
The thing that we want (need) to prove.
The conclusion that can be made only if we have a strong evidence.
The null hypothesis corresponds to a default position.
We first assume that the null hypothesis is correct.
Then we collect sample data.
If under the null hypothesis it is quite unlikely to see our observed
result, we claim that the null hypothesis is wrong.
Hypothesis Testing and p-value 27 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Statistical hypotheses: example 1
In our factory, we produce packs of candy whose average weight should
be 1 kg.
One day, a consumer told us that his pack only weighs 900 g.
We need to know whether this is just a rare event or our production
system is out of control.
If (we believe) the system is out of control, we need to shutdown the
machine and spend two days for inspection and maintenance. This will
cost us at least $100,000.
So we should not to believe that our system is out of control just
because of one complaint. What should we do?
Hypothesis Testing and p-value 28 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Statistical hypotheses: example 1
We first state a hypothesis: “Our production system is under control.”
Then we ask: Is there a strong enough evidence showing that the
hypothesis is wrong, i.e., the system is out of control?
Initially, we assume that our system is under control.
Then we do a survey to see if we have a strong enough evidence.
We shutdown machines only if we can “prove” that the system is indeed
out of control.
Let µ be the average weight, the statistical hypothesis is
H0 : µ = 1
Ha : µ = 1.
Hypothesis Testing and p-value 29 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Statistical hypotheses: example 2
In our society, we adopt the presumption of innocence.
One is considered innocent until proven guilty.
So when there is a person who probably stole some money:
H0 : The person is innocent
Ha : The person is guilty.
There are two possible errors:
One is guilty but we think she/he is innocent.
One is innocent but we think she/he is guilty.
Which one is more critical?
It is unacceptable that an innocent person is considered guilty.
We will say one is guilty only if there is a strong evidence.
Hypothesis Testing and p-value 30 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Statistical hypotheses: example 3
Consider the following hypothesis: “The candidate is preferred by more
than 50% voters.”
As we need a default position, and the percentage that we care about
is 50%, we will choose our null hypothesis as
H0 : p = 0.5.
p is the population proportion of voters preferring the candidate.
More precisely, let Xi = 1 if voter i prefers this candidate and 0
otherwise, i = 1, ..., N, then p =
N
i=1 Xi
N
.
How about the alternative hypothesis? Should it be
Ha : p > 0.5 or Ha : p < 0.5?
Hypothesis Testing and p-value 31 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Statistical hypotheses: example 3
The choice of the alternative hypothesis depends on the related
decisions or actions to make.
Suppose one will go for the election only if she thinks she will win (i.e.,
p > 0.5), the alternative hypothesis will be
Ha : p > 0.5.
Suppose one tends to participate in the election and will give up only if
the chance is slim, the alternative hypothesis will be
Ha : p < 0.5.
The alternative hypothesis is “the thing we want (need) to prove.”
Hypothesis Testing and p-value 32 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Two types of errors
Type-1 error (false positive): Rejecting a true null hypothesis.
There is nothing, but we say there is one.
Type-2 error (false negative): Do not reject a false null hypothesis.
There is something, but we do not see it.
Hypothesis Testing and p-value 33 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Hypothesis Testing and p-value 34 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Remarks
We want to control the chances for us to make these mistakes.
Unfortunately, we cannot control both.
We choose to control the probability of a type-1 error.
The choice of the default position is important.
For setting up a statistical hypothesis:
Our default position will be put in the null hypothesis.
The thing we want to prove (i.e., the thing that needs a strong evidence)
will be put in the alternative hypothesis.
For writing the mathematical statement:
The equal sign (=) will always be put in the null hypothesis.
The alternative hypothesis contains an unequal sign or strict
inequality: =, >, or <.
The direction of the alternative hypothesis, when it is an inequality,
depends on the context.
Hypothesis Testing and p-value 35 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
One-tailed tests and two-tailed tests
If the alternative hypothesis contains an unequal sign (=), the test is a
two-tailed test.
If it contains a strict inequality (> or <), the test is a one-tailed test.
Suppose we want to test the value of the population mean.
In a two-tailed test, we test whether the population mean significantly
deviates from a hypothesized value. We do not care whether it is larger
than or smaller than.
In a one-tailed test, we test whether the population mean significantly
deviates from a hypothesized value in a specific direction.
Hypothesis Testing and p-value 36 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
The first example: a two-tailed test
Let’s test the average weight (in g) of our products.
H0 : µ = 1000
Ha : µ = 1000.
The variance of the product weights is σ2
= 40000 g2
.
The case with unknown σ2
will be discussed later.
A random sample has been collected.
Suppose the sample size n = 100.
Suppose the sample mean X = 963.
How to make a conclusion?
Hypothesis Testing and p-value 37 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Controlling the error probability
All we can do is to collect a random sample and make our conclusion
based on the observed sample.
It is natural that we may be wrong when we claim µ = 1000.
We want to control the error probability.
Let α be the maximum probability for us to make this error.
α is called the significance level.
1 − α is called the confidence level.
Target: If µ = 1000, our sampling and testing process will make us claim
that µ = 1000 with probability at most α.
Hypothesis Testing and p-value 38 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Rejection rule
Now let’s test with the significance level α = 0.05.
Intuitively, if X deviates from 1000 a lot, we should reject the null
hypothesis and believe that µ = 1000.
If µ = 1000, it is so unlikely to observe such a large deviation.
So such a large deviation provides a strong evidence.
So we start by sampling and calculating the sample mean.
We want to construct a rejection rule: If |X − 1000| > d, we reject
H0. We need to calculate d.
Hypothesis Testing and p-value 39 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Rejection rule
We want a distance d such that if
H0 is true, the probability of
rejecting H0 is at most 5%, i.e.,
Pr |X − 1000| > d µ = 1000 ≤ 0.05.
The smallest d that satisfies the
above inequality requires
Pr(|X − 1000| > d) = 0.05.
Consider X:
We know σ = 200 and n = 100.
We assume that µ = 1000.
Thanks to the central limit
theorem, X ∼ ND(1000, 20).
Pr(|X − 1000| > d) = 0.05.
Hypothesis Testing and p-value 40 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Rejection rule: the critical value
According to X ∼ ND(1000, 20), Pr(|X − 1000| > 39.2) = 0.05. The
rejection region is R = (−∞, 960.8) ∪ (1039.2, ∞).
If X falls in the rejection region, we reject H0.
Hypothesis Testing and p-value 41 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Rejection rule: the critical value
Because ¯x = 963 /∈ R, we cannot reject H0.
The deviation from 1000 is not large enough.
The evidence is not strong enough.
Hypothesis Testing and p-value 42 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Rejection rule: the critical value
In this example, the two values 960.8 and 1039.2 are the critical
values for rejection.
If the sample mean is more extreme than one of the critical values, we
reject H0.
Otherwise, we do not reject H0.
¯x = 963 is not strong enough to support Ha: µ = 1000.
Concluding statement:
Because the sample mean does not lie in the rejection region, we cannot
reject H0.
With a 95% confidence level, there is no strong evidence showing that
the average weight is not 1000 g.
Therefore, we should not shutdown machines to do an inspection.
Hypothesis Testing and p-value 43 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Summary
We want to know whether the machine is out of control.
If the machine is actually good, we do not want to reach a conclusion
that requires an inspection and maintenance.
We will do the inspection only if we have a strong evidence suggesting
that µ = 1000.
We want to know whether H0 is false, i.e., µ = 1000.
We control the probability of making a wrong conclusion.
We should not reject H0 if it is true.
We limit the probability at α = 5%.
We will conclude that H0 is false if X falls in the rejection region.
The calculation of the the critical values is based on the normal
distribution, which can always be transformed to the z distribution.
This is called a z test.
Hypothesis Testing and p-value 44 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Not rejecting vs. accepting
We should be careful in writing our conclusions:
Wrong: Because the sample mean does not lie in the rejection region,
we accept H0. With a 95% confidence level, there is a strong evidence
showing that the average weight is 1000 g.
Right: Because the sample mean does not lie in the rejection region, we
cannot reject H0. With a 95% confidence level, there is no strong
evidence showing that the average weight is not 1000 g.
Unable to prove one thing is false does not mean it is true!
Hypothesis Testing and p-value 45 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
The first example (part 2)
Suppose that we modify the hypothesis into a directional one:1
H0 : µ = 1000.
Ha : µ < 1000.
We still have σ2
= 40000, n = 100, and α = 0.05.
This is a one-tailed test.
Once we have a strong evidence supporting Ha, we will claim that
µ < 1000.
We need to find a distance d such that
Pr 1000 − X > d µ = 1000 = 0.05.
1Some researchers write µ ≥ 1000 in this case.
Hypothesis Testing and p-value 46 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Rejection rule: the critical value
For 0.05 = Pr(1000 − X > d), we have d = 32.9.
As the observed sample mean ¯x = 963 ∈ (−∞, 967.1), we reject H0.
The deviation from 1000 is large enough.
The evidence is strong enough.
Hypothesis Testing and p-value 47 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Rejection rule: the critical value
In this example, 967.1 is the critical values for rejection.
If the sample mean is more extreme than (in this case, below) the critical
value, we reject H0.
Otherwise, we do not reject H0.
There is a strong evidence supporting Ha: µ < 1000.
Concluding statement:
Because the sample mean lies in the rejection region, we reject H0.
With a 95% confidence level, there is a strong evidence showing that the
average weight is less than 1000 g.
Hypothesis Testing and p-value 48 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
One-tailed tests vs. two-tailed tests
When should we use a two-tailed test?
We use a two-tailed test when we are lack of the direction information.
E.g., we suspect that the population mean has changed, but we have
no idea about whether it becomes larger or smaller.
If we know or believe that the change is possible only in one
direction, we may use a one-tailed test.
Having more information (i.e., knowing the direction of change) makes
rejection “easier,”, i.e., easier to find a strong enough evidence.
Hypothesis Testing and p-value 49 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Summary
Distinguish the following pairs:
One- and two-tailed tests.
No evidence showing H0 is false and having evidence showing H0 is true.
Not rejecting H0 and accepting H0.
Using = and using ≥ or ≤ in the null hypothesis.
Hypothesis Testing and p-value 50 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Road map
Sampling.
Sampling distributions.
Hypothesis testing.
p-value, t test, and more. .
Hypothesis Testing and p-value 51 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
The p-value
The p-value is an important, meaningful, and widely-adopted tool for
hypothesis testing.
Definition 2
For an observed value of a statistic in a statistical test, the p-value is
the probability of observing a value that is more extreme than the
observed value under the assumption that the null hypothesis is true.
Calculated based on an observed value of the statistic.
Is the tail probability of the observed value.
Assuming that the null hypothesis is true.
Hypothesis Testing and p-value 52 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
The p-value
Mathematically:
Suppose we test a population
mean µ with a one-tailed test
H0 : µ = 1000
Ha : µ < 1000.
Given an observed ¯x, the p-value
is defined as
Pr(X ≤ ¯x).
In the previous example, σ = 200,
n = 100, α = 0.05, and ¯x = 963.
If H0 is true, i.e., µ = 1000, we
have Pr(X ≤ 963) = 0.032.
The p-value of ¯x is 0.032.
Hypothesis Testing and p-value 53 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
How to use the p-value?
The p-value can be used for constructing a rejection rule.
For a one-tailed test:
If the p-value is smaller than α, we reject H0.
If the p-value is greater than α, we do not reject H0.
In our example, the one-tailed test is
H0 : µ = 1000
Ha : µ < 1000.
We have α = 0.05.
Because the p-value 0.032 < 0.05, we reject H0.
Hypothesis Testing and p-value 54 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
p-values vs. critical values
Using the p-value is equivalent to using the critical values.
The rejection-or-not decision we make will be the same based on the two
methods.
Hypothesis Testing and p-value 55 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
The benefit of using the p-value
In many studies, researchers do not determine the significance level α
before a test is conducted.
They calculate the p-value and then mark the significance of the
result with stars.
One typical way of assigning stars:
p-value Significant? Mark
(0, 0.01] Highly significant ***
(0.01, 0.05] Moderately significant **
(0.05, 0.1] Slightly significant *
(0.1, 1) Insignificant (Empty)
Hypothesis Testing and p-value 56 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
The size of a p-value
Suppose one is testing whether people at different ages sleep for at
least eight hours per day in average.
Age groups: [10, 15), [15, 20), [20, 35), etc.
For group i, a one-tailed test is conducted. Ha : µi > 8.
The result may be presented in a table:
Group Age group p-value
1 [10,15) 0.0002***
2 [15,20) 0.2
3 [20,25) 0.06*
4 [25,30) 0.04**
5 [30,35) 0.03**
A smaller p-value does NOT mean a larger deviation!
We cannot conclude that µ5 > µ4, µ1 > µ3, etc.
There are other tests for the difference between two population means.
Hypothesis Testing and p-value 57 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
The p-value for two-tailed tests
How to construct the rejection rule for a two-tailed test?
If the p-value is smaller than α
2
, we reject H0.
If the p-value is greater than α
2
, we do not reject H0.
Consider the two-tailed test
H0 : µ = 1000.
Ha : µ = 1000.
We have α = 0.05.
Because the p-value 0.032 > α
2
= 0.025, we do not reject H0.
Some researchers/books/software use another definition:
The p-value for a two-tailed test is two times of that for the
corresponding one-tailed test.
They then compare this p-value with α.
Hypothesis Testing and p-value 58 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Summary
The p-value is the tail probability of the realized value of a statistics
assuming the null hypothesis is true.
The p-value method is an alternative way of forming the rejection rule.
It is equivalent to the critical-value method.
The p-value is related to the probability for H0 to be false.
It does not measure the magnitude of the deviation.
Hypothesis Testing and p-value 59 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
The z test
In example 1, basically we use the fact that X ∼ ND(µ, σ√
n
.
This implies that X−µ
σ/
√
n
∼ ND(0, 1), the so-called standard normal
distribution, or the z distribution.
Therefore, this test is called the z test.
This requires the knowledge about σ.
Hypothesis Testing and p-value 60 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
When the variance is unknown
When the population variance σ2
is unknown, the quantity X−µ
σ/
√
n
is
unknown.
What if we use the sample variance S2
as a substitute?
Proposition 4
For a normal population, the quantity
T =
X − µ
S/
√
n
follows the t distribution with degree of freedom n − 1.
What is the t distribution?
Hypothesis Testing and p-value 61 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
The t distribution
The t distribution is defined as follows:
Definition 3
A random variable X follows the t distribution with degree of freedom
n, denoted as X ∼ t(n), if
f(x|n) =
Γ(n+1
2 )
√
nπΓ(n
2 )
1 +
x2
n
− n+1
2
,
for all x ∈ (−∞, ∞).
Γ(x) =
∞
0
zx−1
e−z
dz is the gamma function.
Hypothesis Testing and p-value 62 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
The z and t distributions
Let’s compare Z = X−µ
σ/
√
n
and T = X−µ
S/
√
n
.
Because we do not know σ, we use S to substitute it.
Z ∼ ND(0, 1) and T ∼ t(n − 1).
As the t distribution is a substitution of the z distribution, it is designed
to be also centered at 0: E[T] = E[Z] = 0.
However, as we add one more random variable into the formula (σ is a
known constant), T will be “more random” than Z, i.e.,
Var(T) > Var(Z).
Graphically, t curves will be flatter than the z curve.
Fact: t(n) → ND(0, 1) as n → ∞.
Hypothesis Testing and p-value 63 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Hypothesis Testing and p-value 64 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
The t test
We will use the t test to test the population mean if the population is
normal.
If the sample size is large, we may still use the z distribution with s
substituting σ.
Hypothesis Testing and p-value 65 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Example 2
An MBA program seldom admits applicants without a work experience
longer than two years.
To test whether the average work year of admitted students is above
two years, 20 admitted applicants are randomly selected.
Their work experiences prior to entering the program are recorded.
Prior to entering the program, they have an average work experience of
2.5 years. This is the sample mean.
The sample standard deviation is 1.3765 years.
The population is believed to be normal.
The confidence level is set to 95%.
Hypothesis Testing and p-value 66 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Example 2: hypothesis
Suppose the one asking the question is a potential applicant with one
year of work experience. He is pessimistic and will apply for the
program only if the average work experience is proven to be less than
two years.
The hypothesis is
H0 : µ = 2
Ha : µ < 2.
µ is the average work experience (in years) of all admitted applicants
prior to entering the program.
To encourage him, we need to give him a strong evidence showing that
his chance is high.
Hypothesis Testing and p-value 67 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Example 2: hypothesis and test
Suppose he is optimistic and will not apply for the program only if
the average work experience is proven to be greater than two.
The hypothesis becomes
H0 : µ = 2
Ha : µ > 2.
To discourage him, we need to give him a strong evidence showing that
his chance is slim.
Let’s consider the optimistic candidate (and Ha : µ > 2) first.
Because the population variance is unknown and the population is
normal, we may use the t test.
Hypothesis Testing and p-value 68 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Example 2A: calculation and interpretation
Calculation:
The p-value is Pr(X > 2.5|µ = 2) = 0.0604.
Conclusion:
For this one-tailed test, as the p-value > 0.05 = α, we do not reject H0.
There is no strong evidence showing that the average work experience
is longer than two years.
The result is not strong enough to discourage the potential applicant,
who has only one year of work experience.
Decision:
The (optimistic) applicant should apply.
Hypothesis Testing and p-value 69 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Example 2B – a pessimistic applicant
Suppose the applicant is pessimistic and the hypothesis is
H0 : µ = 2
Ha : µ < 2.
The p-value will be Pr(X < 2.5|µ = 2) = 1 − 0.0604 = 0.9396.
This is calculated based on the t distribution.
We do not reject H0 and cannot conclude that µ < 2. There is no strong
evidence to encourage him.
He should not apply.
Note that when we write different alternative hypotheses, the final
decision is different!
This happens if and only if in both cases we do not reject H0.
Hypothesis Testing and p-value 70 / 71 Ling-Chieh Kung (NTU IM)
Sampling Sampling distributions Hypothesis testing p-value, t test, and more
Summary
To test the population mean µ:
σ2
Sample size
Population distribution
Normal Nonnormal
Known
n ≥ 30 z z
n < 30 z Nonparametric
Unknown
n ≥ 30 t or z z
n < 30 t Nonparametric
More parameters that may be tested:
Population proportion (z test).
Population variance (χ2
test).
Difference of two population means (t test).
Ratio of two population variances (F test).
Hypothesis Testing and p-value 71 / 71 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Statistics and Data Analysis for Engineers
Part 3:
Regression Analysis
Ling-Chieh Kung
Department of Information Management
National Taiwan University
September 4, 2016
Regression Analysis 1 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Correlation and prediction
We often try to find correlation among variables.
For example, prices and sizes of houses:
House 1 2 3 4 5 6
Size (m2) 75 59 85 65 72 46
Price ($1000) 315 229 355 261 234 216
House 7 8 9 10 11 12
Size (m2) 107 91 75 65 88 59
Price ($1000) 308 306 289 204 265 195
We may calculate their correlation coefficient as r = 0.729.
Now given a house whose size is 100 m2
, may we predict its price?
Regression Analysis 2 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Correlation among more than two variables
Sometimes we have more than two variables:
For example, we may also know the number of bedrooms in each house:
House 1 2 3 4 5 6
Size (m2) 75 59 85 65 72 46
Price ($1000) 315 229 355 261 234 216
Bedroom 1 1 2 2 2 1
House 7 8 9 10 11 12
Size (m2) 107 91 75 65 88 59
Price ($1000) 308 306 289 204 265 195
Bedroom 3 3 2 1 3 1
How to summarize the correlation among the three variables?
How to predict house price based on size and number of bedrooms?
Regression Analysis 3 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Regression analysis
Regression is a solution!
As one of the most widely used tools in Statistics, it discovers:
Which variables affect a given variable.
How they affect the target.
In general, we will predict/estimate one dependent variable by one
or multiple independent variables.
Independent variables: Potential factors that may affect the outcome.
Dependent variable: The outcome.
Independent variables are explanatory variables; the dependent variable
is the response variable.
As another example, suppose we want to predict the number of arrival
consumers for tomorrow:
Dependent variable: Number of arrival consumers.
Independent variables: Weather, holiday or not, promotion or not, etc.
Regression Analysis 4 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Types of regression analysis
Based on the number of independent variables:
Simple regression: One independent variable.
Multiple regression: More than one independent variables.
The dependent variable may be quantitative or qualitative.
In ordinary regression, the dependent variable is quantitative.
In logistic regression, the dependent variable is qualitative.
There are other types of regression models.
Regression Analysis 5 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Road map
Simple regression.
Multiple regression.
Indicator variables and interaction.
Endogeneity and residual analysis.
Logistic regression.
Regression Analysis 6 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Basic principle
Consider the price-size relationship again. In the sequel, let xi be the
size and yi be the price of house i, i = 1, ..., 12.
Size Price
(in m2
) (in $1000)
46 216
59 229
59 195
65 261
65 204
72 234
75 315
75 289
85 355
88 265
91 306
107 308
How to relate sizes and prices “in the best way?”
Regression Analysis 7 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Linear estimation
If we believe that the relationship between the two variables is linear,
we will assume that
yi = β0 + β1xi + i.
β0 is the intercept of the equation.
β1 is the slope of the equation.
i is the random noise for estimating record i.
Somehow there is such a formula, but we do not know β0 and β1.
β0 and β1 are the parameter of the population.
We want to use our sample data (e.g., the information of the twelve
houses) to estimate β0 and β1.
We want to form two statistics ˆβ0 and ˆβ1 as our estimates of β0 and β1.
Regression Analysis 8 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Linear estimation
Given the values of ˆβ0 and ˆβ1, we will use ˆyi = ˆβ0 + ˆβ1xi as our
estimate of yi.
Then we have
yi = ˆβ0 + ˆβ1xi + i,
where i is now interpreted as the estimation error.
Let ˆyi = ˆβ0 + ˆβ1xi be our estimate of yi. We hope i = yi − ˆyi to be small.
For all data points, let’s minimize the sum of squared errors (SSE):
n
i=1
2
i = (yi − ˆyi)2
=
n
i=1
(yi − (ˆβ0 + ˆβ1xi)
2
.
The solution of
min
ˆβ0, ˆβ1
n
i=1
(yi − (ˆβ0 + ˆβ1xi)
2
is our least square approximation (estimation) of the given data.
Regression Analysis 9 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Least square approximation
The least square approximation problem
min
ˆβ0, ˆβ1
n
i=1
(yi − (ˆβ0 + ˆβ1xi)
2
has a closed-form formula for the best (ˆβ0, ˆβ1):
ˆβ1 =
n
i=1(xi − ¯x)(yi − ¯y)
n
i=1(xi − ¯x)2
and ˆβ0 = ¯y − ˆβ1 ¯x.
For our house example, we will get (ˆβ0, ˆβ1) = (102.717, 2.192).
Its SSE is 13118.63.
We will never know the true values of β0 and β1. However, according to
our sample data, the best (least square) estimate is (102.717, 2.192).
We tend to believe that β0 = 102.717 and β1 = 2.192.
Regression Analysis 10 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Interpretations
Our regression model is
y = 102.717 + 2.192x.
Interpretation: When the house
size increases by 1 m2
, the price is
expected to increase by $2, 192.
(Bad) interpretation: For a house
whose size is 0 m2
, the price is
expected to be $102,717.
Regression Analysis 11 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Linear multiple regression
In most cases, more than one independent variable may be used to
explain the outcome of the dependent variable.
For example, consider the number of bedrooms.
We may take both variables as
independent variables to do linear
multiple regression:
yi = β0 + β1x1,i + β2x2,i + i.
yi is the house price (in $1000).
x1,i is the house size (in m2
).
x2,i is the number of bedrooms.
i is the random noise.
Our (least square) estimate is
(ˆβ0, ˆβ1, ˆβ2) = (82.737, 2.854, −15.789).
Price Size
Bedroom
(in $1000) (in m2
)
315 75 1
229 59 1
355 85 2
261 65 2
234 72 2
216 46 1
308 107 3
306 91 3
289 75 2
204 65 1
265 88 3
195 59 1
Regression Analysis 12 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Interpretations
Our regression model is
y = 82.737 + 2.854x1 − 15.789x2.
When the house size increases by 1 m2
(and all other independent
variables are fixed), we expect the price to increase by $2, 854.
When there is one more bedroom (and all other independent variables
are fixed), we expect the price to decrease by $15, 789.
One must interpret the results and determine whether the result is
meaningful by herself/himself.
The number of bedrooms may not be a good indicator of house price.
At least not in a linear way.
We need more than finding coefficients:
We need to judge the overall quality of a given regression model.
We may want to compare multiple regression models.
We must test the significance of regression coefficients.
Regression Analysis 13 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Model validation: How good is a model?
How to measure the quality of a model?
For the model y = 102.717 + 2.192x, how good is it?
In general, for a given regression model y = ˆβ0 + ˆβ1x1 + · · · ˆβkxk, how
may we evaluate its overall quality?
The sum of squared total errors (SST), SST =
n
i=1(yi − ¯y)2
, is
for the worst model.
With our regression model, the sum of squared errors (SSE) is
SSE =
n
i=1
(yi − ˆyi)2
=
n
i=1
(yi − (ˆβ0 + ˆβ1xi)
2
.
The proportion of total variability that is explained by the regression
model is
0 ≤ R2
= 1 −
SSE
SST
≤ 1.
The larger R2
, the better the regression model.
Regression Analysis 14 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Obtaining R2
Whenever we find the estimated coefficients, we have R2
.
Statistical software includes R2
in the regression report.
For the regression model y = 102.717 + 2.192x, we have R2
= 0.5315:
Around 53% of a house price is determined by its house size.
If (and only if) there is only one independent variable, then R2
= r2
,
where r is the correlation coefficient between the dependent and
independent variables.
−1 ≤ r ≤ 1.
0 ≤ r2
= R2
≤ 1.
Regression Analysis 15 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Comparing regression models
Now we have a way to compare regression models.
For our example:
Size only Bedroom only Size and bedroom
R2
0.5315 0.29 0.5513
Using prices only is better than using numbers of bedrooms only.
Is using prices and bedrooms better?
In general, adding more variables always increases R2
!
In the worst case, we may set the corresponding coefficients to 0.
Some variables may actually be meaningless.
To perform a “fair” comparison and identify those meaningful factors,
we need to adjust R2
based on the number of independent variables.
Regression Analysis 16 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Adjusted R2
The standard way to adjust R2
to adjusted R2
is
R2
adj = 1 −
n − 1
n − k − 1
(1 − R2
).
n is the sample size and k is the number of independent variables used.
For our example:
Size only Bedroom only Size and bedroom
R2
0.5315 0.290 0.5513
R2
adj 0.4846 0.219 0.4516
Actually using sizes only results in the best model!
Regression Analysis 17 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Testing coefficient significance
Another important task for validating a regression model is to test the
significance of each coefficient.
Recall our model with two independent variables
y = 82.737 + 2.854x1 − 15.789x2.
Note that 2.854 and −15.789 are solely calculated based on the sample.
We never know whether β1 and β2 are really these two values!
In fact, we cannot even be sure that β1 and β2 are not 0. We need to
test them:
H0 : βi = 0
Ha : βi = 0.
We look for a strong enough evidence showing that βi = 0.
Regression Analysis 18 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Testing coefficient significance
The testing results are provided in regression reports.
Statistical software (e.g., R) tells us:
Coefficients Standard Error t Stat p-value
Intercept 82.737 59.873 1.382 0.200
Size 2.854 1.247 2.289 0.048 **
Bedroom −15.789 25.056 −0.630 0.544
As we have no idea about population variance, we apply the t test.
“Coefficients” records sample means ¯x; “Standard Error” records S√
n
; “t
Stat” records T = ¯x−0
S/
√
n
.
“p-value” are the tail probabilities of T multiplied by 2 (done by most
software). Simply compare them with α!
Recall the assumption that i is normal!
Regression Analysis 19 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Testing coefficient significance
Statistical software tells us:
Coefficients Standard Error t Stat p-value
Intercept 82.737 59.873 1.382 0.200
Size 2.854 1.247 2.289 0.048 **
Bedroom −15.789 25.056 −0.630 0.544
At a 95% confidence level, we believe that β1 = 0. House size really has
some impact on house price.
At a 95% confidence level, we have no evidence for β2 = 0. We cannot
conclude that the number of bedrooms has an impact on house price.
If we use only size as an independent variable, its p-value will be
0.00714. We will be quite confident that it has an impact.
Regression Analysis 20 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Road map
Simple regression.
Multiple regression.
Indicator variables and interaction.
Endogeneity and residual analysis.
Logistic regression.
Regression Analysis 21 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
House age
The age of a house may also affect its price.
Price Size
Bedroom
Age
(in $1000) (in m2
) (in years)
315 75 1 16
229 59 1 20
355 85 2 16
261 65 2 15
234 72 2 21
216 46 1 16
308 107 3 15
306 91 3 15
289 75 2 14
204 65 1 21
265 88 3 15
195 59 1 26
Let’s add age as an independent variable in explaining house prices.
Because the number of bedroom seems to be unhelpful, let’s ignore it.
Regression Analysis 22 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
House age
For house i, let yi be its price, x1,i be its size, and x3,i be its age. We
assume the following linear relationship:
yi = β0 + β1x1,i + β2x3,i + i.
Software gives us the following regression report:
Coefficients Standard Error t Stat p-value
Intercept 262.882 83.632 3.143 0.012
Size 1.533 0.628 2.443 0.037 **
Age −6.368 2.881 −2.211 0.054 *
R2
= 0.696, R2
adj = 0.629
R2
goes up from 0.485 (size only) to 0.629. Age is significant at a 10%
significance level. Seems good!
Regression Analysis 23 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
“Nonlinear” relationship
May we do better?
By looking at the age-price scatter plot
(and our intuition), maybe the impact of
age on price is “nonlinear”:
A new house’s value depreciates fast.
The value depreciates slowly when the
house is old.
At least this is true for a car.
It is worthwhile to try a capture this
nonlinear relationship.
For example, we may try to replace house
age by its reciprocal:
yi = β0 + β1x1,i + β2
1
x3,i
+ i.
Regression Analysis 24 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Variable transformation
To fit
yi = β0 + β1x1,i + β2
1
x3,i
+ i.
to our sample data:
Prepare a new column as 1
age
.
Input these three columns to software.
Read the report.
We may consider any kind of nonlinear
relationship.
This technique is called variable
transformation.
Price Size 1/Age
(in $1000) (in m2
) (in 1/years)
315 75 0.063
229 59 0.05
355 85 0.063
261 65 0.067
234 72 0.048
216 46 0.063
308 107 0.067
306 91 0.067
289 75 0.071
204 65 0.048
265 88 0.067
195 59 0.038
Regression Analysis 25 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
The reciprocal of house age
Software gives us the following regression report:
Coefficients Standard Error t Stat p-value
Intercept 22.905 57.154 0.401 0.698
Size 1.524 0.647 2.356 0.043 **
1/Age 2185.575 1044.497 2.092 0.066 *
R2
= 0.685, R2
adj = 0.615
Validation:
Variables are both significant (at different significance level).
Using size and age better explains house price (at least for the given
sample data).
The intuition that house value depreciates at different speeds is not
supported by the data.
Changing 1
age to age2
also does not help.
Regression Analysis 26 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Typical ways of variable transformation
Regression Analysis 27 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Variable selection and model building
In general, we may have a lot of candidate independent variables.
Size, number of bedrooms, age, distance to a park, distance to a hospital,
safety in the neighborhood, etc.
If we consider only linear relationships, for p candidate independent
variables, we have 2p
− 1 combinations.
For each variable, we have many ways to transform it.
In the next lecture, we will introduce the way of modeling interaction
among independent variables.
How to find the “best” regression model (if there is one)?
Regression Analysis 28 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Variable selection and model building
There is no “best” model; there are “good” models.
Some general suggestions:
Take each independent variable one at a time and observe the
relationship between it and the dependent variable. A scatter plot
helps. Use this to consider variable transformation.
For each pair of independent variables, check their relationship. If two
are highly correlated, quite likely one is not needed.
Once a model is built, check the p-values. You may want to remove
insignificant variables (but removing a variable may change the
significance of other variables).
Go back and forth to try various combinations. Stop when a good
enough one (with high R2
and R2
adj and small p-values) is found.
Software can somewhat automate the process, but its power is limited
(e.g., it cannot decide transformation).
We may need to find new independent variables.
Intuitions and experiences may help (or hurt).
Regression Analysis 29 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Summary
With a regression model, we try to identify how independent variables
affect the dependent variable.
For a regression model, we adopt the least square criterion for estimating
the coefficients.
Model validation:
The overall quality of a regression model is decided by its R2
and R2
adj.
We may test the significance of independent variables by their p-values.
Modeling building:
Variable transformation.
Variable selection.
Regression Analysis 30 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Case study: ticket selling
A theater made hundreds of stage performances in the past six years.
The owner hopes that statistics and data analysis may help her
improve the ticket sales.
Key questions: What makes a show popular?
Popularity is defined as the numbers of tickets sold.
Potential factors: year, month, day, time, location, actors/actresses,
drama type, ticket prices, etc.
100 performances are randomly drawn from the whole pool.
All were made during weekends.
Tickets were all publicly sold.
Tickets for all performances were sold through the same channels.
For each performance, the ticket price(s) remained the same.
As a group of consultants, how may we help the theater?
Regression Analysis 31 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Variables
Six variables are obtained:
Variable Meaning
Year The year in which the performance was made
Time Morning, afternoon, or evening
Capacity The number of seats in the theater hall
AvgPrice The average of all prices
SalesQty The number of tickets sold
SalesDuration Performance day − Announcement day
Labeling and scaling:
Years are labeled as 1, 2, ..., and 6 (6 means the last year).
Capacities and sales quantities have been scaled in the same proportion.
Regression Analysis 32 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Data (incomplete)
Yr. Tm. Cap. A.P. Qty S.D. Yr. Tm. Cap. A.P. Qty S.D.
5 A 230 400 218 50 2 M 190 575 190 289
5 A 150 500 119 46 6 A 130 500 108 89
5 A 230 400 160 126 4 E 200 775 169 100
5 A 200 775 200 324 4 E 200 775 135 259
6 E 190 1175 178 115 5 A 310 650 251 346
6 A 190 1175 183 109 2 A 250 550 250 145
5 E 190 775 161 58 1 A 190 675 183 254
3 A 200 675 200 112 6 A 200 1175 146 110
5 E 200 775 158 323 1 M 200 575 140 94
1 M 200 575 128 360 4 A 200 775 195 255
Regression Analysis 33 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Regression
To construct a regression model, we first consider quantitative
independent variables.
Dependent variable: SalesQty.
Independent variables: Capacity, AvgPrice, Year.
Let’s ignore SalesDuration for a while.
Note that Year is a quantitative variable.
The difference between two values makes sense: 4 − 2 and 5 − 3 both
mean a difference of two years.
The values will keep increasing.
If we have a variable Month whose possible values are 1, 2, ..., and 12,
the difference between 12 and 1 is ambiguous: 11 months or 1 month.
Scatter plots help us consider:
Variable selection: Does a variable has an impact?
Transformation: What is a variable’s impact?
Multicollinearity: Are two variables highly correlated?
Regression Analysis 34 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Regression Analysis 35 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Regression
It seems that Capacity, AvgSales, and Year are all worth a try.
Let’s put them into a regression model.
If we do this one by one:
SalesQty = 20.79 + 0.72Capacity: R2
= 0.538, p-value ≈ 0.
SalesQty = 174.9 + 0.0028AvgPrice: R2
= 0.0002, p-value = 0.885.
SalesQty = 203.6 − 6.77Y ear: R2
= 0.063, p-value = 0.0115.
If we include them together:
The regression model is
SalesQty = 24.742 + 0.702Capacity + 0.027AvgPrice − 4.696Y ear.
R2
= 0.57, R2
adj = 0.556; p-values are 0, 0.056, and 0.019, respectively.
Do not try independent variables separately; try them together.
Regression Analysis 36 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Adding Time into the model
Time may also be an influential variable.
However, it is qualitative.
More precisely, it is nominal.
Even if we label Time with numeric values, we cannot treat it as a
quantitative variable and put it into a regression model.
For each qualitative variable, we need to introduce several indicator
variables to represent its values.
Regression Analysis 37 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Road map
Simple regression.
Multiple regression.
Indicator variables and interaction.
Endogeneity and residual analysis.
Logistic regression.
Regression Analysis 38 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Numeric labeling does not work
The variable Time has three values.
Morning, afternoon, and evening.
Why can’t we label them as 1, 2, and 3 and do regression?
Suppose we label (morning, afternoon, evening) as (1, 2, 3):
The regression model is
SalesQty = 164.021 + 6.313Time.
Why is this wrong?
Regression Analysis 39 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Numeric labeling does not work
Different labeling gives different regression results.
We may also label (morning, afternoon, evening) as (1, 2, 10) or (3, 1, 2):
SalesQty =
164.021 + 6.313Time
p-value = 0.294
SalesQty =
177.224 − 0.075Time
p-value = 0.95
SalesQty =
205.725 − 15.091Time
p-value = 0.0084
Regression Analysis 40 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Binary variables
There is one exception: If a qualitative variable is binary, we may
label the values as 0 and 1 and then treat it as quantitative.
Labeling values as 1 and 0, 1 and 2, or 7 and 8 is also good.
Labeling values as 1 and −1, 1 and 5, or 4 and 8 is bad.
This is because a regression coefficient measures what happens to the
dependent variable “when that independent variable increases by 1.”
When the binary variable is labeled with 0 and 1, its regression
coefficient ˆβi tells us that “if the value changes from 0 to 1 (while all
others remain the same), we expect the dependent variable to increase
by ˆβi.”
What if we have more than two values?
Regression Analysis 41 / 83 Ling-Chieh Kung (NTU IM)
Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression
Indicator variables
Consider a variable x with three values A, B, and C.
We first choose a reference level, say, A.
We then manually create two indicator variables xB
and xC
:
xB
=
1 if x = B
0 otherwise
and xC
=
1 if x = C
0 otherwise
In other words, we have a mapping:
x xB
xC
A 0 0
B 1 0
C 0 1
Regression Analysis 42 / 83 Ling-Chieh Kung (NTU IM)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)
孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)

Contenu connexe

Tendances

Machine learning algorithms and business use cases
Machine learning algorithms and business use casesMachine learning algorithms and business use cases
Machine learning algorithms and business use casesSridhar Ratakonda
 
Lecture 7
Lecture 7Lecture 7
Lecture 7butest
 
Alleviating Privacy Attacks Using Causal Models
Alleviating Privacy Attacks Using Causal ModelsAlleviating Privacy Attacks Using Causal Models
Alleviating Privacy Attacks Using Causal ModelsAmit Sharma
 
Nbe rtopicsandrecomvlecture1
Nbe rtopicsandrecomvlecture1Nbe rtopicsandrecomvlecture1
Nbe rtopicsandrecomvlecture1NBER
 
Machine Learning and Causal Inference
Machine Learning and Causal InferenceMachine Learning and Causal Inference
Machine Learning and Causal InferenceNBER
 
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai UniversityMadhav Mishra
 
Machine Learning Unit 4 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 4 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 4 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 4 Semester 3 MSc IT Part 2 Mumbai UniversityMadhav Mishra
 
Active learning lecture
Active learning lectureActive learning lecture
Active learning lectureazuring
 
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...Edureka!
 
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Parth Khare
 
An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier ananth
 
Quant Data Analysis
Quant Data AnalysisQuant Data Analysis
Quant Data AnalysisSaad Chahine
 
Machine Learning 1 - Introduction
Machine Learning 1 - IntroductionMachine Learning 1 - Introduction
Machine Learning 1 - Introductionbutest
 
Causal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine LearningCausal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine LearningBill Liu
 
Dowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inferenceDowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inferenceAmit Sharma
 
Tech meetup Data Driven - Codemotion
Tech meetup Data Driven - Codemotion Tech meetup Data Driven - Codemotion
Tech meetup Data Driven - Codemotion antimo musone
 

Tendances (20)

Machine learning algorithms and business use cases
Machine learning algorithms and business use casesMachine learning algorithms and business use cases
Machine learning algorithms and business use cases
 
Lecture 7
Lecture 7Lecture 7
Lecture 7
 
Alleviating Privacy Attacks Using Causal Models
Alleviating Privacy Attacks Using Causal ModelsAlleviating Privacy Attacks Using Causal Models
Alleviating Privacy Attacks Using Causal Models
 
Nbe rtopicsandrecomvlecture1
Nbe rtopicsandrecomvlecture1Nbe rtopicsandrecomvlecture1
Nbe rtopicsandrecomvlecture1
 
Machine Learning and Causal Inference
Machine Learning and Causal InferenceMachine Learning and Causal Inference
Machine Learning and Causal Inference
 
sigir2018tutorial
sigir2018tutorialsigir2018tutorial
sigir2018tutorial
 
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
 
sigir2020
sigir2020sigir2020
sigir2020
 
LR2. Summary Day 2
LR2. Summary Day 2LR2. Summary Day 2
LR2. Summary Day 2
 
Machine Learning Unit 4 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 4 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 4 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 4 Semester 3 MSc IT Part 2 Mumbai University
 
Active learning lecture
Active learning lectureActive learning lecture
Active learning lecture
 
AI Algorithms
AI AlgorithmsAI Algorithms
AI Algorithms
 
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
Decision Tree Algorithm & Analysis | Machine Learning Algorithm | Data Scienc...
 
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
 
An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier
 
Quant Data Analysis
Quant Data AnalysisQuant Data Analysis
Quant Data Analysis
 
Machine Learning 1 - Introduction
Machine Learning 1 - IntroductionMachine Learning 1 - Introduction
Machine Learning 1 - Introduction
 
Causal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine LearningCausal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine Learning
 
Dowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inferenceDowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inference
 
Tech meetup Data Driven - Codemotion
Tech meetup Data Driven - Codemotion Tech meetup Data Driven - Codemotion
Tech meetup Data Driven - Codemotion
 

En vedette

「資料視覺化」有志一同場次 at 2016 台灣資料科學年會
「資料視覺化」有志一同場次 at 2016 台灣資料科學年會「資料視覺化」有志一同場次 at 2016 台灣資料科學年會
「資料視覺化」有志一同場次 at 2016 台灣資料科學年會台灣資料科學年會
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探台灣資料科學年會
 
[DSC 2016] 系列活動:吳牧恩、林佳緯 / 用 R 輕鬆做交易策略分析及自動下單
[DSC 2016] 系列活動:吳牧恩、林佳緯 / 用 R 輕鬆做交易策略分析及自動下單[DSC 2016] 系列活動:吳牧恩、林佳緯 / 用 R 輕鬆做交易策略分析及自動下單
[DSC 2016] 系列活動:吳牧恩、林佳緯 / 用 R 輕鬆做交易策略分析及自動下單台灣資料科學年會
 
陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰台灣資料科學年會
 
Blockchain Cloudminds: Human-Machine Pooled-Mind DACs
Blockchain Cloudminds: Human-Machine Pooled-Mind DACsBlockchain Cloudminds: Human-Machine Pooled-Mind DACs
Blockchain Cloudminds: Human-Machine Pooled-Mind DACsMelanie Swan
 
Spark, GraphX, and Blockchains: Building a Behavioral Analytics Platform for ...
Spark, GraphX, and Blockchains: Building a Behavioral Analytics Platform for ...Spark, GraphX, and Blockchains: Building a Behavioral Analytics Platform for ...
Spark, GraphX, and Blockchains: Building a Behavioral Analytics Platform for ...Databricks
 
Philosophy of Deep Learning
Philosophy of Deep LearningPhilosophy of Deep Learning
Philosophy of Deep LearningMelanie Swan
 
Blockchain Economic Theory
Blockchain Economic TheoryBlockchain Economic Theory
Blockchain Economic TheoryMelanie Swan
 
Workplace harassment of health worker
Workplace harassment of health workerWorkplace harassment of health worker
Workplace harassment of health workerShahid Imran Khan
 
Is the Data Scaled, Ordinal, or Nominal Proportional?
Is the Data Scaled, Ordinal, or Nominal Proportional?Is the Data Scaled, Ordinal, or Nominal Proportional?
Is the Data Scaled, Ordinal, or Nominal Proportional?Ken Plummer
 
Descriptive v inferential
Descriptive v inferentialDescriptive v inferential
Descriptive v inferentialKen Plummer
 
Basic Statistics & Data Analysis
Basic Statistics & Data AnalysisBasic Statistics & Data Analysis
Basic Statistics & Data AnalysisAjendra Sharma
 
Quick reminder ordinal or scaled or nominal porportional
Quick reminder   ordinal or scaled or nominal porportionalQuick reminder   ordinal or scaled or nominal porportional
Quick reminder ordinal or scaled or nominal porportionalKen Plummer
 
Standard Deviation
Standard DeviationStandard Deviation
Standard DeviationJRisi
 
Scales of measurement in statistics
Scales of measurement in statisticsScales of measurement in statistics
Scales of measurement in statisticsShahid Imran Khan
 
Quickreminder nature of the data (relationship)
Quickreminder nature of the data (relationship)Quickreminder nature of the data (relationship)
Quickreminder nature of the data (relationship)Ken Plummer
 
1a difference between inferential and descriptive statistics (explanation)
1a difference between inferential and descriptive statistics (explanation)1a difference between inferential and descriptive statistics (explanation)
1a difference between inferential and descriptive statistics (explanation)Ken Plummer
 

En vedette (20)

「資料視覺化」有志一同場次 at 2016 台灣資料科學年會
「資料視覺化」有志一同場次 at 2016 台灣資料科學年會「資料視覺化」有志一同場次 at 2016 台灣資料科學年會
「資料視覺化」有志一同場次 at 2016 台灣資料科學年會
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
 
[DSC 2016] 系列活動:吳牧恩、林佳緯 / 用 R 輕鬆做交易策略分析及自動下單
[DSC 2016] 系列活動:吳牧恩、林佳緯 / 用 R 輕鬆做交易策略分析及自動下單[DSC 2016] 系列活動:吳牧恩、林佳緯 / 用 R 輕鬆做交易策略分析及自動下單
[DSC 2016] 系列活動:吳牧恩、林佳緯 / 用 R 輕鬆做交易策略分析及自動下單
 
陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰
 
Blockchain Cloudminds: Human-Machine Pooled-Mind DACs
Blockchain Cloudminds: Human-Machine Pooled-Mind DACsBlockchain Cloudminds: Human-Machine Pooled-Mind DACs
Blockchain Cloudminds: Human-Machine Pooled-Mind DACs
 
Spark, GraphX, and Blockchains: Building a Behavioral Analytics Platform for ...
Spark, GraphX, and Blockchains: Building a Behavioral Analytics Platform for ...Spark, GraphX, and Blockchains: Building a Behavioral Analytics Platform for ...
Spark, GraphX, and Blockchains: Building a Behavioral Analytics Platform for ...
 
Philosophy of Deep Learning
Philosophy of Deep LearningPhilosophy of Deep Learning
Philosophy of Deep Learning
 
Blockchain Economic Theory
Blockchain Economic TheoryBlockchain Economic Theory
Blockchain Economic Theory
 
Workplace harassment of health worker
Workplace harassment of health workerWorkplace harassment of health worker
Workplace harassment of health worker
 
Is the Data Scaled, Ordinal, or Nominal Proportional?
Is the Data Scaled, Ordinal, or Nominal Proportional?Is the Data Scaled, Ordinal, or Nominal Proportional?
Is the Data Scaled, Ordinal, or Nominal Proportional?
 
Descriptive v inferential
Descriptive v inferentialDescriptive v inferential
Descriptive v inferential
 
EEX 501 Assess Ch4,5,6,7,All
EEX 501 Assess Ch4,5,6,7,AllEEX 501 Assess Ch4,5,6,7,All
EEX 501 Assess Ch4,5,6,7,All
 
Basic Statistics & Data Analysis
Basic Statistics & Data AnalysisBasic Statistics & Data Analysis
Basic Statistics & Data Analysis
 
Quick reminder ordinal or scaled or nominal porportional
Quick reminder   ordinal or scaled or nominal porportionalQuick reminder   ordinal or scaled or nominal porportional
Quick reminder ordinal or scaled or nominal porportional
 
Standard Deviation
Standard DeviationStandard Deviation
Standard Deviation
 
Basic Statistics
Basic  StatisticsBasic  Statistics
Basic Statistics
 
Burns And Bush Chapter 15
Burns And Bush Chapter 15Burns And Bush Chapter 15
Burns And Bush Chapter 15
 
Scales of measurement in statistics
Scales of measurement in statisticsScales of measurement in statistics
Scales of measurement in statistics
 
Quickreminder nature of the data (relationship)
Quickreminder nature of the data (relationship)Quickreminder nature of the data (relationship)
Quickreminder nature of the data (relationship)
 
1a difference between inferential and descriptive statistics (explanation)
1a difference between inferential and descriptive statistics (explanation)1a difference between inferential and descriptive statistics (explanation)
1a difference between inferential and descriptive statistics (explanation)
 

Similaire à 孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)

probability and statistics-4.pdf
probability and statistics-4.pdfprobability and statistics-4.pdf
probability and statistics-4.pdfhabtamu292245
 
Intra Cranial Pressure ( Icp ) Measurements Are Taken Via...
Intra Cranial Pressure ( Icp ) Measurements Are Taken Via...Intra Cranial Pressure ( Icp ) Measurements Are Taken Via...
Intra Cranial Pressure ( Icp ) Measurements Are Taken Via...Michelle Love
 
MAC411(A) Analysis in Communication Researc.ppt
MAC411(A) Analysis in Communication Researc.pptMAC411(A) Analysis in Communication Researc.ppt
MAC411(A) Analysis in Communication Researc.pptPreciousOsoOla
 
Types of Statistics Descriptive and Inferential Statistics
Types of Statistics Descriptive and Inferential StatisticsTypes of Statistics Descriptive and Inferential Statistics
Types of Statistics Descriptive and Inferential StatisticsDr. Amjad Ali Arain
 
IDS-Unit-II. bachelor of computer applicatio notes
IDS-Unit-II. bachelor of computer applicatio notesIDS-Unit-II. bachelor of computer applicatio notes
IDS-Unit-II. bachelor of computer applicatio notesAnkurTiwari813070
 
data analysis in Statistics-2023 guide 2023
data analysis in Statistics-2023 guide 2023data analysis in Statistics-2023 guide 2023
data analysis in Statistics-2023 guide 2023ayesha455941
 
Statistics Assignments 090427
Statistics Assignments 090427Statistics Assignments 090427
Statistics Assignments 090427amykua
 
INTRODUCTION TO STATISTICS.pptx
INTRODUCTION TO STATISTICS.pptxINTRODUCTION TO STATISTICS.pptx
INTRODUCTION TO STATISTICS.pptxAvilosErgelaKram
 
Analysing & interpreting data.ppt
Analysing & interpreting data.pptAnalysing & interpreting data.ppt
Analysing & interpreting data.pptmanaswidebbarma1
 

Similaire à 孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4) (20)

Data Analysis
Data Analysis Data Analysis
Data Analysis
 
probability and statistics-4.pdf
probability and statistics-4.pdfprobability and statistics-4.pdf
probability and statistics-4.pdf
 
Presentation of BRM.pptx
Presentation of BRM.pptxPresentation of BRM.pptx
Presentation of BRM.pptx
 
Bio stat
Bio statBio stat
Bio stat
 
Intra Cranial Pressure ( Icp ) Measurements Are Taken Via...
Intra Cranial Pressure ( Icp ) Measurements Are Taken Via...Intra Cranial Pressure ( Icp ) Measurements Are Taken Via...
Intra Cranial Pressure ( Icp ) Measurements Are Taken Via...
 
Statistics
StatisticsStatistics
Statistics
 
chapter 1.pptx
chapter 1.pptxchapter 1.pptx
chapter 1.pptx
 
MAC411(A) Analysis in Communication Researc.ppt
MAC411(A) Analysis in Communication Researc.pptMAC411(A) Analysis in Communication Researc.ppt
MAC411(A) Analysis in Communication Researc.ppt
 
Types of Statistics Descriptive and Inferential Statistics
Types of Statistics Descriptive and Inferential StatisticsTypes of Statistics Descriptive and Inferential Statistics
Types of Statistics Descriptive and Inferential Statistics
 
Statistics
StatisticsStatistics
Statistics
 
Statistics
StatisticsStatistics
Statistics
 
IDS-Unit-II. bachelor of computer applicatio notes
IDS-Unit-II. bachelor of computer applicatio notesIDS-Unit-II. bachelor of computer applicatio notes
IDS-Unit-II. bachelor of computer applicatio notes
 
Mm1
Mm1Mm1
Mm1
 
Lecture-1 Introduction to statistics.ppt
Lecture-1 Introduction to statistics.pptLecture-1 Introduction to statistics.ppt
Lecture-1 Introduction to statistics.ppt
 
Data analysis aug-11
Data analysis aug-11Data analysis aug-11
Data analysis aug-11
 
data analysis in Statistics-2023 guide 2023
data analysis in Statistics-2023 guide 2023data analysis in Statistics-2023 guide 2023
data analysis in Statistics-2023 guide 2023
 
Statistics Assignments 090427
Statistics Assignments 090427Statistics Assignments 090427
Statistics Assignments 090427
 
INTRODUCTION TO STATISTICS.pptx
INTRODUCTION TO STATISTICS.pptxINTRODUCTION TO STATISTICS.pptx
INTRODUCTION TO STATISTICS.pptx
 
Basic concept of statistics
Basic concept of statisticsBasic concept of statistics
Basic concept of statistics
 
Analysing & interpreting data.ppt
Analysing & interpreting data.pptAnalysing & interpreting data.ppt
Analysing & interpreting data.ppt
 

Plus de 台灣資料科學年會

[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 人工智慧技術發展與應用[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 人工智慧技術發展與應用台灣資料科學年會
 
[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 執行長報告[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 執行長報告台灣資料科學年會
 
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰台灣資料科學年會
 
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機台灣資料科學年會
 
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機台灣資料科學年會
 
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話台灣資料科學年會
 
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇台灣資料科學年會
 
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察 [TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察 台灣資料科學年會
 
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵台灣資料科學年會
 
[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 從經濟學看人工智慧產業應用[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 從經濟學看人工智慧產業應用台灣資料科學年會
 
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告台灣資料科學年會
 
[台中分校] 第一期結業典禮 - 執行長談話
[台中分校] 第一期結業典禮 - 執行長談話[台中分校] 第一期結業典禮 - 執行長談話
[台中分校] 第一期結業典禮 - 執行長談話台灣資料科學年會
 
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人台灣資料科學年會
 
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維台灣資料科學年會
 
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察台灣資料科學年會
 
[TOxAIA新竹分校] 深度學習與Kaggle實戰
[TOxAIA新竹分校] 深度學習與Kaggle實戰[TOxAIA新竹分校] 深度學習與Kaggle實戰
[TOxAIA新竹分校] 深度學習與Kaggle實戰台灣資料科學年會
 
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT台灣資料科學年會
 
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達台灣資料科學年會
 
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳台灣資料科學年會
 

Plus de 台灣資料科學年會 (20)

[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 人工智慧技術發展與應用[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 人工智慧技術發展與應用
 
[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 執行長報告[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 執行長報告
 
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
 
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
 
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
 
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
 
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
 
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察 [TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
 
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
 
[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 從經濟學看人工智慧產業應用[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 從經濟學看人工智慧產業應用
 
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
 
台灣人工智慧學校成果發表會
台灣人工智慧學校成果發表會台灣人工智慧學校成果發表會
台灣人工智慧學校成果發表會
 
[台中分校] 第一期結業典禮 - 執行長談話
[台中分校] 第一期結業典禮 - 執行長談話[台中分校] 第一期結業典禮 - 執行長談話
[台中分校] 第一期結業典禮 - 執行長談話
 
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
 
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
 
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
 
[TOxAIA新竹分校] 深度學習與Kaggle實戰
[TOxAIA新竹分校] 深度學習與Kaggle實戰[TOxAIA新竹分校] 深度學習與Kaggle實戰
[TOxAIA新竹分校] 深度學習與Kaggle實戰
 
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
 
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
 
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
 

Dernier

➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx9to5mart
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 

Dernier (20)

➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 

孔令傑 / 給工程師的統計學及資料分析 123 (2016/9/4)

  • 1. Basic concepts Data visualization Data summarization Statistics and Data Analysis for Engineers Part 1: Introduction and Descriptive Statistics Ling-Chieh Kung Department of Information Management National Taiwan University September 4, 2016 Introduction and Descriptive Statistics 1 / 62 Ling-Chieh Kung (NTU IM)
  • 2. Basic concepts Data visualization Data summarization What is Statistics? Many things are unknown... Consumers’ tastes. Quality of a product. Stock prices. The effectiveness of a new way of teaching/training. Statistics is the science of collecting, analyzing, interpreting, and presenting (numerical) data. Ultimate goal (of Business Statistics): to achieve better decision making. The study of Statistics includes: Descriptive Statistics. Probability. Inferential Statistics: Estimation. Inferential Statistics: Hypothesis testing. Inferential Statistics: Prediction. In summary: To estimate, test, and predict those unknowns. Introduction and Descriptive Statistics 2 / 62 Ling-Chieh Kung (NTU IM)
  • 3. Basic concepts Data visualization Data summarization My plan for today Descriptive Statistics. Visualization and summarization. Inferential Statistics. (Probability). Hypothesis testing and p-value. Regression analysis. Case studies. Introduction and Descriptive Statistics 3 / 62 Ling-Chieh Kung (NTU IM)
  • 4. Basic concepts Data visualization Data summarization Road map Basic concepts. Data visualization. Data summarization. Introduction and Descriptive Statistics 4 / 62 Ling-Chieh Kung (NTU IM)
  • 5. Basic concepts Data visualization Data summarization Populations vs. samples A population is a collection of persons, objects, or items. A census is to investigate the whole population. A sample is a portion of the population. Sampling is to investigate only a subset of the population. We then use the information contained in the sample to infer (“guess”) about the population. What are samples for the following populations? All students in NTU. All students in the business school. All chips made in one factory. All consumers who have bought iPhone 6. Two important questions: Why sampling? Is a sample representative? Introduction and Descriptive Statistics 5 / 62 Ling-Chieh Kung (NTU IM)
  • 6. Basic concepts Data visualization Data summarization Descriptive vs. inferential statistics Descriptive statistics: Graphical or numerical summaries of data. Describing (visualizing or summarizing) a set of data. Inferential statistics: Making a “scientific guess” on unknowns. Trying to say something about the population. Which is descriptive and which is inferential? Calculating the average height of 1000 randomly selected NTU students. Using this number to estimate the average height of all NTU students. Another example (pharmaceutical research): All the potential patients form the population. A group of randomly selected patients is a sample. Use the result on the sample to infer the result on the population. Introduction and Descriptive Statistics 6 / 62 Ling-Chieh Kung (NTU IM)
  • 7. Basic concepts Data visualization Data summarization Parameters vs. statistics A numerical summary of a population is a parameter. The average height of all NTU students. The expected coffee demand when the price is 50 NTD. A numerical summary of a sample is a statistic. The average height of all NTU male students. The average coffee demand when the price is 50 NTD in the past 6 days. Almost always people use a statistic to infer a parameter. Some statistics are “good” while some are “bad.” Introduction and Descriptive Statistics 7 / 62 Ling-Chieh Kung (NTU IM)
  • 8. Basic concepts Data visualization Data summarization Parameters vs. statistics: an example What is the average height of all NTU students? While a census is possible, it is still quite costly. It is natural to: Sample some NTU students. Calculate a statistic. Use that statistic to estimate the average height (the parameter). Some (good or bad) samples and statistics: The average height of all students in this classroom. The average height of 100 students randomly drawn from all students. The maximum height of 100 students randomly drawn from all students. The sum of heights of 100 students randomly drawn from all students. The average height of 60 male and 40 female students randomly drawn from the population. Introduction and Descriptive Statistics 8 / 62 Ling-Chieh Kung (NTU IM)
  • 9. Basic concepts Data visualization Data summarization Levels of data measurement Most data we will play with are numerical. Numerical data may be categorized to three levels: Nominal. Ordinal. Quantitative: interval or ratio. Introduction and Descriptive Statistics 9 / 62 Ling-Chieh Kung (NTU IM)
  • 10. Basic concepts Data visualization Data summarization Nominal level A nominal scale classifies data into categories with no ranking. Data are labels or names used to identify an attribute of the element. The label may be numeric or non-numeric label. Examples: Categorical variables Values (Categories) Laptop ownership Yes / No Citizenship Taiwan / Japan / ... Country code 886 / 86 / 1 / ... Arithmetic operations cannot be applied on nominal data. Introduction and Descriptive Statistics 10 / 62 Ling-Chieh Kung (NTU IM)
  • 11. Basic concepts Data visualization Data summarization Ordinal level An ordinal scale classifies data into categories with ranking. The order or rank of the data is meaningful. However, differences between numerical labels do not imply distances. Examples: Categorical variables Values (Categories) Product satisfaction Satisfied, neutral, unsatisfied Professor rank Full, associate, assistant Ranking of scores 1, 2, 3, 4, ... It is still not meaningful to do arithmetic on ordinal data. Assistant + associate = full?! The grade difference between no. 1 and no. 5 may not be equal to that between no. 11 and no. 15. Introduction and Descriptive Statistics 11 / 62 Ling-Chieh Kung (NTU IM)
  • 12. Basic concepts Data visualization Data summarization Quantitative (interval and ratio) levels An interval scale is an ordered scale in which the difference between measurements is a meaningful quantity but the measurements do not have a true zero point. A ratio scale is an ordered scale in which the difference between measurements is a meaningful quantity and the measurements have a true zero point. Ratio data appear more often in the world. Heights, weights, income, prices. Interval data are actually rare. Degrees in Celsius or Fahrenheit. GRE or GMAT scores. How about degrees in Kelvin? Introduction and Descriptive Statistics 12 / 62 Ling-Chieh Kung (NTU IM)
  • 13. Basic concepts Data visualization Data summarization Some remarks Nominal and ordinal data are called qualitative data. Interval and ratio data are called quantitative data. Most statistical methods are for quantitative data; some are for qualitative data. Distinguishing nominal and ordinal scales is important. Distinguishing interval and ratio scales is not. Sometimes qualitative data are called categorical data. Sometimes quantitative data are called numeric data. Introduction and Descriptive Statistics 13 / 62 Ling-Chieh Kung (NTU IM)
  • 14. Basic concepts Data visualization Data summarization A short summary Understand these terms: Populations vs. samples. Parameters vs. statistics. Inferential statistics vs. descriptive statistics. For each scale of measurement, is it meaningful to calculate the following numbers? Level Ranking Distance Nominal No No Ordinal Yes No Quantitative Yes Yes Introduction and Descriptive Statistics 14 / 62 Ling-Chieh Kung (NTU IM)
  • 15. Basic concepts Data visualization Data summarization Road map Basic concepts. Data visualization. Data summarization. Introduction and Descriptive Statistics 15 / 62 Ling-Chieh Kung (NTU IM)
  • 16. Basic concepts Data visualization Data summarization An example For each day in 2011 and 2012, we record the number of daily rentals of the public bike rental system in Washington, D.C. 985, 801, 1349, 1562, 1600, 1606, 1510, ..., 1341, 1796. and 2729. The smallest and largest numbers are 22 and 8714, respectively. How to get some feeling on 731 numbers? date rental 2011/1/1 985 2011/1/2 801 2011/1/3 1349 2011/1/4 1562 2011/1/5 1600 2011/1/6 1606 2011/1/7 1510 ... 2012/12/29 1341 2012/12/30 1796 2012/12/31 2729 Introduction and Descriptive Statistics 16 / 62 Ling-Chieh Kung (NTU IM)
  • 17. Basic concepts Data visualization Data summarization Frequency distributions The original 731 numbers form a set of ungrouped data. We start by grouping them into a frequency distribution. Grouped data presented in the form of class intervals and frequencies. Let’s create an intuitive frequency distribution. Introduction and Descriptive Statistics 17 / 62 Ling-Chieh Kung (NTU IM)
  • 18. Basic concepts Data visualization Data summarization Frequency distributions: an example The resulting classes: Class Class interval (Which means) 1 [0, 1000) 0 ≤ x < 1000 2 [1000, 2000) 1000 ≤ x < 2000 3 [2000, 3000) 2000 ≤ x < 3000 ... 8 [7000, 8000) 7000 ≤ x < 8000 9 [8000, 9000) 8000 ≤ x < 9000 How about [0, 999], [1000, 1999], etc.? How about (0, 1000], (1000, 2000], etc.? Introduction and Descriptive Statistics 18 / 62 Ling-Chieh Kung (NTU IM)
  • 19. Basic concepts Data visualization Data summarization Frequency distributions: an example Then we count to get the frequency distribution at the right. This is a set of grouped data. Some remarks: Typically we have 5 to 15 classes. Typically all classes have the same width. Be aware of class endpoints! Classes should NOT overlap with each other. If there are outliers, they should be removed first. Class interval Frequency [0, 1000) 18 [1000, 2000) 80 [2000, 3000) 74 [3000, 4000) 107 [4000, 5000) 166 [5000, 6000) 106 [6000, 7000) 86 [7000, 8000) 82 [8000, 9000) 12 Introduction and Descriptive Statistics 19 / 62 Ling-Chieh Kung (NTU IM)
  • 20. Basic concepts Data visualization Data summarization Something more We may add class midpoints, relative frequencies, and cumulative frequencies into a frequency table: Class Frequency Class Relative Cumulative interval midpoint frequency frequency [0, 1000) 18 500 2.46% 18 [1000, 2000) 80 1500 10.94% 98 [2000, 3000) 74 2500 10.12% 172 [3000, 4000) 107 3500 14.64% 279 [4000, 5000) 166 4500 22.71% 445 [5000, 6000) 106 5500 14.50% 551 [6000, 7000) 86 6500 11.76% 637 [7000, 8000) 82 7500 11.22% 719 [8000, 9000) 12 8500 1.64% 731 How about cumulative relative frequencies? Introduction and Descriptive Statistics 20 / 62 Ling-Chieh Kung (NTU IM)
  • 21. Basic concepts Data visualization Data summarization Histograms A frequency distribution may be depicted as a histogram. Interval Freq. [0, 1000) 18 [1000, 2000) 80 [2000, 3000) 74 [3000, 4000) 107 [4000, 5000) 166 [5000, 6000) 106 [6000, 7000) 86 [7000, 8000) 82 [8000, 9000) 12 It consists of a series of contiguous rectangles, each representing the frequency in a class. Introduction and Descriptive Statistics 21 / 62 Ling-Chieh Kung (NTU IM)
  • 22. Basic concepts Data visualization Data summarization Histograms Histograms may be the most important type of data graphs. One particular reason to draw histograms is to get some ideas about the distribution. Bell shape? M shape? Skewed? Any outlier? We will discuss distributions in more details. Introduction and Descriptive Statistics 22 / 62 Ling-Chieh Kung (NTU IM)
  • 23. Basic concepts Data visualization Data summarization Frequency polygons Alternatively, we may draw a frequency polygon by using line segments connecting dots plotted at class midpoints. The information contained in a frequency polygon is quite similar to that contained in a histogram. Introduction and Descriptive Statistics 23 / 62 Ling-Chieh Kung (NTU IM)
  • 24. Basic concepts Data visualization Data summarization Frequency polygons It is more convenient to use a frequency polygon to compare multiple frequency distributions. Both: Uni-modal and symmetric. 2011: Bi-modal and skewed to the right (right-tailed). 2012: Uni-modal and skewed to the left (left-tailed). Warning: People may misinterpret a frequency polygon as a line chart (for data with a time sequence). Introduction and Descriptive Statistics 24 / 62 Ling-Chieh Kung (NTU IM)
  • 25. Basic concepts Data visualization Data summarization Line charts A line chart is useful in depicting a time series data set. A two-dimensional data set whose first dimension (the x-axis) is for labels of time points. It visualizes how a quantity changes as time goes by. For our monthly bike rentals: Introduction and Descriptive Statistics 25 / 62 Ling-Chieh Kung (NTU IM)
  • 26. Basic concepts Data visualization Data summarization Pie charts A pie chart is a circular depiction of data where each slice represents the percentage of the corresponding category. It visualizes relative frequency distributions well. For our bike rental data set: What are the proportions of rentals in the four seasons? What are the proportions of rentals on the seven days of a week? Introduction and Descriptive Statistics 26 / 62 Ling-Chieh Kung (NTU IM)
  • 27. Basic concepts Data visualization Data summarization A pie chart for seasonal rentals Season Total rentals Proportion Winter (12/20-3/20) 471348 14.3% Spring (3/21-6/20) 918589 27.9% Summer (6/21-9/20) 1061129 32.2% Fall (9/21-12/20) 841613 25.6% Introduction and Descriptive Statistics 27 / 62 Ling-Chieh Kung (NTU IM)
  • 28. Basic concepts Data visualization Data summarization A pie chart for rentals among weekdays Day Total rentals Sunday 444027 Monday 455503 Tuesday 469109 Wednesday 473048 Thursday 485395 Friday 487790 Saturday 477807 Introduction and Descriptive Statistics 28 / 62 Ling-Chieh Kung (NTU IM)
  • 29. Basic concepts Data visualization Data summarization Data not appropriate for pie charts Pie charts are used to visualize proportions, i.e., subtotals over the overall total. It should not be used to compare averages. The total numbers of rentals made by male and female users are appropriate for a pie chart. The average numbers of rentals per male and female users are not appropriate for a pie chart. Introduction and Descriptive Statistics 29 / 62 Ling-Chieh Kung (NTU IM)
  • 30. Basic concepts Data visualization Data summarization Bar charts Pie charts are useful in visualizing the proportions of each categories. In demonstrating the differences among categories, a bar chart is a better choice. The larger the category, the longer the bar. Some people draw bars vertically; some horizontally. Introduction and Descriptive Statistics 30 / 62 Ling-Chieh Kung (NTU IM)
  • 31. Basic concepts Data visualization Data summarization Bar charts Let’s replace the pie chart to a bar chart. Day Total rentals Sunday 444027 Monday 455503 Tuesday 469109 Wednesday 473048 Thursday 485395 Friday 487790 Saturday 477807 Note that the y-axis does not start at 0! Introduction and Descriptive Statistics 31 / 62 Ling-Chieh Kung (NTU IM)
  • 32. Basic concepts Data visualization Data summarization Bar charts v.s. histograms What are the differences that distinguish a bar chart from a histogram? A bar chart uses noncontiguous bars to visualize categorical data. A histogram uses contiguous bars to visualize quantitative data. Introduction and Descriptive Statistics 32 / 62 Ling-Chieh Kung (NTU IM)
  • 33. Basic concepts Data visualization Data summarization Visualizing two variables When we have data for two variables, typically we want to identify whether there is any relationship between them. Visualizing the data in a two-dimensional manner helps. When the two vales are both measured in quantitative scales, we may depict each observation as a point on a plane to create a scatter plot. For our bike rental example: How do monthly rentals in 2011 and those in 2012 relate with each other? How do daily casual and registered rentals relate with each other? Introduction and Descriptive Statistics 33 / 62 Ling-Chieh Kung (NTU IM)
  • 34. Basic concepts Data visualization Data summarization Monthly rentals in 2011 and 2012 Month 2011 2012 1 38189 96744 2 48215 103137 3 64045 164875 4 94870 174224 5 135821 195865 6 143512 202830 7 141341 203607 8 136691 214503 9 127418 218573 10 123511 198841 11 102167 152664 12 87323 123713 Introduction and Descriptive Statistics 34 / 62 Ling-Chieh Kung (NTU IM)
  • 35. Basic concepts Data visualization Data summarization Road map Basic concepts. Data visualization. Data summarization. Introduction and Descriptive Statistics 35 / 62 Ling-Chieh Kung (NTU IM)
  • 36. Basic concepts Data visualization Data summarization Summarizing the data with numbers Descriptive Statistics includes some common ways to describe data. Summarization with numbers. Visualization with graphs. This is always the first step of any data analysis project: To get intuitions that guide our directions. Here we talk about summarization. For a set of (a lot of) numbers, we use a few numbers to summarize them. For a population: these numbers are parameters. For a sample: these numbers are statistics. We will talk about three things: Measures of central tendency for the center or middle part of data. Measures of variability for how variable the data are. Measures of correlation for the relationship between two variables. Introduction and Descriptive Statistics 36 / 62 Ling-Chieh Kung (NTU IM)
  • 37. Basic concepts Data visualization Data summarization Medians The median is the middle value in an ordered set of numbers. Roughly speaking, half of the numbers are below and half are above it. Suppose there are N numbers: If N is odd, the median is the N+1 2 th large number. If N is even, the median is the average of the N 2 th and the (N 2 + 1)th large number. For example: The median of {1, 2, 4, 5, 6, 8, 9} is 5. The median of {1, 2, 4, 5, 6, 8} is 4+5 2 = 4.5. Introduction and Descriptive Statistics 37 / 62 Ling-Chieh Kung (NTU IM)
  • 38. Basic concepts Data visualization Data summarization Medians A median is unaffected by the magnitude of extreme values: The median of {1, 2, 4, 5, 6, 8, 9} is 5. The median of {1, 2, 4, 5, 6, 8, 900} is still 5. Medians may be calculated from quantitative or ordinal data. It cannot be calculated from nominal data. Unfortunately, a median uses only part of the information contained in these numbers. For quantitative data, a median only treats them as ordinal. Introduction and Descriptive Statistics 38 / 62 Ling-Chieh Kung (NTU IM)
  • 39. Basic concepts Data visualization Data summarization Means The mean is the average of a set of data. Can be calculated only from quantitative data. The mean of {1, 2, 4, 5, 6, 8, 9} is 1 + 2 + 4 + 5 + 6 + 8 + 9 7 = 5. A mean uses all the information contained in the numbers. Unfortunately, a mean will be affected by extreme values. The mean of {1, 2, 4, 5, 6, 8, 900} is 1+2+4+5+6+8+900 7 ≈ 132.28! Using the mean and median simultaneously can be a good idea. We should try to identify outliers (extreme values that seem to be “strange”) before calculating a mean (or any statistics). Introduction and Descriptive Statistics 39 / 62 Ling-Chieh Kung (NTU IM)
  • 40. Basic concepts Data visualization Data summarization Population means vs. sample means Let {xi}i=1,...,N be a population with N as the population size. The population mean is µ ≡ N i=1 xi N . Let {xi}i=1,...,n be a sample with n < N as the sample size. The sample mean is ¯x ≡ n i=1 xi n . People use µ and ¯x in almost the whole statistics world. Introduction and Descriptive Statistics 40 / 62 Ling-Chieh Kung (NTU IM)
  • 41. Basic concepts Data visualization Data summarization Population means v.s. sample means µ ≡ N i=1 xi N ¯x ≡ n i=1 xi n . Isn’t these two means the same? From the perspective of calculation, yes. From the perspective of statistical inference, no. Typically the population mean is fixed but unknown. The sample mean is random: We may get different values of ¯x today and tomorrow. To start from ¯x and use inferential statistics to estimate or test µ, we need to apply probability. Introduction and Descriptive Statistics 41 / 62 Ling-Chieh Kung (NTU IM)
  • 42. Basic concepts Data visualization Data summarization Quartiles and percentiles The median lies at the middle of the data. The first quartile lies at the middle of the first half of the data. The third quartile lies at the middle of the second half of the data. For the pth percentile: p 100 of the values are below it. 1 − p 100 of the values are above it. Median, quartiles, and percentiles: The 25th percentile is the first quartile. The 50th percentile is the median (and the second quartile). The 75th percentile is the third quartile. Introduction and Descriptive Statistics 42 / 62 Ling-Chieh Kung (NTU IM)
  • 43. Basic concepts Data visualization Data summarization Modes The mode(s) is (are) the most frequently occurring value(s) in a set of qualitative data. In the set {A, A, A, B, B, C, D, E, F, F, F, G, H}, the modes are A and F. The frequency of the modes (A and F) are 3. Though the above definition may also be applied to quantitative data, sometimes it is useless. In many case, all values are modes! For quantitative data, we instead look for the modal class(es). Introduction and Descriptive Statistics 43 / 62 Ling-Chieh Kung (NTU IM)
  • 44. Basic concepts Data visualization Data summarization Modal classes In a baseball team, players’ heights (in cm) are: 178 172 175 184 172 175 165 178 177 175 180 182 177 183 180 178 179 162 170 171 For the classes [160, 165), [165, 170), ..., and [185, 190), the modal class is [175, 180). We sometimes say the mode of this set is 177.5. The way of grouping matters! Introduction and Descriptive Statistics 44 / 62 Ling-Chieh Kung (NTU IM)
  • 45. Basic concepts Data visualization Data summarization Variability Measures of variability describe the spread or dispersion of a set of data. Especially important when two sets of data have the same center. Introduction and Descriptive Statistics 45 / 62 Ling-Chieh Kung (NTU IM)
  • 46. Basic concepts Data visualization Data summarization Ranges and Interquartile ranges The range of a set of data {xi}i=1,...,N is the difference between the maximum and minimum numbers, i.e., max i=1,...,N {xi} − min i=1,...,N {xi}. The interquartile range of a set of data is the difference of the first and third quartile. It is the range of the middle 50 of data. It excludes the effects of extreme values. Introduction and Descriptive Statistics 46 / 62 Ling-Chieh Kung (NTU IM)
  • 47. Basic concepts Data visualization Data summarization Deviations from the mean Consider a set of population data {xi}i=1,...,N with mean µ. Intuitively, a way to measure the dispersion is to examine how each number deviates from the mean. For xi, the deviation from the population mean is defined as xi − µ. For a sample, the deviation from the sample mean of xi is xi − ¯x. i xi deviation 1 1 1 − 5 = −4 2 2 2 − 5 = −3 3 4 4 − 5 = −1 4 5 1 − 5 = 0 5 6 6 − 5 = 1 6 8 8 − 5 = 3 7 9 9 − 5 = 4 Mean 5 Introduction and Descriptive Statistics 47 / 62 Ling-Chieh Kung (NTU IM)
  • 48. Basic concepts Data visualization Data summarization Mean deviations May we summarize the N deviations into a single number to summarize the aggregate deviation? Intuitively, we may sum them up and then calculate the mean deviation: N i=1(xi − µ) N . Is it always 0? i xi deviation 1 1 1 − 5 = −4 2 2 2 − 5 = −3 3 4 4 − 5 = −1 4 5 1 − 5 = 0 5 6 6 − 5 = 1 6 8 8 − 5 = 3 7 9 9 − 5 = 4 Mean 5 0 Introduction and Descriptive Statistics 48 / 62 Ling-Chieh Kung (NTU IM)
  • 49. Basic concepts Data visualization Data summarization Adjusting mean deviations People use two ways to adjust mean deviations: Mean absolute deviations/errors (MAD): N i=1 |xi − µ| N . Mean squared deviations/errors (variance or MSE): N i=1(xi − µ)2 N . A larger MAD or variance means that the data are more disperse. i xi di |di| d2 i 1 1 −4 4 16 2 2 −3 3 9 3 4 −1 1 1 4 5 0 0 0 5 6 1 1 1 6 8 3 3 9 7 9 4 4 16 Mean 5 0 2.29 7.43 Introduction and Descriptive Statistics 49 / 62 Ling-Chieh Kung (NTU IM)
  • 50. Basic concepts Data visualization Data summarization MAD vs. variance The main difference: An MAD puts the same weight on all values. A variance puts more weights on extreme values. They may give different ranks of dispersion: i xi di |di| d2 i 1 0 −5 5 25 2 4 −1 1 1 3 5 0 0 0 4 6 1 1 1 5 10 5 5 25 Mean 5 0 2.4 10.4 i xi di |di| d2 i 1 1 4 4 16 2 2 3 3 9 3 5 0 0 0 4 8 3 3 9 5 9 4 4 16 Mean 5 0 2.8 10 In general, people use variances more than MADs. But MADs are still popular in some areas, e.g., demand forecasting. It is the analyst’s discretion to choose the appropriate one. Introduction and Descriptive Statistics 50 / 62 Ling-Chieh Kung (NTU IM)
  • 51. Basic concepts Data visualization Data summarization Standard deviations One drawback of using variances is that the unit of measurement is the square of the original one. For the baseball team, the variance of member heights is 34.05 cm2 . What is it?! People take the square root of a variance to generate a standard deviation. The standard deviation of member heights is √ 34.05 ≈ 5.85 cm. 178 172 175 184 172 175 165 178 177 175 180 182 177 183 180 178 179 162 170 171 A standard deviation typically has more managerial implications. Introduction and Descriptive Statistics 51 / 62 Ling-Chieh Kung (NTU IM)
  • 52. Basic concepts Data visualization Data summarization Population v.s. sample variances Recall that the formulas for population and sample means are µ ≡ N i=1 xi N and ¯x ≡ n i=1 xi n , respectively. Formula-wise there is no difference. However, population and sample variances are σ2 ≡ N i=1(xi − µ)2 N and s2 ≡ n i=1(xi − ¯x)2 n − 1 , respectively. Note the difference between N and n − 1! Population and sample standard deviations are σ = N i=1(xi−µ)2 N and s = n i=1(xi−¯x)2 n−1 , respectively. People use σ2 , σ, s2 , and s in almost the whole statistics world. Introduction and Descriptive Statistics 52 / 62 Ling-Chieh Kung (NTU IM)
  • 53. Basic concepts Data visualization Data summarization Coefficient of variation The coefficient of variation is the ratio of the standard deviation to the mean: Coefficient of variation = σ µ . When will you use coefficients of variation? Introduction and Descriptive Statistics 53 / 62 Ling-Chieh Kung (NTU IM)
  • 54. Basic concepts Data visualization Data summarization z-scores Consider a set of sample data {xi}i=1,...,n with sample mean ¯x and sample standard deviation s. For xi, the z-score is zi = xi − ¯x s . In a set of population data {xi}i=1,...,N with population mean µ and population standard deviation σ, the z-score of xi is zi = xi − µ σ . A value’s z-score measures for how many standard deviations it deviates from the mean. Introduction and Descriptive Statistics 54 / 62 Ling-Chieh Kung (NTU IM)
  • 55. Basic concepts Data visualization Data summarization z-scores vs. outliers For detecting outliers, one common way is double check whether xi is an outlier if |zi| = xi − µ σ > 3. It is quite rare for a value’s magnitude of z-score to be so large. For sample data, use xi−¯x s . Some people propose the use of median and MAD is a similar way: double check whether xi is an outlier if1 xi − median MAD > 3. The above rules only suggest one to investigate some extreme values again. These rules are neither sufficient nor necessary for outliers. 1The “MAD” here can be mean absolute deviation from mean, mean absolute deviation from median, median absolute deviation from median, etc. Introduction and Descriptive Statistics 55 / 62 Ling-Chieh Kung (NTU IM)
  • 56. Basic concepts Data visualization Data summarization Correlation Consider the size of a house and its price in a city: Size Price (in m2 ) (in $1000) 75 315 59 229 85 355 65 261 72 234 46 216 107 308 91 306 75 289 65 204 88 265 59 195 How do we measure/describe the correlation (linear relationship) between the two variables? Introduction and Descriptive Statistics 56 / 62 Ling-Chieh Kung (NTU IM)
  • 57. Basic concepts Data visualization Data summarization Intuition Consider a set of paired data {(xi, yi)}i=1,...,N . When one variable goes up, does the other one tend to go up or down? More precisely, if xi is larger than µx (the mean of the xis), is it more likely to see yi > µy or yi < µy? We say that the two variables have a positive correlation. If one goes up when the other goes down, there is a negative correlation. Introduction and Descriptive Statistics 57 / 62 Ling-Chieh Kung (NTU IM)
  • 58. Basic concepts Data visualization Data summarization Covariances We define the covariance of a set of two-dimensional (sample) data as sxy ≡ n i=1(xi − ¯x)(yi − ¯y) n − 1 . If most points fall in the first and third quadrants, most (xi − µx)(y − µy) will be positive and sxy tends to be positive. Otherwise, sxy tends to be negative. So the covariance of house size and price is 617.16. Is it large or small? This depends on how variable the two variables themselves are. Introduction and Descriptive Statistics 58 / 62 Ling-Chieh Kung (NTU IM)
  • 59. Basic concepts Data visualization Data summarization Pearson’s correlation coefficients To take away the auto-variability of each variable itself, we define the population and sample correlation coefficients as r ≡ sxy sxsy , sx and sy are the sample standard deviations of xis and yis. In our example, we have r = 617.16 16.78×50.45 ≈ 0.729. It can be shown that we always have −1 ≤ r ≤ 1. r > 0: Positive correlation. r = 0: No correlation. r < 0: Negative correlation. People often determine the degree of correlation based on |s|: 0 ≤ |s| < 0.25: A weak correlation. 0.25 ≤ |s| < 0.5: A moderately weak correlation. 0.5 ≤ |s| < 0.75: A moderately strong correlation. 0.75 ≤ |s| ≤ 1: A strong correlation. Introduction and Descriptive Statistics 59 / 62 Ling-Chieh Kung (NTU IM)
  • 60. Basic concepts Data visualization Data summarization Correlation vs. independence A correlation coefficient only measures how one variable linearly depends on the other variable. (r = 0.5973) (r = 0) Being uncorrelated does not mean being independent! Introduction and Descriptive Statistics 60 / 62 Ling-Chieh Kung (NTU IM)
  • 61. Basic concepts Data visualization Data summarization Correlation vs. causation A correlation coefficient only measures whether two variables correlate with each other. High correlation does not mean causation. A causes B or B causes A? C causes A and B? Or just by chance? Introduction and Descriptive Statistics 61 / 62 Ling-Chieh Kung (NTU IM)
  • 62. Basic concepts Data visualization Data summarization Correlation of qualitative variables Sometimes the variables are not quantitative/numeric. For ordinal data, we calculate their Spearman’s rank correlation. For nominal data, we calculate Cramer’s V. Introduction and Descriptive Statistics 62 / 62 Ling-Chieh Kung (NTU IM)
  • 63. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Statistics and Data Analysis for Engineers Part 2: Hypothesis Testing and p-value Ling-Chieh Kung Department of Information Management National Taiwan University September 4, 2016 Hypothesis Testing and p-value 1 / 71 Ling-Chieh Kung (NTU IM)
  • 64. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Road map Sampling. Sampling distributions. Hypothesis testing. p-value, t test, and more. Hypothesis Testing and p-value 2 / 71 Ling-Chieh Kung (NTU IM)
  • 65. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Random vs. nonrandom sampling Sampling is the process of selecting a subset of entities from the whole population. Sampling can be random or nonrandom. If random, whether an entity is selected is probabilistic. Randomly select 1000 phone numbers on the telephone book and then call them. If nonrandom, it is deterministic. Ask all your classmates for their preferences on iOS/Android. Most statistical methods are only for random sampling. Some popular random sampling techniques: Simple random sampling. Stratified random sampling. Cluster (or area) random sampling. Hypothesis Testing and p-value 3 / 71 Ling-Chieh Kung (NTU IM)
  • 66. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Simple random sampling In simple random sampling, each entity has the same probability of being selected. The good part of simple random sampling is simple. However, it may result in nonrepresentative samples. In simple random sampling, there are some possibilities that too much data we sample fall in the same stratum. They have the same property. E.g., it is possible that all randomly sampled voters are younger than 40. The sample is thus nonrepresentative. How to fix this problem? Hypothesis Testing and p-value 4 / 71 Ling-Chieh Kung (NTU IM)
  • 67. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Stratified random sampling We may apply stratified random sampling. We first split the whole population into several strata. Data in one stratum should be (relatively) homogeneous. Data in different strata should be (relatively) heterogeneous. We then use simple random sampling for each stratum. Hypothesis Testing and p-value 5 / 71 Ling-Chieh Kung (NTU IM)
  • 68. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Stratified random sampling As an example, suppose that we want to sample 40 out of 1000 graduates to understand the number of credits they get at school. Suppose that 100 students double majored, then we can split the whole population into two strata: Stratum Strata size Double major 100 No double major 900 To sample 40 graduates, we sample 40 × 100 1000 = 4 from the double-major stratum and 36 from the other stratum. Hypothesis Testing and p-value 6 / 71 Ling-Chieh Kung (NTU IM)
  • 69. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Stratified random sampling We may further split the population into more strata. Double major: Yes or no. Class: 1994-1998, 1999-2003, 2004-2008, or 2009-2012. This stratification makes sense only if students in different classes tend to take different numbers of units. Stratified random sampling is good in reducing sample error. But it can be hard to identify a reasonable stratification. It is also more costly and time-consuming. Hypothesis Testing and p-value 7 / 71 Ling-Chieh Kung (NTU IM)
  • 70. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Cluster (or area) random sampling Imagine that you are going to introduce a new product into all the retail stores in Taiwan. If the product is actually unpopular, an introduction with a large quantity will incur a huge lost. How to get an idea about the popularity? Typically we first try to introduce the product in a small area. We put the product on the shelves only in those stores in the specified area. This is the idea of cluster (or area) random sampling. Those consumers in the area form a sample. Hypothesis Testing and p-value 8 / 71 Ling-Chieh Kung (NTU IM)
  • 71. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Cluster (or area) random sampling In cluster random sampling, we define clusters. We will only choose one or some clusters and then collect all the data in these clusters. If a cluster is too large, we may further split it into multiple second-stage clusters. Therefore, we want data in a cluster to be heterogeneous, and data across clusters somewhat homogeneous. For example, people may do cluster random sampling to understand the popularity of a new product. Those chosen cities (counties, states, etc.) are called test market cities (counties, states, etc.). People use cluster random sampling in this case because of its feasibility and convenience. We should select test market cities whose population profiles are similar to that of the entire country. Hypothesis Testing and p-value 9 / 71 Ling-Chieh Kung (NTU IM)
  • 72. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Nonrandom sampling Sometimes we do nonrandom sampling. Convenience sampling. The researcher sample data that are easy to sample. Judgment sampling. The researcher decides who to ask or what data to collect. Quota sampling. In each stratum, we use whatever method that is easy to fill the quota, a predetermined number of samples in the stratum. Snowball sampling. Once we ask one person, we ask her/him to suggest others. Nonrandom sampling cannot be analyzed by the statistical methods we introduce in this course. Hypothesis Testing and p-value 10 / 71 Ling-Chieh Kung (NTU IM)
  • 73. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Road map Sampling. Sampling distributions. Hypothesis testing. p-value, t test, and more. . Hypothesis Testing and p-value 11 / 71 Ling-Chieh Kung (NTU IM)
  • 74. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Sampling distributions When we cannot examine the whole population, we study a sample. What will be contained in a random sample is unpredictable. We need to know the probability distribution of a sample so that we may connect the sample with the population. The probability distribution of a sample is a sampling distribution. Hypothesis Testing and p-value 12 / 71 Ling-Chieh Kung (NTU IM)
  • 75. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Sampling distributions A factory produces bags of candies. Ideally, each bag should weigh 2 kg. As the production process cannot be perfect, a bag of candies should weigh between 1.8 and 2.2 kg. Let X be the weight of a bag of candies. Let µ and σ be its expected value and standard deviation. Is µ = 2? Is 1.8 < µ < 2.2? How large is σ? Let’s sample: In a random sample of 1 bag of candies, suppose it weighs 2.1 kg. May we conclude that 1.8 < µ < 2.2? What if the average weight of 5 bags in a random sample is 2.1 kg? What if the sample size is 10, 50, or 100? What if the mean is 2.3 kg? We need to know the sampling distribution of those statistics (sample mean, sample standard deviation, etc.). Hypothesis Testing and p-value 13 / 71 Ling-Chieh Kung (NTU IM)
  • 76. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Sample means The sample mean is one of the most important statistics. Definition 1 Let {Xi}i=1,...,n be a sample from a population, then ¯x = n i=1 Xi n is the sample mean. Sometimes we write ¯xn to emphasize that the sample size is n. We assume that Xi and Xj are independent for all i = j. This is fine if n N, i.e., we sample a few items from a large population. In practice, we require n ≤ 0.05N. Hypothesis Testing and p-value 14 / 71 Ling-Chieh Kung (NTU IM)
  • 77. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Means and variances of sample means Suppose the population mean and variance are µ and σ2 , respectively. These two numbers are fixed. A sample mean ¯x is a random variable. It has its expected value E[¯x], variance Var(¯x), and standard deviation Var(¯x). These numbers are all fixed They are also denoted as µ¯x, σ2 ¯x, and σ¯x, respectively. For any population, we have the following theorem: Proposition 1 (Mean and variance of a sample mean) Let {Xi}i=1,...,n be a size-n random sample from a population with mean µ and variance σ2 , then we have µ¯x = µ, σ2 ¯x = σ2 n , and σ¯x = σ √ n . Hypothesis Testing and p-value 15 / 71 Ling-Chieh Kung (NTU IM)
  • 78. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Means and variances of sample means Do the terms confuse you? The sample mean vs. the mean of the sample mean. The sample variance vs. the variance of the sample mean. By definition, they are: ¯x = 1 n n i=1 Xi; a random variable. E[¯x]; a constant. s2 = 1 n−1 n i=1(Xi − ¯x)2 ; a random variable. Var(¯x); a constant. The sample variance also has its mean and variance. Hypothesis Testing and p-value 16 / 71 Ling-Chieh Kung (NTU IM)
  • 79. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Example: Quality inspection The weight of a bag of candies follow a normal distribution with mean µ = 2 and standard deviation σ = 0.2. Suppose the quality control officer decides to sample 4 bags and calculate the sample mean ¯x. She will punish me if ¯x /∈ [1.8, 2.2]. Note that my production process is actually “good:” µ = 2. Unfortunately, it is not perfect: σ > 0. We may still be punished (if we are unlucky) even though µ = 2. What is the probability that I will be punished? We want to calculate 1 − Pr(1.8 < ¯x < 2.2). We know that µ¯x = µ = 2 and σ¯x = σ√ 4 = 0.1. But we do not know the probability distribution of ¯x! Hypothesis Testing and p-value 17 / 71 Ling-Chieh Kung (NTU IM)
  • 80. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Sampling from a normal population If the population is normal, the sample mean is also normal! Proposition 2 Let {Xi}i=1,...,n be a size-n random sample from a normal population with mean µ and standard deviation σ. Then ¯x ∼ ND µ, σ √ n . We already know that µ¯x = µ and σ¯x = σ√ n . This is true regardless of the population distribution. When the population is normal, the sample mean will also be normal. Hypothesis Testing and p-value 18 / 71 Ling-Chieh Kung (NTU IM)
  • 81. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Example revisited: Quality inspection The weight of a bag of candies follow a normal distribution with mean µ = 2 and standard deviation σ = 0.2. Suppose the quality control officer decides to sample 4 bags and calculate the sample mean ¯x. She will punish me if ¯x /∈ [1.8, 2.2]. What is the probability that I will be punished? The distribution of the sample mean ¯x is ND(2, 0.1). Pr(¯x < 1.8) + Pr(¯x > 2.2) ≈ 0.045. Hypothesis Testing and p-value 19 / 71 Ling-Chieh Kung (NTU IM)
  • 82. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Adjusting the standard deviation When the population is ND(µ = 2, σ = 0.2) and the sample size is n = 4, the probability of punishment is 0.045. If we adjust our standard deviation σ (by paying more or less attention to the production process), the probability will change. Reducing σ reduces the probability of being punished. With the sampling distribution of ¯x, we may optimize σ. An improvement from 0.2 to 0.15 is helpful; from 0.15 to 0.1 is not. Hypothesis Testing and p-value 20 / 71 Ling-Chieh Kung (NTU IM)
  • 83. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Adjusting the sample size When the population is ND(2, 0.2) and the sample size is n = 4, the probability of punishment is 0.045. If the quality control officer increases the sample size n, the probability will decrease. µ = 2 is actually ideal. A larger sample size makes the officer less likely to make a mistake. Hypothesis Testing and p-value 21 / 71 Ling-Chieh Kung (NTU IM)
  • 84. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Distribution of the sample mean So now we have one general conclusion: When we sample from a normal population, the sample mean is also normal. And its mean and standard deviation are µ and σ√ n , respectively. What if the population is non-normal? Fortunately, we have a very powerful theorem, the central limit theorem, which applies to any population. Hypothesis Testing and p-value 22 / 71 Ling-Chieh Kung (NTU IM)
  • 85. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Central limit theorem The theorem says that a sample mean is approximately normal when the sample size is large enough. Proposition 3 (Central limit theorem) Let {Xi}i=1,...,n be a size-n random sample from a population with mean µ and standard deviation σ. Let ¯xn be the sample mean. If σ < ∞, then ¯xn converges to ND(µ, σ√ n ) as n → ∞. How large is “large enough”? In practice, typically n ≥ 30 is believed to be large enough. Hypothesis Testing and p-value 23 / 71 Ling-Chieh Kung (NTU IM)
  • 86. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Road map Sampling. Sampling distributions. Hypothesis testing. p-value, t test, and more. . Hypothesis Testing and p-value 24 / 71 Ling-Chieh Kung (NTU IM)
  • 87. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Hypothesis testing How do scientists (physicists, chemists, etc.) do research? Observe phenomena. Make hypotheses. Test the hypotheses through experiments (or other methods). Make conclusions about the hypotheses. Social scientists and business researchers do the same thing with hypothesis testing. One of the most important technique of statistical inference. A technique for (statistically) proving things. Relying on sampling distributions. Hypothesis Testing and p-value 25 / 71 Ling-Chieh Kung (NTU IM)
  • 88. Sampling Sampling distributions Hypothesis testing p-value, t test, and more People ask questions In the business (or social science) world, people ask questions: Are older workers more loyal to a company? Does the newly hired CEO enhance our profitability? Is one candidate preferred by more than 50% voters? Do teenagers eat fast food more often than adults? Is the quality of our products stable enough? How should we answer these questions? Statisticians suggest: First make a hypothesis. Then test it with samples and statistical methods. Hypothesis Testing and p-value 26 / 71 Ling-Chieh Kung (NTU IM)
  • 89. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Statistical hypotheses A statistical hypothesis is a formal way of stating a hypothesis. Typically it is a mathematical description of parameters to test. It contains two parts: The null hypothesis (denoted as H0). The alternative hypothesis (denoted as Ha or H1). The alternative hypothesis is: The thing that we want (need) to prove. The conclusion that can be made only if we have a strong evidence. The null hypothesis corresponds to a default position. We first assume that the null hypothesis is correct. Then we collect sample data. If under the null hypothesis it is quite unlikely to see our observed result, we claim that the null hypothesis is wrong. Hypothesis Testing and p-value 27 / 71 Ling-Chieh Kung (NTU IM)
  • 90. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Statistical hypotheses: example 1 In our factory, we produce packs of candy whose average weight should be 1 kg. One day, a consumer told us that his pack only weighs 900 g. We need to know whether this is just a rare event or our production system is out of control. If (we believe) the system is out of control, we need to shutdown the machine and spend two days for inspection and maintenance. This will cost us at least $100,000. So we should not to believe that our system is out of control just because of one complaint. What should we do? Hypothesis Testing and p-value 28 / 71 Ling-Chieh Kung (NTU IM)
  • 91. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Statistical hypotheses: example 1 We first state a hypothesis: “Our production system is under control.” Then we ask: Is there a strong enough evidence showing that the hypothesis is wrong, i.e., the system is out of control? Initially, we assume that our system is under control. Then we do a survey to see if we have a strong enough evidence. We shutdown machines only if we can “prove” that the system is indeed out of control. Let µ be the average weight, the statistical hypothesis is H0 : µ = 1 Ha : µ = 1. Hypothesis Testing and p-value 29 / 71 Ling-Chieh Kung (NTU IM)
  • 92. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Statistical hypotheses: example 2 In our society, we adopt the presumption of innocence. One is considered innocent until proven guilty. So when there is a person who probably stole some money: H0 : The person is innocent Ha : The person is guilty. There are two possible errors: One is guilty but we think she/he is innocent. One is innocent but we think she/he is guilty. Which one is more critical? It is unacceptable that an innocent person is considered guilty. We will say one is guilty only if there is a strong evidence. Hypothesis Testing and p-value 30 / 71 Ling-Chieh Kung (NTU IM)
  • 93. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Statistical hypotheses: example 3 Consider the following hypothesis: “The candidate is preferred by more than 50% voters.” As we need a default position, and the percentage that we care about is 50%, we will choose our null hypothesis as H0 : p = 0.5. p is the population proportion of voters preferring the candidate. More precisely, let Xi = 1 if voter i prefers this candidate and 0 otherwise, i = 1, ..., N, then p = N i=1 Xi N . How about the alternative hypothesis? Should it be Ha : p > 0.5 or Ha : p < 0.5? Hypothesis Testing and p-value 31 / 71 Ling-Chieh Kung (NTU IM)
  • 94. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Statistical hypotheses: example 3 The choice of the alternative hypothesis depends on the related decisions or actions to make. Suppose one will go for the election only if she thinks she will win (i.e., p > 0.5), the alternative hypothesis will be Ha : p > 0.5. Suppose one tends to participate in the election and will give up only if the chance is slim, the alternative hypothesis will be Ha : p < 0.5. The alternative hypothesis is “the thing we want (need) to prove.” Hypothesis Testing and p-value 32 / 71 Ling-Chieh Kung (NTU IM)
  • 95. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Two types of errors Type-1 error (false positive): Rejecting a true null hypothesis. There is nothing, but we say there is one. Type-2 error (false negative): Do not reject a false null hypothesis. There is something, but we do not see it. Hypothesis Testing and p-value 33 / 71 Ling-Chieh Kung (NTU IM)
  • 96. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Hypothesis Testing and p-value 34 / 71 Ling-Chieh Kung (NTU IM)
  • 97. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Remarks We want to control the chances for us to make these mistakes. Unfortunately, we cannot control both. We choose to control the probability of a type-1 error. The choice of the default position is important. For setting up a statistical hypothesis: Our default position will be put in the null hypothesis. The thing we want to prove (i.e., the thing that needs a strong evidence) will be put in the alternative hypothesis. For writing the mathematical statement: The equal sign (=) will always be put in the null hypothesis. The alternative hypothesis contains an unequal sign or strict inequality: =, >, or <. The direction of the alternative hypothesis, when it is an inequality, depends on the context. Hypothesis Testing and p-value 35 / 71 Ling-Chieh Kung (NTU IM)
  • 98. Sampling Sampling distributions Hypothesis testing p-value, t test, and more One-tailed tests and two-tailed tests If the alternative hypothesis contains an unequal sign (=), the test is a two-tailed test. If it contains a strict inequality (> or <), the test is a one-tailed test. Suppose we want to test the value of the population mean. In a two-tailed test, we test whether the population mean significantly deviates from a hypothesized value. We do not care whether it is larger than or smaller than. In a one-tailed test, we test whether the population mean significantly deviates from a hypothesized value in a specific direction. Hypothesis Testing and p-value 36 / 71 Ling-Chieh Kung (NTU IM)
  • 99. Sampling Sampling distributions Hypothesis testing p-value, t test, and more The first example: a two-tailed test Let’s test the average weight (in g) of our products. H0 : µ = 1000 Ha : µ = 1000. The variance of the product weights is σ2 = 40000 g2 . The case with unknown σ2 will be discussed later. A random sample has been collected. Suppose the sample size n = 100. Suppose the sample mean X = 963. How to make a conclusion? Hypothesis Testing and p-value 37 / 71 Ling-Chieh Kung (NTU IM)
  • 100. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Controlling the error probability All we can do is to collect a random sample and make our conclusion based on the observed sample. It is natural that we may be wrong when we claim µ = 1000. We want to control the error probability. Let α be the maximum probability for us to make this error. α is called the significance level. 1 − α is called the confidence level. Target: If µ = 1000, our sampling and testing process will make us claim that µ = 1000 with probability at most α. Hypothesis Testing and p-value 38 / 71 Ling-Chieh Kung (NTU IM)
  • 101. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Rejection rule Now let’s test with the significance level α = 0.05. Intuitively, if X deviates from 1000 a lot, we should reject the null hypothesis and believe that µ = 1000. If µ = 1000, it is so unlikely to observe such a large deviation. So such a large deviation provides a strong evidence. So we start by sampling and calculating the sample mean. We want to construct a rejection rule: If |X − 1000| > d, we reject H0. We need to calculate d. Hypothesis Testing and p-value 39 / 71 Ling-Chieh Kung (NTU IM)
  • 102. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Rejection rule We want a distance d such that if H0 is true, the probability of rejecting H0 is at most 5%, i.e., Pr |X − 1000| > d µ = 1000 ≤ 0.05. The smallest d that satisfies the above inequality requires Pr(|X − 1000| > d) = 0.05. Consider X: We know σ = 200 and n = 100. We assume that µ = 1000. Thanks to the central limit theorem, X ∼ ND(1000, 20). Pr(|X − 1000| > d) = 0.05. Hypothesis Testing and p-value 40 / 71 Ling-Chieh Kung (NTU IM)
  • 103. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Rejection rule: the critical value According to X ∼ ND(1000, 20), Pr(|X − 1000| > 39.2) = 0.05. The rejection region is R = (−∞, 960.8) ∪ (1039.2, ∞). If X falls in the rejection region, we reject H0. Hypothesis Testing and p-value 41 / 71 Ling-Chieh Kung (NTU IM)
  • 104. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Rejection rule: the critical value Because ¯x = 963 /∈ R, we cannot reject H0. The deviation from 1000 is not large enough. The evidence is not strong enough. Hypothesis Testing and p-value 42 / 71 Ling-Chieh Kung (NTU IM)
  • 105. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Rejection rule: the critical value In this example, the two values 960.8 and 1039.2 are the critical values for rejection. If the sample mean is more extreme than one of the critical values, we reject H0. Otherwise, we do not reject H0. ¯x = 963 is not strong enough to support Ha: µ = 1000. Concluding statement: Because the sample mean does not lie in the rejection region, we cannot reject H0. With a 95% confidence level, there is no strong evidence showing that the average weight is not 1000 g. Therefore, we should not shutdown machines to do an inspection. Hypothesis Testing and p-value 43 / 71 Ling-Chieh Kung (NTU IM)
  • 106. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Summary We want to know whether the machine is out of control. If the machine is actually good, we do not want to reach a conclusion that requires an inspection and maintenance. We will do the inspection only if we have a strong evidence suggesting that µ = 1000. We want to know whether H0 is false, i.e., µ = 1000. We control the probability of making a wrong conclusion. We should not reject H0 if it is true. We limit the probability at α = 5%. We will conclude that H0 is false if X falls in the rejection region. The calculation of the the critical values is based on the normal distribution, which can always be transformed to the z distribution. This is called a z test. Hypothesis Testing and p-value 44 / 71 Ling-Chieh Kung (NTU IM)
  • 107. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Not rejecting vs. accepting We should be careful in writing our conclusions: Wrong: Because the sample mean does not lie in the rejection region, we accept H0. With a 95% confidence level, there is a strong evidence showing that the average weight is 1000 g. Right: Because the sample mean does not lie in the rejection region, we cannot reject H0. With a 95% confidence level, there is no strong evidence showing that the average weight is not 1000 g. Unable to prove one thing is false does not mean it is true! Hypothesis Testing and p-value 45 / 71 Ling-Chieh Kung (NTU IM)
  • 108. Sampling Sampling distributions Hypothesis testing p-value, t test, and more The first example (part 2) Suppose that we modify the hypothesis into a directional one:1 H0 : µ = 1000. Ha : µ < 1000. We still have σ2 = 40000, n = 100, and α = 0.05. This is a one-tailed test. Once we have a strong evidence supporting Ha, we will claim that µ < 1000. We need to find a distance d such that Pr 1000 − X > d µ = 1000 = 0.05. 1Some researchers write µ ≥ 1000 in this case. Hypothesis Testing and p-value 46 / 71 Ling-Chieh Kung (NTU IM)
  • 109. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Rejection rule: the critical value For 0.05 = Pr(1000 − X > d), we have d = 32.9. As the observed sample mean ¯x = 963 ∈ (−∞, 967.1), we reject H0. The deviation from 1000 is large enough. The evidence is strong enough. Hypothesis Testing and p-value 47 / 71 Ling-Chieh Kung (NTU IM)
  • 110. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Rejection rule: the critical value In this example, 967.1 is the critical values for rejection. If the sample mean is more extreme than (in this case, below) the critical value, we reject H0. Otherwise, we do not reject H0. There is a strong evidence supporting Ha: µ < 1000. Concluding statement: Because the sample mean lies in the rejection region, we reject H0. With a 95% confidence level, there is a strong evidence showing that the average weight is less than 1000 g. Hypothesis Testing and p-value 48 / 71 Ling-Chieh Kung (NTU IM)
  • 111. Sampling Sampling distributions Hypothesis testing p-value, t test, and more One-tailed tests vs. two-tailed tests When should we use a two-tailed test? We use a two-tailed test when we are lack of the direction information. E.g., we suspect that the population mean has changed, but we have no idea about whether it becomes larger or smaller. If we know or believe that the change is possible only in one direction, we may use a one-tailed test. Having more information (i.e., knowing the direction of change) makes rejection “easier,”, i.e., easier to find a strong enough evidence. Hypothesis Testing and p-value 49 / 71 Ling-Chieh Kung (NTU IM)
  • 112. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Summary Distinguish the following pairs: One- and two-tailed tests. No evidence showing H0 is false and having evidence showing H0 is true. Not rejecting H0 and accepting H0. Using = and using ≥ or ≤ in the null hypothesis. Hypothesis Testing and p-value 50 / 71 Ling-Chieh Kung (NTU IM)
  • 113. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Road map Sampling. Sampling distributions. Hypothesis testing. p-value, t test, and more. . Hypothesis Testing and p-value 51 / 71 Ling-Chieh Kung (NTU IM)
  • 114. Sampling Sampling distributions Hypothesis testing p-value, t test, and more The p-value The p-value is an important, meaningful, and widely-adopted tool for hypothesis testing. Definition 2 For an observed value of a statistic in a statistical test, the p-value is the probability of observing a value that is more extreme than the observed value under the assumption that the null hypothesis is true. Calculated based on an observed value of the statistic. Is the tail probability of the observed value. Assuming that the null hypothesis is true. Hypothesis Testing and p-value 52 / 71 Ling-Chieh Kung (NTU IM)
  • 115. Sampling Sampling distributions Hypothesis testing p-value, t test, and more The p-value Mathematically: Suppose we test a population mean µ with a one-tailed test H0 : µ = 1000 Ha : µ < 1000. Given an observed ¯x, the p-value is defined as Pr(X ≤ ¯x). In the previous example, σ = 200, n = 100, α = 0.05, and ¯x = 963. If H0 is true, i.e., µ = 1000, we have Pr(X ≤ 963) = 0.032. The p-value of ¯x is 0.032. Hypothesis Testing and p-value 53 / 71 Ling-Chieh Kung (NTU IM)
  • 116. Sampling Sampling distributions Hypothesis testing p-value, t test, and more How to use the p-value? The p-value can be used for constructing a rejection rule. For a one-tailed test: If the p-value is smaller than α, we reject H0. If the p-value is greater than α, we do not reject H0. In our example, the one-tailed test is H0 : µ = 1000 Ha : µ < 1000. We have α = 0.05. Because the p-value 0.032 < 0.05, we reject H0. Hypothesis Testing and p-value 54 / 71 Ling-Chieh Kung (NTU IM)
  • 117. Sampling Sampling distributions Hypothesis testing p-value, t test, and more p-values vs. critical values Using the p-value is equivalent to using the critical values. The rejection-or-not decision we make will be the same based on the two methods. Hypothesis Testing and p-value 55 / 71 Ling-Chieh Kung (NTU IM)
  • 118. Sampling Sampling distributions Hypothesis testing p-value, t test, and more The benefit of using the p-value In many studies, researchers do not determine the significance level α before a test is conducted. They calculate the p-value and then mark the significance of the result with stars. One typical way of assigning stars: p-value Significant? Mark (0, 0.01] Highly significant *** (0.01, 0.05] Moderately significant ** (0.05, 0.1] Slightly significant * (0.1, 1) Insignificant (Empty) Hypothesis Testing and p-value 56 / 71 Ling-Chieh Kung (NTU IM)
  • 119. Sampling Sampling distributions Hypothesis testing p-value, t test, and more The size of a p-value Suppose one is testing whether people at different ages sleep for at least eight hours per day in average. Age groups: [10, 15), [15, 20), [20, 35), etc. For group i, a one-tailed test is conducted. Ha : µi > 8. The result may be presented in a table: Group Age group p-value 1 [10,15) 0.0002*** 2 [15,20) 0.2 3 [20,25) 0.06* 4 [25,30) 0.04** 5 [30,35) 0.03** A smaller p-value does NOT mean a larger deviation! We cannot conclude that µ5 > µ4, µ1 > µ3, etc. There are other tests for the difference between two population means. Hypothesis Testing and p-value 57 / 71 Ling-Chieh Kung (NTU IM)
  • 120. Sampling Sampling distributions Hypothesis testing p-value, t test, and more The p-value for two-tailed tests How to construct the rejection rule for a two-tailed test? If the p-value is smaller than α 2 , we reject H0. If the p-value is greater than α 2 , we do not reject H0. Consider the two-tailed test H0 : µ = 1000. Ha : µ = 1000. We have α = 0.05. Because the p-value 0.032 > α 2 = 0.025, we do not reject H0. Some researchers/books/software use another definition: The p-value for a two-tailed test is two times of that for the corresponding one-tailed test. They then compare this p-value with α. Hypothesis Testing and p-value 58 / 71 Ling-Chieh Kung (NTU IM)
  • 121. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Summary The p-value is the tail probability of the realized value of a statistics assuming the null hypothesis is true. The p-value method is an alternative way of forming the rejection rule. It is equivalent to the critical-value method. The p-value is related to the probability for H0 to be false. It does not measure the magnitude of the deviation. Hypothesis Testing and p-value 59 / 71 Ling-Chieh Kung (NTU IM)
  • 122. Sampling Sampling distributions Hypothesis testing p-value, t test, and more The z test In example 1, basically we use the fact that X ∼ ND(µ, σ√ n . This implies that X−µ σ/ √ n ∼ ND(0, 1), the so-called standard normal distribution, or the z distribution. Therefore, this test is called the z test. This requires the knowledge about σ. Hypothesis Testing and p-value 60 / 71 Ling-Chieh Kung (NTU IM)
  • 123. Sampling Sampling distributions Hypothesis testing p-value, t test, and more When the variance is unknown When the population variance σ2 is unknown, the quantity X−µ σ/ √ n is unknown. What if we use the sample variance S2 as a substitute? Proposition 4 For a normal population, the quantity T = X − µ S/ √ n follows the t distribution with degree of freedom n − 1. What is the t distribution? Hypothesis Testing and p-value 61 / 71 Ling-Chieh Kung (NTU IM)
  • 124. Sampling Sampling distributions Hypothesis testing p-value, t test, and more The t distribution The t distribution is defined as follows: Definition 3 A random variable X follows the t distribution with degree of freedom n, denoted as X ∼ t(n), if f(x|n) = Γ(n+1 2 ) √ nπΓ(n 2 ) 1 + x2 n − n+1 2 , for all x ∈ (−∞, ∞). Γ(x) = ∞ 0 zx−1 e−z dz is the gamma function. Hypothesis Testing and p-value 62 / 71 Ling-Chieh Kung (NTU IM)
  • 125. Sampling Sampling distributions Hypothesis testing p-value, t test, and more The z and t distributions Let’s compare Z = X−µ σ/ √ n and T = X−µ S/ √ n . Because we do not know σ, we use S to substitute it. Z ∼ ND(0, 1) and T ∼ t(n − 1). As the t distribution is a substitution of the z distribution, it is designed to be also centered at 0: E[T] = E[Z] = 0. However, as we add one more random variable into the formula (σ is a known constant), T will be “more random” than Z, i.e., Var(T) > Var(Z). Graphically, t curves will be flatter than the z curve. Fact: t(n) → ND(0, 1) as n → ∞. Hypothesis Testing and p-value 63 / 71 Ling-Chieh Kung (NTU IM)
  • 126. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Hypothesis Testing and p-value 64 / 71 Ling-Chieh Kung (NTU IM)
  • 127. Sampling Sampling distributions Hypothesis testing p-value, t test, and more The t test We will use the t test to test the population mean if the population is normal. If the sample size is large, we may still use the z distribution with s substituting σ. Hypothesis Testing and p-value 65 / 71 Ling-Chieh Kung (NTU IM)
  • 128. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Example 2 An MBA program seldom admits applicants without a work experience longer than two years. To test whether the average work year of admitted students is above two years, 20 admitted applicants are randomly selected. Their work experiences prior to entering the program are recorded. Prior to entering the program, they have an average work experience of 2.5 years. This is the sample mean. The sample standard deviation is 1.3765 years. The population is believed to be normal. The confidence level is set to 95%. Hypothesis Testing and p-value 66 / 71 Ling-Chieh Kung (NTU IM)
  • 129. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Example 2: hypothesis Suppose the one asking the question is a potential applicant with one year of work experience. He is pessimistic and will apply for the program only if the average work experience is proven to be less than two years. The hypothesis is H0 : µ = 2 Ha : µ < 2. µ is the average work experience (in years) of all admitted applicants prior to entering the program. To encourage him, we need to give him a strong evidence showing that his chance is high. Hypothesis Testing and p-value 67 / 71 Ling-Chieh Kung (NTU IM)
  • 130. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Example 2: hypothesis and test Suppose he is optimistic and will not apply for the program only if the average work experience is proven to be greater than two. The hypothesis becomes H0 : µ = 2 Ha : µ > 2. To discourage him, we need to give him a strong evidence showing that his chance is slim. Let’s consider the optimistic candidate (and Ha : µ > 2) first. Because the population variance is unknown and the population is normal, we may use the t test. Hypothesis Testing and p-value 68 / 71 Ling-Chieh Kung (NTU IM)
  • 131. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Example 2A: calculation and interpretation Calculation: The p-value is Pr(X > 2.5|µ = 2) = 0.0604. Conclusion: For this one-tailed test, as the p-value > 0.05 = α, we do not reject H0. There is no strong evidence showing that the average work experience is longer than two years. The result is not strong enough to discourage the potential applicant, who has only one year of work experience. Decision: The (optimistic) applicant should apply. Hypothesis Testing and p-value 69 / 71 Ling-Chieh Kung (NTU IM)
  • 132. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Example 2B – a pessimistic applicant Suppose the applicant is pessimistic and the hypothesis is H0 : µ = 2 Ha : µ < 2. The p-value will be Pr(X < 2.5|µ = 2) = 1 − 0.0604 = 0.9396. This is calculated based on the t distribution. We do not reject H0 and cannot conclude that µ < 2. There is no strong evidence to encourage him. He should not apply. Note that when we write different alternative hypotheses, the final decision is different! This happens if and only if in both cases we do not reject H0. Hypothesis Testing and p-value 70 / 71 Ling-Chieh Kung (NTU IM)
  • 133. Sampling Sampling distributions Hypothesis testing p-value, t test, and more Summary To test the population mean µ: σ2 Sample size Population distribution Normal Nonnormal Known n ≥ 30 z z n < 30 z Nonparametric Unknown n ≥ 30 t or z z n < 30 t Nonparametric More parameters that may be tested: Population proportion (z test). Population variance (χ2 test). Difference of two population means (t test). Ratio of two population variances (F test). Hypothesis Testing and p-value 71 / 71 Ling-Chieh Kung (NTU IM)
  • 134. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Statistics and Data Analysis for Engineers Part 3: Regression Analysis Ling-Chieh Kung Department of Information Management National Taiwan University September 4, 2016 Regression Analysis 1 / 83 Ling-Chieh Kung (NTU IM)
  • 135. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Correlation and prediction We often try to find correlation among variables. For example, prices and sizes of houses: House 1 2 3 4 5 6 Size (m2) 75 59 85 65 72 46 Price ($1000) 315 229 355 261 234 216 House 7 8 9 10 11 12 Size (m2) 107 91 75 65 88 59 Price ($1000) 308 306 289 204 265 195 We may calculate their correlation coefficient as r = 0.729. Now given a house whose size is 100 m2 , may we predict its price? Regression Analysis 2 / 83 Ling-Chieh Kung (NTU IM)
  • 136. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Correlation among more than two variables Sometimes we have more than two variables: For example, we may also know the number of bedrooms in each house: House 1 2 3 4 5 6 Size (m2) 75 59 85 65 72 46 Price ($1000) 315 229 355 261 234 216 Bedroom 1 1 2 2 2 1 House 7 8 9 10 11 12 Size (m2) 107 91 75 65 88 59 Price ($1000) 308 306 289 204 265 195 Bedroom 3 3 2 1 3 1 How to summarize the correlation among the three variables? How to predict house price based on size and number of bedrooms? Regression Analysis 3 / 83 Ling-Chieh Kung (NTU IM)
  • 137. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Regression analysis Regression is a solution! As one of the most widely used tools in Statistics, it discovers: Which variables affect a given variable. How they affect the target. In general, we will predict/estimate one dependent variable by one or multiple independent variables. Independent variables: Potential factors that may affect the outcome. Dependent variable: The outcome. Independent variables are explanatory variables; the dependent variable is the response variable. As another example, suppose we want to predict the number of arrival consumers for tomorrow: Dependent variable: Number of arrival consumers. Independent variables: Weather, holiday or not, promotion or not, etc. Regression Analysis 4 / 83 Ling-Chieh Kung (NTU IM)
  • 138. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Types of regression analysis Based on the number of independent variables: Simple regression: One independent variable. Multiple regression: More than one independent variables. The dependent variable may be quantitative or qualitative. In ordinary regression, the dependent variable is quantitative. In logistic regression, the dependent variable is qualitative. There are other types of regression models. Regression Analysis 5 / 83 Ling-Chieh Kung (NTU IM)
  • 139. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Road map Simple regression. Multiple regression. Indicator variables and interaction. Endogeneity and residual analysis. Logistic regression. Regression Analysis 6 / 83 Ling-Chieh Kung (NTU IM)
  • 140. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Basic principle Consider the price-size relationship again. In the sequel, let xi be the size and yi be the price of house i, i = 1, ..., 12. Size Price (in m2 ) (in $1000) 46 216 59 229 59 195 65 261 65 204 72 234 75 315 75 289 85 355 88 265 91 306 107 308 How to relate sizes and prices “in the best way?” Regression Analysis 7 / 83 Ling-Chieh Kung (NTU IM)
  • 141. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Linear estimation If we believe that the relationship between the two variables is linear, we will assume that yi = β0 + β1xi + i. β0 is the intercept of the equation. β1 is the slope of the equation. i is the random noise for estimating record i. Somehow there is such a formula, but we do not know β0 and β1. β0 and β1 are the parameter of the population. We want to use our sample data (e.g., the information of the twelve houses) to estimate β0 and β1. We want to form two statistics ˆβ0 and ˆβ1 as our estimates of β0 and β1. Regression Analysis 8 / 83 Ling-Chieh Kung (NTU IM)
  • 142. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Linear estimation Given the values of ˆβ0 and ˆβ1, we will use ˆyi = ˆβ0 + ˆβ1xi as our estimate of yi. Then we have yi = ˆβ0 + ˆβ1xi + i, where i is now interpreted as the estimation error. Let ˆyi = ˆβ0 + ˆβ1xi be our estimate of yi. We hope i = yi − ˆyi to be small. For all data points, let’s minimize the sum of squared errors (SSE): n i=1 2 i = (yi − ˆyi)2 = n i=1 (yi − (ˆβ0 + ˆβ1xi) 2 . The solution of min ˆβ0, ˆβ1 n i=1 (yi − (ˆβ0 + ˆβ1xi) 2 is our least square approximation (estimation) of the given data. Regression Analysis 9 / 83 Ling-Chieh Kung (NTU IM)
  • 143. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Least square approximation The least square approximation problem min ˆβ0, ˆβ1 n i=1 (yi − (ˆβ0 + ˆβ1xi) 2 has a closed-form formula for the best (ˆβ0, ˆβ1): ˆβ1 = n i=1(xi − ¯x)(yi − ¯y) n i=1(xi − ¯x)2 and ˆβ0 = ¯y − ˆβ1 ¯x. For our house example, we will get (ˆβ0, ˆβ1) = (102.717, 2.192). Its SSE is 13118.63. We will never know the true values of β0 and β1. However, according to our sample data, the best (least square) estimate is (102.717, 2.192). We tend to believe that β0 = 102.717 and β1 = 2.192. Regression Analysis 10 / 83 Ling-Chieh Kung (NTU IM)
  • 144. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Interpretations Our regression model is y = 102.717 + 2.192x. Interpretation: When the house size increases by 1 m2 , the price is expected to increase by $2, 192. (Bad) interpretation: For a house whose size is 0 m2 , the price is expected to be $102,717. Regression Analysis 11 / 83 Ling-Chieh Kung (NTU IM)
  • 145. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Linear multiple regression In most cases, more than one independent variable may be used to explain the outcome of the dependent variable. For example, consider the number of bedrooms. We may take both variables as independent variables to do linear multiple regression: yi = β0 + β1x1,i + β2x2,i + i. yi is the house price (in $1000). x1,i is the house size (in m2 ). x2,i is the number of bedrooms. i is the random noise. Our (least square) estimate is (ˆβ0, ˆβ1, ˆβ2) = (82.737, 2.854, −15.789). Price Size Bedroom (in $1000) (in m2 ) 315 75 1 229 59 1 355 85 2 261 65 2 234 72 2 216 46 1 308 107 3 306 91 3 289 75 2 204 65 1 265 88 3 195 59 1 Regression Analysis 12 / 83 Ling-Chieh Kung (NTU IM)
  • 146. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Interpretations Our regression model is y = 82.737 + 2.854x1 − 15.789x2. When the house size increases by 1 m2 (and all other independent variables are fixed), we expect the price to increase by $2, 854. When there is one more bedroom (and all other independent variables are fixed), we expect the price to decrease by $15, 789. One must interpret the results and determine whether the result is meaningful by herself/himself. The number of bedrooms may not be a good indicator of house price. At least not in a linear way. We need more than finding coefficients: We need to judge the overall quality of a given regression model. We may want to compare multiple regression models. We must test the significance of regression coefficients. Regression Analysis 13 / 83 Ling-Chieh Kung (NTU IM)
  • 147. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Model validation: How good is a model? How to measure the quality of a model? For the model y = 102.717 + 2.192x, how good is it? In general, for a given regression model y = ˆβ0 + ˆβ1x1 + · · · ˆβkxk, how may we evaluate its overall quality? The sum of squared total errors (SST), SST = n i=1(yi − ¯y)2 , is for the worst model. With our regression model, the sum of squared errors (SSE) is SSE = n i=1 (yi − ˆyi)2 = n i=1 (yi − (ˆβ0 + ˆβ1xi) 2 . The proportion of total variability that is explained by the regression model is 0 ≤ R2 = 1 − SSE SST ≤ 1. The larger R2 , the better the regression model. Regression Analysis 14 / 83 Ling-Chieh Kung (NTU IM)
  • 148. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Obtaining R2 Whenever we find the estimated coefficients, we have R2 . Statistical software includes R2 in the regression report. For the regression model y = 102.717 + 2.192x, we have R2 = 0.5315: Around 53% of a house price is determined by its house size. If (and only if) there is only one independent variable, then R2 = r2 , where r is the correlation coefficient between the dependent and independent variables. −1 ≤ r ≤ 1. 0 ≤ r2 = R2 ≤ 1. Regression Analysis 15 / 83 Ling-Chieh Kung (NTU IM)
  • 149. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Comparing regression models Now we have a way to compare regression models. For our example: Size only Bedroom only Size and bedroom R2 0.5315 0.29 0.5513 Using prices only is better than using numbers of bedrooms only. Is using prices and bedrooms better? In general, adding more variables always increases R2 ! In the worst case, we may set the corresponding coefficients to 0. Some variables may actually be meaningless. To perform a “fair” comparison and identify those meaningful factors, we need to adjust R2 based on the number of independent variables. Regression Analysis 16 / 83 Ling-Chieh Kung (NTU IM)
  • 150. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Adjusted R2 The standard way to adjust R2 to adjusted R2 is R2 adj = 1 − n − 1 n − k − 1 (1 − R2 ). n is the sample size and k is the number of independent variables used. For our example: Size only Bedroom only Size and bedroom R2 0.5315 0.290 0.5513 R2 adj 0.4846 0.219 0.4516 Actually using sizes only results in the best model! Regression Analysis 17 / 83 Ling-Chieh Kung (NTU IM)
  • 151. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Testing coefficient significance Another important task for validating a regression model is to test the significance of each coefficient. Recall our model with two independent variables y = 82.737 + 2.854x1 − 15.789x2. Note that 2.854 and −15.789 are solely calculated based on the sample. We never know whether β1 and β2 are really these two values! In fact, we cannot even be sure that β1 and β2 are not 0. We need to test them: H0 : βi = 0 Ha : βi = 0. We look for a strong enough evidence showing that βi = 0. Regression Analysis 18 / 83 Ling-Chieh Kung (NTU IM)
  • 152. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Testing coefficient significance The testing results are provided in regression reports. Statistical software (e.g., R) tells us: Coefficients Standard Error t Stat p-value Intercept 82.737 59.873 1.382 0.200 Size 2.854 1.247 2.289 0.048 ** Bedroom −15.789 25.056 −0.630 0.544 As we have no idea about population variance, we apply the t test. “Coefficients” records sample means ¯x; “Standard Error” records S√ n ; “t Stat” records T = ¯x−0 S/ √ n . “p-value” are the tail probabilities of T multiplied by 2 (done by most software). Simply compare them with α! Recall the assumption that i is normal! Regression Analysis 19 / 83 Ling-Chieh Kung (NTU IM)
  • 153. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Testing coefficient significance Statistical software tells us: Coefficients Standard Error t Stat p-value Intercept 82.737 59.873 1.382 0.200 Size 2.854 1.247 2.289 0.048 ** Bedroom −15.789 25.056 −0.630 0.544 At a 95% confidence level, we believe that β1 = 0. House size really has some impact on house price. At a 95% confidence level, we have no evidence for β2 = 0. We cannot conclude that the number of bedrooms has an impact on house price. If we use only size as an independent variable, its p-value will be 0.00714. We will be quite confident that it has an impact. Regression Analysis 20 / 83 Ling-Chieh Kung (NTU IM)
  • 154. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Road map Simple regression. Multiple regression. Indicator variables and interaction. Endogeneity and residual analysis. Logistic regression. Regression Analysis 21 / 83 Ling-Chieh Kung (NTU IM)
  • 155. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression House age The age of a house may also affect its price. Price Size Bedroom Age (in $1000) (in m2 ) (in years) 315 75 1 16 229 59 1 20 355 85 2 16 261 65 2 15 234 72 2 21 216 46 1 16 308 107 3 15 306 91 3 15 289 75 2 14 204 65 1 21 265 88 3 15 195 59 1 26 Let’s add age as an independent variable in explaining house prices. Because the number of bedroom seems to be unhelpful, let’s ignore it. Regression Analysis 22 / 83 Ling-Chieh Kung (NTU IM)
  • 156. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression House age For house i, let yi be its price, x1,i be its size, and x3,i be its age. We assume the following linear relationship: yi = β0 + β1x1,i + β2x3,i + i. Software gives us the following regression report: Coefficients Standard Error t Stat p-value Intercept 262.882 83.632 3.143 0.012 Size 1.533 0.628 2.443 0.037 ** Age −6.368 2.881 −2.211 0.054 * R2 = 0.696, R2 adj = 0.629 R2 goes up from 0.485 (size only) to 0.629. Age is significant at a 10% significance level. Seems good! Regression Analysis 23 / 83 Ling-Chieh Kung (NTU IM)
  • 157. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression “Nonlinear” relationship May we do better? By looking at the age-price scatter plot (and our intuition), maybe the impact of age on price is “nonlinear”: A new house’s value depreciates fast. The value depreciates slowly when the house is old. At least this is true for a car. It is worthwhile to try a capture this nonlinear relationship. For example, we may try to replace house age by its reciprocal: yi = β0 + β1x1,i + β2 1 x3,i + i. Regression Analysis 24 / 83 Ling-Chieh Kung (NTU IM)
  • 158. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Variable transformation To fit yi = β0 + β1x1,i + β2 1 x3,i + i. to our sample data: Prepare a new column as 1 age . Input these three columns to software. Read the report. We may consider any kind of nonlinear relationship. This technique is called variable transformation. Price Size 1/Age (in $1000) (in m2 ) (in 1/years) 315 75 0.063 229 59 0.05 355 85 0.063 261 65 0.067 234 72 0.048 216 46 0.063 308 107 0.067 306 91 0.067 289 75 0.071 204 65 0.048 265 88 0.067 195 59 0.038 Regression Analysis 25 / 83 Ling-Chieh Kung (NTU IM)
  • 159. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression The reciprocal of house age Software gives us the following regression report: Coefficients Standard Error t Stat p-value Intercept 22.905 57.154 0.401 0.698 Size 1.524 0.647 2.356 0.043 ** 1/Age 2185.575 1044.497 2.092 0.066 * R2 = 0.685, R2 adj = 0.615 Validation: Variables are both significant (at different significance level). Using size and age better explains house price (at least for the given sample data). The intuition that house value depreciates at different speeds is not supported by the data. Changing 1 age to age2 also does not help. Regression Analysis 26 / 83 Ling-Chieh Kung (NTU IM)
  • 160. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Typical ways of variable transformation Regression Analysis 27 / 83 Ling-Chieh Kung (NTU IM)
  • 161. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Variable selection and model building In general, we may have a lot of candidate independent variables. Size, number of bedrooms, age, distance to a park, distance to a hospital, safety in the neighborhood, etc. If we consider only linear relationships, for p candidate independent variables, we have 2p − 1 combinations. For each variable, we have many ways to transform it. In the next lecture, we will introduce the way of modeling interaction among independent variables. How to find the “best” regression model (if there is one)? Regression Analysis 28 / 83 Ling-Chieh Kung (NTU IM)
  • 162. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Variable selection and model building There is no “best” model; there are “good” models. Some general suggestions: Take each independent variable one at a time and observe the relationship between it and the dependent variable. A scatter plot helps. Use this to consider variable transformation. For each pair of independent variables, check their relationship. If two are highly correlated, quite likely one is not needed. Once a model is built, check the p-values. You may want to remove insignificant variables (but removing a variable may change the significance of other variables). Go back and forth to try various combinations. Stop when a good enough one (with high R2 and R2 adj and small p-values) is found. Software can somewhat automate the process, but its power is limited (e.g., it cannot decide transformation). We may need to find new independent variables. Intuitions and experiences may help (or hurt). Regression Analysis 29 / 83 Ling-Chieh Kung (NTU IM)
  • 163. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Summary With a regression model, we try to identify how independent variables affect the dependent variable. For a regression model, we adopt the least square criterion for estimating the coefficients. Model validation: The overall quality of a regression model is decided by its R2 and R2 adj. We may test the significance of independent variables by their p-values. Modeling building: Variable transformation. Variable selection. Regression Analysis 30 / 83 Ling-Chieh Kung (NTU IM)
  • 164. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Case study: ticket selling A theater made hundreds of stage performances in the past six years. The owner hopes that statistics and data analysis may help her improve the ticket sales. Key questions: What makes a show popular? Popularity is defined as the numbers of tickets sold. Potential factors: year, month, day, time, location, actors/actresses, drama type, ticket prices, etc. 100 performances are randomly drawn from the whole pool. All were made during weekends. Tickets were all publicly sold. Tickets for all performances were sold through the same channels. For each performance, the ticket price(s) remained the same. As a group of consultants, how may we help the theater? Regression Analysis 31 / 83 Ling-Chieh Kung (NTU IM)
  • 165. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Variables Six variables are obtained: Variable Meaning Year The year in which the performance was made Time Morning, afternoon, or evening Capacity The number of seats in the theater hall AvgPrice The average of all prices SalesQty The number of tickets sold SalesDuration Performance day − Announcement day Labeling and scaling: Years are labeled as 1, 2, ..., and 6 (6 means the last year). Capacities and sales quantities have been scaled in the same proportion. Regression Analysis 32 / 83 Ling-Chieh Kung (NTU IM)
  • 166. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Data (incomplete) Yr. Tm. Cap. A.P. Qty S.D. Yr. Tm. Cap. A.P. Qty S.D. 5 A 230 400 218 50 2 M 190 575 190 289 5 A 150 500 119 46 6 A 130 500 108 89 5 A 230 400 160 126 4 E 200 775 169 100 5 A 200 775 200 324 4 E 200 775 135 259 6 E 190 1175 178 115 5 A 310 650 251 346 6 A 190 1175 183 109 2 A 250 550 250 145 5 E 190 775 161 58 1 A 190 675 183 254 3 A 200 675 200 112 6 A 200 1175 146 110 5 E 200 775 158 323 1 M 200 575 140 94 1 M 200 575 128 360 4 A 200 775 195 255 Regression Analysis 33 / 83 Ling-Chieh Kung (NTU IM)
  • 167. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Regression To construct a regression model, we first consider quantitative independent variables. Dependent variable: SalesQty. Independent variables: Capacity, AvgPrice, Year. Let’s ignore SalesDuration for a while. Note that Year is a quantitative variable. The difference between two values makes sense: 4 − 2 and 5 − 3 both mean a difference of two years. The values will keep increasing. If we have a variable Month whose possible values are 1, 2, ..., and 12, the difference between 12 and 1 is ambiguous: 11 months or 1 month. Scatter plots help us consider: Variable selection: Does a variable has an impact? Transformation: What is a variable’s impact? Multicollinearity: Are two variables highly correlated? Regression Analysis 34 / 83 Ling-Chieh Kung (NTU IM)
  • 168. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Regression Analysis 35 / 83 Ling-Chieh Kung (NTU IM)
  • 169. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Regression It seems that Capacity, AvgSales, and Year are all worth a try. Let’s put them into a regression model. If we do this one by one: SalesQty = 20.79 + 0.72Capacity: R2 = 0.538, p-value ≈ 0. SalesQty = 174.9 + 0.0028AvgPrice: R2 = 0.0002, p-value = 0.885. SalesQty = 203.6 − 6.77Y ear: R2 = 0.063, p-value = 0.0115. If we include them together: The regression model is SalesQty = 24.742 + 0.702Capacity + 0.027AvgPrice − 4.696Y ear. R2 = 0.57, R2 adj = 0.556; p-values are 0, 0.056, and 0.019, respectively. Do not try independent variables separately; try them together. Regression Analysis 36 / 83 Ling-Chieh Kung (NTU IM)
  • 170. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Adding Time into the model Time may also be an influential variable. However, it is qualitative. More precisely, it is nominal. Even if we label Time with numeric values, we cannot treat it as a quantitative variable and put it into a regression model. For each qualitative variable, we need to introduce several indicator variables to represent its values. Regression Analysis 37 / 83 Ling-Chieh Kung (NTU IM)
  • 171. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Road map Simple regression. Multiple regression. Indicator variables and interaction. Endogeneity and residual analysis. Logistic regression. Regression Analysis 38 / 83 Ling-Chieh Kung (NTU IM)
  • 172. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Numeric labeling does not work The variable Time has three values. Morning, afternoon, and evening. Why can’t we label them as 1, 2, and 3 and do regression? Suppose we label (morning, afternoon, evening) as (1, 2, 3): The regression model is SalesQty = 164.021 + 6.313Time. Why is this wrong? Regression Analysis 39 / 83 Ling-Chieh Kung (NTU IM)
  • 173. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Numeric labeling does not work Different labeling gives different regression results. We may also label (morning, afternoon, evening) as (1, 2, 10) or (3, 1, 2): SalesQty = 164.021 + 6.313Time p-value = 0.294 SalesQty = 177.224 − 0.075Time p-value = 0.95 SalesQty = 205.725 − 15.091Time p-value = 0.0084 Regression Analysis 40 / 83 Ling-Chieh Kung (NTU IM)
  • 174. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Binary variables There is one exception: If a qualitative variable is binary, we may label the values as 0 and 1 and then treat it as quantitative. Labeling values as 1 and 0, 1 and 2, or 7 and 8 is also good. Labeling values as 1 and −1, 1 and 5, or 4 and 8 is bad. This is because a regression coefficient measures what happens to the dependent variable “when that independent variable increases by 1.” When the binary variable is labeled with 0 and 1, its regression coefficient ˆβi tells us that “if the value changes from 0 to 1 (while all others remain the same), we expect the dependent variable to increase by ˆβi.” What if we have more than two values? Regression Analysis 41 / 83 Ling-Chieh Kung (NTU IM)
  • 175. Simple regression Multiple regression Indicators, interaction Endogeneity, residuals Logistic regression Indicator variables Consider a variable x with three values A, B, and C. We first choose a reference level, say, A. We then manually create two indicator variables xB and xC : xB = 1 if x = B 0 otherwise and xC = 1 if x = C 0 otherwise In other words, we have a mapping: x xB xC A 0 0 B 1 0 C 0 1 Regression Analysis 42 / 83 Ling-Chieh Kung (NTU IM)