S6 w2 linear regression

 Purpose – Determine if one or more IVs
can predict a DV
 Examples:
• Does your height (IV) predict how much money
you will spend (DV)?
• Does the number of store managers predict how
often the machine will break down (DV)?
• Does the number of clicks (IV1) and the number of
comments (IV2) on the blog predict the size of
revenue (DV)?

Research Question Inferential Statistics
Compare means of 2 numeric T test
variables
Relate 2 categorical variables Pearson Chi Square
Relate 2 numeric variables Pearson Correlation r
Use 1+ IVs to explain 1 numeric DV Regression

 Correlation tells us how X relates to Y (in
the past)
 Simple Regression tells us how X
predicts Y (in the future)
• E.g., Does AvgDailyClicks predict
DirectSalesRevenue?
 Multiple Regression tells us how X1, X2,
X3, ….. predicts Y
• E.g., Do NumberBlogAuthors & AvgDailyClicks
predict SponsorRevenue?

 The relationship between Xs and Y are
linear
 If you have 2 or more Xs, they are not
perfectly correlated with each other
 Xs are not correlated with external
variables
 Independence – Any two observations
should be independent from each other.
 Errors are normally distributed
 And a few others

 Example:Does Number of Stupid
Customers predict Self Checkout Error
Rate?
 When we use X to predict Y:
• X = the predictor = the independent variable (IV)
• Y = the predicted value = the dependent variable (the
value of Y depends on the predictor X) (DV)
• You’re basically building a linear model between X and
Y:
Y = Constant + B*X + error

 Y = Constant + B*X + error
 Y= 1 + 2*X

Constant = 1

Slope B = 2

Source: wikepedia

Who is the best fitting model?
(Hint: Not Kate Moss)
Line that’s closest to all dots

Goodness of Fit (R2):
How well does the line fit the data?
(How well does Kate fit the average woman?)

(constant)

Slope B

Distances to regression line = error
Good fit = small errors

 Y = Constant + B*X + error
 DirectSalesRevenue = 19.466-.003*AvgDailyClicks+error

Constant is significantly greater than zero

Slope (-.003) is significantly less than zero

Goodness of Fit (R2): Model explains 59% variations in DirectSalesRevenue

The number of average daily clicks
significantly predicted direct sales revenue, b
= -.03, t(39) = 14.72, p < .001. The number of
average daily clicks also explained a
significant proportion of variance in direct
sales revenue, R2 = .59, F(1, 38) = 42.64, p <
.001. These findings suggest that, websites
with more average daily clicks tend to have
lower direct sales revenue level.

Y=200X (R2 = 45%)
Given any X, we can predict value of Y with 45% accuracy

 Assumptions: Xs are somewhat independent; Y values are
independent; Y values are normally distributed; errors are normally
distributed; X Y relations are linear; no outliers
• Example: Time series data are NOT independent – stock price today depends on
stock price yesterday which depends on stock price the day before, etc.
 Multiple regression is just an extension of single regression
• Use multiple Xs (e.g., both AvgDailyClicks and NumberAuthors) to
predict Y
• When you have a condition (e.g., customer choice depends on gender;
brand awareness depends on comm. channel; number of applications
depends on program of study), you need to create an interaction term 
next class
 When an X is categorical (e.g., whether the blog host is Google or
WordPress): Code X in numbers – e.g., 0 is Google, 1 is WordPress
 When Y is categorical (e.g., whether the blog won the Outstanding
Blog Award): Code Y in numbers – e.g. 0 is No, 1 is Yes, and use
Logistic Regression

 What is your Y (the value you want to predict)?
 Is your Y categorical?  Do you need Logistic Regression?
See the instructor for help
 What is your X (your predictor variable)? How many Xs do
you have?
 Is any of your Xs categorical?  Do you have a coding
scheme?
 Do you have a condition? (e.g., customer choice depends
on gender; brand awareness depends on comm. channel;
number of applications depends on program of study) 
See the instructor for help

Research Question Inferential Statistics
Compare means of 2 numeric variables T test
Relate 2 numeric variables Pearson Correlation r
Relate 2 categorical variables Pearson Chi Square
Use 1+ IVs to explain 1 numeric DV Regression

S6 w2 linear regression

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (16)

Similaire à S6 w2 linear regression

Similaire à S6 w2 linear regression (20)

Plus de Rachel Chung

Plus de Rachel Chung (20)

Dernier

Dernier (20)

S6 w2 linear regression

Notes de l'éditeur