This document discusses regression analysis and its assumptions. Regression can be used to determine if one or more independent variables (IVs) can predict a dependent variable (DV). Simple regression looks at the relationship between one IV and the DV, while multiple regression examines the relationship between multiple IVs and the DV. The key assumptions of regression include linear relationships between variables, no multicollinearity, independence of observations, and normally distributed errors. An example of applying regression to examine the relationship between average daily clicks and direct sales revenue is provided.
1. Purpose – Determine if one or more IVs
can predict a DV
Examples:
• Does your height (IV) predict how much money
you will spend (DV)?
• Does the number of store managers predict how
often the machine will break down (DV)?
• Does the number of clicks (IV1) and the number of
comments (IV2) on the blog predict the size of
revenue (DV)?
2. Research Question Inferential Statistics
Compare means of 2 numeric T test
variables
Relate 2 categorical variables Pearson Chi Square
Relate 2 numeric variables Pearson Correlation r
Use 1+ IVs to explain 1 numeric DV Regression
3. Correlation tells us how X relates to Y (in
the past)
Simple Regression tells us how X
predicts Y (in the future)
• E.g., Does AvgDailyClicks predict
DirectSalesRevenue?
Multiple Regression tells us how X1, X2,
X3, ….. predicts Y
• E.g., Do NumberBlogAuthors & AvgDailyClicks
predict SponsorRevenue?
4. The relationship between Xs and Y are
linear
If you have 2 or more Xs, they are not
perfectly correlated with each other
Xs are not correlated with external
variables
Independence – Any two observations
should be independent from each other.
Errors are normally distributed
And a few others
5. Example:Does Number of Stupid
Customers predict Self Checkout Error
Rate?
When we use X to predict Y:
• X = the predictor = the independent variable (IV)
• Y = the predicted value = the dependent variable (the
value of Y depends on the predictor X) (DV)
• You’re basically building a linear model between X and
Y:
Y = Constant + B*X + error
7. Who is the best fitting model?
(Hint: Not Kate Moss)
Line that’s closest to all dots
8. Goodness of Fit (R2):
How well does the line fit the data?
(How well does Kate fit the average woman?)
(constant)
Slope B
Distances to regression line = error
Good fit = small errors
9.
10. Y = Constant + B*X + error
DirectSalesRevenue = 19.466-.003*AvgDailyClicks+error
Constant is significantly greater than zero
Slope (-.003) is significantly less than zero
Goodness of Fit (R2): Model explains 59% variations in DirectSalesRevenue
11. The number of average daily clicks
significantly predicted direct sales revenue, b
= -.03, t(39) = 14.72, p < .001. The number of
average daily clicks also explained a
significant proportion of variance in direct
sales revenue, R2 = .59, F(1, 38) = 42.64, p <
.001. These findings suggest that, websites
with more average daily clicks tend to have
lower direct sales revenue level.
12. Y=200X (R2 = 45%)
Given any X, we can predict value of Y with 45% accuracy
13. Assumptions: Xs are somewhat independent; Y values are
independent; Y values are normally distributed; errors are normally
distributed; X Y relations are linear; no outliers
• Example: Time series data are NOT independent – stock price today depends on
stock price yesterday which depends on stock price the day before, etc.
Multiple regression is just an extension of single regression
• Use multiple Xs (e.g., both AvgDailyClicks and NumberAuthors) to
predict Y
• When you have a condition (e.g., customer choice depends on gender;
brand awareness depends on comm. channel; number of applications
depends on program of study), you need to create an interaction term
next class
When an X is categorical (e.g., whether the blog host is Google or
WordPress): Code X in numbers – e.g., 0 is Google, 1 is WordPress
When Y is categorical (e.g., whether the blog won the Outstanding
Blog Award): Code Y in numbers – e.g. 0 is No, 1 is Yes, and use
Logistic Regression
14. What is your Y (the value you want to predict)?
Is your Y categorical? Do you need Logistic Regression?
See the instructor for help
What is your X (your predictor variable)? How many Xs do
you have?
Is any of your Xs categorical? Do you have a coding
scheme?
Do you have a condition? (e.g., customer choice depends
on gender; brand awareness depends on comm. channel;
number of applications depends on program of study)
See the instructor for help
15. Research Question Inferential Statistics
Compare means of 2 numeric variables T test
Relate 2 numeric variables Pearson Correlation r
Relate 2 categorical variables Pearson Chi Square
Use 1+ IVs to explain 1 numeric DV Regression
Notes de l'éditeur
This slide is self explanatoryMake sure you can recognize a research question that can be answered by simple linear regression – they are all predictive in nature
This is a review slide – Again this table shows you where regression is in the world of inferential statistics
Inferential statistics are more powerful when they can help us predict the future
It’s totally possible that the relationship between two variables is NOT linearCheck your scatterplots first to make sure the relationship looks somewhat linear; otherwise simple linear regression method should NOT be usedFor example, if you want to see if gender (IV1) and race (IV2) predict spending (DV). However in your sample, all men are Caucasian and all women are African American (perfect correlation between gender and race) – then you will NOT be able to run regressionFor example, if you want to see if eating ice cream (IV) causes people to go to the beach more often (DV). You probably will find a positive relationship, however the IV correlates with an external variable (temperature) which causes variance in your DV (it determines whether people go to beach or not.) In this case running a regression would not make senseFor example, if you interview 20 men and 30 women, but turned out that 2 of the 20 men are the same person being interviewed twice! Then the “independence” principle is violatedAfter the regression model is built, if you can still see a recognizable pattern in the errors, then the model is not good enough. The model should capture the trend of the data completely and leave behind completely random errors
This is the example used in Individual Assignment 6
To understand regression you need to first understand how a straight line is expressed mathematically1. All straight lines can be expressed in mathematical terms in terms of a constant and a slope2. We use y=2x+1 as an example
Regression is like what Ralph Lauren and Armani do everyday – finding the runway model that fits the best(note: Kate Moss is one of the most prototypical runway models)
Like Kate Moss (or other run way models), the regression line represents an idealized version of the real worldThe reality is the messy data we collected (aka the dots)The line is an idealized model that best represents the messy-data realityA good model, in the world of statistics, is close to reality. The goal is to minimize the difference between the model and the realityWhen a regression model represents the real world well, the errors (distances from the dots to the lines) are minimal. The goodness of fit measure, or R square, is large.
Accordingto this definition of “goodness of fit”, Kate Moss is a really bad model (poor goodness of hit, large errors, R-square would be very small)Your goal is to do better than Kate Moss!
In this case, Average Daily Clicks significantly predicted Direct Sales Revenue.However, the beta coefficient (the slope) is negative – this means the more the clicks the lower the revenueThe model fit is pretty good
Get the total df value (39 in this case) from the ANOVA table
So with regression you can get a model which is a line with a linear function (Y = BX). This means that given any X we can predict the value of Y. For example, if we would like to see if number of clicks (X) predicts revenue (Y). We get a regression line which is Y = 200X with a R square of 45%This means that given any number of clicks, we can predict the expected revenue level with 45% accuracy.Perhaps this is a regression model based on data of 5-20 clicks. But because the linear line can be extended infinitely to the upper right corner of a graph, we can predict with 45% accuracy that, when we get 1000 clicks, our revenue will be $200,000!
Thereis a LOT more to regression that what we discussed. We covered the basic concepts and you’re not expected to know more than that. However this slide gives you some ideas about other considerations when running regression
Here’ssome food for thought for your group project
To summarize, we have discussed 4different kinds of inferential statistics in this course:T testCorrelationChi squareRegressionHow do you know which test is appropriate for your project?Use this summary table to determineMany students often ask, which test is better than others. This question is like asking, is a pregnancy test better than a DNA test? It’s impossible to answer without knowing what’s your objective.Some people also wonder if we can use more than one test in a research study. The answer is obvious. Of course! We take the same approach to other "research questions" in our lives. For example, if you want to know if pregnant, you get a pregnancy test. If you want to know if you're diabetic, you get a blood test. If you want to know who's the father of your child, you get a DNA test! If you need to know answers to all 3 questions, you order all 3 tests!Again, it's all about your research question!