Here are the steps to perform linear regression on the advertising dataset to predict sales based on TV spend:
1. Import necessary libraries
2. Split data into training and test sets
3. Create linear regression object
4. Fit the model on training data
5. Predict on test data
6. Calculate accuracy on test data
7. Print coefficient and intercept values
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42
1. History of Regression
• T H E T E R M R E G R E S S I O N WA S I N T R O D U C E D B Y F R A N C I S G A LTO N I N 1 8 8 6 I N H I S R E S E A R C H PA P E R “ FA M I LY
L I K E N E S S I N S TAT U R E ”
• H E O B S E R V E D T H AT TA L L PA R E N T S H AV E TA L L S O N S A N D S H O R T PA R E N T S H AV E S H O R T S O N S
• H E C A M E U P W I T H T H E V I E W T H AT T H E H E I G H T O F T H E S O N S R E G R E S S “ R E V E R T B A C K ” TO WA R D S T H E AV E R A G E
H E I G H T O F T H E P O P U L AT I O N
Dr. Deepak Mehta
Associate Professor
AIT CSE
2. Regression Introduction
• Lets take a simple example : Suppose your manager asked you to predict annual sales.
• There can be a hundred of factors (drivers) that affects sales
• In this case, sales is your dependent variable. Factors affecting sales are independent
variables.
• Regression analysis would help you to solve this problem.
• In simple words, regression analysis is used to model the relationship between a
dependent variable and one or more independent variables.
It helps us to answer the following questions –
• Which of the drivers have a significant impact on sales.
• Which is the most important driver of sales
• How do the drivers interact with each other
• What would be annual sale next year
3. Regressing Terminologies
Outliers
Suppose there is an observation in the dataset which is having a very high or very low
value as compared to the other observations in the data, i.e. it does not belong to the
population, such an observation is called an outlier. In simple words, it is extreme value.
An outlier is a problem because many times it hampers the results we get.
Multicollinearity
When the independent variables are highly correlated to each other then the variables
are said to be multicollinear. Many types of regression techniques assumes
multicollinearity should not be present in the dataset. It is because it causes problems in
ranking variables based on its importance. Or it makes job difficult in selecting the most
important independent variable (factor).
4. Terminologies
Heteroscedasticity
When dependent variable’s variability is not equal across values of an independent
variable, it is called heteroscedasticity. Example – As one’s income increases, the
variability of food consumption will increase. A poorer person will spend a rather constant
amount by always eating inexpensive food; a wealthier person may occasionally buy
inexpensive food and at other times eat expensive meals. Those with higher incomes
display a greater variability of food consumption.
Underfitting and Overfitting
When we use unnecessary explanatory variables it might lead to overfitting. Overfitting
means that our algorithm works well on the training set but is unable to perform better on
the test sets. It is also known as problem of high variance.
5. History of Regression
Galton’s law of universal regression was confirmed by his friend Karl Pearson, who collected data
from more than 1000 families.
He found that the average height of the sons of tall father is less than the their father’s height
and average height of the sons of short father is more than their fathers’ height.
Research paper of Karl Pearson
“On law of inheritance, biometrika 1903
6. Modern Regression
It investigate the dependence of one variable, conventionally called dependent variable, on one
or more variables called independent variable and provide an equation to be used for estimating
or predicting the average of the dependent variable from the known values of independent
variable.
7. Modern Regression
A variable whose variation we try to explain is dependent variable while the independent
variable is a variable that is used to explain the variation in the dependent variable.
When dependency of variable is studied upon one variable it is called simple linear regression or
two variable regression
While dependency upon two or more variable is studied in multiple regression
8. Deterministic vs Probabilistic
If there is a exact relationship b.w. variables s.t knowing value of one variable we can have
unique value of other, such relation is known as Deterministic relationship or mathematical
relation.
e.g. consider y=a + bx. Substituting value of x we can have unique value of y
Example of such relationship is F=32+(9/5)C, relationship b.w Fahrenheit and Celsius.
Area of circle pi*r*r
9. Deterministic vs Probabilistic
If relationship b.w. variable is not exct, we cannot have precisely value of one variable by putting
the value of the other, such relationship is known as probabilistic or stochastic or statistical
relationship
e.g. consider y=a + bX+e. substituting value of x we cannot have unique value of y, until we know
e. where e is unknown random error.
For ex. Consider relationship b.w. weight and height of person . All person with same height will
not have same weight.
In deterministic model all the point of order pairing of two variable (x,y) fall on the line of
relationship.
10. Dependent Variable
Whereas in probabilistic model all the points of order pairing of two variable (x,y) donot fall on
the line of relationship, there must be scattering around the line of relationship
Dependent variable is random variable also called
◦ Explained variable
◦ Predictand
◦ Regressand
◦ Response
◦ Endogenous
◦ Outcome
◦ Controlled variable
11. Independent variable
The independent variable is fixed variable also called
◦ Explanatory variable
◦ Predictor
◦ Regressor
◦ Stimulus
◦ Exogenous
◦ Covariate
◦ Control variable
12.
13. Assumption of Linear Regression
Linear Relationship
Very low /no multicollinearity
No heteroscedastic
No autocorrelation b.w the errors
Normal distribution of errors
All observation are independent
14. Summary
LR is a ML algorithm based on Supervised Learning
Regression model is a target prediction model based on independent variables
Regression technique find out the relationship bw X(input) and Y(output) e.g. workexpSalary
Hypothesis function of LR: y=W0+W1X
When training this model : it fits the best line to predict the values of y for a given value of X
15. ..
•It shows the significant relationship b.w. label( dependent variable) and the features
(independent variables)
•It indicate the strength of impact of multiple independent variables on a dependent variable
•Regression model is used to predict the continuous value
•LR establishes a relationship b.w. dependent variable (y) and one or more independent
variables(X) using best fit straight line (also known as regression line)
16. There must be a linear relation between independent and dependent variables
There should not be any outliers present
No heteroscedasticity
Sample observations should be independent
Error terms should be normally distributed with mean 0 and constant variance
Absence of multicollinearity and auto-correlation
17. Interpretation of regression
coefficients
Let us consider an example where the dependent variable is marks obtained by a
student and explanatory variables are number of hours studied and no. of classes
attended. Suppose on fitting linear regression we got the linear regression as:
Marks obtained = 5 + 2 (no. of hours studied) + 0.5(no. of classes attended)
Thus we can have the regression coefficients 2 and 0.5 which can interpreted as:
- If no. of hours studied and no. of classes are 0 then the student will obtain 5 marks.
- Keeping no. of classes attended constant, if student studies for one hour more then he
will score 2 more marks in the examination.
- Similarly keeping no. of hours studied constant, if student attends one more class then
he will attain 0.5 marks more.
18. Least Squares
As we know y=w0+w1X+error h(xi)+ei
Error=yi-h(xi)
Error is residual error in the observation
Main aim is to minimize the residual error
Squared error or cost function:
19.
20.
21.
22.
23.
24. Linear Regression Assignment Advertisement Dataset
X = advertising['TV']
y = advertising['Sales’]
-------------------------------------------------------
from sklearn.model_selection import train_test_split
X_train, X_test, y_train,y_test=train_test_split(X, y, train_size = 0.7, test_size = 0.3, random_state = 100)
X_train.head()
y_train.head()