SlideShare une entreprise Scribd logo
1  sur  10
Télécharger pour lire hors ligne
BOSTON HOUSING DATA
A Comprehensive Regression Analysis
Ravish Kalra
Graduate Student, Business Analytics
University of Cincinnati
Table of Contents
Executive Summary - Boston Housing Data.................................................................................................2
Boston Housing Data.....................................................................................................................................3
Introduction ..............................................................................................................................................3
Exploratory Data Analysis .........................................................................................................................3
Variable Selection and Modelling .............................................................................................................7
Residual Diagnostics .................................................................................................................................9
Final Model ...............................................................................................................................................9
Comparison with CART ...........................................................................................................................10
Executive Summary - Boston Housing Data
This report provides an analysis and evaluation of the factors affecting the median value of the
owner occupied homes in the suburbs of Boston. The in-built data set of Boston Housing Data is
used for this analysis and various factors about the structural quality, neighbourhood,
accessibility and air pollution such as per capita crime rate by town, proportion of non-retail
business acres per town, index of accessibility to radial highways etc are taken into account for
this study.
Methods of analysis include (but not limited to) summary statistics and visualization of the
distribution of the variables, finding correlation between variables and conducting linear
regression on the data.
Further, various variable selection methods like Best Subset, Stepwise Selection and LASSO was
performed to come up with the best linear regresssion model to predict the median value of the
owner occupied homes. These models were then compared with a custom model designed after
including all the analysis from the initial exploration.
Finally, a comprehensive comparison was made between linear regression and CART to predict
the median price values after supplying the same data. The results indicated that while CART
outperformed linear regression, the additional details captured by the linear regression model in
the exploratory phase was still a better choice.
The final model included interaction term and variable transformation. This model resulted in an
adjuted R-squared value of 0.85 and an avg MSE value of 3.60
medv ~ nox + ptratio + age + [log(lstat) + rm + log(crim) + dis] * rad_c
Boston Housing Data
Introduction
The entire data consists of 506 observations and 14 variables. A train-test set of the ratio 80:20
was sampled for the study, resulting in 404 observations and 14 variables, all of type numeric. The
variable chas (which captures the amenities of a riverside location) is categorical while the rest are
continuous. Given below is the exploratory data analysis and model selection for best model to
predict the median value of owner-occupied homes.
Exploratory Data Analysis
An initial look at the summary statistics of the data gives us some of the following insights:
• There are no NA / missing values in the data set.
• The median value of the owner occupied homes (medv – the dependent variable) ranges
from 5 to 50 (in $1000s).
• The average number of rooms per dwelling is ~6 rooms.
• The full-value property-tax rate (in $10,000) varies from 187 to 711
• The proportion of owner occupied units built prior to 1940 is on the upper side. More than
50% of the observations are greater 75 years old
From the distributions shown in figure 1, the following can be concluded about the variables taken
for this study -
• The proportion of owner-occupied units built prior to 1940 (age) and the proportion of
blacks by town (black) are highly skewed to the left, which means that the most counts of
these variables occur on the higher end.
• The average number of rooms per dwelling (rm) follows a normal distribution i.e most of
the dwellings have an average of 6 rooms.
• There are more dwellings which have smaller distances to five Boston employment centers
(dis is skewed to the right)
• There are more dwellings which have lower median value (less than $25000) than the
number of dwellings that have a higher value. (medv is skewed to the right)
• There are lesser proportion of adults without high school education and male workers
classified as laborers in the dwellings of the Boston suburbs (lstat is skewed to the right)
• The full value property tax rate (tax - measured in $10000s) can be seen to be separated
into 2 distinct clusters. One below 500 and the other more than 700.
• The index of accessibility to radial highways (rad) also seems to be separated into 2 distinct
clusters. A huge number of dwellings having this index less than 10 and the rest having
more than 24.
Figure 1:Histograms of different variables of Boston data set
Studying the correlation between the variables, some of the following observations were made –
• A strong correlation of 0.912 between variables rad and tax. This is expected as we often
see that as the accessibility to radial highways increase, the property tax rate of the
dwellings also increases.
• A correlation of 0.76 between the proportion of non-retail business acres per town (indus)
and the nitrogen oxide concentration (nox). This corroborates the fact that non-retail
businesses have a high contribution to the nitrogen concentration in the air.
• A correlation of 0.73 between the proportion of non-retail business acres per town (indus)
and the property tax rate of the dwellings (tax). The tax rate may also be influenced by the
presence of non-retail business near the dwellings
• A correlation of 0.73 between proportion of owner-occupied units built prior to 1940 (age)
and the nitrogen oxides concentration (nox). This might lead to the fact the older parts of
the city, or where the older houses are situated have more air pollution.
• A negative correlation of 0.74 between mean of distances to five Boston employment
centers (dis) and proportion of owner-occupied units built prior to 1940 (age). Interesting
to note that older homes are farther away from the employment centers, which shows that
a city expands more where the employment centers are located.
Correlation with the median value of owner-occupied homes (medv):
• A negative correlation of 0.74 with lstat (percent of lower status of the population) i.e more
the proportion of people with lower status, lesser is the value of the house. This can be
attributed the fact of affordability.
• A positive correlation of 0.70 with the average number of rooms per dwelling i.e as the
number of rooms increase, a hike in the price of the dwellings can be observed.
Figure 2: Correlation matrix
Figure 3 shows the scatter plot of the various variables with respect to the variable medv. The linear
regression lines are plotted to better visualize their relationship with medv. Also, we can
consolidate on our understanding of the variables rad and tax, which have a high correlation. It
can also be seen that applying log transformation on the variables crim and lstat seem to fit the
linear line better.
Figure 3: Scatter plots of different variables and medv (including log transformed variables)
Table 1: Correlation coefficients with respect to medv
Variable lstat_log lstat rm ptratio indus crim_log crim
Correlation
coefficient
-0.82 -0.74 0.70 -0.51 -0.48 -0.45 -.039
p-value 9.2e-122 5.0e-88 2.4e-74 1.6e-34 4.9e-31 3.8e-27 1.1e-19
Further analyzing the correlation coefficients of the variables with respect to medv (as shown in
table 1) confirms our understanding about transformed variables being more linearly correlated.
The high correlation between tax and rad can also be observed (as shown in figure 4). Since their
distributions are also in two clusters a new categorical variable called rad_c was created, and tax
variable was dropped as rad_c would be able to explain most of the variation in tax variable.
Figure 4: Correlation and plots of variables tax and rad
With the introduction of new variable, there is a change of slope observed in the following
variables
Figure 5: Introduction of rad_c variable forces a change in slope
Variable Selection and Modelling
For the modelling phase, both classical and regularization techniques for variable selection were
used to come up with the best linear regression model for the dependent variable medv. Best subset
method, stepwise selection and LASSO (with parameter tuning to select best lambda) was
performed. Table 2 gives a summary of these models.
Table 2: Comparison of different models assumed through variable selection techniques
Method Formula 10 fold
Cross
validation
In-sample
Prediction
Out-Sample
Prediction
R2
Adj
R2
AIC BIC
Best
subset
medv ~ chas + nox + rm +
dis + ptratio + black + lstat
+ crim + zn + rad + tax
23.56 24.35 12.75 0.741 0.735 2462 2514
Stepwise
B/F/Both
medv ~ chas + nox + rm +
dis + ptratio + black + lstat
+ crim + zn + rad + tax
23.56 24.35 12.75 0.741 0.735 2462 2514
Full
Model
medv ~ . 24.42 25.10 13.39 0.742 0.734 3030 3102
Lasso (λ
= 0.034)
medv ~ chas + nox + rm +
dis + ptratio + black + lstat
+ crim + zn + rad + age +
indus
24.12 24.26 12.00 0.735 0.729 3036 3095
The difference between in-sample and out-sample prediction was high and surprisingly lower on
the out-sample prediction. This was due to the random one-time split of data from test / train and
goes to show how a single fold result should not be trusted. When the same was repeated for a 10-
fold cross validation, a more realistic picture surfaced which was very different from the out of
sample prediction. Since the splits were random, we obtained different results for 10-fold cross
validation. From our exploratory data analysis, we discovered that taking log of crim and lstat
variable increased their linear correlation with medv. We also observed that an interaction term
with the transformed rad_c variable explained more variations in the regression line. We would
now compare the above models with a customized model that incorporates the discoveries from
exploratory data analysis.
A comparison of repeated cross-validation with 5 repeats and 20 folds is depicted in Figure-6.
Figure 6: Model comparison (RMSE and R sq) at 95% Confidence Interval
The customized model performed much better at explaining the variation in median housing prices
and predicting out of sample.
Residual Diagnostics
Stepwise Selection Model v/s Custom Model Residual Comparison
Figure 7: Residual plots comparison for stepwise model (left) and custom model (right)
Figure - 7 shows that custom model displays a slight improvement in the Q-Q plot that indicates
that the residuals of the model are nearly normal. The curvature in the Scale-Location graph has
also been linearized to an extend in the custom model. This indicates that our assumptions for
linear regression holds better with the custom model than the other models. Thus, to make
predictions for out of sample, the custom model should be preferred.
Final Model
Table 3: Model summary of the final selected model
Formula R2
Adj R2
AIC BIC RMSE
medv ~ nox + ptratio + age + (log(lstat) + rm + log(crim) + dis)
* rad_c
0.854 0.850 2735 2794 3.607
Comparison with CART
After constructing the tree from the split data available, we observed the following values in
comparison to linear model:
Table 4: Comparison of predictions made by linear regression and CART
Sample Type Linear Regression (full model) CART (cp = 0.015642)
In-Sample (80%) 21.50 17.81
Out-Sample (20%) 23.91 21.76
The values observed in Table 4 suggests that CART performed better than the full regression
model. The above values, however, are volatile i.e. the prediction errors vary with a slight change
in the split of train / test data. Thus, to compare these two models and the model arrived at earlier,
we needed to run a repeated cross validation for 5 repeats and 20-fold crosses. Figure – 8 depicts
the summary of these repeats and predictions at 95% confidence interval.
Figure 8: Comparison of model prediction between full linear regression, CART and custom model
From above, it is evident that CART performs better than linear regression model. However,
because of the simplicity of linear regression, the analysis done in the exploratory phase and the
incorporated final model outperforms the CART model.

Contenu connexe

Tendances

Normal Distribution Presentation
Normal Distribution PresentationNormal Distribution Presentation
Normal Distribution Presentation
sankarshanjoshi
 
Stochastic Process
Stochastic ProcessStochastic Process
Stochastic Process
knksmart
 
Variable Selection Methods
Variable Selection MethodsVariable Selection Methods
Variable Selection Methods
joycemi_la
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
saba khan
 

Tendances (20)

Lasso regression
Lasso regressionLasso regression
Lasso regression
 
Normal Distribution Presentation
Normal Distribution PresentationNormal Distribution Presentation
Normal Distribution Presentation
 
Binary Logistic Regression
Binary Logistic RegressionBinary Logistic Regression
Binary Logistic Regression
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsData Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
 
Stochastic Process
Stochastic ProcessStochastic Process
Stochastic Process
 
Logistic regression (blyth 2006) (simplified)
Logistic regression (blyth 2006) (simplified)Logistic regression (blyth 2006) (simplified)
Logistic regression (blyth 2006) (simplified)
 
Variable Selection Methods
Variable Selection MethodsVariable Selection Methods
Variable Selection Methods
 
Estimation
EstimationEstimation
Estimation
 
Presentation On Regression
Presentation On RegressionPresentation On Regression
Presentation On Regression
 
Linear algebra to solve autosomal inheritance
Linear algebra to solve autosomal inheritanceLinear algebra to solve autosomal inheritance
Linear algebra to solve autosomal inheritance
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Correlation and Regression Analysis using SPSS and Microsoft Excel
Correlation and Regression Analysis using SPSS and Microsoft ExcelCorrelation and Regression Analysis using SPSS and Microsoft Excel
Correlation and Regression Analysis using SPSS and Microsoft Excel
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Simple Linear Regression: Step-By-Step
Simple Linear Regression: Step-By-StepSimple Linear Regression: Step-By-Step
Simple Linear Regression: Step-By-Step
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Ames housing
Ames housingAmes housing
Ames housing
 
Mcqs (probability distribution)
Mcqs (probability distribution)Mcqs (probability distribution)
Mcqs (probability distribution)
 
Ridge regression
Ridge regressionRidge regression
Ridge regression
 
Logistic Regression Analysis
Logistic Regression AnalysisLogistic Regression Analysis
Logistic Regression Analysis
 

Similaire à Regression Study: Boston Housing

Evolution of regression ols to gps to mars
Evolution of regression   ols to gps to marsEvolution of regression   ols to gps to mars
Evolution of regression ols to gps to mars
Salford Systems
 
Analysis of the Boston Housing Data from the 1970 census
Analysis of the Boston Housing Data from the 1970 censusAnalysis of the Boston Housing Data from the 1970 census
Analysis of the Boston Housing Data from the 1970 census
Shuai Yuan
 
Predicting US house prices using Multiple Linear Regression in R
Predicting US house prices using Multiple Linear Regression in RPredicting US house prices using Multiple Linear Regression in R
Predicting US house prices using Multiple Linear Regression in R
Sotiris Baratsas
 
Project 1FINA 415-15BGroup of 5.Due by 18092015..docx
Project 1FINA 415-15BGroup of 5.Due by 18092015..docxProject 1FINA 415-15BGroup of 5.Due by 18092015..docx
Project 1FINA 415-15BGroup of 5.Due by 18092015..docx
wkyra78
 
1 BBS300 Empirical Research Methods for Business .docx
1  BBS300 Empirical  Research  Methods  for  Business .docx1  BBS300 Empirical  Research  Methods  for  Business .docx
1 BBS300 Empirical Research Methods for Business .docx
oswald1horne84988
 
regression-linearandlogisitics-220524024037-4221a176 (1).pdf
regression-linearandlogisitics-220524024037-4221a176 (1).pdfregression-linearandlogisitics-220524024037-4221a176 (1).pdf
regression-linearandlogisitics-220524024037-4221a176 (1).pdf
lisow86669
 
Guide for building GLMS
Guide for building GLMSGuide for building GLMS
Guide for building GLMS
Ali T. Lotia
 

Similaire à Regression Study: Boston Housing (20)

RegressionProjectReport
RegressionProjectReportRegressionProjectReport
RegressionProjectReport
 
Evolution of regression ols to gps to mars
Evolution of regression   ols to gps to marsEvolution of regression   ols to gps to mars
Evolution of regression ols to gps to mars
 
Analysis of the Boston Housing Data from the 1970 census
Analysis of the Boston Housing Data from the 1970 censusAnalysis of the Boston Housing Data from the 1970 census
Analysis of the Boston Housing Data from the 1970 census
 
Real - estate pricing valuation
Real - estate pricing valuationReal - estate pricing valuation
Real - estate pricing valuation
 
Dm
DmDm
Dm
 
Predicting US house prices using Multiple Linear Regression in R
Predicting US house prices using Multiple Linear Regression in RPredicting US house prices using Multiple Linear Regression in R
Predicting US house prices using Multiple Linear Regression in R
 
Multiple Linear Regression Applications in Real Estate Pricing
Multiple Linear Regression Applications in Real Estate PricingMultiple Linear Regression Applications in Real Estate Pricing
Multiple Linear Regression Applications in Real Estate Pricing
 
Multiple Linear Regression Applications in Real Estate Pricing
Multiple Linear Regression Applications in Real Estate PricingMultiple Linear Regression Applications in Real Estate Pricing
Multiple Linear Regression Applications in Real Estate Pricing
 
(Machine Learning) Clustering & Classifying Houses in King County, WA
(Machine Learning) Clustering & Classifying Houses in King County, WA(Machine Learning) Clustering & Classifying Houses in King County, WA
(Machine Learning) Clustering & Classifying Houses in King County, WA
 
Project 1FINA 415-15BGroup of 5.Due by 18092015..docx
Project 1FINA 415-15BGroup of 5.Due by 18092015..docxProject 1FINA 415-15BGroup of 5.Due by 18092015..docx
Project 1FINA 415-15BGroup of 5.Due by 18092015..docx
 
1 BBS300 Empirical Research Methods for Business .docx
1  BBS300 Empirical  Research  Methods  for  Business .docx1  BBS300 Empirical  Research  Methods  for  Business .docx
1 BBS300 Empirical Research Methods for Business .docx
 
regression-linearandlogisitics-220524024037-4221a176 (1).pdf
regression-linearandlogisitics-220524024037-4221a176 (1).pdfregression-linearandlogisitics-220524024037-4221a176 (1).pdf
regression-linearandlogisitics-220524024037-4221a176 (1).pdf
 
Linear and Logistics Regression
Linear and Logistics RegressionLinear and Logistics Regression
Linear and Logistics Regression
 
Bab 3.ppt
Bab 3.pptBab 3.ppt
Bab 3.ppt
 
Unsupervised learning
Unsupervised learning Unsupervised learning
Unsupervised learning
 
1.0 Descriptive statistics.pdf
1.0 Descriptive statistics.pdf1.0 Descriptive statistics.pdf
1.0 Descriptive statistics.pdf
 
Chap003.ppt
Chap003.pptChap003.ppt
Chap003.ppt
 
Guide for building GLMS
Guide for building GLMSGuide for building GLMS
Guide for building GLMS
 
Regression Analysis.ppt
Regression Analysis.pptRegression Analysis.ppt
Regression Analysis.ppt
 
Eli plots visualizing innumerable number of correlations
Eli plots   visualizing innumerable number of correlationsEli plots   visualizing innumerable number of correlations
Eli plots visualizing innumerable number of correlations
 

Dernier

Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 

Dernier (20)

Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 

Regression Study: Boston Housing

  • 1. BOSTON HOUSING DATA A Comprehensive Regression Analysis Ravish Kalra Graduate Student, Business Analytics University of Cincinnati
  • 2. Table of Contents Executive Summary - Boston Housing Data.................................................................................................2 Boston Housing Data.....................................................................................................................................3 Introduction ..............................................................................................................................................3 Exploratory Data Analysis .........................................................................................................................3 Variable Selection and Modelling .............................................................................................................7 Residual Diagnostics .................................................................................................................................9 Final Model ...............................................................................................................................................9 Comparison with CART ...........................................................................................................................10 Executive Summary - Boston Housing Data This report provides an analysis and evaluation of the factors affecting the median value of the owner occupied homes in the suburbs of Boston. The in-built data set of Boston Housing Data is used for this analysis and various factors about the structural quality, neighbourhood, accessibility and air pollution such as per capita crime rate by town, proportion of non-retail business acres per town, index of accessibility to radial highways etc are taken into account for this study. Methods of analysis include (but not limited to) summary statistics and visualization of the distribution of the variables, finding correlation between variables and conducting linear regression on the data. Further, various variable selection methods like Best Subset, Stepwise Selection and LASSO was performed to come up with the best linear regresssion model to predict the median value of the owner occupied homes. These models were then compared with a custom model designed after including all the analysis from the initial exploration. Finally, a comprehensive comparison was made between linear regression and CART to predict the median price values after supplying the same data. The results indicated that while CART outperformed linear regression, the additional details captured by the linear regression model in the exploratory phase was still a better choice. The final model included interaction term and variable transformation. This model resulted in an adjuted R-squared value of 0.85 and an avg MSE value of 3.60 medv ~ nox + ptratio + age + [log(lstat) + rm + log(crim) + dis] * rad_c
  • 3. Boston Housing Data Introduction The entire data consists of 506 observations and 14 variables. A train-test set of the ratio 80:20 was sampled for the study, resulting in 404 observations and 14 variables, all of type numeric. The variable chas (which captures the amenities of a riverside location) is categorical while the rest are continuous. Given below is the exploratory data analysis and model selection for best model to predict the median value of owner-occupied homes. Exploratory Data Analysis An initial look at the summary statistics of the data gives us some of the following insights: • There are no NA / missing values in the data set. • The median value of the owner occupied homes (medv – the dependent variable) ranges from 5 to 50 (in $1000s). • The average number of rooms per dwelling is ~6 rooms. • The full-value property-tax rate (in $10,000) varies from 187 to 711 • The proportion of owner occupied units built prior to 1940 is on the upper side. More than 50% of the observations are greater 75 years old From the distributions shown in figure 1, the following can be concluded about the variables taken for this study - • The proportion of owner-occupied units built prior to 1940 (age) and the proportion of blacks by town (black) are highly skewed to the left, which means that the most counts of these variables occur on the higher end. • The average number of rooms per dwelling (rm) follows a normal distribution i.e most of the dwellings have an average of 6 rooms. • There are more dwellings which have smaller distances to five Boston employment centers (dis is skewed to the right) • There are more dwellings which have lower median value (less than $25000) than the number of dwellings that have a higher value. (medv is skewed to the right) • There are lesser proportion of adults without high school education and male workers classified as laborers in the dwellings of the Boston suburbs (lstat is skewed to the right) • The full value property tax rate (tax - measured in $10000s) can be seen to be separated into 2 distinct clusters. One below 500 and the other more than 700. • The index of accessibility to radial highways (rad) also seems to be separated into 2 distinct clusters. A huge number of dwellings having this index less than 10 and the rest having more than 24.
  • 4. Figure 1:Histograms of different variables of Boston data set Studying the correlation between the variables, some of the following observations were made – • A strong correlation of 0.912 between variables rad and tax. This is expected as we often see that as the accessibility to radial highways increase, the property tax rate of the dwellings also increases. • A correlation of 0.76 between the proportion of non-retail business acres per town (indus) and the nitrogen oxide concentration (nox). This corroborates the fact that non-retail businesses have a high contribution to the nitrogen concentration in the air. • A correlation of 0.73 between the proportion of non-retail business acres per town (indus) and the property tax rate of the dwellings (tax). The tax rate may also be influenced by the presence of non-retail business near the dwellings • A correlation of 0.73 between proportion of owner-occupied units built prior to 1940 (age) and the nitrogen oxides concentration (nox). This might lead to the fact the older parts of the city, or where the older houses are situated have more air pollution. • A negative correlation of 0.74 between mean of distances to five Boston employment centers (dis) and proportion of owner-occupied units built prior to 1940 (age). Interesting to note that older homes are farther away from the employment centers, which shows that a city expands more where the employment centers are located. Correlation with the median value of owner-occupied homes (medv): • A negative correlation of 0.74 with lstat (percent of lower status of the population) i.e more the proportion of people with lower status, lesser is the value of the house. This can be attributed the fact of affordability.
  • 5. • A positive correlation of 0.70 with the average number of rooms per dwelling i.e as the number of rooms increase, a hike in the price of the dwellings can be observed. Figure 2: Correlation matrix Figure 3 shows the scatter plot of the various variables with respect to the variable medv. The linear regression lines are plotted to better visualize their relationship with medv. Also, we can consolidate on our understanding of the variables rad and tax, which have a high correlation. It can also be seen that applying log transformation on the variables crim and lstat seem to fit the linear line better. Figure 3: Scatter plots of different variables and medv (including log transformed variables)
  • 6. Table 1: Correlation coefficients with respect to medv Variable lstat_log lstat rm ptratio indus crim_log crim Correlation coefficient -0.82 -0.74 0.70 -0.51 -0.48 -0.45 -.039 p-value 9.2e-122 5.0e-88 2.4e-74 1.6e-34 4.9e-31 3.8e-27 1.1e-19 Further analyzing the correlation coefficients of the variables with respect to medv (as shown in table 1) confirms our understanding about transformed variables being more linearly correlated. The high correlation between tax and rad can also be observed (as shown in figure 4). Since their distributions are also in two clusters a new categorical variable called rad_c was created, and tax variable was dropped as rad_c would be able to explain most of the variation in tax variable. Figure 4: Correlation and plots of variables tax and rad With the introduction of new variable, there is a change of slope observed in the following variables
  • 7. Figure 5: Introduction of rad_c variable forces a change in slope Variable Selection and Modelling For the modelling phase, both classical and regularization techniques for variable selection were used to come up with the best linear regression model for the dependent variable medv. Best subset method, stepwise selection and LASSO (with parameter tuning to select best lambda) was performed. Table 2 gives a summary of these models. Table 2: Comparison of different models assumed through variable selection techniques Method Formula 10 fold Cross validation In-sample Prediction Out-Sample Prediction R2 Adj R2 AIC BIC Best subset medv ~ chas + nox + rm + dis + ptratio + black + lstat + crim + zn + rad + tax 23.56 24.35 12.75 0.741 0.735 2462 2514 Stepwise B/F/Both medv ~ chas + nox + rm + dis + ptratio + black + lstat + crim + zn + rad + tax 23.56 24.35 12.75 0.741 0.735 2462 2514 Full Model medv ~ . 24.42 25.10 13.39 0.742 0.734 3030 3102 Lasso (λ = 0.034) medv ~ chas + nox + rm + dis + ptratio + black + lstat + crim + zn + rad + age + indus 24.12 24.26 12.00 0.735 0.729 3036 3095 The difference between in-sample and out-sample prediction was high and surprisingly lower on the out-sample prediction. This was due to the random one-time split of data from test / train and goes to show how a single fold result should not be trusted. When the same was repeated for a 10-
  • 8. fold cross validation, a more realistic picture surfaced which was very different from the out of sample prediction. Since the splits were random, we obtained different results for 10-fold cross validation. From our exploratory data analysis, we discovered that taking log of crim and lstat variable increased their linear correlation with medv. We also observed that an interaction term with the transformed rad_c variable explained more variations in the regression line. We would now compare the above models with a customized model that incorporates the discoveries from exploratory data analysis. A comparison of repeated cross-validation with 5 repeats and 20 folds is depicted in Figure-6. Figure 6: Model comparison (RMSE and R sq) at 95% Confidence Interval The customized model performed much better at explaining the variation in median housing prices and predicting out of sample.
  • 9. Residual Diagnostics Stepwise Selection Model v/s Custom Model Residual Comparison Figure 7: Residual plots comparison for stepwise model (left) and custom model (right) Figure - 7 shows that custom model displays a slight improvement in the Q-Q plot that indicates that the residuals of the model are nearly normal. The curvature in the Scale-Location graph has also been linearized to an extend in the custom model. This indicates that our assumptions for linear regression holds better with the custom model than the other models. Thus, to make predictions for out of sample, the custom model should be preferred. Final Model Table 3: Model summary of the final selected model Formula R2 Adj R2 AIC BIC RMSE medv ~ nox + ptratio + age + (log(lstat) + rm + log(crim) + dis) * rad_c 0.854 0.850 2735 2794 3.607
  • 10. Comparison with CART After constructing the tree from the split data available, we observed the following values in comparison to linear model: Table 4: Comparison of predictions made by linear regression and CART Sample Type Linear Regression (full model) CART (cp = 0.015642) In-Sample (80%) 21.50 17.81 Out-Sample (20%) 23.91 21.76 The values observed in Table 4 suggests that CART performed better than the full regression model. The above values, however, are volatile i.e. the prediction errors vary with a slight change in the split of train / test data. Thus, to compare these two models and the model arrived at earlier, we needed to run a repeated cross validation for 5 repeats and 20-fold crosses. Figure – 8 depicts the summary of these repeats and predictions at 95% confidence interval. Figure 8: Comparison of model prediction between full linear regression, CART and custom model From above, it is evident that CART performs better than linear regression model. However, because of the simplicity of linear regression, the analysis done in the exploratory phase and the incorporated final model outperforms the CART model.