SlideShare une entreprise Scribd logo
1  sur  10
1
Table of Contents
1. Objective....................................................................................................................................2
2. Methodology.................................................................................................................................2
2.1 Defining the Premise .................................................................................................................2
2.2 Collecting the data....................................................................................................................2
2.3 Organizing the data...................................................................................................................3
2.4.1 Visualizing the data.................................................................................................................3
2.4.2 Hypothesis Testing..................................................................................................................7
2.5 Analysis ....................................................................................................................................7
2.5.1 Construction of the Model...................................................................................................7
2.5.2 Testing the Assumptions of OLS for the Final Model – A Graphical Representation .................8
3. Limitations of the Study.................................................................................................................10
Table of Figures
Table 1: Skewness & Kurtosis ..............................................................................................................3
Table 2: Results of Shapiro-Wilk W Test...............................................................................................4
Figure 1: Histogram & Box-Plot............................................................................................................4
Figure 2: Histogram & Probability Curve...............................................................................................5
Figure 3: Normal Probability Plot.........................................................................................................5
Figure 4: Normal Quantiles Plot (QQ Plot)............................................................................................6
Figure 5: Histogram & Probability Curve – Final Model..........................................................................9
Figure 6: P-P Plot – Final Model...........................................................................................................9
2
1. Objective
A retrospective cohort study of adult patients, diagnosed with a specific ailment and admitted to the
relatedunitof a medical center, wastobe conductedinorder to:
I. Identify demographic and clinical predictors associated with hospital Length of Stay (LOS) for
patients.
II. Predict LOS at the medical center on the basis of predictors found to have a statistically
significantbearingonthe LOS.
2. Methodology
The structure of approach adoptedforthe studywas outlinedusingthe DCOVA methodology:
 Define
 Collect
 Organize
 Visualizeand
 Analyse
2.1 Defining the Premise
The Length of Stay (LOS) of patients dependsonahostof factors, demographicaswell asclinical.
PredictingLOS withaccuracy can presentasignificantopportunityforthe institutiontobetterplanits
resourcesandamenities.The falloutwill thennotjustbe confinedtoan improvedadministrationatthe
institution butalsoamuch improved overall experience of the patientssufferingfromthe specific
ailment.
Moreover,predictingLOSwithresourcesreadilyavailable atthe institutionisasimple and a cost-
effectiveopportunitytoo.
Thus,the studywasaimedat conductinga preliminarygroundwork inordertoidentify andfurther
investigatefactors,clinical aswell asdemographic,thatcanbetter predictLOSfor the specificailmentin
complex patientpopulationsadmittedtothe specializedunitof the medical center.
2.2 Collecting the data
Data on covariates,demographicaswell asclinical,wascollectedfromthe medical center’selectronic
database fora six monthperiod. The covariateswere identifiedapriori knowledgeandvalidated
criteria. The data collectedwasof relateddiagnosesof the specificailmentunderstudy.The dataso
collectedwasIRBapproved.
3
To beginwith,the datawas cleanedforanydata anomalies.The final dataavailable forfurtheranalysis
consistedof bothcategorical aswell ascontinuousvariables.
2.3 Organizing the data
A fewof the variablesinthe dataset were continuousinnature andneededameaningful categorization
for anyfurtheranalysis.Theywere hence,convertedintoanordinal format.Forothercategorical
variables,dummieswerecreatedforN-1sub-categoriesof eachvariable inordertoavoid the problem
of dummytrapand the resultingmulti-collinearity.The targetvariablewasidentifiedandDescriptive
Statisticswere thenrun.
2.4.1 Visualizing the data
Further,a bi-variate analysiswasperformedforeachof the explanatoryvariablesalongwiththe target
variable toidentify andinvestigate anyinterestingpatternsfound inthe descriptivesof the variables.
In orderto determine anappropriate statistical testforconductingabi-variate analysisof explanatory
variableswithrespecttothe dependentvariable,testsfor Normality of the dependentvariable (LOS)
were performed.The resultsandtheirinterpretationare asfollows:
A. Skewness& Kurtosis
Moments
N 346 Sum Weights 346
Mean 5.69653179 Sum Observations 1971
Std Deviation 5.71322794 Variance 32.6409734
Skewness 2.07916408 Kurtosis 5.0826891
Uncorrected SS 22489 Corrected SS 11261.1358
Coeff Variation 100.293093 Std Error Mean 0.30714504
Table 1: Skewness & Kurtosis
 A Skewness of > 0 indicated a right skewed data that is thicker on the left tail
 A Kurtosis of > 3 is indicated of higher peakedness and thinner tails
 Thus, the variable, LOS, was highly skewed to the right with a high peak and thin tails
4
B. Shapiro-WilkW Test
Tests for Normality
Test Statistic p Value
Shapiro-Wilk W 0.768865 Pr < W <0.0001
Kolmogorov-Smirnov D 0.205526 Pr > D <0.0100
Cramer-von Mises W-Sq 3.640111 Pr > W-Sq <0.0050
Anderson-Darling A-Sq 21.50446 Pr > A-Sq <0.0050
Table 2: Results of Shapiro-Wilk W Test
 Since the no. of observations was less than, 2000, a Shapiro-Wilk W test was
found to be appropriate in this case to test normality of the variable of interest.
 Being close to 1 indicated normality.
 Thus, The W Stat rejected the null that the variable is normally distributed; W
was 0.7689 and p-value was less than 0.0001
C. Histogram & Box-Plot
Figure 1: Histogram & Box-Plot
5
Figure 2: Histogram & Probability Curve
D. Normal ProbabilityPlot
Figure 3: Normal Probability Plot
6
E. Normal QuantilesPlot(QQ Plot)
Figure 4: Normal Quantiles Plot (QQ Plot)
 The Histogram,BoxPlot,Normal ProbabilityCurve and QQPlotshownabove all indicatedthat
the variable LOSwas notdistributednormally.
 Hence,the use of T-Test forstatisticallytestingthe associationsbetweeneachof the
independentvariablesandLOS(targetvariable) wasruledout.
 For the purpose of analysis,LOS wastherefore binnedasa categorical variable inorderto
establishthe associationwithothercategorical variables.
 A Chi-Square Testwasthususedtotestassociationsbetweenthe categorical outcome i.e.,LOS
and othercategorical determiningvariables.
 A Contingencytable formatwas usedtobringoutthe associationsbetweenthe variables.
However,incaseswhere the expectedfrequencyof acell was lessthan5, a Fisher’sexactTest
was used.
 Further, since the targetoutcome,LOS, wasa continuousvariable, aMultiple LinearRegression
technique wasusedtomodel the dataand explore the following:
o How muchvariance inLOS is accountedforby the linearcombinationof the
independentvariables?
o How stronglyrelatedto LOSis the betacoefficientforeachindependentvariable?
7
2.4.2 Hypothesis Testing
Ho : There is no relationbetweenthe variousexplanatoryvariablesandthe targetvariable inthe dataset
available.
H1 : False Hypothesis,statisticallysignificant covariateshave beenidentified@a significance level of 5%
2.5 Analysis
2.5.1 ConstructionoftheModel
2.5.1.1 Model Building
Multiple iterationswere run inthe followingorderbefore arrivingata model thathad the highest
AdjustedR-Square:
i. Once an exploratorydataanalysiswasconductedonthe prepareddata,a model wasrun with
the as-isvariables.ThiswasModel 1 (R=0.3679 Adj R2= 0.3288).
ii. On the basisof the resultsobtained,dummieswerecreatedforcertainvariablesforN-1
categoriesof eachsuch variable inordertoavoid dummy-trap.The model wasthenrunwiththe
dummiessocreated. ThiswasModel 2 (R=0.3795 Adj R2= 0.3226).
iii. Next,the covariatesfoundtobe statisticallyinsignificantwere dropped andonlythose
covariatesthatwere foundtobe statisticallysignificant@5% in Model 2 were retainedinthe
subsequentmodel runalongwiththe dummyvariablescreatedforModel 2.This wasModel 3
(R=0.3564 Adj R2= 0.3168).
iv. Further,statisticallyinsignificantvariablesinModel 3were droppedandthe fourthiteration
was runwithfor significantvariablesinModel 3alongwiththe dummies.ThiswasModel 4
(R=0.3456 Adj R2= 0.3179).
v. The VIF (Variance InflationFactor) Scoresof variablesinModel 4 were obtained.The cut-off
criteriaforthe VIFscoreswasset at 10. CovariateswithaVIFscore greaterthan10 were
droppedandthe model wasre-run.ThiswasModel 5 (R=0.3274 Adj R2= 0.3135).
vi. Model 6 wasobtainedforonlythe significantvariablesinModel 5.(R=0.3255 Adj R2= 0.3215).
vii. Model 7 wasattemptedbycreatingcertaininteractionvariablesfromModel 4results.
(R=0.3270 Adj R2= 0.3171).
8
2.5.1.2 Tests for assumptions ofOrdinary LeastSquares and BLUEestimates
Hence,Model 6 (R=0.3255 Adj R2= 0.3215) was obtainedasa parsimoniousmodel withonlytwo
covariatesthatemergedasstatisticallysignificant@5%.However,thoughthe resultsof Model
6 metthe OLS (ordinaryLeastSquares) assumptionsof Linearity,Normality(ShapiroWilkW
Stat: 0.81) andIndependence (Durbin-WatsonDStat: 1.93), the assumptionof
Homoskedasticity(White’sTest&BreuschPaganTest P< 0.0001) wasunmet.Thismeantthat
the OLS estimateswere notBLUE (BestLinearUnbiasedEstimates).
The followingstepswere thenperformedonModel 6in orderto obtainestimatesthatnotonly
metall OLS assumptionsandbutwere alsoBLUE:
1. Weighted Least Squares was performed on this model (Adjusted R-Square = 0.6649; W = 0.12;
White’s Test & Breusch Pagan Test P < 0.0001; D = 1.93). While the assumption of Independence
was met,all otherOLS assumptionswere foundtobe violated.
2. Transformation of the covariates obtained in the model stated above was then attempted with
several iterations until the highest value of Adjusted R-Square (0.3813) with P < 0.0001 was
obtained (Log of both X and Y covariates resulted in highest adjusted R-Square). OLS
assumptions were then tested for: Linearity (R = 0.58), Normality (W = 0.99), Homoskedasticity
(White’s Test P < 0.0001 ; Breusch Pagan Test P = 0.3219) and Independence (D = 1.907). Though
the OLS assumptions were largely met, the assumption of Homoskedasticity was still unmet as
perthe White’sTestbutwasmet as perthe Breusch PaganTest.
3. A final attempt was made to validate the results of White’s Test obtained post the
transformation stated above. Weighted Least Squares was then performed on the model arrived
at from the Transformation of the variables stated above (Adjusted R-Square = 0.3884; W =
0.98; White’s Test: P<0.0001 & Breusch Pagan Test P = 0.3129; D = 1.907). The results confirmed
that all OLS assumptionswere largelymetandthatthe parameterestimateswereBLUE.
2.5.2 Testingthe AssumptionsofOLS forthe Final Model – A Graphical Representation
The following shows the results of the statistical procedures performed in order to determine whether
the OLS assumptions were all met for the model in which Weighted Least Squares Method was
performed on transformed variables (Adjusted R-Square = 0.3884; W = 0.98; White’s Test: P<0.0001 &
BreuschPagan TestP = 0.3129; D = 1.907):
9
A. Histogram
Figure 5: Histogram & Probability Curve – Final Model
B. P-PPlot
Figure 6: P-P Plot – Final Model
10
The plots shown above, hence, also confirmed that the OLS assumptions were met in the Final Model
and that the Betaestimatesobtainedwere BLUE.
3. Limitations ofthe Study
This being a pilot study, the sample data set available was small in size. As a result, the following
limitationswere foundtobe inherentinthe overall model:
1. The impact of interaction variables, depicting the combined effect of variables found to be
insignificant but otherwise expected to have a significant bearing on the model, could not be
explored.
2. Lack of significance of covariates that were examined as a part of the study and high multi-
collinearity amongst covariates expected to have a bearing on the primary outcome appeared
to be largely due to a small sample size. A larger sample data set is expected to explore the
covariatesandtheirbearingonthe dependentvariable better.

Contenu connexe

Similaire à Linear Model Artefact

Lecture 7 guidelines_and_assignment
Lecture 7 guidelines_and_assignmentLecture 7 guidelines_and_assignment
Lecture 7 guidelines_and_assignmentDaria Bogdanova
 
MSc Finance_EF_0853352_Kartik Malla
MSc Finance_EF_0853352_Kartik MallaMSc Finance_EF_0853352_Kartik Malla
MSc Finance_EF_0853352_Kartik MallaKartik Malla
 
Introduction to Quantitative Research Methods
Introduction to Quantitative Research MethodsIntroduction to Quantitative Research Methods
Introduction to Quantitative Research MethodsIman Ardekani
 
An Overview and Application of Discriminant Analysis in Data Analysis
An Overview and Application of Discriminant Analysis in Data AnalysisAn Overview and Application of Discriminant Analysis in Data Analysis
An Overview and Application of Discriminant Analysis in Data AnalysisIOSR Journals
 
[Xin yan, xiao_gang_su]_linear_regression_analysis(book_fi.org)
[Xin yan, xiao_gang_su]_linear_regression_analysis(book_fi.org)[Xin yan, xiao_gang_su]_linear_regression_analysis(book_fi.org)
[Xin yan, xiao_gang_su]_linear_regression_analysis(book_fi.org)mohamedchaouche
 
Combining Economic Fundamentals to Predict Exchange Rates
Combining Economic Fundamentals to Predict Exchange RatesCombining Economic Fundamentals to Predict Exchange Rates
Combining Economic Fundamentals to Predict Exchange RatesBrant Munro
 
Seminar- Robust Regression Methods
Seminar- Robust Regression MethodsSeminar- Robust Regression Methods
Seminar- Robust Regression MethodsSumon Sdb
 
Factor analysis ppt
Factor analysis pptFactor analysis ppt
Factor analysis pptMukesh Bisht
 
An Introduction to Factor analysis ppt
An Introduction to Factor analysis pptAn Introduction to Factor analysis ppt
An Introduction to Factor analysis pptMukesh Bisht
 
7. logistics regression using spss
7. logistics regression using spss7. logistics regression using spss
7. logistics regression using spssDr Nisha Arora
 
Aminullah assagaf model regresi lengkap (ada sobel &amp; peth) 4 agst 2021
Aminullah assagaf model regresi lengkap (ada sobel &amp; peth) 4 agst 2021Aminullah assagaf model regresi lengkap (ada sobel &amp; peth) 4 agst 2021
Aminullah assagaf model regresi lengkap (ada sobel &amp; peth) 4 agst 2021Aminullah Assagaf
 
An intro to applied multi stat with r by everitt et al
An intro to applied multi stat with r by everitt et alAn intro to applied multi stat with r by everitt et al
An intro to applied multi stat with r by everitt et alRazzaqe
 
Lecture 6 guidelines_and_assignment
Lecture 6 guidelines_and_assignmentLecture 6 guidelines_and_assignment
Lecture 6 guidelines_and_assignmentDaria Bogdanova
 

Similaire à Linear Model Artefact (20)

Lecture 7 guidelines_and_assignment
Lecture 7 guidelines_and_assignmentLecture 7 guidelines_and_assignment
Lecture 7 guidelines_and_assignment
 
MSc Finance_EF_0853352_Kartik Malla
MSc Finance_EF_0853352_Kartik MallaMSc Finance_EF_0853352_Kartik Malla
MSc Finance_EF_0853352_Kartik Malla
 
Lecture_note1.pdf
Lecture_note1.pdfLecture_note1.pdf
Lecture_note1.pdf
 
Introduction to Quantitative Research Methods
Introduction to Quantitative Research MethodsIntroduction to Quantitative Research Methods
Introduction to Quantitative Research Methods
 
Linear regression analysis
Linear regression analysisLinear regression analysis
Linear regression analysis
 
4793325
47933254793325
4793325
 
An Overview and Application of Discriminant Analysis in Data Analysis
An Overview and Application of Discriminant Analysis in Data AnalysisAn Overview and Application of Discriminant Analysis in Data Analysis
An Overview and Application of Discriminant Analysis in Data Analysis
 
Correlation
Correlation  Correlation
Correlation
 
Master_Thesis_Harihara_Subramanyam_Sreenivasan
Master_Thesis_Harihara_Subramanyam_SreenivasanMaster_Thesis_Harihara_Subramanyam_Sreenivasan
Master_Thesis_Harihara_Subramanyam_Sreenivasan
 
[Xin yan, xiao_gang_su]_linear_regression_analysis(book_fi.org)
[Xin yan, xiao_gang_su]_linear_regression_analysis(book_fi.org)[Xin yan, xiao_gang_su]_linear_regression_analysis(book_fi.org)
[Xin yan, xiao_gang_su]_linear_regression_analysis(book_fi.org)
 
report
reportreport
report
 
final_report_template
final_report_templatefinal_report_template
final_report_template
 
Combining Economic Fundamentals to Predict Exchange Rates
Combining Economic Fundamentals to Predict Exchange RatesCombining Economic Fundamentals to Predict Exchange Rates
Combining Economic Fundamentals to Predict Exchange Rates
 
Seminar- Robust Regression Methods
Seminar- Robust Regression MethodsSeminar- Robust Regression Methods
Seminar- Robust Regression Methods
 
Factor analysis ppt
Factor analysis pptFactor analysis ppt
Factor analysis ppt
 
An Introduction to Factor analysis ppt
An Introduction to Factor analysis pptAn Introduction to Factor analysis ppt
An Introduction to Factor analysis ppt
 
7. logistics regression using spss
7. logistics regression using spss7. logistics regression using spss
7. logistics regression using spss
 
Aminullah assagaf model regresi lengkap (ada sobel &amp; peth) 4 agst 2021
Aminullah assagaf model regresi lengkap (ada sobel &amp; peth) 4 agst 2021Aminullah assagaf model regresi lengkap (ada sobel &amp; peth) 4 agst 2021
Aminullah assagaf model regresi lengkap (ada sobel &amp; peth) 4 agst 2021
 
An intro to applied multi stat with r by everitt et al
An intro to applied multi stat with r by everitt et alAn intro to applied multi stat with r by everitt et al
An intro to applied multi stat with r by everitt et al
 
Lecture 6 guidelines_and_assignment
Lecture 6 guidelines_and_assignmentLecture 6 guidelines_and_assignment
Lecture 6 guidelines_and_assignment
 

Linear Model Artefact

  • 1. 1 Table of Contents 1. Objective....................................................................................................................................2 2. Methodology.................................................................................................................................2 2.1 Defining the Premise .................................................................................................................2 2.2 Collecting the data....................................................................................................................2 2.3 Organizing the data...................................................................................................................3 2.4.1 Visualizing the data.................................................................................................................3 2.4.2 Hypothesis Testing..................................................................................................................7 2.5 Analysis ....................................................................................................................................7 2.5.1 Construction of the Model...................................................................................................7 2.5.2 Testing the Assumptions of OLS for the Final Model – A Graphical Representation .................8 3. Limitations of the Study.................................................................................................................10 Table of Figures Table 1: Skewness & Kurtosis ..............................................................................................................3 Table 2: Results of Shapiro-Wilk W Test...............................................................................................4 Figure 1: Histogram & Box-Plot............................................................................................................4 Figure 2: Histogram & Probability Curve...............................................................................................5 Figure 3: Normal Probability Plot.........................................................................................................5 Figure 4: Normal Quantiles Plot (QQ Plot)............................................................................................6 Figure 5: Histogram & Probability Curve – Final Model..........................................................................9 Figure 6: P-P Plot – Final Model...........................................................................................................9
  • 2. 2 1. Objective A retrospective cohort study of adult patients, diagnosed with a specific ailment and admitted to the relatedunitof a medical center, wastobe conductedinorder to: I. Identify demographic and clinical predictors associated with hospital Length of Stay (LOS) for patients. II. Predict LOS at the medical center on the basis of predictors found to have a statistically significantbearingonthe LOS. 2. Methodology The structure of approach adoptedforthe studywas outlinedusingthe DCOVA methodology:  Define  Collect  Organize  Visualizeand  Analyse 2.1 Defining the Premise The Length of Stay (LOS) of patients dependsonahostof factors, demographicaswell asclinical. PredictingLOS withaccuracy can presentasignificantopportunityforthe institutiontobetterplanits resourcesandamenities.The falloutwill thennotjustbe confinedtoan improvedadministrationatthe institution butalsoamuch improved overall experience of the patientssufferingfromthe specific ailment. Moreover,predictingLOSwithresourcesreadilyavailable atthe institutionisasimple and a cost- effectiveopportunitytoo. Thus,the studywasaimedat conductinga preliminarygroundwork inordertoidentify andfurther investigatefactors,clinical aswell asdemographic,thatcanbetter predictLOSfor the specificailmentin complex patientpopulationsadmittedtothe specializedunitof the medical center. 2.2 Collecting the data Data on covariates,demographicaswell asclinical,wascollectedfromthe medical center’selectronic database fora six monthperiod. The covariateswere identifiedapriori knowledgeandvalidated criteria. The data collectedwasof relateddiagnosesof the specificailmentunderstudy.The dataso collectedwasIRBapproved.
  • 3. 3 To beginwith,the datawas cleanedforanydata anomalies.The final dataavailable forfurtheranalysis consistedof bothcategorical aswell ascontinuousvariables. 2.3 Organizing the data A fewof the variablesinthe dataset were continuousinnature andneededameaningful categorization for anyfurtheranalysis.Theywere hence,convertedintoanordinal format.Forothercategorical variables,dummieswerecreatedforN-1sub-categoriesof eachvariable inordertoavoid the problem of dummytrapand the resultingmulti-collinearity.The targetvariablewasidentifiedandDescriptive Statisticswere thenrun. 2.4.1 Visualizing the data Further,a bi-variate analysiswasperformedforeachof the explanatoryvariablesalongwiththe target variable toidentify andinvestigate anyinterestingpatternsfound inthe descriptivesof the variables. In orderto determine anappropriate statistical testforconductingabi-variate analysisof explanatory variableswithrespecttothe dependentvariable,testsfor Normality of the dependentvariable (LOS) were performed.The resultsandtheirinterpretationare asfollows: A. Skewness& Kurtosis Moments N 346 Sum Weights 346 Mean 5.69653179 Sum Observations 1971 Std Deviation 5.71322794 Variance 32.6409734 Skewness 2.07916408 Kurtosis 5.0826891 Uncorrected SS 22489 Corrected SS 11261.1358 Coeff Variation 100.293093 Std Error Mean 0.30714504 Table 1: Skewness & Kurtosis  A Skewness of > 0 indicated a right skewed data that is thicker on the left tail  A Kurtosis of > 3 is indicated of higher peakedness and thinner tails  Thus, the variable, LOS, was highly skewed to the right with a high peak and thin tails
  • 4. 4 B. Shapiro-WilkW Test Tests for Normality Test Statistic p Value Shapiro-Wilk W 0.768865 Pr < W <0.0001 Kolmogorov-Smirnov D 0.205526 Pr > D <0.0100 Cramer-von Mises W-Sq 3.640111 Pr > W-Sq <0.0050 Anderson-Darling A-Sq 21.50446 Pr > A-Sq <0.0050 Table 2: Results of Shapiro-Wilk W Test  Since the no. of observations was less than, 2000, a Shapiro-Wilk W test was found to be appropriate in this case to test normality of the variable of interest.  Being close to 1 indicated normality.  Thus, The W Stat rejected the null that the variable is normally distributed; W was 0.7689 and p-value was less than 0.0001 C. Histogram & Box-Plot Figure 1: Histogram & Box-Plot
  • 5. 5 Figure 2: Histogram & Probability Curve D. Normal ProbabilityPlot Figure 3: Normal Probability Plot
  • 6. 6 E. Normal QuantilesPlot(QQ Plot) Figure 4: Normal Quantiles Plot (QQ Plot)  The Histogram,BoxPlot,Normal ProbabilityCurve and QQPlotshownabove all indicatedthat the variable LOSwas notdistributednormally.  Hence,the use of T-Test forstatisticallytestingthe associationsbetweeneachof the independentvariablesandLOS(targetvariable) wasruledout.  For the purpose of analysis,LOS wastherefore binnedasa categorical variable inorderto establishthe associationwithothercategorical variables.  A Chi-Square Testwasthususedtotestassociationsbetweenthe categorical outcome i.e.,LOS and othercategorical determiningvariables.  A Contingencytable formatwas usedtobringoutthe associationsbetweenthe variables. However,incaseswhere the expectedfrequencyof acell was lessthan5, a Fisher’sexactTest was used.  Further, since the targetoutcome,LOS, wasa continuousvariable, aMultiple LinearRegression technique wasusedtomodel the dataand explore the following: o How muchvariance inLOS is accountedforby the linearcombinationof the independentvariables? o How stronglyrelatedto LOSis the betacoefficientforeachindependentvariable?
  • 7. 7 2.4.2 Hypothesis Testing Ho : There is no relationbetweenthe variousexplanatoryvariablesandthe targetvariable inthe dataset available. H1 : False Hypothesis,statisticallysignificant covariateshave beenidentified@a significance level of 5% 2.5 Analysis 2.5.1 ConstructionoftheModel 2.5.1.1 Model Building Multiple iterationswere run inthe followingorderbefore arrivingata model thathad the highest AdjustedR-Square: i. Once an exploratorydataanalysiswasconductedonthe prepareddata,a model wasrun with the as-isvariables.ThiswasModel 1 (R=0.3679 Adj R2= 0.3288). ii. On the basisof the resultsobtained,dummieswerecreatedforcertainvariablesforN-1 categoriesof eachsuch variable inordertoavoid dummy-trap.The model wasthenrunwiththe dummiessocreated. ThiswasModel 2 (R=0.3795 Adj R2= 0.3226). iii. Next,the covariatesfoundtobe statisticallyinsignificantwere dropped andonlythose covariatesthatwere foundtobe statisticallysignificant@5% in Model 2 were retainedinthe subsequentmodel runalongwiththe dummyvariablescreatedforModel 2.This wasModel 3 (R=0.3564 Adj R2= 0.3168). iv. Further,statisticallyinsignificantvariablesinModel 3were droppedandthe fourthiteration was runwithfor significantvariablesinModel 3alongwiththe dummies.ThiswasModel 4 (R=0.3456 Adj R2= 0.3179). v. The VIF (Variance InflationFactor) Scoresof variablesinModel 4 were obtained.The cut-off criteriaforthe VIFscoreswasset at 10. CovariateswithaVIFscore greaterthan10 were droppedandthe model wasre-run.ThiswasModel 5 (R=0.3274 Adj R2= 0.3135). vi. Model 6 wasobtainedforonlythe significantvariablesinModel 5.(R=0.3255 Adj R2= 0.3215). vii. Model 7 wasattemptedbycreatingcertaininteractionvariablesfromModel 4results. (R=0.3270 Adj R2= 0.3171).
  • 8. 8 2.5.1.2 Tests for assumptions ofOrdinary LeastSquares and BLUEestimates Hence,Model 6 (R=0.3255 Adj R2= 0.3215) was obtainedasa parsimoniousmodel withonlytwo covariatesthatemergedasstatisticallysignificant@5%.However,thoughthe resultsof Model 6 metthe OLS (ordinaryLeastSquares) assumptionsof Linearity,Normality(ShapiroWilkW Stat: 0.81) andIndependence (Durbin-WatsonDStat: 1.93), the assumptionof Homoskedasticity(White’sTest&BreuschPaganTest P< 0.0001) wasunmet.Thismeantthat the OLS estimateswere notBLUE (BestLinearUnbiasedEstimates). The followingstepswere thenperformedonModel 6in orderto obtainestimatesthatnotonly metall OLS assumptionsandbutwere alsoBLUE: 1. Weighted Least Squares was performed on this model (Adjusted R-Square = 0.6649; W = 0.12; White’s Test & Breusch Pagan Test P < 0.0001; D = 1.93). While the assumption of Independence was met,all otherOLS assumptionswere foundtobe violated. 2. Transformation of the covariates obtained in the model stated above was then attempted with several iterations until the highest value of Adjusted R-Square (0.3813) with P < 0.0001 was obtained (Log of both X and Y covariates resulted in highest adjusted R-Square). OLS assumptions were then tested for: Linearity (R = 0.58), Normality (W = 0.99), Homoskedasticity (White’s Test P < 0.0001 ; Breusch Pagan Test P = 0.3219) and Independence (D = 1.907). Though the OLS assumptions were largely met, the assumption of Homoskedasticity was still unmet as perthe White’sTestbutwasmet as perthe Breusch PaganTest. 3. A final attempt was made to validate the results of White’s Test obtained post the transformation stated above. Weighted Least Squares was then performed on the model arrived at from the Transformation of the variables stated above (Adjusted R-Square = 0.3884; W = 0.98; White’s Test: P<0.0001 & Breusch Pagan Test P = 0.3129; D = 1.907). The results confirmed that all OLS assumptionswere largelymetandthatthe parameterestimateswereBLUE. 2.5.2 Testingthe AssumptionsofOLS forthe Final Model – A Graphical Representation The following shows the results of the statistical procedures performed in order to determine whether the OLS assumptions were all met for the model in which Weighted Least Squares Method was performed on transformed variables (Adjusted R-Square = 0.3884; W = 0.98; White’s Test: P<0.0001 & BreuschPagan TestP = 0.3129; D = 1.907):
  • 9. 9 A. Histogram Figure 5: Histogram & Probability Curve – Final Model B. P-PPlot Figure 6: P-P Plot – Final Model
  • 10. 10 The plots shown above, hence, also confirmed that the OLS assumptions were met in the Final Model and that the Betaestimatesobtainedwere BLUE. 3. Limitations ofthe Study This being a pilot study, the sample data set available was small in size. As a result, the following limitationswere foundtobe inherentinthe overall model: 1. The impact of interaction variables, depicting the combined effect of variables found to be insignificant but otherwise expected to have a significant bearing on the model, could not be explored. 2. Lack of significance of covariates that were examined as a part of the study and high multi- collinearity amongst covariates expected to have a bearing on the primary outcome appeared to be largely due to a small sample size. A larger sample data set is expected to explore the covariatesandtheirbearingonthe dependentvariable better.