Linear Model Artefact

1
Table of Contents
1. Objective....................................................................................................................................2
2. Methodology.................................................................................................................................2
2.1 Defining the Premise .................................................................................................................2
2.2 Collecting the data....................................................................................................................2
2.3 Organizing the data...................................................................................................................3
2.4.1 Visualizing the data.................................................................................................................3
2.4.2 Hypothesis Testing..................................................................................................................7
2.5 Analysis ....................................................................................................................................7
2.5.1 Construction of the Model...................................................................................................7
2.5.2 Testing the Assumptions of OLS for the Final Model – A Graphical Representation .................8
3. Limitations of the Study.................................................................................................................10
Table of Figures
Table 1: Skewness & Kurtosis ..............................................................................................................3
Table 2: Results of Shapiro-Wilk W Test...............................................................................................4
Figure 1: Histogram & Box-Plot............................................................................................................4
Figure 2: Histogram & Probability Curve...............................................................................................5
Figure 3: Normal Probability Plot.........................................................................................................5
Figure 4: Normal Quantiles Plot (QQ Plot)............................................................................................6
Figure 5: Histogram & Probability Curve – Final Model..........................................................................9
Figure 6: P-P Plot – Final Model...........................................................................................................9

2
1. Objective
A retrospective cohort study of adult patients, diagnosed with a specific ailment and admitted to the
relatedunitof a medical center, wastobe conductedinorder to:
I. Identify demographic and clinical predictors associated with hospital Length of Stay (LOS) for
patients.
II. Predict LOS at the medical center on the basis of predictors found to have a statistically
significantbearingonthe LOS.
2. Methodology
The structure of approach adoptedforthe studywas outlinedusingthe DCOVA methodology:
 Define
 Collect
 Organize
 Visualizeand
 Analyse
2.1 Defining the Premise
The Length of Stay (LOS) of patients dependsonahostof factors, demographicaswell asclinical.
PredictingLOS withaccuracy can presentasignificantopportunityforthe institutiontobetterplanits
resourcesandamenities.The falloutwill thennotjustbe confinedtoan improvedadministrationatthe
institution butalsoamuch improved overall experience of the patientssufferingfromthe specific
ailment.
Moreover,predictingLOSwithresourcesreadilyavailable atthe institutionisasimple and a cost-
effectiveopportunitytoo.
Thus,the studywasaimedat conductinga preliminarygroundwork inordertoidentify andfurther
investigatefactors,clinical aswell asdemographic,thatcanbetter predictLOSfor the specificailmentin
complex patientpopulationsadmittedtothe specializedunitof the medical center.
2.2 Collecting the data
Data on covariates,demographicaswell asclinical,wascollectedfromthe medical center’selectronic
database fora six monthperiod. The covariateswere identifiedapriori knowledgeandvalidated
criteria. The data collectedwasof relateddiagnosesof the specificailmentunderstudy.The dataso
collectedwasIRBapproved.

3
To beginwith,the datawas cleanedforanydata anomalies.The final dataavailable forfurtheranalysis
consistedof bothcategorical aswell ascontinuousvariables.
2.3 Organizing the data
A fewof the variablesinthe dataset were continuousinnature andneededameaningful categorization
for anyfurtheranalysis.Theywere hence,convertedintoanordinal format.Forothercategorical
variables,dummieswerecreatedforN-1sub-categoriesof eachvariable inordertoavoid the problem
of dummytrapand the resultingmulti-collinearity.The targetvariablewasidentifiedandDescriptive
Statisticswere thenrun.
2.4.1 Visualizing the data
Further,a bi-variate analysiswasperformedforeachof the explanatoryvariablesalongwiththe target
variable toidentify andinvestigate anyinterestingpatternsfound inthe descriptivesof the variables.
In orderto determine anappropriate statistical testforconductingabi-variate analysisof explanatory
variableswithrespecttothe dependentvariable,testsfor Normality of the dependentvariable (LOS)
were performed.The resultsandtheirinterpretationare asfollows:
A. Skewness& Kurtosis
Moments
N 346 Sum Weights 346
Mean 5.69653179 Sum Observations 1971
Std Deviation 5.71322794 Variance 32.6409734
Skewness 2.07916408 Kurtosis 5.0826891
Uncorrected SS 22489 Corrected SS 11261.1358
Coeff Variation 100.293093 Std Error Mean 0.30714504
Table 1: Skewness & Kurtosis
 A Skewness of > 0 indicated a right skewed data that is thicker on the left tail
 A Kurtosis of > 3 is indicated of higher peakedness and thinner tails
 Thus, the variable, LOS, was highly skewed to the right with a high peak and thin tails

4
B. Shapiro-WilkW Test
Tests for Normality
Test Statistic p Value
Shapiro-Wilk W 0.768865 Pr < W <0.0001
Kolmogorov-Smirnov D 0.205526 Pr > D <0.0100
Cramer-von Mises W-Sq 3.640111 Pr > W-Sq <0.0050
Anderson-Darling A-Sq 21.50446 Pr > A-Sq <0.0050
Table 2: Results of Shapiro-Wilk W Test
 Since the no. of observations was less than, 2000, a Shapiro-Wilk W test was
found to be appropriate in this case to test normality of the variable of interest.
 Being close to 1 indicated normality.
 Thus, The W Stat rejected the null that the variable is normally distributed; W
was 0.7689 and p-value was less than 0.0001
C. Histogram & Box-Plot
Figure 1: Histogram & Box-Plot

5
Figure 2: Histogram & Probability Curve
D. Normal ProbabilityPlot
Figure 3: Normal Probability Plot

6
E. Normal QuantilesPlot(QQ Plot)
Figure 4: Normal Quantiles Plot (QQ Plot)
 The Histogram,BoxPlot,Normal ProbabilityCurve and QQPlotshownabove all indicatedthat
the variable LOSwas notdistributednormally.
 Hence,the use of T-Test forstatisticallytestingthe associationsbetweeneachof the
independentvariablesandLOS(targetvariable) wasruledout.
 For the purpose of analysis,LOS wastherefore binnedasa categorical variable inorderto
establishthe associationwithothercategorical variables.
 A Chi-Square Testwasthususedtotestassociationsbetweenthe categorical outcome i.e.,LOS
and othercategorical determiningvariables.
 A Contingencytable formatwas usedtobringoutthe associationsbetweenthe variables.
However,incaseswhere the expectedfrequencyof acell was lessthan5, a Fisher’sexactTest
was used.
 Further, since the targetoutcome,LOS, wasa continuousvariable, aMultiple LinearRegression
technique wasusedtomodel the dataand explore the following:
o How muchvariance inLOS is accountedforby the linearcombinationof the
independentvariables?
o How stronglyrelatedto LOSis the betacoefficientforeachindependentvariable?

7
2.4.2 Hypothesis Testing
Ho : There is no relationbetweenthe variousexplanatoryvariablesandthe targetvariable inthe dataset
available.
H1 : False Hypothesis,statisticallysignificant covariateshave beenidentified@a significance level of 5%
2.5 Analysis
2.5.1 ConstructionoftheModel
2.5.1.1 Model Building
Multiple iterationswere run inthe followingorderbefore arrivingata model thathad the highest
AdjustedR-Square:
i. Once an exploratorydataanalysiswasconductedonthe prepareddata,a model wasrun with
the as-isvariables.ThiswasModel 1 (R=0.3679 Adj R2= 0.3288).
ii. On the basisof the resultsobtained,dummieswerecreatedforcertainvariablesforN-1
categoriesof eachsuch variable inordertoavoid dummy-trap.The model wasthenrunwiththe
dummiessocreated. ThiswasModel 2 (R=0.3795 Adj R2= 0.3226).
iii. Next,the covariatesfoundtobe statisticallyinsignificantwere dropped andonlythose
covariatesthatwere foundtobe statisticallysignificant@5% in Model 2 were retainedinthe
subsequentmodel runalongwiththe dummyvariablescreatedforModel 2.This wasModel 3
(R=0.3564 Adj R2= 0.3168).
iv. Further,statisticallyinsignificantvariablesinModel 3were droppedandthe fourthiteration
was runwithfor significantvariablesinModel 3alongwiththe dummies.ThiswasModel 4
(R=0.3456 Adj R2= 0.3179).
v. The VIF (Variance InflationFactor) Scoresof variablesinModel 4 were obtained.The cut-off
criteriaforthe VIFscoreswasset at 10. CovariateswithaVIFscore greaterthan10 were
droppedandthe model wasre-run.ThiswasModel 5 (R=0.3274 Adj R2= 0.3135).
vi. Model 6 wasobtainedforonlythe significantvariablesinModel 5.(R=0.3255 Adj R2= 0.3215).
vii. Model 7 wasattemptedbycreatingcertaininteractionvariablesfromModel 4results.
(R=0.3270 Adj R2= 0.3171).

8
2.5.1.2 Tests for assumptions ofOrdinary LeastSquares and BLUEestimates
Hence,Model 6 (R=0.3255 Adj R2= 0.3215) was obtainedasa parsimoniousmodel withonlytwo
covariatesthatemergedasstatisticallysignificant@5%.However,thoughthe resultsof Model
6 metthe OLS (ordinaryLeastSquares) assumptionsof Linearity,Normality(ShapiroWilkW
Stat: 0.81) andIndependence (Durbin-WatsonDStat: 1.93), the assumptionof
Homoskedasticity(White’sTest&BreuschPaganTest P< 0.0001) wasunmet.Thismeantthat
the OLS estimateswere notBLUE (BestLinearUnbiasedEstimates).
The followingstepswere thenperformedonModel 6in orderto obtainestimatesthatnotonly
metall OLS assumptionsandbutwere alsoBLUE:
1. Weighted Least Squares was performed on this model (Adjusted R-Square = 0.6649; W = 0.12;
White’s Test & Breusch Pagan Test P < 0.0001; D = 1.93). While the assumption of Independence
was met,all otherOLS assumptionswere foundtobe violated.
2. Transformation of the covariates obtained in the model stated above was then attempted with
several iterations until the highest value of Adjusted R-Square (0.3813) with P < 0.0001 was
obtained (Log of both X and Y covariates resulted in highest adjusted R-Square). OLS
assumptions were then tested for: Linearity (R = 0.58), Normality (W = 0.99), Homoskedasticity
(White’s Test P < 0.0001 ; Breusch Pagan Test P = 0.3219) and Independence (D = 1.907). Though
the OLS assumptions were largely met, the assumption of Homoskedasticity was still unmet as
perthe White’sTestbutwasmet as perthe Breusch PaganTest.
3. A final attempt was made to validate the results of White’s Test obtained post the
transformation stated above. Weighted Least Squares was then performed on the model arrived
at from the Transformation of the variables stated above (Adjusted R-Square = 0.3884; W =
0.98; White’s Test: P<0.0001 & Breusch Pagan Test P = 0.3129; D = 1.907). The results confirmed
that all OLS assumptionswere largelymetandthatthe parameterestimateswereBLUE.
2.5.2 Testingthe AssumptionsofOLS forthe Final Model – A Graphical Representation
The following shows the results of the statistical procedures performed in order to determine whether
the OLS assumptions were all met for the model in which Weighted Least Squares Method was
performed on transformed variables (Adjusted R-Square = 0.3884; W = 0.98; White’s Test: P<0.0001 &
BreuschPagan TestP = 0.3129; D = 1.907):

9
A. Histogram
Figure 5: Histogram & Probability Curve – Final Model
B. P-PPlot
Figure 6: P-P Plot – Final Model

10
The plots shown above, hence, also confirmed that the OLS assumptions were met in the Final Model
and that the Betaestimatesobtainedwere BLUE.
3. Limitations ofthe Study
This being a pilot study, the sample data set available was small in size. As a result, the following
limitationswere foundtobe inherentinthe overall model:
1. The impact of interaction variables, depicting the combined effect of variables found to be
insignificant but otherwise expected to have a significant bearing on the model, could not be
explored.
2. Lack of significance of covariates that were examined as a part of the study and high multi-
collinearity amongst covariates expected to have a bearing on the primary outcome appeared
to be largely due to a small sample size. A larger sample data set is expected to explore the
covariatesandtheirbearingonthe dependentvariable better.

Linear Model Artefact

Recommandé

Recommandé

Contenu connexe

Similaire à Linear Model Artefact

Similaire à Linear Model Artefact (20)

Linear Model Artefact