Factors influencing the Human Development Index (HDI) using Multiple Linear Regression
1. Factors influencing the Human Development Index (HDI) using Multiple linear regression ADITYA PANUGANTI 1202062944 Industrial Engineering Year of data: 2008 Source: UN Development Programme Database
2. Objective and Dataset description To find which of the following variables have an effect on the Human Development Index (HDI)
3. Fitting the full model without interaction terms The regression equation for full model is y = 0.0596 + 0.00440 LIF + 0.000007 GDP - 0.000748 GRO + 0.0158 SCH + 0.0080 GEN+ 0.0159 EXP - 0.000004 GNI + 0.000003 MAT - 0.000051 HOM - 0.000540 MOR+ 0.000176 LIT - 0.0185 DEP + 0.0023 CON1 - 0.0117 CON2 - 0.0100 CON3+ 0.00431 CON4 - 0.0268 CON5 Difficult to interpret the coefficients of the above regression equation. Hence standardized the regression coefficients using Unit Normal scaling
4. Fitting the full model after Standardization The regression equation is y = 0.684 + 0.0404 LIF + 0.100 GDP - 0.0117 GRO + 0.0408 SCH + 0.00136 GEN+ 0.0443 EXP - 0.0627 GNI + 0.00089 MAT - 0.00068 HOM - 0.0196 MOR+ 0.00259 LIT - 0.0185 DEP + 0.0023 CON1 - 0.0117 CON2 - 0.0100 CON3+ 0.00431 CON4 - 0.0268 CON5 Model Statistics: R-Sq = 98.5% R-Sq(adj) = 98.2% Analysis of Variance (ANOVA) Source DF SS MS F P Regression 17 2.21784 0.13046 325.49 0.000 Residual Error 84 0.03367 0.00040 Total 101 2.25150
5.
6.
7. Indicator Interactions Considered interaction terms of DEP and other numerical variables. 24 variables in all including all the interaction terms S = 0.0220704 R-Sq = 98.3% R-Sq(adj) = 97.8%; R-Sq(pred) = 96.80% Residual plots:
9. Other outliers in graph Fitting each of the datapoints 45, 50, 80 and checking if there is any changes in summary stats These points are not contributing to any leverage, nor being influential; except for the fact that they are outliers; also R-sq not changing much, therefore we are leaving them in the model.
10.
11.
12. Residual plots after transformation Can find some outliers in the Normal probability plot
16. Fit the selected model Regression equation: y2= 0.476 - 0.0164 GEN + 0.0403 GRO + 0.0422 LIF + 0.0557 GDP + 0.0449 SCH - 0.0181 CON2 - 0.0388 MOR + 0.0523 GDP_D + 0.0289 CON5 + 0.0412 MOR_D - 0.0476 HOM_D Detected Multicollinearity using Principal component analysis condition number = 134.837 (>100, Moderate Multicollinearity) Linear dependency equation: 0.107GRO+0.337LIF+0.798MOR-0.467MOR_D (dependency between the variables in the equation) Using correlation matrix found that the variable MOR has large correlation with LIF and MOR_D. Dropping MOR removed multicollinearity from model (condition number = 39.04617 (<100, No multicollinearity)
20. Conclusion The reduced model has a better R-sq than the actual model and most of the variables are significant (low p-value) in the model. The following variables were found to be significant Gender inequality index Combined gross enrolment Life expectancy at birth GDP Mean schooling years Countries in continent 2 GDP& intensity of deprivation Under 5 mortality rate& intensity of deprivation Homicide rate& intensity of deprivation
21. Possible improvements More datapoints Ridge regression to eliminate multicollinearity Robust regression – to add more weight to the datapoints and retain them in the model.