Similaire à Blanka Láng, László Kovács and László Mohácsi: Linear regression model selection using a hibrid genetic improved harmony search parallelized algorithm
A Cooperative Coevolutionary Approach to Maximise Surveillance Coverage of UA...Daniel H. Stolfi
Similaire à Blanka Láng, László Kovács and László Mohácsi: Linear regression model selection using a hibrid genetic improved harmony search parallelized algorithm (20)
3. Linear Regression
We have:
§ Y: dependent variable
§ 𝑋 = 𝑋#, 𝑋%, … , 𝑋' vectors of independent variables
Goal:
𝑌 = 𝛽* + 𝛽# 𝑋# + 𝛽% 𝑋% + ⋯ + 𝛽' 𝑋' + 𝜀
OLS Model: 𝑌. = 𝛽/* + 𝛽/# 𝑋# + 𝛽/% 𝑋% + ⋯ + 𝛽/' 𝑋' = 𝛽/* + ∑ 𝛽/1 𝑋1
'
12#
Parsimony: 𝑋3 ⊆ 𝑋 àminimalize residuals, with the use of as few independents as
possible
maximalize the model’s ability to generalize
Partial effects of independentsàonly significant variables in the model
these hypotheses can an be statistically tested
Objective functions
AIC
SBC
HQC
adjusted R2 à MAX
MIN
Linear Regression
Model Selection
Problem
Datasets Used
Performance of
Selection Algorithms
on Our Data
The Need for a New
Solution
The Performance of
out Hybrid Algoirthm
4. Dataset #1
Body Fat Measurements – real dataset from 1996
𝑛 = 252
𝑌: Percent of body fat to muscle tissue
𝑚 = 16 (age, abdomen circumference, weight, height, etc.)
Multicollinearity: Redundancy between independents.
Pl.:
Which of these two independents matters most when predicting 𝑌?
How can we interpret the partial effects of these independents?
Measure: Regress the independents on each otheràVIF indicator for each independent
if VIF>2àmulticollinearity
Linear
Regression
Model Selection
Problem
Datasets Used
Performance of
Selection
Algorithms on
Our Data
The Need for a
New Solution
The
Performance of
out Hybrid
Algoirthm
5. Dataset #2
DATA26 – simulated dataset from Gumbel Copula
𝑛 = 1000
𝑚 = 25 (plus 𝑌)
Generating Correlation Matrix (CM) with high correlations in absolute value
vineBeta method (Lewandowskia et. al, 2009)
Simulating Multicollinearity
All 26 generated variables follow N(µ,s)
distributions, where µ and s are
randomly generated for each variable
Linear
Regression
Model Selection
Problem
Datasets Used
Performance of
Selection
Algorithms on
Our Data
The Need for a
New Solution
The
Performance of
out Hybrid
Algoirthm
6. Performance of Selection Algorithms–
FAT
Linear Regression
Model Selection
Problem
Datasets Used
Performance of
Selection Algorithms
on Our Data
The Need for a New
Solution
The Performance of
out Hybrid Algoirthm
AIC SBC 𝑅>% Runtime (sec) St Dev (sec)
Best Subsets (SPSS Leaps
and Bound)
-2,013
(Variables: 1)
-1,987
(Variables: 1)
0,9829
(Variables: 1, 2, 3,
5, 6, 8, 11, 12, 15)
4,558 0,878
Best Subsets (Minerva:
GARS)
-2,013
(Variables: 1)
-1,987
(Variables: 1)
0,9829
(Variables: 1, 2, 3,
5, 6, 8, 11, 12, 15)
5,921 1,658
improved GARS
-2,013
(Variables: 1)
-1,987
(Variables: 1)
0,9822
(Variables: 1, 3, 5,
6, 8, 12, 15)
11,268 2,941
IHSRS
-2,013
(Variables: 1)
-1,987
(Variables: 1)
0,9822
(Variables: 1, 3, 5,
6, 8, 12, 15)
0,968 0,188
Forward+Backward
0,058
(Variables: 1, 3, 5,
6, 8, 12, 15)
0,239
(Variables: 1, 3, 5,
6, 8, 12, 15)
0,9822
(Variables: 1, 3, 5,
6, 8, 12, 15)
0,976 0,050
Variable Importance in
Projection (Partial Least
Squares)
-0,247
(Variables: 1, 2, 5,
6, 8, 9)
-0,092
(Variables: 1, 2, 5,
6, 8, 9)
0,9618
(Variables: 1, 2, 5,
6, 8, 9)
1,807 0,896
Elastic Net
-2,013
(Variables: 1)
-1,987
(Variables: 1)
0,9410
(Variables: 1)
50,858 9,019
Stepwise VIF Selection
-0,189 (Variables:
1, 2, 15)
-0,008 (Variables:
1, 2, 15)
0,954
(Variables: 1, 2, 15)
0,832 0,034
Nested Estimate Procedure
-1,402
(Variables: 1, 8)
-1,351
(Variables: 1, 8)
0,9538
(Variables: 1, 8)
0,352 0,047
8. Problem with the results
Model
Collinearity Statistics
Tolerance VIF
X1 ,069 14,490
X3 ,017 59,097
X5 ,089 11,271
X6 ,030 33,682
X8 ,105 9,540
X12 ,239 4,182
X15 ,399 2,509
Linear Regression
Model Selection
Problem
Datasets Used
Performance of
Selection
Algorithms on Our
Data
The Need for a
New Solution
The Performance
of out Hybrid
Algoirthm
Model
Collinearity Statistics
Tolerance VIF
(Constant)
X1 ,065 15,347
X4 ,001 1644,939
X5 ,003 388,860
X6 ,002 538,248
X8 ,005 197,505
X10 ,050 20,165
X12 ,001 1366,452
X13 ,030 33,293
X15 ,001 1133,939
X16 ,048 20,828
X17 ,041 24,297
X18 ,016 64,340
X21 ,003 393,569
X23 ,002 554,800
X24 ,004 262,232
X25 ,001 825,023
FAT DATA26
Optimal solutions of IHSRS for 𝑹@ 𝟐
9. Modify the IHRSRS
Include an all VIFs<2 condition to the optimalization task
Optimal solutions of IHSRS with VIF conditions:
Linear Regression
Model Selection
Problem
Datasets Used
Performance of
Selection
Algorithms on Our
Data
The Need for a
New Solution
The Performance
of out Hybrid
Algoirthm
Model
Collinearity Statistics
Tolerance VIF
X1 ,508 1,970
X2 ,879 1,138
X8 ,558 1,791
𝑹@%
=0,9854
FAT
Model
Collinearity Statistics
Tolerance VIF
(Constant)
X2 ,503 1,986
X6 ,548 1,825
X10 ,500 1,999
X14 ,526 1,902
X23 ,565 1,770
DATA26
𝑹@%
=0,991
Other models with VIF values smaller than 2:
Backward – VIF: 𝑹@%
= 0,9540 (FAT); 0,940 (DATA26)
Nested Estimates: 𝑹@%
= 0,9538 (FAT); 0,917 (DATA26)
10. A Great Setback for
the modified IHSRS
Linear Regression
Model Selection
Problem
Datasets Used
Performance of
Selection
Algorithms on Our
Data
The Need for a
New Solution
The Performance
of out Hybrid
Algoirthm
0
10000
20000
30000
40000
50000
60000
average solution time (number of steps) standard deviation of solution times
(number of steps)
FAT
IHSRS without VIF IHSRS with VIF
0
10
20
30
40
50
60
70
average solution time (sec) standard deviation of solution times (sec)
FAT
IHSRS without VIF IHSRS with VIF
0
50000
100000
150000
200000
250000
average solution time (number of steps) standard deviation of solution times
(number of steps)
DATA26
IHSRS without VIF IHSRS with VIF
0
500
1000
1500
2000
2500
3000
3500
average solution time (sec) standard deviation of solution times (sec)
DATA26
IHSRS without VIF IHSRS with VIF
Average runtime
is almost an hour!
11. We can not parallelize
the IHSRS
Linear Regression
Model Selection
Problem
Datasets Used
Performance of
Selection Algorithms
on Our Data
The Need for a New
Solution
The Performance of
out Hybrid Algoirthm
individual/melody: ● = 0 0 1 0 1 1 1
population/harmony memory: ● ● ● ●
STEP 1&2: Generate a random harmony and evaluate the regressions for each individual
● ● ● ●
HMCR prob 1-HMCR prob
● ● ● ● Generate a RANDOM indvidual
PAR prob 1-PAR prob
Mutate ● with mutation (bw) prob No modification on ●
Increase PAR + Decrease bw
Is new ● better than the worst individual?
YES NO
Change the worst individual
YES Termination Criterion? NO
STOP
12. Our GA-HS hybrid
solution
Linear Regression
Model Selection
Problem
Datasets Used
Performance of
Selection Algorithms
on Our Data
The Need for a New
Solution
The Performance of
out Hybrid Algoirthm
individual: ● = 0 0 1 0 1 1 1
population: ● ● ● ●
STEP 1&2: Generate a random harmony and evaluate the regressions for each individual
● ● ● ●
Select better than average individuals
● ● ● ●
Start a new population: ● ● x x
Can be
Parallel
ized!
HMCR prob 1-HMCR prob
● ● x x Generate RANDOM indvidual
Mutate ● with mutation (bw) prob
Increase HMCR + Decrease bw
Is every x filled? NO
YES
Evaluate the regressions for the new individuals in our population
YES Termination Criterion? NO
STOP
13. Differences from GA
1. More than one kind of mutation
2. No crossover
In Linear Regression Model Selection randomization is more important, than
inhereted good properties
The inclusion or exculsion of a single independent can save
or ruin a model
We could observe that GA is a relatively slow algorithm when applied to Model
Selecton
Linear Regression
Model Selection
Problem
Datasets Used
Performance of
Selection Algorithms
on Our Data
The Need for a New
Solution
The Performance of
out Hybrid Algoirthm
15. Enviroment
The solution times are an average of 30 runs. The standard deviation of the
runtimes is determined from the same 30 runs.
Most Selection Algorithms were used in IBM SPSS Statistics 22
Elastic Net: Catgreg SPSS macro by the University of Leiden
Numpy and Scipy Python libraries for Partial Least Squares
Metaheuristics (GARS, improved GARS, IHSRS, GAIHSRS) are implemented in C#
OS and Hardware Configurations
OS: Windows 8.1 Ultimate 64 bit
CPU: Intel Core i7-2700K, 3.5GHz
RAM: 16GB DDR3 SDRAM