Cross-project defect prediction is very appealing because (i) it allows predicting defects in projects for which the availability of data is limited, and (ii) it allows producing generalizable prediction models. However, existing research suggests that cross-project prediction is particularly challenging and, due to heterogeneity of projects, prediction accuracy is not always very good.
This paper proposes a novel, multi-objective approach for cross-project defect prediction, based on a multi-objective logistic regression model built using a genetic algorithm. Instead of providing the software engineer with a single predictive model, the multi-objective approach allows software engineers to choose predictors achieving a compromise between number of likely defect-prone artifacts (effectiveness) and LOC to be analyzed/tested (which can be considered as a proxy of the cost of code inspection).
Results of an empirical evaluation on 10 datasets from the Promise repository indicate the superiority and the usefulness of the multi-objective approach with respect to single-objective predictors. Also, the proposed approach outperforms an alternative approach for cross-project prediction, based on local prediction upon clusters of similar classes.
4. Indicators of defects
Cached history
information
Kim at al.
ICSE 2007
Change Metrics
Moser at al.
ICSE 2008.
A metrics suite
for object
oriented design
Chidamber at al.
TSE 1994
8. Defect Prediction Methodology
Predicting
Model
Project
Test Set
Training Set
Defect
Prone
Class1 YES
Class2 YES
Class3 NO
… YES
ClassN …
Predicting
Model
Test Set
Training Set
Defect
Prone
Class1 YES
Class2 YES
Class3 NO
… YES
ClassN …
Past Projects
New Project
9. Project B
Project A
Defect Prediction Methodology
Predicting
Model
Project
Test Set
Training Set
Defect
Prone
Class1 YES
Class2 YES
Class3 NO
… YES
ClassN …
Predicting
Model
Test Set
Training Set
Defect
Prone
Class1 YES
Class2 YES
Class3 NO
… YES
ClassN …
10. Project B
Project A
Defect Prediction Methodology
Predicting
Model
Project
Test Set
Training Set
Defect
Prone
Class1 YES
Class2 YES
Class3 NO
… YES
ClassN …
Predicting
Model
Test Set
Training Set
Defect
Prone
Class1 YES
Class2 YES
Class3 NO
… YES
ClassN …
11. Cost Effectiveness
1) Cross-project does not
necessarily works worse
than within-project
2) Better precision (accuracy)
does not mirror less
inspection cost
3) Traditional predicting
model: logistic regression
Recaling the “imprecision” of Cross-
project Defect Prediction, Rahman at
al. FSE 2012
13. Cost Effectiveness: example
Predicting model 1
Class A Class B Class A Class C Class D
100
LOC
10,000
LOC
100
LOC
100
LOC
100
LOC
Predicting model 2
Class A Class B Class C Class D
14. Cost Effectiveness: example
Predicting model 1
Class A Class B Class A Class C Class D
100
LOC
10,000
LOC
100
LOC
100
LOC
100
LOC
Predicting model 2
Class A Class B Class C Class D
15. Cost Effectiveness: example
Predicting model 1
Class A Class B Class A Class C Class D
100
LOC
10,000
LOC
100
LOC
100
LOC
100
LOC
Predicting model 2
Class A Class B Class C Class D
16. Cost Effectiveness: an example
Predicting model 1
Class A Class B Class A Class C Class D
100
LOC
10,000
LOC
100
LOC
100
LOC
100
LOC
Predicting model 2
Class A Class B Class C Class D
17. Class A Class B Class C Class D
Cost Effectiveness: an example
Predicting model 1
Class A Class B Class A Class C Class D
100
LOC
10,000
LOC
100
LOC
100
LOC
100
LOC
Predicting model 2
19. Building Predicting Model on Training Set
Training Set
P1 P2 …
Class1 m11 m12 …
Class2 m21 m22 …
Class3 m31 m32 …
Class4 … … …
… … … …
Logistic Regression
a + b mi1 + c mi2 + ….
1 e
e
Pred
a + b mi1 + c mi2 + …
Pred.
C1
C2
C3
C4
…
20. Building Predicting Model on Training Set
Training Set
P1 P2 …
Class1 m11 m12 …
Class2 m21 m22 …
Class3 m31 m32 …
Class4 … … …
… … … …
Logistic Regression
2 + 3 mi1 + 4 mi2 + …
2 + 3 mi1 + 4 mi2 + …
Pred.
C1 1
C2 1
C3 0
C4 1
… 0
.
1 e
e
Pred
Logistic Regression
1 - 2 mi1 + 1 mi2 + …
1 - 2 mi1 + 1 mi2 + …
Pred.
C1 0
C2 0
C3 1
C4 1
… 1
.
1 e
e
Pred
21. Building Predicting Model on Training Set
Training Set
Logistic
Regression
Pred.
C1 1
C2 1
C3 0
C4 1
… 0
Actual Val
C1 1
C2 0
C3 1
C4 1
… 0
Comparison
P1 P2 …
Class1 m11 m12 …
Class2 m21 m22 …
Class3 m31 m32 …
Class4 … … …
… … … …
22. Building Predicting Model on Training Set
Training Set
Logistic
Regression
Pred.
C1 1
C2 1
C3 0
C4 1
… 0
Actual Val
C1 1
C2 0
C3 1
C4 1
… 0
Comparison
P1 P2 …
Class1 m11 m12 …
Class2 m21 m22 …
Class3 m31 m32 …
Class4 … … …
… … … …
GOAL: minimazing
the predicting error
(PRECISION)
23. Building Predicting Model on Training Set
Training Set
Logistic
Regression
Pred.
C1 1
C2 1
C3 0
C4 1
… 0
Actual Val
C1 1
C2 0
C3 1
C4 1
… 0
Comparison
P1 P2 …
Class1 m11 m12 …
Class2 m21 m22 …
Class3 m31 m32 …
Class4 … … …
… … … …
GOAL: minimazing
the predicting error
(PRECISION)
28. a + b mi1 + c mi2 + …
Multi-objective Genetic Algorithm
i
ii
i
i
i
ActualedessEffectiven
CostPredCostIspection
Pr
min
max
.
1 e
e
Pred
a + b mi1 + c mi2 + …
Chromosome (a, b,c , …)
Fitness Function
Multiple objectives are
optimized using Pareto
efficient approaches
36. Experiment outline
• Cross-projects defect prediction:
Training model on nine projects and test on the remaining one
(10 times)
• Within project defect prediction:
10 cross-folder validation
RQ1
RQ1
37. Experiment outline
• Cross-projects defect prediction:
Training model on nine projects and test on the remaining one
(10 times)
• Within project defect prediction:
10 cross-folder validation
• Local prediction:
K-means clustering algorithm
Silhouette Coefficient
RQ1
RQ1
RQ2
53. • Multi-objective Logistic Regression and GA Settings
Cross-project defect validation
Population size=100
Max number of generations = 400
Mutation Probability = 0.05
Crossover Function = Arithmetic Crossover
Experiment outline