This document discusses research on improving bug prediction models by accounting for development effort. It finds that process metrics are still better predictors than product metrics in effort-aware models. While previous research found package-level predictions more effective than file-level, this study finds file-level predictions are more effective when accounting for effort using three approaches: directly using package metrics, lifting file metrics to package-level, and lifting file predictions to package-level. Lifting file predictions to package-level yielded the best performance.
13. Do We Consider Effort?
12
# BUGS: 5
Effort : 1
File A
File B
# BUGS: 6
Effort : 20
14. Do We Consider Effort?
13
File A
File B
# BUGS: 5
Effort : 1
# BUGS: 6
Effort : 20
15. Effort-aware Models*
* T. Mende and R. Koschke, “Effort-aware defect prediction models,” in Proc. of European
Conference on Software Maintenance and Reengineering (CSMR’10), 2010, pp. 109–118. 14
16. Effort-aware Models*
* T. Mende and R. Koschke, “Effort-aware defect prediction models,” in Proc. of European
Conference on Software Maintenance and Reengineering (CSMR’10), 2010, pp. 109–118. 15
# BUGS: 5
Effort : 1
# BUGS: 6
Effort : 20
File A File B
17. Effort-aware Models*
* T. Mende and R. Koschke, “Effort-aware defect prediction models,” in Proc. of European
Conference on Software Maintenance and Reengineering (CSMR’10), 2010, pp. 109–118. 16
5 = 5 / 1 0.30 = 6 / 20
# BUGS: 5
Effort : 1
# BUGS: 6
Effort : 20
File A File B
18. Effort-aware Models*
* T. Mende and R. Koschke, “Effort-aware defect prediction models,” in Proc. of European
Conference on Software Maintenance and Reengineering (CSMR’10), 2010, pp. 109–118. 17
5 = 5 / 1 0.30 = 6 / 20
# BUGS: 5
Effort : 1
# BUGS: 6
Effort : 20
File A File B
19. Effort-aware Models*
* T. Mende and R. Koschke, “Effort-aware defect prediction models,” in Proc. of European
Conference on Software Maintenance and Reengineering (CSMR’10), 2010, pp. 109–118. 18
# BUGS: 5
Effort : 1
# BUGS: 6
Effort : 20
5 = 5 / 1 0.30 = 6 / 20
SLOC
File A File B
20. Major Findings in Prediction Studies
Process metrics are better defect predictors
than product metrics
Package-level prediction has higher
precision and recall than file-level prediction
…
19
[Schroter2006ISESE], [Zimmermann2007PROMISE], …
[Graves2000TSE], [Nagappan2006ICSE], [Moser2008ICSE], …
21. Major Findings in Prediction Studies
Process metrics are better defect predictors
than product metrics
Package-level prediction has higher
precision and recall than file-level prediction
…
20
RQ1
RQ2
[Schroter2006ISESE], [Zimmermann2007PROMISE], …
[Graves2000TSE], [Nagappan2006ICSE], [Moser2008ICSE], …
26. Cross-release Prediction
25
BUGS : 5
BUGS : 7
BUGS : 0
Build a prediction model Predict a bug
TimePlatform 3.1 Platform 3.2
Measure Metrics
Bugs
Measure Metrics
Bugs
Training data Testing data
27. Cumulative Lift Chart
All modules are ordered by decreasing
predicted RR(x) .
26
#bugs
0 200 400 600 800
05001000150020002500
KSLOC(= Effort)
#Bugs
28. Cumulative Lift Chart
All modules are ordered by decreasing
predicted RR(x) .
27
#bugs
0 200 400 600 800
05001000150020002500
KSLOC(= Effort)
#Bugs
20%
54%
29. Research Questions
RQ1: Are process metrics still more effective
than product metrics in effort-aware models?
RQ2: Are package-level predictions still more
effective than file-level predictions?
28
30. Research Questions
RQ1: Are process metrics still more effective
than product metrics in effort-aware models?
RQ2: Are package-level predictions still more
effective than file-level predictions?
29
31. RQ1: Process vs. Product Metrics
Compare prediction models based on process
and product metrics at the file-level
30
Process metrics Product metrics
36. KSLOC
#bugs
0 200 400 600 800
05001000150020002500
Process Metrics are still more
effective than Product Metrics
35
20%
74%
29%
Process
Product
KSLOC(= Effort)
#Bugs
37. KSLOC
#bugs
0 200 400 600 800
05001000150020002500
36
20%
74%
29%
KSLOC(= Effort)
#Bugs
2.6 (= 74/29)
Process
Product
Process Metrics are still more
effective than Product Metrics
38. Impact of Process and Product Metrics
The top five metrics are all process metrics
37
NSM
NSC
Refactorings
NORM
LCOM
SIX
DIT
NSF
NOF
NOM
WMC
VG
PAR
MLOC
NBD
SLOC
LOCAdded
Codechurn
LOCDeleted
Age
BugFixes
Revisions
0.00 0.05 0.10 0.15 0.20
rf1
IncNodePurity
Revisions
BugFixes
LOCDeleted
Age
Codechurn
LOCAdded
Refactorings
Process Metrics
Product Metrics
39. Research Questions
RQ1: Are process metrics still more effective
than product metrics in effort-aware models?
RQ2: Are package-level predictions still more
effective than file-level predictions?
38
YES
45. Model Building Approach
44
B1 Package-level metrics
B2 Lift file-level metrics to package-level
B3 Lift file-level predictions to package-level
46. B2 Lift File-level Metrics to
Package-level
45
RQ2: Model Building Approach
Lift
Metrics
File-level
metrics
Package-level
metrics
47. B2 Lift File-level Metrics to
Package-level
46
RQ2: Model Building Approach
Package A
File a
File b
File c
Lift
Metrics
File-level
metrics
Package-level
metrics
48. B2 Lift File-level Metrics to
Package-level
47
RQ2: Model Building Approach
Package A
File a
File b
File c
Complexity:
9
4
5
6 =
9 + 4 + 5
3
Lift
Metrics
File-level
metrics
Package-level
metrics
49. B2 Lift File-level Metrics to
Package-level
48
RQ2: Model Building Approach
Lift
Metrics
File-level
metrics
Package-level
metrics
50. B2 Lift File-level Metrics to
Package-level
49
RQ2: Model Building Approach
Lift
Metrics
File-level
metrics
Package-level
metrics
Build a model
Package-level
predictions
51. Model Building Approach
50
B1 Package-level metrics
B2 Lift file-level metrics to package-level
B3 Lift file-level predictions to package-level
52. B3 Lift File-level Predictions
to Package-level
51
RQ2: Model Building Approach
Lift
Predictions
Build a model
File-level
predictionsFile-level
metrics
53. B3 Lift File-level Predictions
to Package-level
52
RQ2: Model Building Approach
Lift
Predictions
Build a model
File-level
predictionsFile-level
metrics
54. B3 Lift File-level Predictions
to Package-level
53
RQ2: Model Building Approach
Package A
File a
File b
File c
#bugs
6
3
2
Lift
Predictions
55. B3 Lift File-level Predictions
to Package-level
54
RQ2: Model Building Approach
Package A
File a
File b
File c
#bugs
6
3
2
KSLOC:
1.0
0.5
1.5
Lift
Predictions
56. B3 Lift File-level Predictions
to Package-level
55
RQ2: Model Building Approach
Package A
File a
File b
File c
#bugs
6
3
2
KSLOC:
1.0
0.5
1.5
1.0+0.5+1.5
6+3+2
2.9 =
Lift
Predictions
57. B3 Lift File-level Predictions
to Package-level
56
RQ2: Model Building Approach
Build a model
File-level
predictions
Lift
Predictions
Package-level
predictions
File-level
metrics
58. Summary of Model Building
Approaches
57
RQ2: Model Building Approach
Package-level
metrics
Build a model
at Package-level
Package-level
predictions
B1 Martin Metrics
59. Summary of Model Building
Approaches
58
RQ2: Model Building Approach
Package-level
metrics
File-level
metrics
Build a model
at Package-level
Package-level
predictions
Lift
MetricsB2 LiftUp(Input)
B1 Martin Metrics
60. Summary of Model Building
Approaches
59
RQ2: Model Building Approach
Package-level
metrics
File-level
metrics
Build a model
at Package-level
Package-level
predictions
Build a model
at File-level
File-level
predictions
Lift
Predictions
Lift
MetricsB2 LiftUp(Input)
B3 LiftUp(Pred)
B1 Martin Metrics
61. #bugs
0 200 400 600 800
05001000150020002500
60
20%
Lifting Predictions yields the Best
Performance at Package-level
KSLOC(= Effort)
#Bugs
62% B3 LiftUp(Prediction)
62. #bugs
0 200 400 600 800
05001000150020002500
61
20%
Lifting Predictions yields the Best
Performance at Package-level
KSLOC(= Effort)
#Bugs
62% B3 LiftUp(Prediction)
57% B2 LiftUp(Input)
63. #bugs
0 200 400 600 800
05001000150020002500
62
20%
57% B2 LiftUp(Input)
19% B1 Martin Metrics
Lifting Predictions yields the Best
Performance at Package-level
KSLOC(= Effort)
#Bugs
62% B3 LiftUp(Prediction)
65. 64
#bugs
0 200 400 600 800
05001000150020002500
20%
74%
62%
Package
B3 LiftUp(Pred.)
File
KSLOC(= Effort)
#Bugs
File-level Predictions are more
effective than Package-level
66. Research Questions
RQ1: Are process metrics still more effective
than product metrics in effort-aware models?
RQ2: Are package-level predictions still more
effective than file-level predictions?
65
YES
NO
67. Why is RQ2 not Supported?
66
Package-level File-level
68. Why is RQ2 not Supported?
The larger the package, the more likely a bug
is introduced.
67
Package-level File-level
# BUGS: 8
SLOC : 20
# BUGS: 2
SLOC : 0.5
76. Example of Counting
the Number of Bugs
75
v3.0 release v3.1 release v3.2 release
B
bug introduction bug fix
v3.0 release v3.1 release v3.2 release
A
bug introduction bug fix
Hello, my name is Yasutaka Kamei. I’m a postdoc at Queens University, Canada.
Today’s my talk is titled “Revisiing Common bug Pred[i]ction Findings Using Effort-Aware Models”.
We seek to revisit some of the major findings in the prediction literature by taking into account the cost of additional software quality assurance.
1st 18min. 2nd 19min for Bram.
1st 24min 2nd 23min 3rd 22min 4th 23min 5th 22min
1st 24min 2nd 24min 3rd 23min
1st 23min 2nd 23min
Bugs are everywhere such as laptop application, iPhone and youtube so on.
Sometimes, when we write a research paper, a laptop suddenly shots down.
Software bugs in released products have exp[e]nsive c[o]nsequences for a comany.
For example, software field defects cost the U.S. ec[o]nomy an estimated $60 billion annually.
Also, these bugs seriously affect company’s reput[a]tion.
But, unfortunately, a company has only limited developers and times for software quality assurance activities.
How do we do?
Please ask Paul!
He is an octopus but a super octopus. He correctly predicts win-lose of all 8 games in world cup 2010.
He will help us to predict a bug. But, we have one bad news. Unusually, the life duration of octopus is just about 3 years. So we’ll need another solution in one or two years.
Another approach is a fault prediction technique. Many studies show these techniques could be effective for software quality assurance.
When we use these techniques, firstly, we measure source code metrics.
For example, we measure complexity, cohesion, churn and bugs.
Then, we build a prediction model using statistic techniques and machine learning techniques.
Then given new files, we measure source code metrics.
[click]
Then we predict the number of bugs in each code using this prediction model.
[click]
We can allocate limited testing or reviewing efforts to the middle files because the predicted number of bugs is the largest.
Then given new files, we measure source code metrics.
[click]
Then we predict the number of bugs in each code using this prediction model.
[click]
We can allocate limited testing or reviewing efforts to the middle files because the predicted number of bugs is the largest.
Then we predict the number of bugs in each code using this prediction model.
We can allocate limited testing or reviewing efforts to the middle files because the predicted number of bugs is the largest.
But, do we consider aneffort? Always, we should take into account the effort to review a file for detecting a bug.
[click]
Do files with small size and large size require a same amount of effort to review files?
We don’t think its same. So we should consider the effectiveness to review files in terms of effort.
[click]
For example, we get the predicted number of bugs, 5 and 6.
[click]
And we need 1 effort and 20 effort in each file. In terms of effectiveness, we should review file A.
But previous prediction approaches indicate we allocate more efforts to file B.
Do files with small size and large size require a same amount of effort to review files?
We don’t think its same. So we should consider the effectiveness to review files in terms of effort.
[click]
For example, we get the predicted number of bugs, 5 and 6.
[click]
And we need 1 effort and 20 effort in each file. In terms of effectiveness, we should review file A.
But previous prediction approaches indicate we allocate more efforts to file B. Because the number of bugs in file B is larger.
To solve this problem, effort-aware model has been proposed. This equation means how many bugs we can detect per effort.
We use this equation as the dependent variable when building a prediction model instead of the number of bugs.
[click]
We show an example using this case.
[click]
We can finish a review in 1 effort and detect 5 bugs. On the other hand, we can finish a review in 20 effort and detect 6 bugs. So, a relative risk is just 0.30. We believe this result is more intuitive.
[click]
In this study, we consider the number of lines of code as an effort.
This effort-aware prediction model offers a totally new interpretation and the practical view of bug prediction results.
We show an example using this case.
We can finish a review in 1 effort and detect 5 bugs. On the other hand, we can finish a review in 20 effort and detect 6 bugs. So, a relative risk is just 0.30. We believe this result is more intuitive.
This effort-aware models indicate we allocate more effort to file A. We believe this result is more intuitive.
There are many ways to measure effort. One possible way is easier start. We consider the lines of code as the effort.
This effort-aware prediction model offers a totally new interpretation and the practical view of bug prediction results.
We’d like to revisit some of the major findings in the prediction literature when considering the effort.
As a part of major findings in prediction study, process metrics are better defect….
Process metrics is measured from
And package-level prediction shows to have higher precision and recall than file-level prediction.
We revisit two major findings, process versus
The target of our study is three subprojects with the Eclipse software system, one of the well-known open development platforms.
Each subsystem are pretty large. They have 5,000 files, 3,000 files and 1,000 files.
[click]
In this presentation, we show the result of Platform. Because this project is the largest among three projects.
In this presentation, we show the result of Platform. Because this project is the largest among three projects.
We perform a cross-release prediction analysis to study the performance of our experiments.
[click]
We build a prediction model using a training dataset collected from a past release of a project,
[click]
and evaluate the performance using a test dataset collected from the following release.
This cross-release evaluation leads to an evaluation in a more practical setting.
[click]
We evaluate the prediction performance using cumulative lift chart. We order all files by decreasing the predicted fault density.
[click]
And we calculate how many percentage of bugs to all bugs we detect when we use 20% of effort. 100% of effort means the effort that it take to test all files. This example shows that we can detect 54\% of all bugs.
And we calculate how many percentage of bugs to all bugs do we detect when we use 20% of effort. 100% of effort means the effort that it take to test all files. This example shows that we can detect 54\% of all bugs.
So we are interested in addressing the following two research questions
<<07:30>>
Q1: process > product
Q2: file level > package level
So we are interested in addressing the following two research questions
Let me talk about the research question 1, process metrics versus product metrics.
We compare models based on product metrics to models based on process metrics at file-level.
As a product metrics,
we measured these product metrics using the Eclipse Metrics plug-in.
As a process metrics, we measured 7 kinds of metrics such as codechurn, number of revisions of a file, number of times a file has been refactored.
We wrote a script that obtains these measures from version control system.
As a product metrics,
we measured these product metrics using the Eclipse Metrics plug-in.
We use three modeling techniques (regression model, regression tree and random forest) to build the prediction models.
This environment is in R.
We use three modeling techniques (regression model, regression tree and random forest) to build the prediction models.
This environment is in R.
I’ll show the result of process metrics versus product metrics.
This graph is …
Solid lines shows the result of process metrics and dashed line shows the result of product metrics.
Using our bug predictions based on process metrics, we only need to spend 20% of the efforts to detect up to 74% of all faults.
On the other hand, using product metrics, just detect up to 29% of all faults.
Process metrics outperform product metrics by a factor of 2.6(=74/29) when considering effort
To illustrate the impact of product metrics and process metrics, we build a random forest using both of process metrics and product metrics and use IncNodePurity in the output. A higher IncNodePurity means that a variable plays a more important role in a built prediction model. This figure shows IncNodePurity of the metrics, sorted decreasingly from top to bottom. Process metrics are colored with Red. The top five metrics are all process metrics (Revisions, BugFixes, Age, LOCDeleted and Codechurn). And the lines of code added is ranked at top 6. But only the number of refactorings is not important.
Next, we talk about research question 2...
Package-level vs. File-level. We compare package-predictions to file-level predictions.
<<13:30>>
When building a file-level prediction model, product metrics and process metrics can be used as is. However, since a package consists of multiple files, we either need to lift the file-level metrics up to the package-level or use special package-level metrics.
Martin’s package design metrics indicate instability and abstractness of packages such as number of abstract classes, and concrete classes and Fan-in and Fan-out of packages. To our knowledge, no
study has reported the effects of Martin metrics on bug prediction.
We show an overview of building approaches.
When we use Martin Metrics, we measure metrics and build a model at packages.
When we use LiftUp Input, we lift up the metrics to package level.
When we use LiftUp Prediction, we lift up the prediction result to package level.
None of the Martin metrics show up in the top five metrics.
Two process and three products…
One of Martin metrics is ranked at top 10 but the others are not important to build a model
Why is RQ2 not supported? [click]
Normally, package is larger and has more bugs than file. If we don’t consider effort, it’s easier to detect a bug in package. Because the larger the package, the more likely a bug is introduced. So in previous work, package-level prediction has higher precision and recall than file-level prediction.
Normally, package is larger and has more bugs than file. If we don’t consider effort, it’s easier to detect a bug in package. Because the larger the package, the more likely a bug is introduced. So in previous work, package-level prediction has higher precision and recall than file-level prediction.
In this paper, we use effort-aware models to revisit some of the major findings in the prediction.
RQ1 is process metrics versus product metrics.
RQ2 is package-level prediction versus file-level prediction. We introduce three building approach for package-level prediction.
To study the two research question, we evaluate prediction performance using dataset of Eclipse project.
We find that the first finding holds when effort is considered, while the second finding does not hold.
To study the two research question, we evaluate prediction performance using dataset of Eclipse project.
We find that the first finding holds when effort is considered, while the second finding does not hold.
List of Questions:
Why do you use the number of lines of code as the effort to test files?
Package-level prediction has some different characteristics from file-level prediction.
Do you use a bug density as the dependent variable? If yes, what’s the new point of your research?
How do you collect a bug information from CVS?
Martin’s package design metrics indicate instability and abstractness of packages such as number of classes inside this package that depend on classes outside this package and number of abstract classes.
As a process metrics, we measured 7 kinds of metrics such as codechurn, number of revisions of a file, number of times a file has been refactored.
We wrote a script that obtains these measures from version control system.
As a product metrics,
we measured these product metrics using the Eclipse Metrics plug-in.
For package-level predictions, in addition to product metrics and process metrics, we use the metrics suite proposed by Martin.