Presentation of the paper published in Empirical Software Engineering, at the European Open Symposium on Empirical Software Engineering (EOSESE’2015), EVOLille, 03 Dec. 2015.
The corresponding paper appeared in the Empirical Software Engineering journal, Springer, http://link.springer.com/article/10.1007/s10664-015-9391-7
Using text clustering to predict defect resolution time: A conceptual replication and an evaluation of prediction accuracy
1. Using text clustering to predict defect
resolution time: A conceptual replication
and an evaluation of prediction accuracy
Saïd Assar1, Markus Borg2, Dietmar Pfahl3
1 Institut Mines Télécom, Ecole de Management, France
2 Lund university / Swedish ICT, Sweden
3 Lund university / Univ. of Tartu, Estonia
2. 2
USING TEXT CLUSTERING TO PREDICT DEFECT RESOLUTION
TIME: A CONCEPTUAL REPLICATION AND AN EVALUATION
OF PREDICTION ACCURACY
3. 3
USING TEXT CLUSTERING TO PREDICT DEFECT RESOLUTION
TIME: A CONCEPTUAL REPLICATION AND AN EVALUATION
OF PREDICTION ACCURACY
4. USING TEXT CLUSTERING TO PREDICT DEFECT RESOLUTION
TIME: A CONCEPTUAL REPLICATION AND AN EVALUATION
OF PREDICTION ACCURACY
4
USING TEXT CLUSTERING TO PREDICT DEFECT RESOLUTION
TIME: A CONCEPTUAL REPLICATION AND AN EVALUATION
OF PREDICTION ACCURACY
Leveraging …
textual similarity among defext textual description
non-supervised ML
to analyse Defect Resolution Time (DRT)
Main result : clusters of similar defects
have significant differences in mean RT
=> To which extent does this result hold?
Claim : mean RT for clusters of similar
defects can be used for DRT prediction
=> To which extent is this claim valid?
Raja (2014). All complaints are not created equal: text analysis of open source software defect reports.
Empirical Software Engineering 18:117–138
5. Related works (1) 5
Leveraging attributes of previous defects to manage new
incoming defects
– Attributes:
o Severity
o Software component
o Developer
o Textual description
o Textual comments
o etc.
– Goals:
o Duplicate detection
o Severity prediction
o Triaging
o Defect resolution time (DRT) prediction
o etc.
6. Related works (2)
DRT prediction using different defect attributes
– Kim and Whitehead (2006): descriptive statistics
– Panjer (2007), Bougie (2010): different methods and different attributes
(excluding textual descriptions)
– Anbalagan and Vouk (2009): number of persons as predictor
– Giger et al. (2010): different attributes, decision tree analysis
– Bhattacharya and Neamtiu (2011): different attributes , univariate and
multivariate regression analysis
– Lamkanfi and Demeyer (2012): filtering outliers
– Zhang et al. (2013): Markov model
Prediction using text similarity (Weiss et al., 2007))
6
Previous defects
Prediction set (K=1, 2, 3, … ∞)
New incoming
defect (textual
desc.)
Similarity level
(α = 0 to 100%)
7. Baseline experiment (1) 7
TEXT CLUSTERING TO PREDICT DEFECT RESOLUTION TIME (Raja, 2014)
Studied projects (#closed
defects)
FileZilla (956)
jEdit (1682)
phpMyAdmin (1521)
Pidgin (2689)
Slash (3580)
Tool support SAS Text Miner
Manual preprocessing Removal of report clones
Manual examination of issues closed in <1 hour
Automatic preprocessing Stop words
Stemming
Term weighting (entropy weights)
Latent Semantic Indexing (LSI)
Clustering algorithm Entropy minimization
Manual cluster tuning Several labor-intensive steps
# of clusters 3-5
8. Baseline experiment (2) 8
Statistical test Goal of the test Results
Kolmogorov-
Smirnov test
Normality distribution of
DRT among clusters
Partially positive – for each project data, “there
was a slight violation of normality assumption”
for certain clusters (p.129)
Levene’s test Equal variance of DRT
among clusters
Negative – unequal variance for all projects’ data
(p.129)
One-way ANOVA
test
Analysis of DRT variance
among clusters
Positive – for each project, DRT mean of at least
one cluster differs from all other clusters (p.130)
Brown-Forsythe
test
Equality of RT means among
clusters
Positive – the results were consistent, across all
projects, with the findings of ANOVA (p.130)
Games-Howell’s
post-hoc test
Identify the exact pattern of
differences among clusters’
RT means
Partially positive – some of the clusters do not
have significant differences in their DRT (p.130)
TEXT CLUSTERING TO PREDICT DEFECT RESOLUTION TIME (Raja, 2014)
9. Replication (1) 9
A CONCEPTUAL REPLICATION
Studied projects
(#closed defects)
FileZilla (956)
jEdit (1682)
phpMyAdmin (1521)
Pidgin (2689)
Slash (3580)
Android (4684) – OSS
Eclipse (4158) – OSS
Company A (6790) – Proprietary
Tool support SAS Text Miner RapidMiner
Manual
preprocessing
Removal of report clones
Manual examination of issues
closed in <1 hour
Removal of report clones
Automatic
preprocessing
Stop words
Stemming
Term weighting (entropy weights)
Latent Semantic Indexing (LSI)
Stop words
Stemming
Term weighting (TF-IDF )
Clustering
algorithm
Entropy minimization K-means clustering
Manual cluster
tuning
Several labor-intensive steps None
# of clusters 3-5 4
14. Simulation (4)
Varying the simulation parameters
– Error threshold = 25% or 50%
– K = 4, 6, 8 and 10
– SSF = 0.1, 0.3, 0.5
The results were globally negative, i.e., no significant
difference when compared with naïve prediction
14
EVALUATION OF PREDICTION ACCURACY
16. Simulation (4)
Results at with data from Company A only
– K=6, 8, 10
– SSF = 0.3 , 0.5
EVALUATION OF PREDICTION ACCURACY
17. Conclusion
Using a simple, fully automated clustering approach based
on term-frequency in defect report descriptions cannot
predict DRT with sufficient accuracy
Replication without a grounding theory “is by far the most
risky type of replication” (Kitchenham 2008)
Future work:
– Does the similarity assumption hold?
– Semantic clustering ?
17