Using text clustering to predict defect resolution time: A conceptual replication and an evaluation of prediction accuracy

Using text clustering to predict defect
resolution time: A conceptual replication
and an evaluation of prediction accuracy
Saïd Assar1, Markus Borg2, Dietmar Pfahl3
1 Institut Mines Télécom, Ecole de Management, France
2 Lund university / Swedish ICT, Sweden
3 Lund university / Univ. of Tartu, Estonia

2
USING TEXT CLUSTERING TO PREDICT DEFECT RESOLUTION
TIME: A CONCEPTUAL REPLICATION AND AN EVALUATION
OF PREDICTION ACCURACY

3

4
Leveraging …
 textual similarity among defext textual description
 non-supervised ML
to analyse Defect Resolution Time (DRT)
Main result : clusters of similar defects
have significant differences in mean RT
=> To which extent does this result hold?
Claim : mean RT for clusters of similar
defects can be used for DRT prediction
=> To which extent is this claim valid?
Raja (2014). All complaints are not created equal: text analysis of open source software defect reports.
Empirical Software Engineering 18:117–138

Related works (1) 5
 Leveraging attributes of previous defects to manage new
incoming defects
– Attributes:
o Severity
o Software component
o Developer
o Textual description
o Textual comments
o etc.
– Goals:
o Duplicate detection
o Severity prediction
o Triaging
o Defect resolution time (DRT) prediction
o etc.

Related works (2)
 DRT prediction using different defect attributes
– Kim and Whitehead (2006): descriptive statistics
– Panjer (2007), Bougie (2010): different methods and different attributes
(excluding textual descriptions)
– Anbalagan and Vouk (2009): number of persons as predictor
– Giger et al. (2010): different attributes, decision tree analysis
– Bhattacharya and Neamtiu (2011): different attributes , univariate and
multivariate regression analysis
– Lamkanfi and Demeyer (2012): filtering outliers
– Zhang et al. (2013): Markov model
 Prediction using text similarity (Weiss et al., 2007))
6
Previous defects
Prediction set (K=1, 2, 3, … ∞)
New incoming
defect (textual
desc.)
Similarity level
(α = 0 to 100%)

Baseline experiment (1) 7
TEXT CLUSTERING TO PREDICT DEFECT RESOLUTION TIME (Raja, 2014)
Studied projects (#closed
defects)
 FileZilla (956)
 jEdit (1682)
 phpMyAdmin (1521)
 Pidgin (2689)
 Slash (3580)
Tool support SAS Text Miner
Manual preprocessing  Removal of report clones
 Manual examination of issues closed in <1 hour
Automatic preprocessing  Stop words
 Stemming
 Term weighting (entropy weights)
 Latent Semantic Indexing (LSI)
Clustering algorithm Entropy minimization
Manual cluster tuning Several labor-intensive steps
# of clusters 3-5

Baseline experiment (2) 8
Statistical test Goal of the test Results
Kolmogorov-
Smirnov test
Normality distribution of
DRT among clusters
Partially positive – for each project data, “there
was a slight violation of normality assumption”
for certain clusters (p.129)
Levene’s test Equal variance of DRT
among clusters
Negative – unequal variance for all projects’ data
(p.129)
One-way ANOVA
test
Analysis of DRT variance
among clusters
Positive – for each project, DRT mean of at least
one cluster differs from all other clusters (p.130)
Brown-Forsythe
test
Equality of RT means among
clusters
Positive – the results were consistent, across all
projects, with the findings of ANOVA (p.130)
Games-Howell’s
post-hoc test
Identify the exact pattern of
differences among clusters’
RT means
Partially positive – some of the clusters do not
have significant differences in their DRT (p.130)
TEXT CLUSTERING TO PREDICT DEFECT RESOLUTION TIME (Raja, 2014)

Replication (1) 9
A CONCEPTUAL REPLICATION
Studied projects
(#closed defects)
 FileZilla (956)
 jEdit (1682)
 phpMyAdmin (1521)
 Pidgin (2689)
 Slash (3580)
 Android (4684) – OSS
 Eclipse (4158) – OSS
 Company A (6790) – Proprietary
Tool support SAS Text Miner RapidMiner
Manual
preprocessing
 Removal of report clones
 Manual examination of issues
closed in <1 hour
Removal of report clones
Automatic
preprocessing
 Stop words
 Stemming
 Term weighting (entropy weights)
 Latent Semantic Indexing (LSI)
 Stop words
 Stemming
 Term weighting (TF-IDF )
Clustering
algorithm
Entropy minimization K-means clustering
Manual cluster
tuning
Several labor-intensive steps None
# of clusters 3-5 4

Replication (2)
1
0
One-way ANOVA test Brown-Forsythe test
Android F=13.47 p-value = 0.0000 *** F=14.03 p-value = 0.0000 ***
Eclipse F=21.44 p-value = 0.0000 *** F=18.78 p-value = 0.0000 ***
CompA F=92.99 p-value = 0.0000 *** F=48.5 p-value = 0.0000 ***
RESULTSRESULTS : CONFIRMATION …
Android Cluster 2 Cluster 3 Cluster 4
Cluster 1 0.0004 *** 0.0472 * 0.9127
Cluster 2 0.0000 *** 0.8930
Cluster 3 0.3840
Eclipse
Cluster 1 0.0250 * 0.0000 *** 0.9308
Cluster 2 0.0000 *** 0.0000 ***
Cluster 3 0.0000 ***
Company A
Cluster 1 0.0000 *** 0.0000 *** 0.0000 ***
Cluster 2 0.0000 *** 0.0000 ***
Cluster 3 0.0000 ***

Simulation (1)
1
1
EVALUATION OF PREDICTION ACCURACY
Simulation principles

Simulation (2)
1
2
Simulation principles (cont’d)
Magnitude of Relative Error (MRE) accuracy indicator

Simulation (3) 13
Simulation principles (cont’d)

Simulation (4)
 Varying the simulation parameters
– Error threshold = 25% or 50%
– K = 4, 6, 8 and 10
– SSF = 0.1, 0.3, 0.5
 The results were globally negative, i.e., no significant
difference when compared with naïve prediction
14

Simulation (4)
 Results at
– K=4
– SSF = 0.1, 0.3 , 0.5

Simulation (4)
 Results at with data from Company A only
– K=6, 8, 10
– SSF = 0.3 , 0.5

Conclusion
 Using a simple, fully automated clustering approach based
on term-frequency in defect report descriptions cannot
predict DRT with sufficient accuracy
 Replication without a grounding theory “is by far the most
risky type of replication” (Kitchenham 2008)
 Future work:
– Does the similarity assumption hold?
– Semantic clustering ?
17

Using text clustering to predict defect resolution time: A conceptual replication and an evaluation of prediction accuracy

Recommandé

Recommandé

Contenu connexe

Plus de Saïd Assar

Plus de Saïd Assar (10)

Dernier

Dernier (20)

Using text clustering to predict defect resolution time: A conceptual replication and an evaluation of prediction accuracy