A Machine Learning approach to predict Software Defects
1. Escalation Prediction on Defects Database
Dr. K. V. SubramaniamChetan Hireholi, 01FM14ESE006
GuideProject author
2. Problem statement
Determine what lead to Escalation by interpreting the Defects Corpus of the
customer support cases
Alert on the Escalation based on the nature of the Defects, correlate the
Escalations on defects discovered by the customers and find the trigger point
which leads to one such Escalation
3. Data Source
Incident
Database
CRs
Database
The Incident Database: Contained the Customer Support
cases.
The CRs Database: Internally used database which
details the cases which were Change Requests
4. Data Cleansing
The data in the Incidents and CRs Database had a lot of discrepancy (Ex. Rows
not in order, special characters in the Date Field, Multiple discrepancy in the
company names viz. Boeing, Boeing Inc.,)
Tools such as OpenRefine & Microsoft Excel helped in removing such
discrepancies.
3831
125
329
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Green Red Yellow
Total
Green
Red
Yellow
Incident Database
6. Algorithms
20.779
70.22
J48 Decision Tree
Correctly Classified Incorrectly classified
1. J 48 Decision Tree: 2. Naïve Bayes (RED & YELLOW corpus):
Attributes selected: Escalation, Expectation, Modules, Severity.
Motivation to do Textual Analysis:
The discussion between the client and the developer is captured in the ‘Comments’ attribute in the
Incidents Database. By analyzing this can unearth additional info about the defects (viz. what triggered the
escalation?, initial escalation of a defect, nature of the client, etc.). This lead to the use of R to do Text
Mining
a. Attributes selected: Escalation, Expectation, Modules,
Severity.
b. Probability distribution for:
i. RED Escalation: 0.242 (24.2%)
ii. YELLOW Escalation: 0.758 (75.8%)
iii. When Escalation is RED, then it is more likely that the
Severity is URGENT, with its probability distribution: 0.449
(44.9%)
iv. When Escalation is YELLOW, then it is more likely that the
Severity is HIGH, with its probability distribution: 0.634
(63.4%)
3. Simple K Means method:
a. Cluster 1 formed: YELLOW, Investigate
Issue & Hotfix required, Installation, High
b. Cluster 2 formed: RED, Investigate Issue,
Installation, High
7. Text Mining using R
Why R over NLTK (Python)?
Easy to code, abundant packages
Faster Pre Processing of the text
Mining the E- mail dump
Create Corpus
(RED, YELLOW & GREEN)
Pre Processing of the Text
(Removing punctuations,
Stop words, Numbers, Noise)
Apply ‘tm’ package for Text
Mining the Corpus
Extract Graphs, Word Clouds
of the trigger points which
are causing Escalations
9. Final escalation state= GREEN; Observations made prior to RED
Most frequently usedThe affected module
10. Final escalation state= GREEN; Observations made prior to YELLOW
Aiding words / Prefix- Postfix Most frequently used
Words with highest frequency mined
11. Final escalation state= YELLOW; Observations made prior to RED
(only 4 cases)
Developer who is associated with the
bug/incident
12. Final escalation state= RED; Observations made prior to RED
(Incidents jumped to RED from YELLOW state)
Most frequently usedThe affected module
13. Observations made on RED corpus
(The whole RED escalated dump)
The term
“escalation” used
along with “please”
and “support”
indicates that the
escalation is RED or
it will get converted
to RED
14. Observations made on GREEN corpus
(The whole GREEN escalated dump)
The use of “Please” is not frequent;
which in turn indicates- there are no
much RED escalations happening in the
incident history
Escalation count on the defect dump
3831
125
329
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Green Red Yellow
Total
Green
Red
Yellow
15. Other observations made on Incidents
For RED cases:
(Where SEVERITY is URGENT) The Average number of days for a case to get escalated = 13.56 days
(Where SEVERITY is HIGH) The Average number of days for a case to get escalated= 25.29 days
(Where SEVERITY is MEDIUM) The Average number of days for a case to get escalated= 19.66 days
16. Analyzing Incidents: Customers vs Escalations
RHEINENERGIE, HEWLETT PACKARD,
DEUTSCHE BUNDESBANK: Highest number of
RED escalations
4
33
222222222222222
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
RHEINENERGIE
VR-LEASINGAG
SWIFTINC
EURIWARES.A.
CTCTechnology
INTESASANPAOLOS.P.A.
FASTWEBS.P.A.
THEBOEINGCOMPANY
TOYOTA
TELECOMITALIASPA
USCensus
TATACONSULTANCYSERVICESLTD
THECAPITALGROUPCOMPANIESINC
Walmart
HPCMS
STADTKÖLN
TycoElectronics;CITEC;GOVERNMENT…
SIACNYSEGROUP
BANKOFINDIA
RockwellCollins
ITCBANGALOREDATACENTRE
WELLSFARGO
NTTDATA
PostNordic
HEWLETT-PACKARDGMBH
PepsiCo.
FOXTELEVISIONSTATIONSINC
PACIFICORP
T-SYSTEMSINTERNATIONALGMBH
NTTWest/HPPSO
BOEHRINGERINGELHEIM
MEDCOHEALTH
BANGALOREELECTRICITYSUPPLY
McKesson
NTTDATA
Total
Total
17. Total RED escalations: 125/6433; The below shows the highest number of escalations on
modules
Ops - Action Agent (opcacta) & Installation: Highest number of RED escalations
31
12
9 9
8
6
5 5
4 4 4
3 3
2 2 2 2 2 2 2
1 1 1 1 1 1 1 1
0
5
10
15
20
25
30
35
Total
Total
Analyzing Incidents: Modules vs Escalations
20. Analyzing Incidents: Developer vs Escalations
prasad.m.k_hp.com:
Handled high number of
escalations 29
15
10 10
8 8
5 5 5 5
3 3
2 2 2 2 2
1 1 1 1 1 1 1 1 1
0
5
10
15
20
25
30
35
Total
Total
21. Analyzing CR data
10219
75 93
0
2000
4000
6000
8000
10000
12000
Total
Escalations in CR
N
Showstopper
Y
N Showstopper Y Grand Total
Count of ESCALATION 10219 75 93 10387
Note: For Defects or CRs (QCCR) ,
Showstopper would be marked for the
defects which are must fixes or immediate
fix is needed for a release
24. Analyzing CRs: S/w release vs Escalations
20
5
1 1
4
1 2
10
5
12
5
63
14
6
4
1 1 1 1
0
10
20
30
40
50
60
70
Showstopper
Y
Release 11 : Highest number of ”Showstopper” and ”Y” escalations
25. Analyzing CRs: OS vs Escalations
Windows (Version number
not clear): Highest number of
Escalations Both
“Showstopper” and “Y”
4 4 4
3 3
2
1
9
1
2
1 1
3
2
1 1 1
0
1
2
3
4
5
6
7
8
9
10
Showstopper
Y
Note: Submitter of CRs tend to
choose the OS fields as they want to.
Some choose the exact versions
where the issue was seen or reported
or some choose just at a high level.
No strict rules observed
27. Company behavior analysis: RHEINENERGIE
(Had maximum RED escalations)
28 incident cases Patterns observed:
◦ 6 RED escalation
◦ Mostly contains RED escalations (6/28); 21.28% chance that an incident logged in will be a
RED escalation
◦ Most reported module:
◦ Ops - Monitor Agent (opcmona) (7 nos.) ; 3 of them were RED escalated
◦ Installation (6 nos.)
◦ Perf – Collector (3 nos.)
◦ Average number of days a single incident handled: 73.5 days
◦ Number of incidents which move to CR: 15; 53.57% of the incidents move to CRs;
◦ All the 6 RED escalations moved to CR;
◦ 8 GREEN escalations moved to CR;
◦ 1 YELLOW escalations moved to CR;
28. Company behavior analysis: APPLE INC
27 incident cases
Patterns observed:
No RED escalations ever
Mostly contains GREEN escalations (19/27); 70.37% chance that an incident logged in will be a
GREEN escalation
Most reported modules:
◦ Ops Monitor Agent (4 nos)
◦ Perf Collector (3 nos)
◦ Installation, Ops- Action Agent, Ops- Ops Agent, Perf Other (2 nos each)
Average number of days a single incident handled: 463.777 days
Number of incidents which move to CR: 10; 37.03% of the incidents move to CRs
29. Company behavior analysis: BOEING
33 incident cases
Patterns observed:
1 RED escalation
Mostly contains GREEN escalations (31/33); 93.93% chance that an incident logged in will be a GREEN
escalation
Most reported module:
◦ Installation (7 nos.)
◦ Perf Collector, Other (5 nos.)
◦ Perf GlancePlus (4 nos.)
◦ Perf ARM (RED escalation); 3% chance that it will be an RED escalation
Average number of days a single incident handled: 399.322 days
Number of incidents which move to CR: 22; 66.66% of the incidents move to CRs
30. Other observations made on Incidents
DIFFERENCE_INITIAL_CLOSED and DAYS_SUPPORT_TO_CPE are not matching
-400
-300
-200
-100
0
100
200
300
400
500
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100103106109112115118121
DIFFERENCE_INITIAL_CLOSED DAYS_SUPPORT_TO_CPE