Software Analytics: Towards Software Mining that Matters
1. Software Analytics:
Towards Software Mining
that Matters
Tao Xie
University of Illinois at Urbana-Champaign
http://www.cs.illinois.edu/homes/taoxie/
taoxie@illinois.edu
6. Software analytics is to enable software practitioners
to perform data exploration and analysis in order to
obtain insightful and actionable information for datadriven tasks around software and services.
[MALETS’11 Zhang et al.]
7. Software Intelligence &
Analytics for Software Development
http://people.engr.ncsu.edu/txie/publications/foser10-si.pdf
http://thomas-zimmermann.com/publications/files/buse-foser-2010.pdf
8. • use Data Exploration and Analysis
Mining Software Repositories (MSR)
• for Software Practitioners
Beyond Software Developers
• obtain Insightful and Actionable info
Need get real as well
• Analytic Techniques
• Producing Impact on Practice
25. • use Data Exploration and Analysis
Mining Software Repositories (MSR)
• for Software Practitioners
Beyond Software Developers
• obtain Insightful and Actionable info
Need get real as well
• Analytic Techniques
• Producing Impact on Practice
39. • use Data Exploration and Analysis
Mining Software Repositories (MSR)
• for Software Practitioners
Beyond Software Developers
• obtain Insightful and Actionable info
Need get real as well
• Analytic Techniques
• Case Studies
40. Predicting Bugs
• Studies have shown that most complexity metrics
correlate well with LOC!
– Graves et al. 2000 on commercial systems
– Herraiz et al. 2007 on open source systems
• Noteworthy findings:
– Previous bugs are good predictors of future bugs
– The more a file changes, the more likely it will have
bugs in it
– Recent changes affect more the bug potential of a file
over older changes (weighted time damp models)
– Number of developers is of little help in predicting bugs
– Hard to generalize bug predictors across projects
unless in similar domains [Nagappan, Ball et al. 2006]
23
41. Using Imports in Eclipse to Predict Bugs
71% of files that import compiler packages,
had to be fixed later on.
import org.eclipse.jdt.internal.compiler.lookup.*;
import org.eclipse.jdt.internal.compiler.*;
import org.eclipse.jdt.internal.compiler.ast.*;
import org.eclipse.jdt.internal.compiler.util.*;
...
import org.eclipse.pde.core.*;
import org.eclipse.jface.wizard.*;
import org.eclipse.ui.*;
14% of all files that import ui packages,
had to be fixed later on.
[Schröter et al. 06]
24
42. Don’t program on Fridays ;-)
Percentage of bug-introducing changes for eclipse
[Zimmermann et al. 05]
25
43. Failure is a 4-letter Word
[PROMISE’11 Zeller et al.]
26
47. • use Data Exploration and Analysis
Mining Software Repositories (MSR)
• for Software Practitioners
Beyond Software Developers
• obtain Insightful and Actionable info
Need get real as well
• Analytic Techniques
• Producing Impact on Practice
48. Analytic Techniques in SE
• Association rules and frequent patterns
• Classification
• Clustering
• Text mining/Natural language processing
• Visualization
More details are at
• https://sites.google.com/site/xsoftanalytics/
30
49. Solution-Driven Problem-Driven
Where can I apply X miner?
Basic
mining
algorithms
E.g., association rule,
frequent itemset mining…
Advanced
mining
algorithms
E.g., frequent partial order
mining [ESEC/FSE 07]
What patterns do
we really need?
New/adapted
mining
algorithms
E.g., [ICSE 09], [ASE 09]
49
51. Mining Searching + Mining
Traditional approaches
Code repositories
1
Eclipse, Linux, …
2
mining
patterns
Often lack sufficient relevant data points (Eg. API call sites)
51
52. Mining Searching + Mining
Traditional approaches
Code repositories
1
patterns
mining
2
Eclipse, Linux, …
Often lack sufficient relevant data points (Eg. API call sites)
Our new approaches
Code repositories
1
2
…
Open source code
on the web
N
searching
mining
patterns
Code search engine
e.g.,
53
53
52
53. Existing approaches produce high % of false positives
One major observation:
Programmers often write code in different ways for
achieving the same task
Some ways are more frequent than others
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
54. Existing approaches produce high % of false positives
One major observation:
Programmers often write code in different ways for
achieving the same task
Some ways are more frequent than others
Frequent
ways
Infrequent
ways
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
55. Existing approaches produce high % of false positives
One major observation:
Programmers often write code in different ways for
achieving the same task
Some ways are more frequent than others
Frequent
ways
Infrequent
ways
mine patterns
Mined Patterns
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
56. Existing approaches produce high % of false positives
One major observation:
Programmers often write code in different ways for
achieving the same task
Some ways are more frequent than others
Frequent
ways
mine patterns
Infrequent
ways
detect violations
Mined Patterns
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
57. Existing approaches produce high % of false positives
One major observation:
Programmers often write code in different ways for
achieving the same task
Some ways are more frequent than others
Frequent
ways
mine patterns
Infrequent
ways
detect violations
Mined Patterns
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
58. Example: java.util.Iterator.next()
Java.util.Iterator.next() throws NoSuchElementException when invoked on a list
without any elements
Code Sample 1
PrintEntries1(ArrayList<string>
entries)
{
…
Iterator it = entries.iterator();
if(it.hasNext()) {
string last = (string) it.next();
}
…
}
Code Sample 2
PrintEntries2(ArrayList<string>
entries)
{
…
if(entries.size() > 0) {
Iterator it = entries.iterator();
string last = (string) it.next();
}
…
}
58
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
59. Example: java.util.Iterator.next()
Code Sample 1
PrintEntries1(ArrayList<string>
entries)
{
…
Iterator it = entries.iterator();
if(it.hasNext()) {
string last = (string) it.next();
}
…
}
Code Sample 2
PrintEntries2(ArrayList<string>
entries)
{
…
if(entries.size() > 0) {
Iterator it = entries.iterator();
string last = (string) it.next();
}
…
}
59
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
60. Example: java.util.Iterator.next()
Code Sample 1
PrintEntries1(ArrayList<string>
entries)
{
…
Iterator it = entries.iterator();
if(it.hasNext()) {
string last = (string) it.next();
}
…
}
Code Sample 2
PrintEntries2(ArrayList<string>
entries)
{
…
if(entries.size() > 0) {
Iterator it = entries.iterator();
string last = (string) it.next();
}
…
}
1243 code
examples
60
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
61. Example: java.util.Iterator.next()
Code Sample 1
Code Sample 2
PrintEntries1(ArrayList<string>
entries)
{
…
Iterator it = entries.iterator();
if(it.hasNext()) {
string last = (string) it.next();
}
…
}
PrintEntries2(ArrayList<string>
entries)
{
…
if(entries.size() > 0) {
Iterator it = entries.iterator();
string last = (string) it.next();
}
…
}
Sample 1 (1218 / 1243)
1243 code
examples
61
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
62. Example: java.util.Iterator.next()
Code Sample 1
Code Sample 2
PrintEntries1(ArrayList<string>
entries)
{
…
Iterator it = entries.iterator();
if(it.hasNext()) {
string last = (string) it.next();
}
…
}
PrintEntries2(ArrayList<string>
entries)
{
…
if(entries.size() > 0) {
Iterator it = entries.iterator();
string last = (string) it.next();
}
…
}
Sample 2 (6/1243)
Sample 1 (1218 / 1243)
1243 code
examples
62
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
63. Example: java.util.Iterator.next()
Code Sample 1
Code Sample 2
PrintEntries1(ArrayList<string>
entries)
{
…
Iterator it = entries.iterator();
if(it.hasNext()) {
string last = (string) it.next();
}
…
}
PrintEntries2(ArrayList<string>
entries)
{
…
if(entries.size() > 0) {
Iterator it = entries.iterator();
string last = (string) it.next();
}
…
}
Sample 2 (6/1243)
Sample 1 (1218 / 1243)
1243 code
examples
Mined Pattern from existing approaches:
“boolean check on return of Iterator.hasNext before Iterator.next”
63
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
64. Example: java.util.Iterator.next()
Code Sample 1
Code Sample 2
PrintEntries1(ArrayList<string>
entries)
{
…
Iterator it = entries.iterator();
if(it.hasNext()) {
string last = (string) it.next();
}
…
}
PrintEntries2(ArrayList<string>
entries)
{
…
if(entries.size() > 0) {
Iterator it = entries.iterator();
string last = (string) it.next();
}
…
}
Sample 2 (6/1243)
Sample 1 (1218 / 1243)
1243 code
examples
Mined Pattern from existing approaches:
“boolean check on return of Iterator.hasNext before Iterator.next”
64
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
65. Example: java.util.Iterator.next()
Code Sample 1
Code Sample 2
PrintEntries1(ArrayList<string>
entries)
{
…
Iterator it = entries.iterator();
if(it.hasNext()) {
string last = (string) it.next();
}
…
}
PrintEntries2(ArrayList<string>
entries)
{
…
if(entries.size() > 0) {
Iterator it = entries.iterator();
string last = (string) it.next();
}
…
}
Sample 2 (6/1243)
Sample 1 (1218 / 1243)
1243 code
examples
Mined Pattern from existing approaches:
“boolean check on return of Iterator.hasNext before Iterator.next”
65
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
66. Example: java.util.Iterator.next()
Code Sample 1
PrintEntries1(ArrayList<string>
entries)
{
…
Iterator it = entries.iterator();
if(it.hasNext()) {
string last = (string) it.next();
}
…
}
Code Sample 2
PrintEntries2(ArrayList<string>
entries)
{
…
if(entries.size() > 0) {
Iterator it = entries.iterator();
string last = (string) it.next();
}
…
}
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
67. Example: java.util.Iterator.next()
Code Sample 1
PrintEntries1(ArrayList<string>
entries)
{
…
Iterator it = entries.iterator();
if(it.hasNext()) {
string last = (string) it.next();
}
…
}
Code Sample 2
PrintEntries2(ArrayList<string>
entries)
{
…
if(entries.size() > 0) {
Iterator it = entries.iterator();
string last = (string) it.next();
}
…
}
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
68. Example: java.util.Iterator.next()
Code Sample 1
PrintEntries1(ArrayList<string>
entries)
{
…
Iterator it = entries.iterator();
if(it.hasNext()) {
string last = (string) it.next();
}
…
}
Code Sample 2
PrintEntries2(ArrayList<string>
entries)
{
…
if(entries.size() > 0) {
Iterator it = entries.iterator();
string last = (string) it.next();
}
…
}
Require more general patterns (alternative patterns): P1 or P2
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
69. Example: java.util.Iterator.next()
Code Sample 1
PrintEntries1(ArrayList<string>
entries)
{
…
Iterator it = entries.iterator();
if(it.hasNext()) {
string last = (string) it.next();
}
…
}
Code Sample 2
PrintEntries2(ArrayList<string>
entries)
{
…
if(entries.size() > 0) {
Iterator it = entries.iterator();
string last = (string) it.next();
}
…
}
Require more general patterns (alternative patterns): P1 or P2
P1 : boolean check on return of Iterator.hasNext before Iterator.next
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
70. Example: java.util.Iterator.next()
Code Sample 1
PrintEntries1(ArrayList<string>
entries)
{
…
Iterator it = entries.iterator();
if(it.hasNext()) {
string last = (string) it.next();
}
…
}
Code Sample 2
PrintEntries2(ArrayList<string>
entries)
{
…
if(entries.size() > 0) {
Iterator it = entries.iterator();
string last = (string) it.next();
}
…
}
Require more general patterns (alternative patterns): P1 or P2
P1 : boolean check on return of Iterator.hasNext before Iterator.next
P2 : boolean check on return of ArrayList.size before Iterator.next
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
71. Example: java.util.Iterator.next()
Code Sample 1
PrintEntries1(ArrayList<string>
entries)
{
…
Iterator it = entries.iterator();
if(it.hasNext()) {
string last = (string) it.next();
}
…
}
Code Sample 2
PrintEntries2(ArrayList<string>
entries)
{
…
if(entries.size() > 0) {
Iterator it = entries.iterator();
string last = (string) it.next();
}
…
}
Require more general patterns (alternative patterns): P1 or P2
P1 : boolean check on return of Iterator.hasNext before Iterator.next
P2 : boolean check on return of ArrayList.size before Iterator.next
Cannot be mined by existing approaches, since alternative P2 is
infrequent
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
72. Our Solution: ImMiner Algorithm
[ASE 09]
Mines alternative patterns of the form P1 or P2
Based on the observation that infrequent alternatives such as P2
are frequent among code examples that do not support P1
72
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
73. Our Solution: ImMiner Algorithm
[ASE 09]
Mines alternative patterns of the form P1 or P2
Based on the observation that infrequent alternatives such as P2
are frequent among code examples that do not support P1
1243 code examples
Sample 2 (6/1243)
Sample 1 (1218 / 1243)
73
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
74. Our Solution: ImMiner Algorithm
[ASE 09]
Mines alternative patterns of the form P1 or P2
Based on the observation that infrequent alternatives such as P2
are frequent among code examples that do not support P1
1243 code examples
Sample 2 (6/1243)
Sample 1 (1218 / 1243)
P2 is infrequent among
entire 1243 code examples
74
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
75. Our Solution: ImMiner Algorithm
[ASE 09]
Mines alternative patterns of the form P1 or P2
Based on the observation that infrequent alternatives such as P2
are frequent among code examples that do not support P1
1243 code examples
Sample 2 (6/1243)
Sample 1 (1218 / 1243)
P2 is frequent among code
examples not supporting P1
75
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
76. Alternative Patterns
ImMiner mines three kinds of alternative
patterns of the general form “P1 or P2”
Balanced: all alternatives (both P1 and P2) are frequent
Imbalanced: some alternatives (P1) are frequent and
others are infrequent (P2). Represented as “P1 or P^2”
Single: only one alternative
76
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
77. ImMiner Algorithm
Uses frequent-itemset mining [Burdick et al. ICDE 01]
iteratively
An input database with the following APIs
for Iterator.next()
Input database
Mapping of IDs to APIs
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
78. ImMiner Algorithm: Frequent Alternatives
Input database
Frequent itemset
mining
(min_sup 0.5)
Frequent item: 1
P1: boolean-check on the return of Iterator.hasNext()
before Iterator.next()
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
79. ImMiner: Infrequent Alternatives of P1
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
79
80. ImMiner: Infrequent Alternatives of P1
Split input database into two databases: Positive and Negative
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
80
81. ImMiner: Infrequent Alternatives of P1
Split input database into two databases: Positive and Negative
Positive database (PSD)
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
81
82. ImMiner: Infrequent Alternatives of P1
Split input database into two databases: Positive and Negative
Positive database (PSD)
Negative database (NSD)
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
82
83. ImMiner: Infrequent Alternatives of P1
Split input database into two databases: Positive and Negative
Positive database (PSD)
Negative database (NSD)
Mine patterns that are frequent in NSD and are infrequent in PSD
Reason: Only such patterns serve as alternatives for P1
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
83
84. ImMiner: Infrequent Alternatives of P1
Split input database into two databases: Positive and Negative
Positive database (PSD)
Negative database (NSD)
Mine patterns that are frequent in NSD and are infrequent in PSD
Reason: Only such patterns serve as alternatives for P1
Alternative Pattern : P2 “const check on the return of ArrayList.size()
before Iterator.next()”
Alattin applies ImMiner algorithm to detect neglected conditions
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
84
85. Neglected Conditions
Neglected conditions refer to
Missing conditions that check the arguments
or receiver of the API call before the API call
Missing conditions that check the return or
receiver of the API call after the API call
One primary reason for many fatal issues
security or buffer-overflow vulnerabilities
[Chang et al. ISSTA 07]
S.Thummalapenta and T. Xie. Alattin: Mining Alternative Patterns for Detecting Neglected Conditions. ASE 2009.
86. • use Data Exploration and Analysis
Mining Software Repositories (MSR)
• for Software Practitioners
Beyond Software Developers
• obtain Insightful and Actionable info
Need get real as well
• Analytic Techniques
• Producing Impact on Practice
87. Machine Learning that Matters
[ICML’12 Wagstaff]
http://arxiv.org/ftp/arxiv/papers/1206/1206.4656.pdf
88. • Hyper-Focus on Benchmark Data Sets
• Hyper-Focus on Abstract Metrics
• Lack of Follow-Through
[ICML’12 Wagstaff]
http://arxiv.org/ftp/arxiv/papers/1206/1206.4656.pdf
89. • Meaningful Evaluation Methods
• Involvement of the World Outside ML
• Eyes on the Prize
[ICML’12 Wagstaff]
http://arxiv.org/ftp/arxiv/papers/1206/1206.4656.pdf
90. MSRA Software Analytics Group
Utilize data-driven approach to help create highly performing, user
friendly, and efficiently developed and operated software and services.
Contact: Dongmei Zhang (dongmeiz@microsoft.com)
http://research.microsoft.com/groups/sa/
91. MSRA Software Analytics Group
Utilize data-driven approach to help create highly performing, user
friendly, and efficiently developed and operated software and services.
Software
Users
Software
Development
Process
Software
Systems
Research Topics
Contact: Dongmei Zhang (dongmeiz@microsoft.com)
http://research.microsoft.com/groups/sa/
92. MSRA Software Analytics Group
Utilize data-driven approach to help create highly performing, user
friendly, and efficiently developed and operated software and services.
Information Visualization
Software
Users
Software
Development
Process
Analysis Algorithms
Software
Systems
Large-scale Computing
Research Topics
Technology Pillars
Contact: Dongmei Zhang (dongmeiz@microsoft.com)
http://research.microsoft.com/groups/sa/
93. MSRA Software Analytics Group
Utilize data-driven approach to help create highly performing, user
friendly, and efficiently developed and operated software and services.
Information Visualization
Software
Users
Software
Development
Process
Analysis Algorithms
Software
Systems
Large-scale Computing
Research Topics
Technology Pillars
Contact: Dongmei Zhang (dongmeiz@microsoft.com)
http://research.microsoft.com/groups/sa/
94. MSRA Software Analytics Group
Utilize data-driven approach to help create highly performing, user
friendly, and efficiently developed and operated software and services.
Information Visualization
Software
Users
Software
Vertical
Development
Process
Analysis Algorithms
Horizontal
Software
Systems
Large-scale Computing
Research Topics
Technology Pillars
Contact: Dongmei Zhang (dongmeiz@microsoft.com)
http://research.microsoft.com/groups/sa/
100. "Are Automated Debugging [Research]
Techniques Actually Helping Programmers?"
• 50 years of automated debugging research
– N papers only 5 evaluated with actual programmers
“
”
[ISSTA11 Parnin&Orso]
101. Are Regression Testing [Research] Techniques
Actually Helping Industry?
• Likely most studied testing problems
– N papers
“
”
[STVR11 Yoo&Harman]
102. Are [Some] Failure-Proneness Prediction
[Research] Techniques Actually Helping?
• Empirical software engineering (on prediction)
– N papers
”
[PROMISE11 Zeller et al.]
111. MS Academic Search: “Clone
Detection”
Typically focus/evaluate on
intermediate steps (e.g., clone
detection) instead of ultimate
tasks (e.g., bug detection or
refactoring), even when the
field already grows mature
with n years of efforts on
intermediate steps
112. Some Success Stories of Applying
Clone Detection [Focus on Ultimate Tasks]
MSRA
XIAO
Yingnong Dang, Dongmei Zhang, Song
Ge, Chengyun Chu, Yingjun Qiu, and
Tao Xie. XIAO: Tuning Code Clones at
Hands of Engineers in Practice. In
Proc. ACSAC 2012,
http://research.microsoft.com/en-us/groups/sa/
Zhenmin Li, Shan Lu, Suvda Myagmar,
and Yuanyuan Zhou. CP-Miner: a tool
for finding copy-paste and related
bugs in operating system code. In
Proc. OSDI 2004.
http://patterninsight.com/
http://www.blackducksoftware.com/
61
113. Suggested Actions Tech Adoption
•
•
•
•
Get research problems from real practice
Get feedback from real practice
Collaborate across disciplines
Collaborate with industry
114. •Software Analytics
Data Exploration and Analysis
For Software Practitioners
Obtain Insightful and Actionable info
With Analytic Techniques
• Producing Impact on Practice
117. •Software Analytics
Data Exploration and Analysis
For Software Practitioners
Obtain Insightful and Actionable info
With Analytic Techniques
• Producing Impact on Practice