SlideShare a Scribd company logo
1 of 63
Download to read offline
MACHINE
LEARNING FOR
SOFTWARE
MAINTAINABILITY
Anna Corazza, Sergio Di Martino, Valerio Maggio
Alessandro Moschitti, Andrea Passerini, Giuseppe Scanniello,
Fabrizio Silverstri
JIMSE 2012
August 28, 2012 Montpellier, France
SOFTWARE MAINTENANCE
β€œA software system must be continuously
adapted during its overall life cycle or it
progressively becomes less satisfactory”
(cit. Lehman’s Law of Software Evolution)
β€’ Software Maintenance is one of the most
expensive and time consuming phase of the
whole life cycle
β€’ Anticipating the Maintenance operations
reduces the cost
β€’ 85%-90% of the total cost are related to the
effort necessary to comprehend the system
and its source code [Erlikh, 2000]
Software Artifacts
UI Process
Components
UI
Components
Data Access
Components
Data Helpers /
Utilities
Security
Operational Management
Communications
Business
Components
Application Facade
Buisiness
Workflows
Messages
Interfaces
Service Interfaces
β€’ Provide models and views
representing the relationships
among different software artifacts
β€’ Clustering of Software Artifacts
β€’ Advantages:
β€’ To aid the comprehension
β€’ To reduce maintenance effort
SOFTWARE ARCHITECTURE
β€’ Provide models and views
representing the relationships
among different software artifacts
β€’ Clustering of Software Artifacts
β€’ Advantages:
β€’ To aid the comprehension
β€’ To reduce maintenance effort
SOFTWARE ARCHITECTURE
External
Systems
Service Consumers
Services
Service Interfaces
Messages
Interfaces
Cross Cutting
Security
OperationalManagement
Communications
Data
Data Access
Components
Data Helpers /
Utilities
Presentation
UI
Components
UI Process
Components
Business
Application Facade
Buisiness
Workflows
Business
Components
Clusters of Software Artifacts
β€’ Provide models and views
representing the relationships
among different software artifacts
β€’ Clustering of Software Artifacts
β€’ Advantages:
β€’ To aid the comprehension
β€’ To reduce maintenance effort
SOFTWARE ARCHITECTURE
External
Systems
Service Consumers
Services
Service Interfaces
Messages
Interfaces
Cross Cutting
Security
OperationalManagement
Communications
Data
Data Access
Components
Data Helpers /
Utilities
Presentation
UI
Components
UI Process
Components
Business
Application Facade
Buisiness
Workflows
Business
Components
Clusters of Software Artifacts
Software Artifacts may be analyzed at different
levels of abstractions
SOFTWARE ARTIFACTS
Software Artifacts may be analyzed at different
levels of abstractions
SOFTWARE ARTIFACTS
Software Artifacts may be analyzed at different
levels of abstractions
The different levels of abstractions
lead to different analysis tasks:
β€’ Identification of functional
modules and their hierarchical
arrangement
β€’ i.e., Clustering of Software
classes
β€’ Identification of Code Clones
β€’ i.e., Clustering of Duplicated
code fragments (blocks,
SOFTWARE ARTIFACTS
β€’ Mine information directly from the source
code:
β€’ Exploit the syntactic/lexical
information provided in the source
code text
β€’ Exploit the relational information
between artifacts
β€’ e.g., Program Dependencies
Problem: Definition of a proper similarity measure to apply in the clustering
analysis, which is able to exploit the considered representation of software artifacts
SOFTWARE ARTIFACTS
CLUSTERING
β€’ Analysis of large and complex systems
β€’ Solutions and algorithms must be able to scale efficiently
(in the large and in the many)
MINING LARGE REPOSITORIES
Idea: Definition of Machine Learning techniques to mine information from the
source code
β€’ Combine different kind of information (lexical and structural)
β€’ Application of Kernel Methods to software artifacts
β€’ Provide flexible and computational effective solutions to analyze large data
sets
ADVANCED MACHINE LEARNING
FOR SOFTWARE MAINTENANCE
Idea: Definition of Machine Learning techniques to mine information from the
source code
β€’ Combine different kind of information (lexical and structural)
β€’ Application of Kernel Methods to software artifacts
β€’ Provide flexible and computational effective solutions to analyze large data
sets
Advanced Machine Learning
β€’ Learning with syntactic/semantic information (Natural Language Processing)
β€’ Learning in relational domains (Structured-output learning, Logic Learning,
Statistical Relational Learning)
ADVANCED MACHINE LEARNING
FOR SOFTWARE MAINTENANCE
KERNEL METHODS FOR
STRUCTURED DATA
β€’ A Kernel is a function between (arbitrary) pairs of
entities
β€’ It can be seen as a kind of similarity measure
β€’ Based on the idea that structured objects can be
described in terms of their constituent parts
β€’ Generalize the computation of the dot product to
arbitrary domains
β€’ Can be easily tailored to specific domains
β€’ Tree Kernels
β€’ Graph Kernels
β€’ ....
KERNELS FOR
STRUCTURES




ο€…


ο€ˆ
 ο€Šο€‹ο€‹ο€†ο€Œ
ο€ˆ
Computation of the dot product between (Graph) Structures
β€’ Parse Trees represent the syntactic
structure of a sentence
β€’ Tree Kernels can be used to measure
the similarity between parse trees
KERNELS FOR
LANGUAGES
β€’ Parse Trees represent the syntactic
structure of a sentence
β€’ Tree Kernels can be used to measure
the similarity between parse trees
KERNELS FOR
LANGUAGES
β€’ Abstract Syntax Trees (AST)
represent the syntactic structure of a
piece of code
β€’ Research on Tree Kernels for NLP
carries over to AST (with
adjustments)
KERNELS FOR
SOURCE CODE
KERNELS FOR PARSE
TREE
ο€ο€‚ο€ƒο€„ο€…ο€ƒο€†ο€‡ο€…ο€‡ο€ƒο€ˆο€…ο€‰ο€Šο€‹ο€‹
ο€Œο€…ο€‹ο€†ο€ο€ˆο€…ο€Žο€‚ο€
ο€ο€ˆο€‡ 
ο€’
 
ο€” 
ο€’
 
ο€” 


ο€‡ο€ƒο€ˆ ο€‰ο€Šο€‹ο€‹
ο€Œ
ο€‹ο€†ο€ο€ˆ ο€Žο€‚ο€
ο€’ο€ˆο€„ο€‡ο€ˆο€„ο€•ο€ˆο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€“ο€Šο€–ο€—ο€ˆο€…ο€˜ο€–ο€ˆο€ˆο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€“ο€Šο€–ο€—ο€ˆο€…ο€˜ο€–ο€ˆο€ˆο€…ο€™ο€ˆο€–ο€„ο€ˆο€‹ο€…
ο€’
 

ο€” 
ο€ο€ˆο€‡ 

ο€’
 

ο€” 
KERNELS FOR AST
ο€ο€‚ο€ƒο€„ο€…ο€†ο€‡ο€ˆο€†ο€‰ο€†ο€Šο€‹ο€†ο€Œ
ο€†ο€†ο€†ο€ˆο€†ο€ο€†ο€ˆο€†ο€Žο€†ο€
ο€†ο€†ο€†ο€Šο€†ο€ο€†ο€Šο€†ο€ο€†ο€

ο€ο€‚ο€ƒο€„ο€…ο€†ο€‡ο€’ο€†ο€‰ο€†ο€“ο€‹ο€†ο€Œ




ο€“ο€„ο€—ο€˜ο€™
ο€ˆ ο€Š
ο€Ž





ο€ˆ
ο€ˆ
ο€Š
ο€Š 

ο€“ο€„ο€—ο€˜ο€™
ο€’ 
ο€”

ο€–

ο€š

ο€’
ο€’

 

ο€“ο€„ο€—ο€˜ο€™ο€‰
ο€“ο€„ο€—ο€˜ο€™
 
ο€ˆ ο€Š

ο€Ž

ο€ˆ 

ο€Š
ο€Ž
ο€ο€ˆ

ο€Š 

ο€“ο€„ο€—ο€˜ο€™ο€‰
ο€“ο€„ο€—ο€˜ο€™
 
ο€’ 
ο€š
ο€”

ο€’ ο€–


ο€”

ο€–
 
ο€›ο€œο€ο€žο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€Ÿο€ ο€‘ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€Ÿο€ ο€‘ο€†ο€’ο€žο€£ο€€ο€žο€₯
β€’ Supervised Learning
β€’ Binary Classification
β€’ Multi-class Classification
β€’ Ranking
β€’ Unsupervised Learning
β€’ Clustering
β€’ Anomaly Detection
Idea: Any learning algorithm relying on similarity
measure can be used
KERNEL MACHINES
KERNEL MACHINES
FOR CONE DETECTION
β€’ Supervised Learning
β€’ Pairwise classifier: predict if a pair of fragments is clone
β€’ Unsupervised Learning
β€’ Clustering: cluster together all candidate clones
KERNEL FOR CLONES
ο€ο€‚ο€ƒο€„ο€…ο€†ο€‡ο€ˆο€†ο€‰ο€†ο€Šο€‹ο€†ο€Œ
ο€†ο€†ο€†ο€ˆο€†ο€ο€†ο€ˆο€†ο€Žο€†ο€
ο€†ο€†ο€†ο€Šο€†ο€ο€†ο€Šο€†ο€ο€†ο€

ο€ο€‚ο€ƒο€„ο€…ο€†ο€‡ο€’ο€†ο€“ο€†ο€”ο€‹ο€†ο€Œ
ο€†ο€†ο€†ο€”ο€†ο€ο€†ο€”ο€†ο€Žο€†ο€




ο€ˆ ο€Š
ο€Ž





ο€ˆ
ο€ˆ
ο€Š
ο€Š 


ο€’ ο€”
ο€Ž





ο€”
ο€”
ο€’
ο€’ 



 
ο€ˆ ο€Š

ο€Ž

ο€ˆ 

ο€Š
ο€Ž
ο€ο€ˆ

ο€Š 



 
ο€’ ο€”

ο€Ž

ο€” 

ο€’
ο€Ž


ο€’ 
ο€˜ο€™ο€šο€›ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€œο€ο€žο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€œο€ο€žο€†ο€Ÿο€›ο€ ο€‘ο€›ο€’
KERNEL LEARNING
β€’ Construct a number of candidate kernels with different characteristics
β€’ e.g., Ignore variables names or not
β€’ Employ kernel learning approaches which learn a weighted combination
of candidate kernels
β€’ Useless/harmful kernels will get zero weight and will be discarded in the
final model
LEARNING SIMILARITIES
Supervised Clustering
β€’ Exploit information on already annotated pieces of software
β€’ Training examples are software projects/portions with annotation on existing
clones (clustering)
β€’ A learning model uses training examples to refine the similarity measure for
correctly clustering novel examples
STRUCTURED-OUTPUT
LEARNING
β€’ Software has a rich structure and heterogeneous information
β€’ Advanced Machine learning approaches are promising for exploiting such
information
β€’ Kernel Methods are natural candidate
β€’ e.g., see the analogy between NLP parse trees and AST
β€’ Many applications:
β€’ architecture recovery, code clone detection, vulnerability detection ....
SUMMARY
CASE STUDY:
KERNELS FOR
CLONES
β€’ Goal: β€œIdentify and group all duplicated code fragments/functions”
β€’ Copy&Paste programming
β€’ Taxonomy of 4 different types of clones
β€’ Program Text similarities and Functional similarities
β€’ Clones affect the reliability and the maintainability of a software
system
CODE CLONE DETECTION
β€’ Abstract Syntax Tree (AST)
β€’ Tree structure representing the
syntactic structure of the different
instructions of a program (function)
β€’ Program Dependencies Graph
(PDG)
β€’ (Directed) Graph structure
representing the relationship among
the different statement of a program
KERNELS FOR
CLONESCODE
STRUCTURES
Kernels for Structured Data:
β€’ The source code could be represented by many different data
structures
ABSTRACT SYNTAX
TREE (AST)
CODE
STRUCTURES
AST
Function
Body
= whileprint
k 10 =
i +
p
=
i 0
i 1.0
<
i 7
AST embeds both Syntactic and
Lexical Information
β€’ Program Instructions
β€’ Name of Variables, Literals...
ABSTRACT SYNTAX
TREE (AST)
CODE
STRUCTURES
AST
Function
Body
= whileprint
k 10 =
i +
p
=
i 0
i 1.0
<
i 7
AST embeds both Syntactic and
Lexical Information
β€’ Program Instructions
β€’ Name of Variables, Literals...
ABSTRACT SYNTAX
TREE (AST)
CODE
STRUCTURES
AST
Function
Body
= whileprint
k 10 =
i +
p
=
i 0
i 1.0
<
i 7
AST embeds both Syntactic and
Lexical Information
β€’ Program Instructions
β€’ Name of Variables, Literals...
ABSTRACT SYNTAX
TREE (AST)
CODE
STRUCTURES
AST
Function
Body
= whileprint
k 10 =
i +
p
=
i 0
i 1.0
<
i 7
AST embeds both Syntactic and
Lexical Information
β€’ Program Instructions
β€’ Name of Variables, Literals...
while
call-site
expr
decl param
expr
decl
arg
expr
CODE
STRUCTURES
PDG
β€’ Nodes correspond to instructions
β€’ Edges represent relationships
between couple of nodes
PROGRAM DEPENDENCIES
GRAPH (PDG)
while
call-site
expr
decl param
expr
decl
arg
expr
CODE
STRUCTURES
PDG
β€’ Nodes correspond to instructions
β€’ Edges represent relationships
between couple of nodes
PROGRAM DEPENDENCIES
GRAPH (PDG)
while
call-site
expr
decl param
expr
decl
arg
expr
CODE
STRUCTURES
PDG
β€’ Nodes correspond to instructions
β€’ Edges represent relationships
between couple of nodes
PROGRAM DEPENDENCIES
GRAPH (PDG)
while
call-site
expr
decl param
expr
decl
arg
expr
CODE
STRUCTURES
PDG
β€’ Nodes correspond to instructions
β€’ Edges represent relationships
between couple of nodes
PROGRAM DEPENDENCIES
GRAPH (PDG)
while
call-site
expr
decl param
expr
decl
arg
expr
CODE
STRUCTURES
PDG
β€’ Nodes correspond to instructions
β€’ Edges represent relationships
between couple of nodes
PROGRAM DEPENDENCIES
GRAPH (PDG)
CODE
STRUCTURES
PDG
β€’ Two Types of Nodes
β€’ Control Nodes (Dashed ones)
β€’ e.g., if - for - while - function calls...
β€’ Data Nodes
β€’ e.g., expressions - parameters...
NODES AND EDGES
while call-site
argexpr
CODE
STRUCTURES
PDG
β€’ Two Types of Nodes
β€’ Control Nodes (Dashed ones)
β€’ e.g., if - for - while - function calls...
β€’ Data Nodes
β€’ e.g., expressions - parameters...
β€’ Two Types of Edges (i.e., dependencies)
β€’ Control edges (Dashed ones)
β€’ Data edges
NODES AND EDGES
while call-site
argexpr
DEFINING KERNELS FOR
STRUCTURED DATA
β€’ The definition of a new Kernel for a Structured Object requires the definition
of:
β€’ Set of features to annotate each part of the object
β€’ A Kernel function to measure the similarity on the smallest part of the object
β€’ e.g., Nodes of AST and Graphs
β€’ A Kernel function to apply the computation on the different (sub)parts of the
structured object
KERNELS
FOR CODE
STRUCTURES
β€’ Features: each node is characterized by a set of 4
features
β€’ Instruction Class
β€’ i.e., LOOP, CONDITIONAL_STATEMENT, CALL
β€’ Instruction
β€’ i.e., FOR, IF, WHILE, RETURN
β€’ Context
β€’ i.e., Instruction Class of the closer statement node
β€’ Lexemes
β€’ Lexical information gathered (recursively) from
leaves
KERNELS
FOR CODE
STRUCTURES:
AST
TREE KERNELS FOR
AST
FOR
FOR-INIT
FOR-
BODY
β€’ Features: each node is characterized by a set of 4
features
β€’ Instruction Class
β€’ i.e., LOOP, CONDITIONAL_STATEMENT, CALL
β€’ Instruction
β€’ i.e., FOR, IF, WHILE, RETURN
β€’ Context
β€’ i.e., Instruction Class of the closer statement node
β€’ Lexemes
β€’ Lexical information gathered (recursively) from
leaves
KERNELS
FOR CODE
STRUCTURES:
AST
Instruction Class = LOOP
Instruction = FOR
Context = (e.g., LOOP)
Lexemes = (e.g, name of variables in FOR-
INIT..)
TREE KERNELS FOR
AST
FOR
FOR-INIT
FOR-
BODY
β€’ Goal: Identify the maximum isomorphic Tree/Subtree
β€’ Comparison of blocks to each other
β€’ Blocks: Atomic unit for (sub) tree considered
KERNELS
FOR CODE
STRUCTURES:
AST
TREE KERNELS FOR
AST
BLOCK
= =
print
x 1.0 y f
x x
y
BLOCK
= = print
s 0.0 p f
s 1.0
p
β€’ Goal: Identify the maximum isomorphic Tree/Subtree
β€’ Comparison of blocks to each other
β€’ Blocks: Atomic unit for (sub) tree considered
KERNELS
FOR CODE
STRUCTURES:
AST
TREE KERNELS FOR
AST
BLOCK
= =
print
x 1.0 y f
x x
y
BLOCK
= = print
s 0.0 p f
s 1.0
p
β€’ Features of nodes:
β€’ Node Label
β€’ i.e., , WHILE, CALL-SITE, EXPR, ...
β€’ Node Type
β€’ i.e., Data Node or Control Node
β€’ Features of edges:
β€’ Edge Type
β€’ i.e., Data Edge or Control Edge
KERNELS
FOR CODE
STRUCTURES:
PDG
GRAPH KERNELS
FOR PDG
while
call-site
arg
expr expr
β€’ Features of nodes:
β€’ Node Label
β€’ i.e., , WHILE, CALL-SITE, EXPR, ...
β€’ Node Type
β€’ i.e., Data Node or Control Node
β€’ Features of edges:
β€’ Edge Type
β€’ i.e., Data Edge or Control Edge
KERNELS
FOR CODE
STRUCTURES:
PDG
Node Label = WHILE
Node Type = Control Node
GRAPH KERNELS
FOR PDG
while
call-site
arg
expr expr
Control Edge
Data Edge
while
call-site
arg
expr expr
while
call-site
arg
expr call-site
GRAPH KERNELS
FOR PDG
β€’ Goal: Identify common subgraphs
β€’ Selectors: Compare nodes to each others and explore the subgraphs of only β€œcompatible”
nodes (i.e., Nodes of the same type)
β€’ Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
KERNELS
FOR CODE
STRUCTURES:
PDG
while
call-site
arg
expr expr
while
call-site
arg
expr call-site
GRAPH KERNELS
FOR PDG
β€’ Goal: Identify common subgraphs
β€’ Selectors: Compare nodes to each others and explore the subgraphs of only β€œcompatible”
nodes (i.e., Nodes of the same type)
β€’ Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
KERNELS
FOR CODE
STRUCTURES:
PDG
while
call-site
arg
expr expr
while
call-site
arg
expr call-site
GRAPH KERNELS
FOR PDG
β€’ Goal: Identify common subgraphs
β€’ Selectors: Compare nodes to each others and explore the subgraphs of only β€œcompatible”
nodes (i.e., Nodes of the same type)
β€’ Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
KERNELS
FOR CODE
STRUCTURES:
PDG
while
call-site
arg
expr expr
while
call-site
arg
expr call-site
GRAPH KERNELS
FOR PDG
β€’ Goal: Identify common subgraphs
β€’ Selectors: Compare nodes to each others and explore the subgraphs of only β€œcompatible”
nodes (i.e., Nodes of the same type)
β€’ Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
KERNELS
FOR CODE
STRUCTURES:
PDG
EVALUATION
PROTOCOL
β€’ Comparison of results with other two clone detector tools:
β€’ AST-based Clone detector
β€’ PDG-based Clone Detector
EMPIRICAL
EVALUATION
EVALUATION
PROTOCOL
β€’ Comparison of results with other two clone detector tools:
β€’ AST-based Clone detector
β€’ PDG-based Clone Detector
β€’ No publicly available clone detection dataset
EMPIRICAL
EVALUATION
EVALUATION
PROTOCOL
β€’ Comparison of results with other two clone detector tools:
β€’ AST-based Clone detector
β€’ PDG-based Clone Detector
β€’ No publicly available clone detection dataset
β€’ No unique set of analyzed open source systems
EMPIRICAL
EVALUATION
EVALUATION
PROTOCOL
β€’ Comparison of results with other two clone detector tools:
β€’ AST-based Clone detector
β€’ PDG-based Clone Detector
β€’ No publicly available clone detection dataset
β€’ No unique set of analyzed open source systems
β€’ Usually clone results are not available
EMPIRICAL
EVALUATION
EVALUATION
PROTOCOL
β€’ Comparison of results with other two clone detector tools:
β€’ AST-based Clone detector
β€’ PDG-based Clone Detector
β€’ No publicly available clone detection dataset
β€’ No unique set of analyzed open source systems
β€’ Usually clone results are not available
β€’ Two possible strategies:
EMPIRICAL
EVALUATION
EVALUATION
PROTOCOL
β€’ Comparison of results with other two clone detector tools:
β€’ AST-based Clone detector
β€’ PDG-based Clone Detector
β€’ No publicly available clone detection dataset
β€’ No unique set of analyzed open source systems
β€’ Usually clone results are not available
β€’ Two possible strategies:
β€’ To automatically modify an existing system with randomly generated clones
EMPIRICAL
EVALUATION
EVALUATION
PROTOCOL
β€’ Comparison of results with other two clone detector tools:
β€’ AST-based Clone detector
β€’ PDG-based Clone Detector
β€’ No publicly available clone detection dataset
β€’ No unique set of analyzed open source systems
β€’ Usually clone results are not available
β€’ Two possible strategies:
β€’ To automatically modify an existing system with randomly generated clones
β€’ Manual classification of candidate results
EMPIRICAL
EVALUATION
BENCHMARKS AND
DATASET
Project Size (KLOC) # PDGS
Apache-2.2.14 343 3017
Python-2.5.1 435 5091
β€’ Comparison with another Graph-based clone detector
β€’ MeCC (ICSE2011)
β€’ Baseline Dataset
β€’ Results provided by MeCC
β€’ Extended Dataset
β€’ Extension of Clones results by manual evaluation of candidate clones
β€’ Agreement rate calculation between the evaluators
EMPIRICAL
EVALUATION
EMPIRICAL EVALUATION
OF TREE KERNEL FOR AST
β€’ Comparison with another (pure) AST-based clone detector
β€’ Clone Digger http://clonedigger.sourceforge.net/
β€’ Comparison on a system with randomly seeded clones
Results refer to clones where code
fragments have been modified by
adding/removing or changing code
statements
EVALUATION
TREE KERNELS
FOR AST
PRECISION, RECALL AND F1
PLOT
0
0.25
0.5
0.75
1
0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98
Precision Recall F1
Clone results with different similarity
thresholds
EVALUATION
TREE KERNELS
FOR AST
LOREM
I P S U M
Threshold #Clones in the Baseline #Clones in the Extended Dataset
1.00 874 1089
0.99 874 1514
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.7"
0.8"
0.9"
0.99" 1"
Soglia'di'SimilaritΓ '
Apache'2.2.14'6'Precision'con'Oracolo'Esteso'
Precision5
Baseline"
Precision5
Extended"
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.99" 1"
Soglia'di'SimilaritΓ '
Apache'2.2.14'6'Recall'con'Oracolo'Esteso'
Recall0
Baseline"
Recall0
Extended"
RESULTS WITH
APACHE 2.2.14
EVALUATION
GRAPH KERNELS
FOR PDG
LOREM
I P S U M
Threshold #Clones in the Baseline #Clones in the Extended Dataset
1.00 858 1066
0.99 858 2119
0"
0.2"
0.4"
0.6"
0.8"
1"
1.2"
0.99" 1"
Soglia'di'SimilaritΓ '
Python'2.5.1'5'Precision'con'Oracolo'
Esteso'
Precision2
Baseline"
Precision2
Extended"
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.7"
0.8"
0.9"
0.99" 1"
Soglia'di'SimilaritΓ '
Python'2.5.1'5'Recall'con'Oracolo'Esteso'
Recall2Baseline"
Recall2Extended"
RESULTS WITH
PYTHON 2.5.2
EVALUATION
GRAPH KERNELS
FOR PDG
CHALLENGES AND
OPPORTUNITIES
β€’ Learning Kernel Functions from Data Set
β€’ Kernel Methods advantages:
β€’ flexible solution to be tailored to specific domain
β€’ efficient solution easy to parallelize
β€’ combinations of multiple kernels
β€’ Provide a publicly available data set
THANK YOU
FOR YOUR KIND
ATTENTION

More Related Content

What's hot

The Python ecosystem for data science - Landscape Overview
The Python ecosystem for data science - Landscape OverviewThe Python ecosystem for data science - Landscape Overview
The Python ecosystem for data science - Landscape OverviewDr. Ananth Krishnamoorthy
Β 
Data Science - Part II - Working with R & R studio
Data Science - Part II -  Working with R & R studioData Science - Part II -  Working with R & R studio
Data Science - Part II - Working with R & R studioDerek Kane
Β 
Using electronic laboratory notebooks in the academic life sciences: a group ...
Using electronic laboratory notebooks in the academic life sciences: a group ...Using electronic laboratory notebooks in the academic life sciences: a group ...
Using electronic laboratory notebooks in the academic life sciences: a group ...SC CTSI at USC and CHLA
Β 
Makine Γ–ΔŸrenmesi ile GΓΆrΓΌntΓΌ TanΔ±ma | Image Recognition using Machine Learning
Makine Γ–ΔŸrenmesi ile GΓΆrΓΌntΓΌ TanΔ±ma | Image Recognition using Machine LearningMakine Γ–ΔŸrenmesi ile GΓΆrΓΌntΓΌ TanΔ±ma | Image Recognition using Machine Learning
Makine Γ–ΔŸrenmesi ile GΓΆrΓΌntΓΌ TanΔ±ma | Image Recognition using Machine LearningAli Alkan
Β 
Determining the Credibility of Science Communication
Determining the Credibility of Science CommunicationDetermining the Credibility of Science Communication
Determining the Credibility of Science CommunicationIsabelle Augenstein
Β 
Intelligent Software Engineering: Synergy between AI and Software Engineering...
Intelligent Software Engineering: Synergy between AI and Software Engineering...Intelligent Software Engineering: Synergy between AI and Software Engineering...
Intelligent Software Engineering: Synergy between AI and Software Engineering...Tao Xie
Β 
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...Isabelle Augenstein
Β 
Machine Learning vs. Deep Learning
Machine Learning vs. Deep LearningMachine Learning vs. Deep Learning
Machine Learning vs. Deep LearningBelatrix Software
Β 
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...Isabelle Augenstein
Β 
Finding Bad Code Smells with Neural Network Models
Finding Bad Code Smells with Neural Network Models Finding Bad Code Smells with Neural Network Models
Finding Bad Code Smells with Neural Network Models IJECEIAES
Β 
Bug or Not? Bug Report Classification using N-Gram Idf
Bug or Not? Bug Report Classification using N-Gram IdfBug or Not? Bug Report Classification using N-Gram Idf
Bug or Not? Bug Report Classification using N-Gram IdfHideaki Hata
Β 
Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...Ola Spjuth
Β 
R programming for psychometrics
R programming for psychometricsR programming for psychometrics
R programming for psychometricsDiane Talley
Β 
(Semi-)Automatic analysis of online contents
(Semi-)Automatic analysis of online contents(Semi-)Automatic analysis of online contents
(Semi-)Automatic analysis of online contentsSteffen Staab
Β 
Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningKuppusamy P
Β 
Requirementv4
Requirementv4Requirementv4
Requirementv4stat
Β 
Integrating natural language processing and software engineering
Integrating natural language processing and software engineeringIntegrating natural language processing and software engineering
Integrating natural language processing and software engineeringNakul Sharma
Β 
Programming with Semantic Broad Data
Programming with Semantic Broad DataProgramming with Semantic Broad Data
Programming with Semantic Broad DataSteffen Staab
Β 

What's hot (20)

The Python ecosystem for data science - Landscape Overview
The Python ecosystem for data science - Landscape OverviewThe Python ecosystem for data science - Landscape Overview
The Python ecosystem for data science - Landscape Overview
Β 
Proposed Talk Outline for Pycon2017
Proposed Talk Outline for Pycon2017 Proposed Talk Outline for Pycon2017
Proposed Talk Outline for Pycon2017
Β 
Data Science - Part II - Working with R & R studio
Data Science - Part II -  Working with R & R studioData Science - Part II -  Working with R & R studio
Data Science - Part II - Working with R & R studio
Β 
Using electronic laboratory notebooks in the academic life sciences: a group ...
Using electronic laboratory notebooks in the academic life sciences: a group ...Using electronic laboratory notebooks in the academic life sciences: a group ...
Using electronic laboratory notebooks in the academic life sciences: a group ...
Β 
Makine Γ–ΔŸrenmesi ile GΓΆrΓΌntΓΌ TanΔ±ma | Image Recognition using Machine Learning
Makine Γ–ΔŸrenmesi ile GΓΆrΓΌntΓΌ TanΔ±ma | Image Recognition using Machine LearningMakine Γ–ΔŸrenmesi ile GΓΆrΓΌntΓΌ TanΔ±ma | Image Recognition using Machine Learning
Makine Γ–ΔŸrenmesi ile GΓΆrΓΌntΓΌ TanΔ±ma | Image Recognition using Machine Learning
Β 
Determining the Credibility of Science Communication
Determining the Credibility of Science CommunicationDetermining the Credibility of Science Communication
Determining the Credibility of Science Communication
Β 
Intelligent Software Engineering: Synergy between AI and Software Engineering...
Intelligent Software Engineering: Synergy between AI and Software Engineering...Intelligent Software Engineering: Synergy between AI and Software Engineering...
Intelligent Software Engineering: Synergy between AI and Software Engineering...
Β 
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Β 
Machine Learning vs. Deep Learning
Machine Learning vs. Deep LearningMachine Learning vs. Deep Learning
Machine Learning vs. Deep Learning
Β 
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...
Β 
Finding Bad Code Smells with Neural Network Models
Finding Bad Code Smells with Neural Network Models Finding Bad Code Smells with Neural Network Models
Finding Bad Code Smells with Neural Network Models
Β 
Bug or Not? Bug Report Classification using N-Gram Idf
Bug or Not? Bug Report Classification using N-Gram IdfBug or Not? Bug Report Classification using N-Gram Idf
Bug or Not? Bug Report Classification using N-Gram Idf
Β 
Olap, expert system, data visualisation
Olap, expert system, data visualisationOlap, expert system, data visualisation
Olap, expert system, data visualisation
Β 
Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...
Β 
R programming for psychometrics
R programming for psychometricsR programming for psychometrics
R programming for psychometrics
Β 
(Semi-)Automatic analysis of online contents
(Semi-)Automatic analysis of online contents(Semi-)Automatic analysis of online contents
(Semi-)Automatic analysis of online contents
Β 
Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine Learning
Β 
Requirementv4
Requirementv4Requirementv4
Requirementv4
Β 
Integrating natural language processing and software engineering
Integrating natural language processing and software engineeringIntegrating natural language processing and software engineering
Integrating natural language processing and software engineering
Β 
Programming with Semantic Broad Data
Programming with Semantic Broad DataProgramming with Semantic Broad Data
Programming with Semantic Broad Data
Β 

Similar to Machine Learning for Software Maintainability

data structures and its importance
 data structures and its importance  data structures and its importance
data structures and its importance Anaya Zafar
Β 
Intro to Data Structure & Algorithms
Intro to Data Structure & AlgorithmsIntro to Data Structure & Algorithms
Intro to Data Structure & AlgorithmsAkhil Kaushik
Β 
Model-Based Systems Engineering Demystified
Model-Based Systems Engineering DemystifiedModel-Based Systems Engineering Demystified
Model-Based Systems Engineering DemystifiedElizabeth Steiner
Β 
Open CAESAR Initiative
Open CAESAR InitiativeOpen CAESAR Initiative
Open CAESAR InitiativeMaged Elaasar
Β 
136 latest dot net interview questions
136  latest dot net interview questions136  latest dot net interview questions
136 latest dot net interview questionssandi4204
Β 
Embedded system design challenges
Embedded system design challenges Embedded system design challenges
Embedded system design challenges Aditya Kamble
Β 
7 latest-dot-net-interview-questions
7  latest-dot-net-interview-questions7  latest-dot-net-interview-questions
7 latest-dot-net-interview-questionssadiqkhanpathan
Β 
Natural Language Understanding of Systems Engineering Artifacts
Natural Language Understanding of Systems Engineering ArtifactsNatural Language Understanding of Systems Engineering Artifacts
Natural Language Understanding of Systems Engineering ArtifactsÁkos HorvÑth
Β 
Algorithms and Data Structures
Algorithms and Data StructuresAlgorithms and Data Structures
Algorithms and Data Structuressonykhan3
Β 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and SparkAudible, Inc.
Β 
Introduction to Software - Coder Forge - John Mulhall
Introduction to Software - Coder Forge - John MulhallIntroduction to Software - Coder Forge - John Mulhall
Introduction to Software - Coder Forge - John MulhallJohn Mulhall
Β 
ONTOLOGY BASED DATA ACCESS
ONTOLOGY BASED DATA ACCESSONTOLOGY BASED DATA ACCESS
ONTOLOGY BASED DATA ACCESSKishan Patel
Β 
Lviv Data Science Club (Sergiy Lunyakin)
Lviv Data Science Club (Sergiy Lunyakin)Lviv Data Science Club (Sergiy Lunyakin)
Lviv Data Science Club (Sergiy Lunyakin)Lviv Startup Club
Β 
part 1 - intorduction data structure 2021 mte.ppt
part 1 -  intorduction data structure  2021 mte.pptpart 1 -  intorduction data structure  2021 mte.ppt
part 1 - intorduction data structure 2021 mte.pptabdoSelem1
Β 
Design Patterns.ppt
Design Patterns.pptDesign Patterns.ppt
Design Patterns.pptTanishaKochak
Β 
Architecture design in software engineering
Architecture design in software engineeringArchitecture design in software engineering
Architecture design in software engineeringPreeti Mishra
Β 
Capitalizing on Machine Reading to Engage Bigger Data
Capitalizing on Machine Reading to Engage Bigger DataCapitalizing on Machine Reading to Engage Bigger Data
Capitalizing on Machine Reading to Engage Bigger DataShalin Hai-Jew
Β 
Class Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP TechniquesClass Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP Techniquesiosrjce
Β 

Similar to Machine Learning for Software Maintainability (20)

data structures and its importance
 data structures and its importance  data structures and its importance
data structures and its importance
Β 
Intro to Data Structure & Algorithms
Intro to Data Structure & AlgorithmsIntro to Data Structure & Algorithms
Intro to Data Structure & Algorithms
Β 
Model-Based Systems Engineering Demystified
Model-Based Systems Engineering DemystifiedModel-Based Systems Engineering Demystified
Model-Based Systems Engineering Demystified
Β 
Open CAESAR Initiative
Open CAESAR InitiativeOpen CAESAR Initiative
Open CAESAR Initiative
Β 
136 latest dot net interview questions
136  latest dot net interview questions136  latest dot net interview questions
136 latest dot net interview questions
Β 
Embedded system design challenges
Embedded system design challenges Embedded system design challenges
Embedded system design challenges
Β 
7 latest-dot-net-interview-questions
7  latest-dot-net-interview-questions7  latest-dot-net-interview-questions
7 latest-dot-net-interview-questions
Β 
Natural Language Understanding of Systems Engineering Artifacts
Natural Language Understanding of Systems Engineering ArtifactsNatural Language Understanding of Systems Engineering Artifacts
Natural Language Understanding of Systems Engineering Artifacts
Β 
Algorithms and Data Structures
Algorithms and Data StructuresAlgorithms and Data Structures
Algorithms and Data Structures
Β 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and Spark
Β 
Introduction to Software - Coder Forge - John Mulhall
Introduction to Software - Coder Forge - John MulhallIntroduction to Software - Coder Forge - John Mulhall
Introduction to Software - Coder Forge - John Mulhall
Β 
ONTOLOGY BASED DATA ACCESS
ONTOLOGY BASED DATA ACCESSONTOLOGY BASED DATA ACCESS
ONTOLOGY BASED DATA ACCESS
Β 
Lviv Data Science Club (Sergiy Lunyakin)
Lviv Data Science Club (Sergiy Lunyakin)Lviv Data Science Club (Sergiy Lunyakin)
Lviv Data Science Club (Sergiy Lunyakin)
Β 
part 1 - intorduction data structure 2021 mte.ppt
part 1 -  intorduction data structure  2021 mte.pptpart 1 -  intorduction data structure  2021 mte.ppt
part 1 - intorduction data structure 2021 mte.ppt
Β 
Design Patterns.ppt
Design Patterns.pptDesign Patterns.ppt
Design Patterns.ppt
Β 
Architecture design in software engineering
Architecture design in software engineeringArchitecture design in software engineering
Architecture design in software engineering
Β 
Introduction
IntroductionIntroduction
Introduction
Β 
Capitalizing on Machine Reading to Engage Bigger Data
Capitalizing on Machine Reading to Engage Bigger DataCapitalizing on Machine Reading to Engage Bigger Data
Capitalizing on Machine Reading to Engage Bigger Data
Β 
Class Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP TechniquesClass Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP Techniques
Β 
D017232729
D017232729D017232729
D017232729
Β 

More from Valerio Maggio

Unsupervised Machine Learning for clone detection
Unsupervised Machine Learning for clone detectionUnsupervised Machine Learning for clone detection
Unsupervised Machine Learning for clone detectionValerio Maggio
Β 
Number Crunching in Python
Number Crunching in PythonNumber Crunching in Python
Number Crunching in PythonValerio Maggio
Β 
Clone detection in Python
Clone detection in PythonClone detection in Python
Clone detection in PythonValerio Maggio
Β 
LINSEN an efficient approach to split identifiers and expand abbreviations
LINSEN an efficient approach to split identifiers and expand abbreviationsLINSEN an efficient approach to split identifiers and expand abbreviations
LINSEN an efficient approach to split identifiers and expand abbreviationsValerio Maggio
Β 
A Tree Kernel based approach for clone detection
A Tree Kernel based approach for clone detectionA Tree Kernel based approach for clone detection
A Tree Kernel based approach for clone detectionValerio Maggio
Β 
Refactoring: Improve the design of existing code
Refactoring: Improve the design of existing codeRefactoring: Improve the design of existing code
Refactoring: Improve the design of existing codeValerio Maggio
Β 
Scaffolding with JMock
Scaffolding with JMockScaffolding with JMock
Scaffolding with JMockValerio Maggio
Β 
Unit testing with Junit
Unit testing with JunitUnit testing with Junit
Unit testing with JunitValerio Maggio
Β 
Design patterns and Refactoring
Design patterns and RefactoringDesign patterns and Refactoring
Design patterns and RefactoringValerio Maggio
Β 
Test Driven Development
Test Driven DevelopmentTest Driven Development
Test Driven DevelopmentValerio Maggio
Β 
Unit testing and scaffolding
Unit testing and scaffoldingUnit testing and scaffolding
Unit testing and scaffoldingValerio Maggio
Β 

More from Valerio Maggio (13)

Unsupervised Machine Learning for clone detection
Unsupervised Machine Learning for clone detectionUnsupervised Machine Learning for clone detection
Unsupervised Machine Learning for clone detection
Β 
Number Crunching in Python
Number Crunching in PythonNumber Crunching in Python
Number Crunching in Python
Β 
Clone detection in Python
Clone detection in PythonClone detection in Python
Clone detection in Python
Β 
LINSEN an efficient approach to split identifiers and expand abbreviations
LINSEN an efficient approach to split identifiers and expand abbreviationsLINSEN an efficient approach to split identifiers and expand abbreviations
LINSEN an efficient approach to split identifiers and expand abbreviations
Β 
A Tree Kernel based approach for clone detection
A Tree Kernel based approach for clone detectionA Tree Kernel based approach for clone detection
A Tree Kernel based approach for clone detection
Β 
Refactoring: Improve the design of existing code
Refactoring: Improve the design of existing codeRefactoring: Improve the design of existing code
Refactoring: Improve the design of existing code
Β 
Scaffolding with JMock
Scaffolding with JMockScaffolding with JMock
Scaffolding with JMock
Β 
Junit in action
Junit in actionJunit in action
Junit in action
Β 
Unit testing with Junit
Unit testing with JunitUnit testing with Junit
Unit testing with Junit
Β 
Design patterns and Refactoring
Design patterns and RefactoringDesign patterns and Refactoring
Design patterns and Refactoring
Β 
Test Driven Development
Test Driven DevelopmentTest Driven Development
Test Driven Development
Β 
Unit testing and scaffolding
Unit testing and scaffoldingUnit testing and scaffolding
Unit testing and scaffolding
Β 
Web frameworks
Web frameworksWeb frameworks
Web frameworks
Β 

Recently uploaded

Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
Β 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
Β 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
Β 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
Β 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
Β 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
Β 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
Β 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
Β 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
Β 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
Β 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
Β 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
Β 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
Β 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
Β 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
Β 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
Β 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
Β 
WhatsApp 9892124323 βœ“Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 βœ“Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 βœ“Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 βœ“Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
Β 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
Β 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
Β 

Recently uploaded (20)

Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
Β 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
Β 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Β 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Β 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
Β 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Β 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Β 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
Β 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Β 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
Β 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
Β 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
Β 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
Β 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
Β 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Β 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
Β 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
Β 
WhatsApp 9892124323 βœ“Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 βœ“Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 βœ“Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 βœ“Call Girls In Kalyan ( Mumbai ) secure service
Β 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
Β 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Β 

Machine Learning for Software Maintainability

  • 1. MACHINE LEARNING FOR SOFTWARE MAINTAINABILITY Anna Corazza, Sergio Di Martino, Valerio Maggio Alessandro Moschitti, Andrea Passerini, Giuseppe Scanniello, Fabrizio Silverstri JIMSE 2012 August 28, 2012 Montpellier, France
  • 2. SOFTWARE MAINTENANCE β€œA software system must be continuously adapted during its overall life cycle or it progressively becomes less satisfactory” (cit. Lehman’s Law of Software Evolution) β€’ Software Maintenance is one of the most expensive and time consuming phase of the whole life cycle β€’ Anticipating the Maintenance operations reduces the cost β€’ 85%-90% of the total cost are related to the effort necessary to comprehend the system and its source code [Erlikh, 2000]
  • 3. Software Artifacts UI Process Components UI Components Data Access Components Data Helpers / Utilities Security Operational Management Communications Business Components Application Facade Buisiness Workflows Messages Interfaces Service Interfaces β€’ Provide models and views representing the relationships among different software artifacts β€’ Clustering of Software Artifacts β€’ Advantages: β€’ To aid the comprehension β€’ To reduce maintenance effort SOFTWARE ARCHITECTURE
  • 4. β€’ Provide models and views representing the relationships among different software artifacts β€’ Clustering of Software Artifacts β€’ Advantages: β€’ To aid the comprehension β€’ To reduce maintenance effort SOFTWARE ARCHITECTURE External Systems Service Consumers Services Service Interfaces Messages Interfaces Cross Cutting Security OperationalManagement Communications Data Data Access Components Data Helpers / Utilities Presentation UI Components UI Process Components Business Application Facade Buisiness Workflows Business Components Clusters of Software Artifacts
  • 5. β€’ Provide models and views representing the relationships among different software artifacts β€’ Clustering of Software Artifacts β€’ Advantages: β€’ To aid the comprehension β€’ To reduce maintenance effort SOFTWARE ARCHITECTURE External Systems Service Consumers Services Service Interfaces Messages Interfaces Cross Cutting Security OperationalManagement Communications Data Data Access Components Data Helpers / Utilities Presentation UI Components UI Process Components Business Application Facade Buisiness Workflows Business Components Clusters of Software Artifacts
  • 6. Software Artifacts may be analyzed at different levels of abstractions SOFTWARE ARTIFACTS
  • 7. Software Artifacts may be analyzed at different levels of abstractions SOFTWARE ARTIFACTS
  • 8. Software Artifacts may be analyzed at different levels of abstractions The different levels of abstractions lead to different analysis tasks: β€’ Identification of functional modules and their hierarchical arrangement β€’ i.e., Clustering of Software classes β€’ Identification of Code Clones β€’ i.e., Clustering of Duplicated code fragments (blocks, SOFTWARE ARTIFACTS
  • 9. β€’ Mine information directly from the source code: β€’ Exploit the syntactic/lexical information provided in the source code text β€’ Exploit the relational information between artifacts β€’ e.g., Program Dependencies Problem: Definition of a proper similarity measure to apply in the clustering analysis, which is able to exploit the considered representation of software artifacts SOFTWARE ARTIFACTS CLUSTERING
  • 10. β€’ Analysis of large and complex systems β€’ Solutions and algorithms must be able to scale efficiently (in the large and in the many) MINING LARGE REPOSITORIES
  • 11. Idea: Definition of Machine Learning techniques to mine information from the source code β€’ Combine different kind of information (lexical and structural) β€’ Application of Kernel Methods to software artifacts β€’ Provide flexible and computational effective solutions to analyze large data sets ADVANCED MACHINE LEARNING FOR SOFTWARE MAINTENANCE
  • 12. Idea: Definition of Machine Learning techniques to mine information from the source code β€’ Combine different kind of information (lexical and structural) β€’ Application of Kernel Methods to software artifacts β€’ Provide flexible and computational effective solutions to analyze large data sets Advanced Machine Learning β€’ Learning with syntactic/semantic information (Natural Language Processing) β€’ Learning in relational domains (Structured-output learning, Logic Learning, Statistical Relational Learning) ADVANCED MACHINE LEARNING FOR SOFTWARE MAINTENANCE
  • 13. KERNEL METHODS FOR STRUCTURED DATA β€’ A Kernel is a function between (arbitrary) pairs of entities β€’ It can be seen as a kind of similarity measure β€’ Based on the idea that structured objects can be described in terms of their constituent parts β€’ Generalize the computation of the dot product to arbitrary domains β€’ Can be easily tailored to specific domains β€’ Tree Kernels β€’ Graph Kernels β€’ ....
  • 15. β€’ Parse Trees represent the syntactic structure of a sentence β€’ Tree Kernels can be used to measure the similarity between parse trees KERNELS FOR LANGUAGES
  • 16. β€’ Parse Trees represent the syntactic structure of a sentence β€’ Tree Kernels can be used to measure the similarity between parse trees KERNELS FOR LANGUAGES β€’ Abstract Syntax Trees (AST) represent the syntactic structure of a piece of code β€’ Research on Tree Kernels for NLP carries over to AST (with adjustments) KERNELS FOR SOURCE CODE
  • 17. KERNELS FOR PARSE TREE ο€ο€‚ο€ƒο€„ο€…ο€ƒο€†ο€‡ο€…ο€‡ο€ƒο€ˆο€…ο€‰ο€Šο€‹ο€‹ ο€Œο€…ο€‹ο€†ο€ο€ˆο€…ο€Žο€‚ο€ ο€ο€ˆο€‡  ο€’   ο€”  ο€’   ο€”    ο€‡ο€ƒο€ˆ ο€‰ο€Šο€‹ο€‹ ο€Œ ο€‹ο€†ο€ο€ˆ ο€Žο€‚ο€ ο€’ο€ˆο€„ο€‡ο€ˆο€„ο€•ο€ˆο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€“ο€Šο€–ο€—ο€ˆο€…ο€˜ο€–ο€ˆο€ˆο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€…ο€“ο€Šο€–ο€—ο€ˆο€…ο€˜ο€–ο€ˆο€ˆο€…ο€™ο€ˆο€–ο€„ο€ˆο€‹ο€… ο€’    ο€”  ο€ο€ˆο€‡   ο€’    ο€” 
  • 18. KERNELS FOR AST ο€ο€‚ο€ƒο€„ο€…ο€†ο€‡ο€ˆο€†ο€‰ο€†ο€Šο€‹ο€†ο€Œ ο€†ο€†ο€†ο€ˆο€†ο€ο€†ο€ˆο€†ο€Žο€†ο€ ο€†ο€†ο€†ο€Šο€†ο€ο€†ο€Šο€†ο€ο€†ο€  ο€ο€‚ο€ƒο€„ο€…ο€†ο€‡ο€’ο€†ο€‰ο€†ο€“ο€‹ο€†ο€Œ     ο€“ο€„ο€—ο€˜ο€™ ο€ˆ ο€Š ο€Ž      ο€ˆ ο€ˆ ο€Š ο€Š   ο€“ο€„ο€—ο€˜ο€™ ο€’  ο€”  ο€–  ο€š  ο€’ ο€’     ο€“ο€„ο€—ο€˜ο€™ο€‰ ο€“ο€„ο€—ο€˜ο€™   ο€ˆ ο€Š  ο€Ž  ο€ˆ   ο€Š ο€Ž ο€ο€ˆ  ο€Š   ο€“ο€„ο€—ο€˜ο€™ο€‰ ο€“ο€„ο€—ο€˜ο€™   ο€’  ο€š ο€”  ο€’ ο€–   ο€”  ο€–   ο€›ο€œο€ο€žο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€Ÿο€ ο€‘ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€Ÿο€ ο€‘ο€†ο€’ο€žο€£ο€€ο€žο€₯
  • 19. β€’ Supervised Learning β€’ Binary Classification β€’ Multi-class Classification β€’ Ranking β€’ Unsupervised Learning β€’ Clustering β€’ Anomaly Detection Idea: Any learning algorithm relying on similarity measure can be used KERNEL MACHINES
  • 20. KERNEL MACHINES FOR CONE DETECTION β€’ Supervised Learning β€’ Pairwise classifier: predict if a pair of fragments is clone β€’ Unsupervised Learning β€’ Clustering: cluster together all candidate clones
  • 21. KERNEL FOR CLONES ο€ο€‚ο€ƒο€„ο€…ο€†ο€‡ο€ˆο€†ο€‰ο€†ο€Šο€‹ο€†ο€Œ ο€†ο€†ο€†ο€ˆο€†ο€ο€†ο€ˆο€†ο€Žο€†ο€ ο€†ο€†ο€†ο€Šο€†ο€ο€†ο€Šο€†ο€ο€†ο€  ο€ο€‚ο€ƒο€„ο€…ο€†ο€‡ο€’ο€†ο€“ο€†ο€”ο€‹ο€†ο€Œ ο€†ο€†ο€†ο€”ο€†ο€ο€†ο€”ο€†ο€Žο€†ο€     ο€ˆ ο€Š ο€Ž      ο€ˆ ο€ˆ ο€Š ο€Š    ο€’ ο€” ο€Ž      ο€” ο€” ο€’ ο€’       ο€ˆ ο€Š  ο€Ž  ο€ˆ   ο€Š ο€Ž ο€ο€ˆ  ο€Š       ο€’ ο€”  ο€Ž  ο€”   ο€’ ο€Ž   ο€’  ο€˜ο€™ο€šο€›ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€œο€ο€žο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€†ο€œο€ο€žο€†ο€Ÿο€›ο€ ο€‘ο€›ο€’
  • 22. KERNEL LEARNING β€’ Construct a number of candidate kernels with different characteristics β€’ e.g., Ignore variables names or not β€’ Employ kernel learning approaches which learn a weighted combination of candidate kernels β€’ Useless/harmful kernels will get zero weight and will be discarded in the final model LEARNING SIMILARITIES
  • 23. Supervised Clustering β€’ Exploit information on already annotated pieces of software β€’ Training examples are software projects/portions with annotation on existing clones (clustering) β€’ A learning model uses training examples to refine the similarity measure for correctly clustering novel examples STRUCTURED-OUTPUT LEARNING
  • 24. β€’ Software has a rich structure and heterogeneous information β€’ Advanced Machine learning approaches are promising for exploiting such information β€’ Kernel Methods are natural candidate β€’ e.g., see the analogy between NLP parse trees and AST β€’ Many applications: β€’ architecture recovery, code clone detection, vulnerability detection .... SUMMARY
  • 26. β€’ Goal: β€œIdentify and group all duplicated code fragments/functions” β€’ Copy&Paste programming β€’ Taxonomy of 4 different types of clones β€’ Program Text similarities and Functional similarities β€’ Clones affect the reliability and the maintainability of a software system CODE CLONE DETECTION
  • 27. β€’ Abstract Syntax Tree (AST) β€’ Tree structure representing the syntactic structure of the different instructions of a program (function) β€’ Program Dependencies Graph (PDG) β€’ (Directed) Graph structure representing the relationship among the different statement of a program KERNELS FOR CLONESCODE STRUCTURES Kernels for Structured Data: β€’ The source code could be represented by many different data structures
  • 28. ABSTRACT SYNTAX TREE (AST) CODE STRUCTURES AST Function Body = whileprint k 10 = i + p = i 0 i 1.0 < i 7 AST embeds both Syntactic and Lexical Information β€’ Program Instructions β€’ Name of Variables, Literals...
  • 29. ABSTRACT SYNTAX TREE (AST) CODE STRUCTURES AST Function Body = whileprint k 10 = i + p = i 0 i 1.0 < i 7 AST embeds both Syntactic and Lexical Information β€’ Program Instructions β€’ Name of Variables, Literals...
  • 30. ABSTRACT SYNTAX TREE (AST) CODE STRUCTURES AST Function Body = whileprint k 10 = i + p = i 0 i 1.0 < i 7 AST embeds both Syntactic and Lexical Information β€’ Program Instructions β€’ Name of Variables, Literals...
  • 31. ABSTRACT SYNTAX TREE (AST) CODE STRUCTURES AST Function Body = whileprint k 10 = i + p = i 0 i 1.0 < i 7 AST embeds both Syntactic and Lexical Information β€’ Program Instructions β€’ Name of Variables, Literals...
  • 32. while call-site expr decl param expr decl arg expr CODE STRUCTURES PDG β€’ Nodes correspond to instructions β€’ Edges represent relationships between couple of nodes PROGRAM DEPENDENCIES GRAPH (PDG)
  • 33. while call-site expr decl param expr decl arg expr CODE STRUCTURES PDG β€’ Nodes correspond to instructions β€’ Edges represent relationships between couple of nodes PROGRAM DEPENDENCIES GRAPH (PDG)
  • 34. while call-site expr decl param expr decl arg expr CODE STRUCTURES PDG β€’ Nodes correspond to instructions β€’ Edges represent relationships between couple of nodes PROGRAM DEPENDENCIES GRAPH (PDG)
  • 35. while call-site expr decl param expr decl arg expr CODE STRUCTURES PDG β€’ Nodes correspond to instructions β€’ Edges represent relationships between couple of nodes PROGRAM DEPENDENCIES GRAPH (PDG)
  • 36. while call-site expr decl param expr decl arg expr CODE STRUCTURES PDG β€’ Nodes correspond to instructions β€’ Edges represent relationships between couple of nodes PROGRAM DEPENDENCIES GRAPH (PDG)
  • 37. CODE STRUCTURES PDG β€’ Two Types of Nodes β€’ Control Nodes (Dashed ones) β€’ e.g., if - for - while - function calls... β€’ Data Nodes β€’ e.g., expressions - parameters... NODES AND EDGES while call-site argexpr
  • 38. CODE STRUCTURES PDG β€’ Two Types of Nodes β€’ Control Nodes (Dashed ones) β€’ e.g., if - for - while - function calls... β€’ Data Nodes β€’ e.g., expressions - parameters... β€’ Two Types of Edges (i.e., dependencies) β€’ Control edges (Dashed ones) β€’ Data edges NODES AND EDGES while call-site argexpr
  • 39. DEFINING KERNELS FOR STRUCTURED DATA β€’ The definition of a new Kernel for a Structured Object requires the definition of: β€’ Set of features to annotate each part of the object β€’ A Kernel function to measure the similarity on the smallest part of the object β€’ e.g., Nodes of AST and Graphs β€’ A Kernel function to apply the computation on the different (sub)parts of the structured object KERNELS FOR CODE STRUCTURES
  • 40. β€’ Features: each node is characterized by a set of 4 features β€’ Instruction Class β€’ i.e., LOOP, CONDITIONAL_STATEMENT, CALL β€’ Instruction β€’ i.e., FOR, IF, WHILE, RETURN β€’ Context β€’ i.e., Instruction Class of the closer statement node β€’ Lexemes β€’ Lexical information gathered (recursively) from leaves KERNELS FOR CODE STRUCTURES: AST TREE KERNELS FOR AST FOR FOR-INIT FOR- BODY
  • 41. β€’ Features: each node is characterized by a set of 4 features β€’ Instruction Class β€’ i.e., LOOP, CONDITIONAL_STATEMENT, CALL β€’ Instruction β€’ i.e., FOR, IF, WHILE, RETURN β€’ Context β€’ i.e., Instruction Class of the closer statement node β€’ Lexemes β€’ Lexical information gathered (recursively) from leaves KERNELS FOR CODE STRUCTURES: AST Instruction Class = LOOP Instruction = FOR Context = (e.g., LOOP) Lexemes = (e.g, name of variables in FOR- INIT..) TREE KERNELS FOR AST FOR FOR-INIT FOR- BODY
  • 42. β€’ Goal: Identify the maximum isomorphic Tree/Subtree β€’ Comparison of blocks to each other β€’ Blocks: Atomic unit for (sub) tree considered KERNELS FOR CODE STRUCTURES: AST TREE KERNELS FOR AST BLOCK = = print x 1.0 y f x x y BLOCK = = print s 0.0 p f s 1.0 p
  • 43. β€’ Goal: Identify the maximum isomorphic Tree/Subtree β€’ Comparison of blocks to each other β€’ Blocks: Atomic unit for (sub) tree considered KERNELS FOR CODE STRUCTURES: AST TREE KERNELS FOR AST BLOCK = = print x 1.0 y f x x y BLOCK = = print s 0.0 p f s 1.0 p
  • 44. β€’ Features of nodes: β€’ Node Label β€’ i.e., , WHILE, CALL-SITE, EXPR, ... β€’ Node Type β€’ i.e., Data Node or Control Node β€’ Features of edges: β€’ Edge Type β€’ i.e., Data Edge or Control Edge KERNELS FOR CODE STRUCTURES: PDG GRAPH KERNELS FOR PDG while call-site arg expr expr
  • 45. β€’ Features of nodes: β€’ Node Label β€’ i.e., , WHILE, CALL-SITE, EXPR, ... β€’ Node Type β€’ i.e., Data Node or Control Node β€’ Features of edges: β€’ Edge Type β€’ i.e., Data Edge or Control Edge KERNELS FOR CODE STRUCTURES: PDG Node Label = WHILE Node Type = Control Node GRAPH KERNELS FOR PDG while call-site arg expr expr Control Edge Data Edge
  • 46. while call-site arg expr expr while call-site arg expr call-site GRAPH KERNELS FOR PDG β€’ Goal: Identify common subgraphs β€’ Selectors: Compare nodes to each others and explore the subgraphs of only β€œcompatible” nodes (i.e., Nodes of the same type) β€’ Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops) KERNELS FOR CODE STRUCTURES: PDG
  • 47. while call-site arg expr expr while call-site arg expr call-site GRAPH KERNELS FOR PDG β€’ Goal: Identify common subgraphs β€’ Selectors: Compare nodes to each others and explore the subgraphs of only β€œcompatible” nodes (i.e., Nodes of the same type) β€’ Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops) KERNELS FOR CODE STRUCTURES: PDG
  • 48. while call-site arg expr expr while call-site arg expr call-site GRAPH KERNELS FOR PDG β€’ Goal: Identify common subgraphs β€’ Selectors: Compare nodes to each others and explore the subgraphs of only β€œcompatible” nodes (i.e., Nodes of the same type) β€’ Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops) KERNELS FOR CODE STRUCTURES: PDG
  • 49. while call-site arg expr expr while call-site arg expr call-site GRAPH KERNELS FOR PDG β€’ Goal: Identify common subgraphs β€’ Selectors: Compare nodes to each others and explore the subgraphs of only β€œcompatible” nodes (i.e., Nodes of the same type) β€’ Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops) KERNELS FOR CODE STRUCTURES: PDG
  • 50. EVALUATION PROTOCOL β€’ Comparison of results with other two clone detector tools: β€’ AST-based Clone detector β€’ PDG-based Clone Detector EMPIRICAL EVALUATION
  • 51. EVALUATION PROTOCOL β€’ Comparison of results with other two clone detector tools: β€’ AST-based Clone detector β€’ PDG-based Clone Detector β€’ No publicly available clone detection dataset EMPIRICAL EVALUATION
  • 52. EVALUATION PROTOCOL β€’ Comparison of results with other two clone detector tools: β€’ AST-based Clone detector β€’ PDG-based Clone Detector β€’ No publicly available clone detection dataset β€’ No unique set of analyzed open source systems EMPIRICAL EVALUATION
  • 53. EVALUATION PROTOCOL β€’ Comparison of results with other two clone detector tools: β€’ AST-based Clone detector β€’ PDG-based Clone Detector β€’ No publicly available clone detection dataset β€’ No unique set of analyzed open source systems β€’ Usually clone results are not available EMPIRICAL EVALUATION
  • 54. EVALUATION PROTOCOL β€’ Comparison of results with other two clone detector tools: β€’ AST-based Clone detector β€’ PDG-based Clone Detector β€’ No publicly available clone detection dataset β€’ No unique set of analyzed open source systems β€’ Usually clone results are not available β€’ Two possible strategies: EMPIRICAL EVALUATION
  • 55. EVALUATION PROTOCOL β€’ Comparison of results with other two clone detector tools: β€’ AST-based Clone detector β€’ PDG-based Clone Detector β€’ No publicly available clone detection dataset β€’ No unique set of analyzed open source systems β€’ Usually clone results are not available β€’ Two possible strategies: β€’ To automatically modify an existing system with randomly generated clones EMPIRICAL EVALUATION
  • 56. EVALUATION PROTOCOL β€’ Comparison of results with other two clone detector tools: β€’ AST-based Clone detector β€’ PDG-based Clone Detector β€’ No publicly available clone detection dataset β€’ No unique set of analyzed open source systems β€’ Usually clone results are not available β€’ Two possible strategies: β€’ To automatically modify an existing system with randomly generated clones β€’ Manual classification of candidate results EMPIRICAL EVALUATION
  • 57. BENCHMARKS AND DATASET Project Size (KLOC) # PDGS Apache-2.2.14 343 3017 Python-2.5.1 435 5091 β€’ Comparison with another Graph-based clone detector β€’ MeCC (ICSE2011) β€’ Baseline Dataset β€’ Results provided by MeCC β€’ Extended Dataset β€’ Extension of Clones results by manual evaluation of candidate clones β€’ Agreement rate calculation between the evaluators EMPIRICAL EVALUATION
  • 58. EMPIRICAL EVALUATION OF TREE KERNEL FOR AST β€’ Comparison with another (pure) AST-based clone detector β€’ Clone Digger http://clonedigger.sourceforge.net/ β€’ Comparison on a system with randomly seeded clones Results refer to clones where code fragments have been modified by adding/removing or changing code statements EVALUATION TREE KERNELS FOR AST
  • 59. PRECISION, RECALL AND F1 PLOT 0 0.25 0.5 0.75 1 0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 Precision Recall F1 Clone results with different similarity thresholds EVALUATION TREE KERNELS FOR AST
  • 60. LOREM I P S U M Threshold #Clones in the Baseline #Clones in the Extended Dataset 1.00 874 1089 0.99 874 1514 0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9" 0.99" 1" Soglia'di'SimilaritΓ ' Apache'2.2.14'6'Precision'con'Oracolo'Esteso' Precision5 Baseline" Precision5 Extended" 0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.99" 1" Soglia'di'SimilaritΓ ' Apache'2.2.14'6'Recall'con'Oracolo'Esteso' Recall0 Baseline" Recall0 Extended" RESULTS WITH APACHE 2.2.14 EVALUATION GRAPH KERNELS FOR PDG
  • 61. LOREM I P S U M Threshold #Clones in the Baseline #Clones in the Extended Dataset 1.00 858 1066 0.99 858 2119 0" 0.2" 0.4" 0.6" 0.8" 1" 1.2" 0.99" 1" Soglia'di'SimilaritΓ ' Python'2.5.1'5'Precision'con'Oracolo' Esteso' Precision2 Baseline" Precision2 Extended" 0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9" 0.99" 1" Soglia'di'SimilaritΓ ' Python'2.5.1'5'Recall'con'Oracolo'Esteso' Recall2Baseline" Recall2Extended" RESULTS WITH PYTHON 2.5.2 EVALUATION GRAPH KERNELS FOR PDG
  • 62. CHALLENGES AND OPPORTUNITIES β€’ Learning Kernel Functions from Data Set β€’ Kernel Methods advantages: β€’ flexible solution to be tailored to specific domain β€’ efficient solution easy to parallelize β€’ combinations of multiple kernels β€’ Provide a publicly available data set
  • 63. THANK YOU FOR YOUR KIND ATTENTION