SlideShare une entreprise Scribd logo
1  sur  42
Télécharger pour lire hors ligne
A Tree Kernel Based
Approach for
Clone Detection
1) University of Naples Federico II
2) University of Basilicata
Anna Corazza1
, Sergio Di Martino1
,
Valerio Maggio1
, Giuseppe Scanniello2
Outline
►Background
○ Clone detection definition
○ State of the Art Techniques Taxonomy
►Our Abstract Syntax Tree based Proposal
○ A Tree Kernel based approach for clone detection
►A preliminary evaluation
Code Clones
► Two code fragments form a clone if they are similar enough
according to a given measure of similarity (I.D. Baxter, 1998)
3. R. Tiarks, R. Koschke, and R. Falke,
An assessment of type-3 clones as detected by state-of-the-art tools
1
Code Clones
► Two code fragments form a clone if they are similar enough
according to a given measure of similarity (I.D. Baxter, 1998)
► Similarity based on Program Text or on “Semantics”
3. R. Tiarks, R. Koschke, and R. Falke,
An assessment of type-3 clones as detected by state-of-the-art tools
1
Code Clones
► Two code fragments form a clone if they are similar enough
according to a given measure of similarity (I.D. Baxter, 1998)
► Similarity based on Program Text or on “Semantics”
► Program Text can be further distinguished by their degree of similarity1
○ Type 1 Clone: Exact Copy
○ Type 2 Clone: Parameter Substituted Clone
○ Type 3 Clone: Modified/Structure Substituted Clone
1. R. Tiarks, R. Koschke, and R. Falke,
An assessment of type-3 clones as detected by state-of-the-art tools
1
State of the Art Techniques
► Classified in terms of Program Text representation2
○ String, token, syntax tree, control structures, metric vectors
► String/Token based Techniques
► Abstract Syntax Tree (AST) Techniques
► ...
2
2. Roy, Cordy, Koschke Comparison and Evaluation of Clone Detection Tools and Technique 2009
State of the Art Techniques
► String/Token based Techniques
► Abstract Syntax Tree (AST) Techniques
► ...
► Combined Techniques (a.k.a. Hybrid)
○Combine different representations
○Combine different techniques
○Combine different sources of information
●Tree Kernel based approach (Our approach :)
2
The Proposed Approach
The Goal
► Define an AST based technique able to detect up to Type 3
Clones
3
The Goal
► Define an AST based technique able to detect up to Type 3
Clones
► The Key Ideas:
○ Improve the amount of information carried by ASTs by adding (also)
lexical information
○ Define a proper measure to compute similarities among (sub)trees,
exploiting such information
3
The Goal
► Define an AST based technique able to detect up to Type 3
Clones
► The Key Ideas:
○ Improve the amount of information carried by ASTs by adding (also)
lexical information
○ Define a proper measure to compute similarities among (sub)trees,
exploiting such information
► As a measure we propose the use of a
(Tree) Kernel Function
3
Kernels for Structured Data
► Kernels are a class of functions with many appealing features:
○ Are based on the idea that a complex object can be described in terms of
its constituent parts
○ Can be easily tailored to a specific domain
► There exist different classes of Kernels:
○ String Kernels
○ Graph Kernels
○ …
○ Tree Kernels
● Applied to NLP Parse Trees (Collins and Duffy 2004)
4
Defining a new Tree Kernel
► The definition of a new Tree Kernel requires the
specification of:
(1) A set of features to annotate nodes of
compared trees
5
Defining a new Tree Kernel
► The definition of a new Tree Kernel requires the
specification of:
(1) A set of features to annotate nodes of compared
trees
(2) A (primitive) Kernel Function to measure the
similarity of each pair of nodes
5
Defining a new Tree Kernel
► The definition of a new Tree Kernel requires the
specification of:
(1) A set of features to annotate nodes of compared
trees
(2) A (primitive) Kernel Function to measure the
similarity of each pair of nodes
(3) A proper Kernel Function to compare subparts of
trees
5
(1) The defined features
► We annotate each node of AST by 4 features:
6
(1) The defined features
► We annotate each node of AST by 4 features:
○ Instruction Class
● i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW
CONTROL,...
6
(1) The defined features
► We annotate each node of AST by 4 features:
○ Instruction Class
● i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW
CONTROL,...
○ Instruction
● i.e. FOR, WHILE, IF, RETURN, CONTINUE,...
6
(1) The defined features
► We annotate each node of AST by 4 features:
○ Instruction Class
● i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW CONTROL,...
○ Instruction
● i.e. FOR, WHILE, IF, RETURN, CONTINUE,...
○ Context
● Instruction class of statement in which node is
enclosed
6
(1) The defined features
► We annotate each node of AST by 4 features:
○ Instruction Class
● i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW CONTROL,...
○ Instruction
● i.e. FOR, WHILE, IF, RETURN, CONTINUE,...
○ Context
● Instruction class of statement in which node is enclosed
○ Lexemes
● Lexical information within the code
6
Context Feature
► Rationale: two nodes are more similar if they appear in the same
Instruction class
for (int i=0; i<10; i++)
    x += i+2;
if (i<10)
    x += i+2;
while (i<10)
    x += i+2;
7
Context Feature
► Rationale: two nodes are more similar if they appear in the same
Instruction class
for (int i=0; i<10; i++)
    x += i+2;
if (i<10)
    x += i+2;
while (i<10)
    x += i+2;
7
Context Feature
► Rationale: two nodes are more similar if they appear in the same
Instruction class
for (int i=0; i<10; i++)
    x += i+2;
if (i<10)
    x += i+2;
while (i<10)
    x += i+2;
7
Context Feature
► Rationale: two nodes are more similar if they appear in the same
Instruction class
for (int i=0; i<10; i++)
    x += i+2;
if (i<10)
    x += i+2;
while (i<10)
    x += i+2;
7
Context Feature
► Rationale: two nodes are more similar if they appear in the same
Instruction class
for (int i=0; i<10; i++)
    x += i+2;
if (i<10)
    x += i+2;
while (i<10)
    x += i+2;
7
Lexemes Feature
► For leaf nodes:
○ It is the lexeme associated to the node
► For internal nodes:
○ It is the set of lexemes that recursively comes from
subtrees with minimum height
8
Lexemes Propagation
x
<
0
return
yblock
%=
x y
block
while
9
Lexemes Propagation
x
<
0
return
yblock
%=
x y
block
while
x 0
x y
y
9
Lexemes Propagation
x
<
0
return
yblock
%=
x y
block
while
x
x y
y
0
x, 0
9
Lexemes Propagation
x
<
0
return
yblock
%=
x y
block
while
x
x y
y
0
x, 0
x, y
x, y
9
Lexemes Propagation
x
<
0
return
yblock
%=
x y
block
while
x
x y
y
0
x, 0
x, y
x, y
x, 0, while
y, return
9
Lexemes Propagation
x
<
0
return
yblock
%=
x y
block
while
x
x y
y
0
x, 0
x, y
x, y
x, 0, while
y, return
y, return
9
(2) Applying features in a Kernel
We exploits these features to compute similarity among pairs of
nodes, as follows:
► Instruction Class filters comparable nodes
○ We compare only nodes with the same Instruction Class
► Instruction, Context and Lexemes are used to define a value of
similarity between compared nodes
10
(Primitive) Kernel Function between nodes
1.0 If two nodes have the same values of
features
0.8 If two nodes differ in lexemes
(same instruction and context)
0.7 If two nodes share lexemes and are
the same instruction
0.5 If two nodes share lexemes and are
enclosed in the same context
0.25 If two nodes have at least one feature
in common
0.0 no match
s(n1,n2)=
11
(3) Tree Kernel: Kernel on entire Tree Structures
►We apply nodes comparison recursively to compute
similarity between subtrees
►We aim to identify the maximum isomorphic
tree/subtree
12
Overall Process
1. Preprocessing 2. Extraction
3. Match Detection 4. Aggregation
13
A Preliminary evaluation
Evaluation Description
► We considered a small Java software system
○ We choose to identify clones at method level
► We checked system against the presence of up to Type 3 clones
○ Removed all detected clones through refactoring operations
► We manually and randomly injected a set of artificially created clones
○ One set for each type of clones
► We applied our prototype and CloneDigger* to mutated systems
► We evaluated performances in terms of Precision, Recall and F1
*http://clonedigger.sourceforge.net/
14
Results (1)
► Type 1 and Type 2 Clones:
○ We were able to detect all clones without any false
positive
○ This was obtained also by CloneDigger
○ Both tools expressed the potential of AST-based
approaches
15
Results (2)
► Type 3 clones:
○ We classified results as “true Type 3 clones” according to
different thresholds on similarity values
○ We measured performance on different thresholds
We get best results with
threshold equals to 0.70
16
Conclusions and Future Works
► Measure performance on real systems and projects
○ Bellon's Benchmark
○ Investigate best results with 0.7 as threshold
○ Measure Time Performances
► Improve the scalability of the approach
○ Avoid to compare all pairs
► Improve similarity computation
○ Avoid manual weighting features
► Extend Supported Languages
○ Now we support Java, C, Python
17
Thank you for listening.
Questions?
18

Contenu connexe

Tendances

Data structures in scala
Data structures in scalaData structures in scala
Data structures in scala
Meetu Maltiar
 
Serialization/deserialization
Serialization/deserializationSerialization/deserialization
Serialization/deserialization
Young Alista
 

Tendances (17)

Chapter ii(oop)
Chapter ii(oop)Chapter ii(oop)
Chapter ii(oop)
 
Data structures in scala
Data structures in scalaData structures in scala
Data structures in scala
 
Euclideus_Language
Euclideus_LanguageEuclideus_Language
Euclideus_Language
 
Java Unit 2(Part 1)
Java Unit 2(Part 1)Java Unit 2(Part 1)
Java Unit 2(Part 1)
 
Java Unit 2(part 3)
Java Unit 2(part 3)Java Unit 2(part 3)
Java Unit 2(part 3)
 
Java Unit 2 (Part 2)
Java Unit 2 (Part 2)Java Unit 2 (Part 2)
Java Unit 2 (Part 2)
 
Vectors in Java
Vectors in JavaVectors in Java
Vectors in Java
 
Jist of Java
Jist of JavaJist of Java
Jist of Java
 
JAVA CONCEPTS AND PRACTICES
JAVA CONCEPTS AND PRACTICESJAVA CONCEPTS AND PRACTICES
JAVA CONCEPTS AND PRACTICES
 
What Do You Mean By NUnit
What Do You Mean By NUnitWhat Do You Mean By NUnit
What Do You Mean By NUnit
 
Java Serialization Deep Dive
Java Serialization Deep DiveJava Serialization Deep Dive
Java Serialization Deep Dive
 
Class
ClassClass
Class
 
Iterator Design Pattern
Iterator Design PatternIterator Design Pattern
Iterator Design Pattern
 
LectureNotes-05-DSA
LectureNotes-05-DSALectureNotes-05-DSA
LectureNotes-05-DSA
 
Collections - Array List
Collections - Array List Collections - Array List
Collections - Array List
 
Bin Sorting And Bubble Sort By Luisito G. Trinidad
Bin Sorting And Bubble Sort By Luisito G. TrinidadBin Sorting And Bubble Sort By Luisito G. Trinidad
Bin Sorting And Bubble Sort By Luisito G. Trinidad
 
Serialization/deserialization
Serialization/deserializationSerialization/deserialization
Serialization/deserialization
 

En vedette

Physical and Conceptual Identifier Dispersion: Measures and Relation to Fault...
Physical and Conceptual Identifier Dispersion: Measures and Relation to Fault...Physical and Conceptual Identifier Dispersion: Measures and Relation to Fault...
Physical and Conceptual Identifier Dispersion: Measures and Relation to Fault...
ICSM 2010
 
Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )
Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )
Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )
Neeraj Bhandari
 
Method of least square
Method of least squareMethod of least square
Method of least square
Somya Bagai
 
STATA - Panel Regressions
STATA - Panel RegressionsSTATA - Panel Regressions
STATA - Panel Regressions
stata_org_uk
 
Measures of dispersion
Measures of dispersionMeasures of dispersion
Measures of dispersion
DrZahid Khan
 
Lesson 8 Linear Correlation And Regression
Lesson 8 Linear Correlation And RegressionLesson 8 Linear Correlation And Regression
Lesson 8 Linear Correlation And Regression
Sumit Prajapati
 
Simple linear regression (final)
Simple linear regression (final)Simple linear regression (final)
Simple linear regression (final)
Harsh Upadhyay
 
Regression analysis ppt
Regression analysis pptRegression analysis ppt
Regression analysis ppt
Elkana Rorio
 

En vedette (20)

Physical and Conceptual Identifier Dispersion: Measures and Relation to Fault...
Physical and Conceptual Identifier Dispersion: Measures and Relation to Fault...Physical and Conceptual Identifier Dispersion: Measures and Relation to Fault...
Physical and Conceptual Identifier Dispersion: Measures and Relation to Fault...
 
Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )
Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )
Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )
 
Method of least square
Method of least squareMethod of least square
Method of least square
 
Stata statistics
Stata statisticsStata statistics
Stata statistics
 
Dispersion stati
Dispersion statiDispersion stati
Dispersion stati
 
Measures of dispersion
Measures of dispersion Measures of dispersion
Measures of dispersion
 
Simple (and Simplistic) Introduction to Econometrics and Linear Regression
Simple (and Simplistic) Introduction to Econometrics and Linear RegressionSimple (and Simplistic) Introduction to Econometrics and Linear Regression
Simple (and Simplistic) Introduction to Econometrics and Linear Regression
 
STATA - Panel Regressions
STATA - Panel RegressionsSTATA - Panel Regressions
STATA - Panel Regressions
 
Linear regression and correlation analysis ppt @ bec doms
Linear regression and correlation analysis ppt @ bec domsLinear regression and correlation analysis ppt @ bec doms
Linear regression and correlation analysis ppt @ bec doms
 
Measures of dispersion
Measures of dispersionMeasures of dispersion
Measures of dispersion
 
Regression
RegressionRegression
Regression
 
T test and ANOVA
T test and ANOVAT test and ANOVA
T test and ANOVA
 
ANOVA II
ANOVA IIANOVA II
ANOVA II
 
Correlation
CorrelationCorrelation
Correlation
 
Lesson 8 Linear Correlation And Regression
Lesson 8 Linear Correlation And RegressionLesson 8 Linear Correlation And Regression
Lesson 8 Linear Correlation And Regression
 
Simple linear regression (final)
Simple linear regression (final)Simple linear regression (final)
Simple linear regression (final)
 
Measure of dispersion part I (Range, Quartile Deviation, Interquartile devi...
Measure of dispersion part   I (Range, Quartile Deviation, Interquartile devi...Measure of dispersion part   I (Range, Quartile Deviation, Interquartile devi...
Measure of dispersion part I (Range, Quartile Deviation, Interquartile devi...
 
Regression analysis ppt
Regression analysis pptRegression analysis ppt
Regression analysis ppt
 
Student's T-test, Paired T-Test, ANOVA & Proportionate Test
Student's T-test, Paired T-Test, ANOVA & Proportionate TestStudent's T-test, Paired T-Test, ANOVA & Proportionate Test
Student's T-test, Paired T-Test, ANOVA & Proportionate Test
 
Analysis of variance (ANOVA)
Analysis of variance (ANOVA)Analysis of variance (ANOVA)
Analysis of variance (ANOVA)
 

Similaire à A tree kernel based approach for clone detection

Text clustering
Text clusteringText clustering
Text clustering
KU Leuven
 

Similaire à A tree kernel based approach for clone detection (20)

Clustering.pdf
Clustering.pdfClustering.pdf
Clustering.pdf
 
Text clustering
Text clusteringText clustering
Text clustering
 
Poggi analytics - clustering - 1
Poggi   analytics - clustering - 1Poggi   analytics - clustering - 1
Poggi analytics - clustering - 1
 
Multi-Armed Bandits:
 Intro, examples and tricks
Multi-Armed Bandits:
 Intro, examples and tricksMulti-Armed Bandits:
 Intro, examples and tricks
Multi-Armed Bandits:
 Intro, examples and tricks
 
Distributed Coordination
Distributed CoordinationDistributed Coordination
Distributed Coordination
 
Islamic University Pattern Recognition & Neural Network 2019
Islamic University Pattern Recognition & Neural Network 2019 Islamic University Pattern Recognition & Neural Network 2019
Islamic University Pattern Recognition & Neural Network 2019
 
Q
QQ
Q
 
cluster analysis
cluster analysiscluster analysis
cluster analysis
 
Lecture 7: Recurrent Neural Networks
Lecture 7: Recurrent Neural NetworksLecture 7: Recurrent Neural Networks
Lecture 7: Recurrent Neural Networks
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorization
 
tutorial.ppt
tutorial.ppttutorial.ppt
tutorial.ppt
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
 
Neural Nets Deconstructed
Neural Nets DeconstructedNeural Nets Deconstructed
Neural Nets Deconstructed
 
Document clustering and classification
Document clustering and classification Document clustering and classification
Document clustering and classification
 
Manipulating string data with a pattern in R
Manipulating string data with  a pattern in RManipulating string data with  a pattern in R
Manipulating string data with a pattern in R
 
002.decision trees
002.decision trees002.decision trees
002.decision trees
 
Siggraph2016 - The Devil is in the Details: idTech 666
Siggraph2016 - The Devil is in the Details: idTech 666Siggraph2016 - The Devil is in the Details: idTech 666
Siggraph2016 - The Devil is in the Details: idTech 666
 
A Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
A Comparative Analysis of Feature Selection Methods for Clustering DNA SequencesA Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
A Comparative Analysis of Feature Selection Methods for Clustering DNA Sequences
 
dm_clustering2.ppt
dm_clustering2.pptdm_clustering2.ppt
dm_clustering2.ppt
 
ai7.ppt
ai7.pptai7.ppt
ai7.ppt
 

Plus de ICSM 2010

Scalable Semantic Web-based Source Code Search Infrastructure
Scalable Semantic Web-based Source Code Search InfrastructureScalable Semantic Web-based Source Code Search Infrastructure
Scalable Semantic Web-based Source Code Search Infrastructure
ICSM 2010
 
2D and 3D Visualizations In Wikidev2.0 M. Fokaefs, D. Serrano, B. Tansey and ...
2D and 3D Visualizations In Wikidev2.0 M. Fokaefs, D. Serrano, B. Tansey and ...2D and 3D Visualizations In Wikidev2.0 M. Fokaefs, D. Serrano, B. Tansey and ...
2D and 3D Visualizations In Wikidev2.0 M. Fokaefs, D. Serrano, B. Tansey and ...
ICSM 2010
 
Wiki dev nlp
Wiki dev nlpWiki dev nlp
Wiki dev nlp
ICSM 2010
 
iFL: An Interactive Environment for Understanding Feature Implementations
iFL: An Interactive Environment for Understanding Feature ImplementationsiFL: An Interactive Environment for Understanding Feature Implementations
iFL: An Interactive Environment for Understanding Feature Implementations
ICSM 2010
 
Using Clone Detection to Identify Bugs in Concurrent Software
Using Clone Detection to Identify Bugs in Concurrent SoftwareUsing Clone Detection to Identify Bugs in Concurrent Software
Using Clone Detection to Identify Bugs in Concurrent Software
ICSM 2010
 
Automatically Repairing Test Cases for Evolving Method Declarations
Automatically Repairing Test Cases for Evolving Method DeclarationsAutomatically Repairing Test Cases for Evolving Method Declarations
Automatically Repairing Test Cases for Evolving Method Declarations
ICSM 2010
 
Automated Identification of Cross-browser Issues in Web Applications
Automated Identification of Cross-browser Issues in Web ApplicationsAutomated Identification of Cross-browser Issues in Web Applications
Automated Identification of Cross-browser Issues in Web Applications
ICSM 2010
 
Reverse Engineering Object-Oriented Distributed Systems
Reverse Engineering Object-Oriented Distributed SystemsReverse Engineering Object-Oriented Distributed Systems
Reverse Engineering Object-Oriented Distributed Systems
ICSM 2010
 
Software asset management
Software asset managementSoftware asset management
Software asset management
ICSM 2010
 
Successfulresearch 100915022614-phpapp01
Successfulresearch 100915022614-phpapp01Successfulresearch 100915022614-phpapp01
Successfulresearch 100915022614-phpapp01
ICSM 2010
 
Enabling multi tenancy(An Industrial Experience Report)
Enabling multi tenancy(An Industrial Experience Report)Enabling multi tenancy(An Industrial Experience Report)
Enabling multi tenancy(An Industrial Experience Report)
ICSM 2010
 
Ponsini automatic slides
Ponsini automatic slidesPonsini automatic slides
Ponsini automatic slides
ICSM 2010
 
Studying the impact of dependency network measures on software quality
Studying the impact of dependency network measures on software quality	Studying the impact of dependency network measures on software quality
Studying the impact of dependency network measures on software quality
ICSM 2010
 
Icsm2010 Announcement
Icsm2010 AnnouncementIcsm2010 Announcement
Icsm2010 Announcement
ICSM 2010
 

Plus de ICSM 2010 (14)

Scalable Semantic Web-based Source Code Search Infrastructure
Scalable Semantic Web-based Source Code Search InfrastructureScalable Semantic Web-based Source Code Search Infrastructure
Scalable Semantic Web-based Source Code Search Infrastructure
 
2D and 3D Visualizations In Wikidev2.0 M. Fokaefs, D. Serrano, B. Tansey and ...
2D and 3D Visualizations In Wikidev2.0 M. Fokaefs, D. Serrano, B. Tansey and ...2D and 3D Visualizations In Wikidev2.0 M. Fokaefs, D. Serrano, B. Tansey and ...
2D and 3D Visualizations In Wikidev2.0 M. Fokaefs, D. Serrano, B. Tansey and ...
 
Wiki dev nlp
Wiki dev nlpWiki dev nlp
Wiki dev nlp
 
iFL: An Interactive Environment for Understanding Feature Implementations
iFL: An Interactive Environment for Understanding Feature ImplementationsiFL: An Interactive Environment for Understanding Feature Implementations
iFL: An Interactive Environment for Understanding Feature Implementations
 
Using Clone Detection to Identify Bugs in Concurrent Software
Using Clone Detection to Identify Bugs in Concurrent SoftwareUsing Clone Detection to Identify Bugs in Concurrent Software
Using Clone Detection to Identify Bugs in Concurrent Software
 
Automatically Repairing Test Cases for Evolving Method Declarations
Automatically Repairing Test Cases for Evolving Method DeclarationsAutomatically Repairing Test Cases for Evolving Method Declarations
Automatically Repairing Test Cases for Evolving Method Declarations
 
Automated Identification of Cross-browser Issues in Web Applications
Automated Identification of Cross-browser Issues in Web ApplicationsAutomated Identification of Cross-browser Issues in Web Applications
Automated Identification of Cross-browser Issues in Web Applications
 
Reverse Engineering Object-Oriented Distributed Systems
Reverse Engineering Object-Oriented Distributed SystemsReverse Engineering Object-Oriented Distributed Systems
Reverse Engineering Object-Oriented Distributed Systems
 
Software asset management
Software asset managementSoftware asset management
Software asset management
 
Successfulresearch 100915022614-phpapp01
Successfulresearch 100915022614-phpapp01Successfulresearch 100915022614-phpapp01
Successfulresearch 100915022614-phpapp01
 
Enabling multi tenancy(An Industrial Experience Report)
Enabling multi tenancy(An Industrial Experience Report)Enabling multi tenancy(An Industrial Experience Report)
Enabling multi tenancy(An Industrial Experience Report)
 
Ponsini automatic slides
Ponsini automatic slidesPonsini automatic slides
Ponsini automatic slides
 
Studying the impact of dependency network measures on software quality
Studying the impact of dependency network measures on software quality	Studying the impact of dependency network measures on software quality
Studying the impact of dependency network measures on software quality
 
Icsm2010 Announcement
Icsm2010 AnnouncementIcsm2010 Announcement
Icsm2010 Announcement
 

Dernier

Dernier (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

A tree kernel based approach for clone detection

  • 1. A Tree Kernel Based Approach for Clone Detection 1) University of Naples Federico II 2) University of Basilicata Anna Corazza1 , Sergio Di Martino1 , Valerio Maggio1 , Giuseppe Scanniello2
  • 2. Outline ►Background ○ Clone detection definition ○ State of the Art Techniques Taxonomy ►Our Abstract Syntax Tree based Proposal ○ A Tree Kernel based approach for clone detection ►A preliminary evaluation
  • 3. Code Clones ► Two code fragments form a clone if they are similar enough according to a given measure of similarity (I.D. Baxter, 1998) 3. R. Tiarks, R. Koschke, and R. Falke, An assessment of type-3 clones as detected by state-of-the-art tools 1
  • 4. Code Clones ► Two code fragments form a clone if they are similar enough according to a given measure of similarity (I.D. Baxter, 1998) ► Similarity based on Program Text or on “Semantics” 3. R. Tiarks, R. Koschke, and R. Falke, An assessment of type-3 clones as detected by state-of-the-art tools 1
  • 5. Code Clones ► Two code fragments form a clone if they are similar enough according to a given measure of similarity (I.D. Baxter, 1998) ► Similarity based on Program Text or on “Semantics” ► Program Text can be further distinguished by their degree of similarity1 ○ Type 1 Clone: Exact Copy ○ Type 2 Clone: Parameter Substituted Clone ○ Type 3 Clone: Modified/Structure Substituted Clone 1. R. Tiarks, R. Koschke, and R. Falke, An assessment of type-3 clones as detected by state-of-the-art tools 1
  • 6. State of the Art Techniques ► Classified in terms of Program Text representation2 ○ String, token, syntax tree, control structures, metric vectors ► String/Token based Techniques ► Abstract Syntax Tree (AST) Techniques ► ... 2 2. Roy, Cordy, Koschke Comparison and Evaluation of Clone Detection Tools and Technique 2009
  • 7. State of the Art Techniques ► String/Token based Techniques ► Abstract Syntax Tree (AST) Techniques ► ... ► Combined Techniques (a.k.a. Hybrid) ○Combine different representations ○Combine different techniques ○Combine different sources of information ●Tree Kernel based approach (Our approach :) 2
  • 9. The Goal ► Define an AST based technique able to detect up to Type 3 Clones 3
  • 10. The Goal ► Define an AST based technique able to detect up to Type 3 Clones ► The Key Ideas: ○ Improve the amount of information carried by ASTs by adding (also) lexical information ○ Define a proper measure to compute similarities among (sub)trees, exploiting such information 3
  • 11. The Goal ► Define an AST based technique able to detect up to Type 3 Clones ► The Key Ideas: ○ Improve the amount of information carried by ASTs by adding (also) lexical information ○ Define a proper measure to compute similarities among (sub)trees, exploiting such information ► As a measure we propose the use of a (Tree) Kernel Function 3
  • 12. Kernels for Structured Data ► Kernels are a class of functions with many appealing features: ○ Are based on the idea that a complex object can be described in terms of its constituent parts ○ Can be easily tailored to a specific domain ► There exist different classes of Kernels: ○ String Kernels ○ Graph Kernels ○ … ○ Tree Kernels ● Applied to NLP Parse Trees (Collins and Duffy 2004) 4
  • 13. Defining a new Tree Kernel ► The definition of a new Tree Kernel requires the specification of: (1) A set of features to annotate nodes of compared trees 5
  • 14. Defining a new Tree Kernel ► The definition of a new Tree Kernel requires the specification of: (1) A set of features to annotate nodes of compared trees (2) A (primitive) Kernel Function to measure the similarity of each pair of nodes 5
  • 15. Defining a new Tree Kernel ► The definition of a new Tree Kernel requires the specification of: (1) A set of features to annotate nodes of compared trees (2) A (primitive) Kernel Function to measure the similarity of each pair of nodes (3) A proper Kernel Function to compare subparts of trees 5
  • 16. (1) The defined features ► We annotate each node of AST by 4 features: 6
  • 17. (1) The defined features ► We annotate each node of AST by 4 features: ○ Instruction Class ● i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW CONTROL,... 6
  • 18. (1) The defined features ► We annotate each node of AST by 4 features: ○ Instruction Class ● i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW CONTROL,... ○ Instruction ● i.e. FOR, WHILE, IF, RETURN, CONTINUE,... 6
  • 19. (1) The defined features ► We annotate each node of AST by 4 features: ○ Instruction Class ● i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW CONTROL,... ○ Instruction ● i.e. FOR, WHILE, IF, RETURN, CONTINUE,... ○ Context ● Instruction class of statement in which node is enclosed 6
  • 20. (1) The defined features ► We annotate each node of AST by 4 features: ○ Instruction Class ● i.e. LOOP, CONDITIONAL CONTROL, CONTROL FLOW CONTROL,... ○ Instruction ● i.e. FOR, WHILE, IF, RETURN, CONTINUE,... ○ Context ● Instruction class of statement in which node is enclosed ○ Lexemes ● Lexical information within the code 6
  • 21. Context Feature ► Rationale: two nodes are more similar if they appear in the same Instruction class for (int i=0; i<10; i++)     x += i+2; if (i<10)     x += i+2; while (i<10)     x += i+2; 7
  • 22. Context Feature ► Rationale: two nodes are more similar if they appear in the same Instruction class for (int i=0; i<10; i++)     x += i+2; if (i<10)     x += i+2; while (i<10)     x += i+2; 7
  • 23. Context Feature ► Rationale: two nodes are more similar if they appear in the same Instruction class for (int i=0; i<10; i++)     x += i+2; if (i<10)     x += i+2; while (i<10)     x += i+2; 7
  • 24. Context Feature ► Rationale: two nodes are more similar if they appear in the same Instruction class for (int i=0; i<10; i++)     x += i+2; if (i<10)     x += i+2; while (i<10)     x += i+2; 7
  • 25. Context Feature ► Rationale: two nodes are more similar if they appear in the same Instruction class for (int i=0; i<10; i++)     x += i+2; if (i<10)     x += i+2; while (i<10)     x += i+2; 7
  • 26. Lexemes Feature ► For leaf nodes: ○ It is the lexeme associated to the node ► For internal nodes: ○ It is the set of lexemes that recursively comes from subtrees with minimum height 8
  • 31. Lexemes Propagation x < 0 return yblock %= x y block while x x y y 0 x, 0 x, y x, y x, 0, while y, return 9
  • 32. Lexemes Propagation x < 0 return yblock %= x y block while x x y y 0 x, 0 x, y x, y x, 0, while y, return y, return 9
  • 33. (2) Applying features in a Kernel We exploits these features to compute similarity among pairs of nodes, as follows: ► Instruction Class filters comparable nodes ○ We compare only nodes with the same Instruction Class ► Instruction, Context and Lexemes are used to define a value of similarity between compared nodes 10
  • 34. (Primitive) Kernel Function between nodes 1.0 If two nodes have the same values of features 0.8 If two nodes differ in lexemes (same instruction and context) 0.7 If two nodes share lexemes and are the same instruction 0.5 If two nodes share lexemes and are enclosed in the same context 0.25 If two nodes have at least one feature in common 0.0 no match s(n1,n2)= 11
  • 35. (3) Tree Kernel: Kernel on entire Tree Structures ►We apply nodes comparison recursively to compute similarity between subtrees ►We aim to identify the maximum isomorphic tree/subtree 12
  • 36. Overall Process 1. Preprocessing 2. Extraction 3. Match Detection 4. Aggregation 13
  • 38. Evaluation Description ► We considered a small Java software system ○ We choose to identify clones at method level ► We checked system against the presence of up to Type 3 clones ○ Removed all detected clones through refactoring operations ► We manually and randomly injected a set of artificially created clones ○ One set for each type of clones ► We applied our prototype and CloneDigger* to mutated systems ► We evaluated performances in terms of Precision, Recall and F1 *http://clonedigger.sourceforge.net/ 14
  • 39. Results (1) ► Type 1 and Type 2 Clones: ○ We were able to detect all clones without any false positive ○ This was obtained also by CloneDigger ○ Both tools expressed the potential of AST-based approaches 15
  • 40. Results (2) ► Type 3 clones: ○ We classified results as “true Type 3 clones” according to different thresholds on similarity values ○ We measured performance on different thresholds We get best results with threshold equals to 0.70 16
  • 41. Conclusions and Future Works ► Measure performance on real systems and projects ○ Bellon's Benchmark ○ Investigate best results with 0.7 as threshold ○ Measure Time Performances ► Improve the scalability of the approach ○ Avoid to compare all pairs ► Improve similarity computation ○ Avoid manual weighting features ► Extend Supported Languages ○ Now we support Java, C, Python 17
  • 42. Thank you for listening. Questions? 18