SlideShare une entreprise Scribd logo
1  sur  80
Télécharger pour lire hors ligne
UNSUPERVISED
MACHINE
LEARNING FOR
CLONE DETECTION
Valerio Maggio, Ph.D.
June 25, 2013
valerio.maggio@unina.it
General Disclaimer:
All the Maths appearing in the next slides is only intended to better introduce the considered case studies. Speakers are not
responsible for any possible disease or “brain consumption” caused by too much formulas.
So BEWARE; use this information at your own risk!
It's intention is solely educational. We would strongly encourage you to use this information in cooperation with a medical or
health professional.
AwfulMaths
Number one in the stink parade is duplicated code.
If you see the same code structure in more than one
place, you can be sure that your program will be better
if you find a way to unify them.
ImageMapOutputFormat.java SVGOutputFormat.java
JHOTDRAW
CPYTHON2.5.1
PYTHON (NLTK)
PROBL
EM
S T A T E
M E N T
CLONE DETECTION
Software clones are fragments of code that are similar according
to some predefined measure of similarity
I.D. Baxter, 1998
PROBL
EM
S T A T E
M E N T
CLONE DETECTION
PROBL
EM
S T A T E
M E N T
CLONE DETECTION
Clones Textual Similarity
PROBL
EM
S T A T E
M E N T
CLONE DETECTION
Clones Functional Similarity
PROBL
EM
S T A T E
M E N T
CLONE DETECTION
Clones affect the reliability of the system!
Sneaky Bug!
DIFFERENT TYPES OF
CLONES
THE ORIGINAL ONE
# Original Fragment
def do_something_cool_in_Python(filepath, marker='---end---'):
! lines = list()
! with open(filepath) as report:
! ! for l in report:
! ! ! if l.endswith(marker):
! ! ! ! lines.append(l) # Stores only lines that ends with "marker"
! return lines #Return the list of different lines
TYPE 1: Exact Copy
• Identical code segments except for differences in
layout, whitespace, and comments
def do_something_cool_in_Python (filepath, marker='---end---'):
! lines = list() # This list is initially empty
! with open(filepath) as report:
! ! for l in report: # It goes through the lines of the file
! ! ! if l.endswith(marker):
! ! ! ! lines.append(l)
! return lines
TYPE 1: Exact Copy
• Identical code segments except for differences in
layout, whitespace, and comments
# Original Fragment
def do_something_cool_in_Python(filepath, marker='---end---'):
! lines = list()
! with open(filepath) as report:
! ! for l in report:
! ! ! if l.endswith(marker):
! ! ! ! lines.append(l) # Stores only lines that ends with "marker"
! return lines #Return the list of different lines
TYPE 2: Parameter Substituted
• Structurally identical segments except for differences in identifiers, literals,
layout, whitespace, and comments
# Type 2 Clone
def do_something_cool_in_Python(path, end='---end---'):
! targets = list()
! with open(path) as data_file:
! ! for t in datae:
! ! ! if l.endswith(end):
! ! ! ! targets.append(t) # Stores only lines that ends with "marker"
! #Return the list of different lines
! return targets
# Original Fragment
def do_something_cool_in_Python(filepath, marker='---end---'):
! lines = list()
! with open(filepath) as report:
! ! for l in report:
! ! ! if l.endswith(marker):
! ! ! ! lines.append(l) # Stores only lines that ends with "marker"
! return lines #Return the list of different lines
TYPE 2: Parameter Substituted
• Structurally identical segments except for differences in identifiers, literals,
layout, whitespace, and comments
TYPE 3: Structure Substituted
• Similar segments with further modifications such as changed, added (or deleted)
statements, in additions to variations in identifiers, literals, layout and comments
import os
def do_something_with(path, marker='---end---'):
! # Check if the input path corresponds to a file
! if not os.path.isfile(path):
! ! return None
! bad_ones = list()
! good_ones = list()
! with open(path) as report:
! ! for line in report:
! ! ! line = line.strip()
! ! ! if line.endswith(marker):
! ! ! ! good_ones.append(line)
! ! ! else:
! ! ! ! bad_ones.append(line)
! #Return the lists of different lines
! return good_ones, bad_ones
TYPE 3: Structure Substituted
• Similar segments with further modifications such as changed, added (or deleted)
statements, in additions to variations in identifiers, literals, layout and comments
import os
def do_something_with(path, marker='---end---'):
! # Check if the input path corresponds to a file
! if not os.path.isfile(path):
! ! return None
! bad_ones = list()
! good_ones = list()
! with open(path) as report:
! ! for line in report:
! ! ! line = line.strip()
! ! ! if line.endswith(marker):
! ! ! ! good_ones.append(line)
! ! ! else:
! ! ! ! bad_ones.append(line)
! #Return the lists of different lines
! return good_ones, bad_ones
TYPE 3: Structure Substituted
• Similar segments with further modifications such as changed, added (or deleted)
statements, in additions to variations in identifiers, literals, layout and comments
import os
def do_something_with(path, marker='---end---'):
! # Check if the input path corresponds to a file
! if not os.path.isfile(path):
! ! return None
! bad_ones = list()
! good_ones = list()
! with open(path) as report:
! ! for line in report:
! ! ! line = line.strip()
! ! ! if line.endswith(marker):
! ! ! ! good_ones.append(line)
! ! ! else:
! ! ! ! bad_ones.append(line)
! #Return the lists of different lines
! return good_ones, bad_ones
TYPE 3: Structure Substituted
• Similar segments with further modifications such as changed, added (or deleted)
statements, in additions to variations in identifiers, literals, layout and comments
import os
def do_something_with(path, marker='---end---'):
! # Check if the input path corresponds to a file
! if not os.path.isfile(path):
! ! return None
! bad_ones = list()
! good_ones = list()
! with open(path) as report:
! ! for line in report:
! ! ! line = line.strip()
! ! ! if line.endswith(marker):
! ! ! ! good_ones.append(line)
! ! ! else:
! ! ! ! bad_ones.append(line)
! #Return the lists of different lines
! return good_ones, bad_ones
TYPE 3: Structure Substituted
• Similar segments with further modifications such as changed, added (or deleted)
statements, in additions to variations in identifiers, literals, layout and comments
TYPE 4: “Functional” Copies
• Semantically equivalent segments that perform the same
computation but are implemented by different syntactic variants
# Original Fragment
def do_something_cool_in_Python(filepath, marker='---end---'):
! lines = list()
! with open(filepath) as report:
! ! for l in report:
! ! ! if l.endswith(marker):
! ! ! ! lines.append(l) # Stores only lines that ends with "marker"
! return lines #Return the list of different lines
def do_always_the_same_stuff(filepath, marker='---end---'):
! report = open(filepath)
! file_lines = report.readlines()
! report.close()
! #Filters only the lines ending with marker
! return filter(lambda l: len(l) and l.endswith(marker), file_lines)
TYPE 4: “Functional” Copies
• Semantically equivalent segments that perform the same
computation but are implemented by different syntactic variants
HTTPD2.2.14:TYPE1
HTTPD2.2.14:TYPE2
HTTPD2.2.14:TYPE3
SOURCECODEINFORMATION
SOURCECODEINFORMATION
SOURCECODEINFORMATION
FUNCTION
parser_compare PARAMS
PARAMPARAM
node *left node *right
IF-STMT IF-STMT RETURN-STMT
BODY
CALL-STMT
parser_compare_node
PARAMS
STRUCT-OP
right st_nodeleft st_node
BODY BODYCOND COND
OR
====
left right0 0
==
rightleft
RETURN-
STMTRETURN-STMT
00
SOURCECODEINFORMATION ENTRY EXIT
FORMAL-IN
ACTUAL-IN
ACTUAL-IN
FORMAL-IN
BODY
CONTROL-POINT
EXPR
CONTROL-POINT CONTROL-POINT CALL-SITE
RETURN
ACTUAL-OUT
RETURN
EXPR
EXPR
FORMAL-OUT
Duplix
Scorpio
PMD
CCFinder
Dup
CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
CloneDr
SimScan
CLICS
NiCAD
Simian
Duploc
Dude
SDD
STATEOFTHEARTTOOLS
Duplix
Scorpio
PMD
CCFinder
Dup
CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
CloneDr
SimScan
CLICS
NiCAD
Simian
Duploc
Dude
SDD
Text Based Tools:
Text is compared line by line
STATEOFTHEARTTOOLS
Duplix
Scorpio
PMD
CCFinder
Dup
CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
CloneDr
SimScan
CLICS
NiCAD
Simian
Duploc
Dude
SDD
Token Based Tools:
Token sequences are
compared to sequences
STATEOFTHEARTTOOLS
Duplix
Scorpio
PMD
CCFinder
Dup
CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
CloneDr
SimScan
CLICS
NiCAD
Simian
Duploc
Dude
SDD
Syntax Based Tools:
Syntax subtrees are compared
to each other
STATEOFTHEARTTOOLS
Duplix
Scorpio
PMD
CCFinder
Dup
CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
CloneDr
SimScan
CLICS
NiCAD
Simian
Duploc
Dude
SDD
Graph Based Tools:
(sub) graphs are compared to
each other
STATEOFTHEARTTOOLS
• String/Token based Techniques:
• Pros: Run very fast
• Cons: Too many false clones
STATEOFTHEART
TECHNIQUES
• String/Token based Techniques:
• Pros: Run very fast
• Cons: Too many false clones
• Syntax based (AST) Techniques:
• Pros: Well suited to detect structural similarities
• Cons: Not Properly suited to detect Type 3 Clones
STATEOFTHEART
TECHNIQUES
• String/Token based Techniques:
• Pros: Run very fast
• Cons: Too many false clones
• Syntax based (AST) Techniques:
• Pros: Well suited to detect structural similarities
• Cons: Not Properly suited to detect Type 3 Clones
• Graph based Techniques:
• Pros: The only one able to deal with Type 4 Clones
• Cons: Performance Issues
STATEOFTHEART
TECHNIQUES
USE
MACHINE
LEARNING
L U K E
USE
MACHINE
LEARNING
L U K E
• Provides computational effective solutions to analyze large data sets
USE
MACHINE
LEARNING
L U K E
• Provides computational effective solutions to analyze large data sets
• Provides solutions that can be tailored to different tasks/domains
USE
MACHINE
LEARNING
L U K E
• Provides computational effective solutions to analyze large data sets
• Provides solutions that can be tailored to different tasks/domains
• Requires many efforts in:
USE
MACHINE
LEARNING
L U K E
• Provides computational effective solutions to analyze large data sets
• Provides solutions that can be tailored to different tasks/domains
• Requires many efforts in:
• the definition of the relevant information best suited for the specific task/domain
USE
MACHINE
LEARNING
L U K E
• Provides computational effective solutions to analyze large data sets
• Provides solutions that can be tailored to different tasks/domains
• Requires many efforts in:
• the definition of the relevant information best suited for the specific task/domain
• the application of the learning algorithms to the considered data
UNSUPERVISEDLEARNING
• Supervised Learning:
• Learn from labelled samples
• Unsupervised Learning:
• Learn (directly) from the data
Learn by examples
UNSUPERVISEDLEARNING
• Supervised Learning:
• Learn from labelled samples
• Unsupervised Learning:
• Learn (directly) from the data
Learn by examples
(+) No cost of labeling samples
(-) Trade-off imposed on the quality of the data
CODE
STRUCTURES
KERNELSFORSTRUCTURES
Computation of the dot product between (Graph) Structures
K( ),
CODE
STRUCTURES
KERNELSFORSTRUCTURES
Abstract Syntax Tree (AST)
Tree structure representing the syntactic structure of
the different instructions of a program (function)
Program Dependencies Graph (PDG)
(Directed) Graph structure representing the relationship
among the different statement of a program
Computation of the dot product between (Graph) Structures
K( ),
CODE
KERNELFORCLONES
<
x y = =
x +
x 1
y -
y 1
while
block
while
block
block
if
>
b a = =
a +
a 1
b -
b 1
>
b 0 =
c 3
CODE AST
KERNELFORCLONES
<
x y = =
x +
x 1
y -
y 1
while
block
while
block
block
if
>
b a = =
a +
a 1
b -
b 1
>
b 0 =
c 3
CODE AST AST KERNEL
KERNELFORCLONES
<
block
while
= =
block
=
y -
=
x +
+
x 1
-
y 1
<
x y
>
b 0 =
c 3
if
block
>
b a
-
b 1
<
block
while
+
a 1
=
b -
=
a +
while
block<
x y
KERNELS
FOR CODE
STRUCTURES:
AST
KERNELFEATURES
while
block<
x y
KERNELS
FOR CODE
STRUCTURES:
AST
KERNELFEATURES Instruction Class (IC)
i.e., LOOP, CALL,
CONDITIONAL_STATEMENT
while
block<
x y
KERNELS
FOR CODE
STRUCTURES:
AST
KERNELFEATURES Instruction Class (IC)
i.e., LOOP, CALL,
CONDITIONAL_STATEMENT
Instruction (I)
i.e., FOR, IF, WHILE, RETURN
while
block<
x y
KERNELS
FOR CODE
STRUCTURES:
AST
KERNELFEATURES Instruction Class (IC)
i.e., LOOP, CALL,
CONDITIONAL_STATEMENT
Instruction (I)
i.e., FOR, IF, WHILE, RETURN
Context (C)
i.e., Instruction Class of
the closer statement node
while
block<
x y
KERNELS
FOR CODE
STRUCTURES:
AST
KERNELFEATURES Instruction Class (IC)
i.e., LOOP, CALL,
CONDITIONAL_STATEMENT
Instruction (I)
i.e., FOR, IF, WHILE, RETURN
Context (C)
i.e., Instruction Class of
the closer statement node
Lexemes (Ls)
Lexical information gathered
(recursively) from leaves
while
block<
x y
KERNELS
FOR CODE
STRUCTURES:
AST
KERNELFEATURES
IC = Conditional-Expr
I = Less-operator
C = Loop
Ls= [x,y]
IC = Loop
I = while-loop
C = Function-Body
Ls= [x, y]
Instruction Class (IC)
i.e., LOOP, CALL,
CONDITIONAL_STATEMENT
Instruction (I)
i.e., FOR, IF, WHILE, RETURN
Context (C)
i.e., Instruction Class of
the closer statement node
Lexemes (Ls)
Lexical information gathered
(recursively) from leaves
IC = Block
I = while-body
C = Loop
Ls= [ x ]
CLONE DETECTION
• Comparison with another (pure) AST-based clone detector
• Comparison on a system with randomly seeded clones
0
0.25
0.5
0.75
1
Precision Recall F-measure
CloneDigger Tree Kernel Tool
RE
SULTS
Results refer to clones where code
fragments have been modified by adding/
removing or changing code statements
0
0.25
0.50
0.75
1.00
0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98
Precision, Recall and F-Measure
Precision Recall F1
Precision: How accurate are the obtained results?
(Altern.) How many errors do they contain?
Recall: How complete are the obtained results?
(Altern.) How many clones have been retrieved w.r.t. Total Clones?
CODE
STRUCTURES
PDG NODES AND EDGES
while call-site
argexpr
CODE
STRUCTURES
PDG
• Two Types of Nodes
• Control Nodes (Dashed ones)
• e.g., if - for - while - function calls...
• Data Nodes
• e.g., expressions - parameters...
NODES AND EDGES
while call-site
argexpr
CODE
STRUCTURES
PDG
• Two Types of Nodes
• Control Nodes (Dashed ones)
• e.g., if - for - while - function calls...
• Data Nodes
• e.g., expressions - parameters...
• Two Types of Edges (i.e., dependencies)
• Control edges (Dashed ones)
• Data edges
NODES AND EDGES
while call-site
argexpr
• Features of nodes:
• Node Label
• i.e., , WHILE, CALL-SITE, EXPR, ...
• Node Type
• i.e., Data Node or Control Node
• Features of edges:
• Edge Type
• i.e., Data Edge or Control Edge
KERNELS
FOR CODE
STRUCTURES:
PDG
GRAPH KERNELS
FOR PDG
while
call-site
arg
expr expr
• Features of nodes:
• Node Label
• i.e., , WHILE, CALL-SITE, EXPR, ...
• Node Type
• i.e., Data Node or Control Node
• Features of edges:
• Edge Type
• i.e., Data Edge or Control Edge
KERNELS
FOR CODE
STRUCTURES:
PDG
Node Label = WHILE
Node Type = Control Node
GRAPH KERNELS
FOR PDG
while
call-site
arg
expr expr
Control Edge
Data Edge
while
call-site
arg
expr expr
while
call-site
arg
expr call-site
GRAPH KERNELS FOR PDG
• Goal: Identify common subgraphs
• Selectors: Compare nodes to each others and explore the subgraphs of only “compatible”
nodes (i.e., Nodes of the same type)
• Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
while
call-site
arg
expr expr
while
call-site
arg
expr call-site
GRAPH KERNELS FOR PDG
• Goal: Identify common subgraphs
• Selectors: Compare nodes to each others and explore the subgraphs of only “compatible”
nodes (i.e., Nodes of the same type)
• Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
while
call-site
arg
expr expr
while
call-site
arg
expr call-site
GRAPH KERNELS FOR PDG
• Goal: Identify common subgraphs
• Selectors: Compare nodes to each others and explore the subgraphs of only “compatible”
nodes (i.e., Nodes of the same type)
• Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
while
call-site
arg
expr expr
while
call-site
arg
expr call-site
GRAPH KERNELS FOR PDG
• Goal: Identify common subgraphs
• Selectors: Compare nodes to each others and explore the subgraphs of only “compatible”
nodes (i.e., Nodes of the same type)
• Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
SCENARIO-BASED
EVALUATION
FUTURE
RESEARCH
DIRECTIONS
PROBL
EM
S T A T E
M E N T
(MODEL) CLONE
DETECTION
Models: models are typically represented visually, as box-and-arrow diagrams,
and the clones we are searching for are similar subgraphs of these diagrams.
Model Granularity: models could be represented at different levels of granularity
(such as the source code) corresponding to different syntactic (and semantic)
units.
Models Clones are categorized in (three) different Types
REFERENCEEXAMPLE
TYPE 1C L O N E S
(MODEL) CLONE
DETECTION
• Type 1 (exact) model clones: Identical model fragments except for
variations in visual presentation, layout and formatting.
TYPE 2C L O N E S
(MODEL) CLONE
DETECTION
Type 2 (renamed) model clones: Structurally identical model fragments except
for variations in labels, values, types, visual presentation, layout and formatting.
model@Friction Mode Logic/Break
Apart Detection
model@Friction Mode Logic/Lockup
Detection/Required Friction for
Lockup
TYPE 3C L O N E S
(MODEL) CLONE
DETECTION
Type 3 (near-miss) model clones: Model fragments with further modifications,
such as changes in position or connection with respect to other model fragments
and small additions or removals of blocks or lines in addition to variations in labels,
values, types, visual presentation, layout and formatting.
model@Speed.speed_estimation
model@Throttle.throttle_estimation
MODELSASSOURCECODE
THANK YOU
Valerio Maggio
Ph.D., University of Naples “Federico II”
valerio.maggio@unina.it

Contenu connexe

Tendances

Something About Dynamic Linking
Something About Dynamic LinkingSomething About Dynamic Linking
Something About Dynamic LinkingWang Hsiangkai
 
Phyton Learning extracts
Phyton Learning extracts Phyton Learning extracts
Phyton Learning extracts Pavan Babu .G
 
Yacc (yet another compiler compiler)
Yacc (yet another compiler compiler)Yacc (yet another compiler compiler)
Yacc (yet another compiler compiler)omercomail
 
Advanced C Language for Engineering
Advanced C Language for EngineeringAdvanced C Language for Engineering
Advanced C Language for EngineeringVincenzo De Florio
 
OpenGurukul : Language : C Programming
OpenGurukul : Language : C ProgrammingOpenGurukul : Language : C Programming
OpenGurukul : Language : C ProgrammingOpen Gurukul
 
Let’s Learn Python An introduction to Python
Let’s Learn Python An introduction to Python Let’s Learn Python An introduction to Python
Let’s Learn Python An introduction to Python Jaganadh Gopinadhan
 
Report on c and c++
Report on c and c++Report on c and c++
Report on c and c++oggyrao
 
Managing input and output operation in c
Managing input and output operation in cManaging input and output operation in c
Managing input and output operation in cyazad dumasia
 
C LANGUAGE - BESTECH SOLUTIONS
C LANGUAGE - BESTECH SOLUTIONSC LANGUAGE - BESTECH SOLUTIONS
C LANGUAGE - BESTECH SOLUTIONSBESTECH SOLUTIONS
 
Hands-on Introduction to the C Programming Language
Hands-on Introduction to the C Programming LanguageHands-on Introduction to the C Programming Language
Hands-on Introduction to the C Programming LanguageVincenzo De Florio
 
Python Programming Basics for begginners
Python Programming Basics for begginnersPython Programming Basics for begginners
Python Programming Basics for begginnersAbishek Purushothaman
 
'C' language notes (a.p)
'C' language notes (a.p)'C' language notes (a.p)
'C' language notes (a.p)Ashishchinu
 
Programming languages
Programming languagesProgramming languages
Programming languagesEelco Visser
 
Introduction to C Language - Version 1.0 by Mark John Lado
Introduction to C Language - Version 1.0 by Mark John LadoIntroduction to C Language - Version 1.0 by Mark John Lado
Introduction to C Language - Version 1.0 by Mark John LadoMark John Lado, MIT
 

Tendances (20)

Something About Dynamic Linking
Something About Dynamic LinkingSomething About Dynamic Linking
Something About Dynamic Linking
 
7.0 files and c input
7.0 files and c input7.0 files and c input
7.0 files and c input
 
Phyton Learning extracts
Phyton Learning extracts Phyton Learning extracts
Phyton Learning extracts
 
Yacc (yet another compiler compiler)
Yacc (yet another compiler compiler)Yacc (yet another compiler compiler)
Yacc (yet another compiler compiler)
 
Flex
FlexFlex
Flex
 
Advanced C Language for Engineering
Advanced C Language for EngineeringAdvanced C Language for Engineering
Advanced C Language for Engineering
 
OpenGurukul : Language : C Programming
OpenGurukul : Language : C ProgrammingOpenGurukul : Language : C Programming
OpenGurukul : Language : C Programming
 
C Programming Project
C Programming ProjectC Programming Project
C Programming Project
 
Let’s Learn Python An introduction to Python
Let’s Learn Python An introduction to Python Let’s Learn Python An introduction to Python
Let’s Learn Python An introduction to Python
 
Report on c and c++
Report on c and c++Report on c and c++
Report on c and c++
 
Notes part 8
Notes part 8Notes part 8
Notes part 8
 
Managing input and output operation in c
Managing input and output operation in cManaging input and output operation in c
Managing input and output operation in c
 
C LANGUAGE - BESTECH SOLUTIONS
C LANGUAGE - BESTECH SOLUTIONSC LANGUAGE - BESTECH SOLUTIONS
C LANGUAGE - BESTECH SOLUTIONS
 
Hands-on Introduction to the C Programming Language
Hands-on Introduction to the C Programming LanguageHands-on Introduction to the C Programming Language
Hands-on Introduction to the C Programming Language
 
Python Programming Basics for begginners
Python Programming Basics for begginnersPython Programming Basics for begginners
Python Programming Basics for begginners
 
Embedded C - Lecture 2
Embedded C - Lecture 2Embedded C - Lecture 2
Embedded C - Lecture 2
 
'C' language notes (a.p)
'C' language notes (a.p)'C' language notes (a.p)
'C' language notes (a.p)
 
C++ How to program
C++ How to programC++ How to program
C++ How to program
 
Programming languages
Programming languagesProgramming languages
Programming languages
 
Introduction to C Language - Version 1.0 by Mark John Lado
Introduction to C Language - Version 1.0 by Mark John LadoIntroduction to C Language - Version 1.0 by Mark John Lado
Introduction to C Language - Version 1.0 by Mark John Lado
 

En vedette

네이티브 웹앱 기술 동향 및 전망
네이티브 웹앱 기술 동향 및 전망네이티브 웹앱 기술 동향 및 전망
네이티브 웹앱 기술 동향 및 전망Wonsuk Lee
 
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniquesImproving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniquesValerio Maggio
 
Principles in Refactoring
Principles in RefactoringPrinciples in Refactoring
Principles in RefactoringChamnap Chhorn
 
Refactoring: Improve the design of existing code
Refactoring: Improve the design of existing codeRefactoring: Improve the design of existing code
Refactoring: Improve the design of existing codeValerio Maggio
 
Unit testing with Junit
Unit testing with JunitUnit testing with Junit
Unit testing with JunitValerio Maggio
 
Refactoring Tips by Martin Fowler
Refactoring Tips by Martin FowlerRefactoring Tips by Martin Fowler
Refactoring Tips by Martin FowlerIgor Crvenov
 
Refactoring 101
Refactoring 101Refactoring 101
Refactoring 101Adam Culp
 
영화 예매 프로그램 (DB 설계, 프로그램 연동)
영화 예매 프로그램 (DB 설계, 프로그램 연동)영화 예매 프로그램 (DB 설계, 프로그램 연동)
영화 예매 프로그램 (DB 설계, 프로그램 연동)_ce
 
딥러닝을 이용한 자연어처리의 연구동향
딥러닝을 이용한 자연어처리의 연구동향딥러닝을 이용한 자연어처리의 연구동향
딥러닝을 이용한 자연어처리의 연구동향홍배 김
 

En vedette (11)

네이티브 웹앱 기술 동향 및 전망
네이티브 웹앱 기술 동향 및 전망네이티브 웹앱 기술 동향 및 전망
네이티브 웹앱 기술 동향 및 전망
 
Improving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniquesImproving Software Maintenance using Unsupervised Machine Learning techniques
Improving Software Maintenance using Unsupervised Machine Learning techniques
 
Junit
JunitJunit
Junit
 
Principles in Refactoring
Principles in RefactoringPrinciples in Refactoring
Principles in Refactoring
 
Refactoring: Improve the design of existing code
Refactoring: Improve the design of existing codeRefactoring: Improve the design of existing code
Refactoring: Improve the design of existing code
 
Unit testing with Junit
Unit testing with JunitUnit testing with Junit
Unit testing with Junit
 
Junit
JunitJunit
Junit
 
Refactoring Tips by Martin Fowler
Refactoring Tips by Martin FowlerRefactoring Tips by Martin Fowler
Refactoring Tips by Martin Fowler
 
Refactoring 101
Refactoring 101Refactoring 101
Refactoring 101
 
영화 예매 프로그램 (DB 설계, 프로그램 연동)
영화 예매 프로그램 (DB 설계, 프로그램 연동)영화 예매 프로그램 (DB 설계, 프로그램 연동)
영화 예매 프로그램 (DB 설계, 프로그램 연동)
 
딥러닝을 이용한 자연어처리의 연구동향
딥러닝을 이용한 자연어처리의 연구동향딥러닝을 이용한 자연어처리의 연구동향
딥러닝을 이용한 자연어처리의 연구동향
 

Similaire à Unsupervised Machine Learning for Clone Detection

RubyConf Portugal 2014 - Why ruby must go!
RubyConf Portugal 2014 - Why ruby must go!RubyConf Portugal 2014 - Why ruby must go!
RubyConf Portugal 2014 - Why ruby must go!Gautam Rege
 
Python language data types
Python language data typesPython language data types
Python language data typesHarry Potter
 
Python language data types
Python language data typesPython language data types
Python language data typesHoang Nguyen
 
Python language data types
Python language data typesPython language data types
Python language data typesLuis Goldster
 
Python language data types
Python language data typesPython language data types
Python language data typesTony Nguyen
 
Python language data types
Python language data typesPython language data types
Python language data typesFraboni Ec
 
Python language data types
Python language data typesPython language data types
Python language data typesJames Wong
 
Python language data types
Python language data typesPython language data types
Python language data typesYoung Alista
 
Licão 07 operating the shell
Licão 07 operating the shellLicão 07 operating the shell
Licão 07 operating the shellAcácio Oliveira
 
Clojure: Simple By Design
Clojure: Simple By DesignClojure: Simple By Design
Clojure: Simple By DesignAll Things Open
 
Shell Scripts
Shell ScriptsShell Scripts
Shell ScriptsDr.Ravi
 
Ruby -the wheel Technology
Ruby -the wheel TechnologyRuby -the wheel Technology
Ruby -the wheel Technologyppparthpatel123
 
Documenting with xcode
Documenting with xcodeDocumenting with xcode
Documenting with xcodeGoran Blazic
 
Introduction of bison
Introduction of bisonIntroduction of bison
Introduction of bisonvip_du
 
Ruby_Coding_Convention
Ruby_Coding_ConventionRuby_Coding_Convention
Ruby_Coding_ConventionJesse Cai
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparationKushaal Singla
 
OSCON2014 : Quick Introduction to System Tools Programming with Go
OSCON2014 : Quick Introduction to System Tools Programming with GoOSCON2014 : Quick Introduction to System Tools Programming with Go
OSCON2014 : Quick Introduction to System Tools Programming with GoChris McEniry
 
Hierarchical Clustering - Text Mining/NLP
Hierarchical Clustering - Text Mining/NLPHierarchical Clustering - Text Mining/NLP
Hierarchical Clustering - Text Mining/NLPRupak Roy
 

Similaire à Unsupervised Machine Learning for Clone Detection (20)

RubyConf Portugal 2014 - Why ruby must go!
RubyConf Portugal 2014 - Why ruby must go!RubyConf Portugal 2014 - Why ruby must go!
RubyConf Portugal 2014 - Why ruby must go!
 
Theperlreview
TheperlreviewTheperlreview
Theperlreview
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Syntax analysis
Syntax analysisSyntax analysis
Syntax analysis
 
Licão 07 operating the shell
Licão 07 operating the shellLicão 07 operating the shell
Licão 07 operating the shell
 
Clojure: Simple By Design
Clojure: Simple By DesignClojure: Simple By Design
Clojure: Simple By Design
 
Shell Scripts
Shell ScriptsShell Scripts
Shell Scripts
 
Ruby -the wheel Technology
Ruby -the wheel TechnologyRuby -the wheel Technology
Ruby -the wheel Technology
 
Documenting with xcode
Documenting with xcodeDocumenting with xcode
Documenting with xcode
 
Introduction of bison
Introduction of bisonIntroduction of bison
Introduction of bison
 
Ruby_Coding_Convention
Ruby_Coding_ConventionRuby_Coding_Convention
Ruby_Coding_Convention
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
 
OSCON2014 : Quick Introduction to System Tools Programming with Go
OSCON2014 : Quick Introduction to System Tools Programming with GoOSCON2014 : Quick Introduction to System Tools Programming with Go
OSCON2014 : Quick Introduction to System Tools Programming with Go
 
Hierarchical Clustering - Text Mining/NLP
Hierarchical Clustering - Text Mining/NLPHierarchical Clustering - Text Mining/NLP
Hierarchical Clustering - Text Mining/NLP
 

Plus de Valerio Maggio

Number Crunching in Python
Number Crunching in PythonNumber Crunching in Python
Number Crunching in PythonValerio Maggio
 
Machine Learning for Software Maintainability
Machine Learning for Software MaintainabilityMachine Learning for Software Maintainability
Machine Learning for Software MaintainabilityValerio Maggio
 
LINSEN an efficient approach to split identifiers and expand abbreviations
LINSEN an efficient approach to split identifiers and expand abbreviationsLINSEN an efficient approach to split identifiers and expand abbreviations
LINSEN an efficient approach to split identifiers and expand abbreviationsValerio Maggio
 
A Tree Kernel based approach for clone detection
A Tree Kernel based approach for clone detectionA Tree Kernel based approach for clone detection
A Tree Kernel based approach for clone detectionValerio Maggio
 
Scaffolding with JMock
Scaffolding with JMockScaffolding with JMock
Scaffolding with JMockValerio Maggio
 
Design patterns and Refactoring
Design patterns and RefactoringDesign patterns and Refactoring
Design patterns and RefactoringValerio Maggio
 
Test Driven Development
Test Driven DevelopmentTest Driven Development
Test Driven DevelopmentValerio Maggio
 
Unit testing and scaffolding
Unit testing and scaffoldingUnit testing and scaffolding
Unit testing and scaffoldingValerio Maggio
 

Plus de Valerio Maggio (10)

Number Crunching in Python
Number Crunching in PythonNumber Crunching in Python
Number Crunching in Python
 
Machine Learning for Software Maintainability
Machine Learning for Software MaintainabilityMachine Learning for Software Maintainability
Machine Learning for Software Maintainability
 
LINSEN an efficient approach to split identifiers and expand abbreviations
LINSEN an efficient approach to split identifiers and expand abbreviationsLINSEN an efficient approach to split identifiers and expand abbreviations
LINSEN an efficient approach to split identifiers and expand abbreviations
 
A Tree Kernel based approach for clone detection
A Tree Kernel based approach for clone detectionA Tree Kernel based approach for clone detection
A Tree Kernel based approach for clone detection
 
Scaffolding with JMock
Scaffolding with JMockScaffolding with JMock
Scaffolding with JMock
 
Junit in action
Junit in actionJunit in action
Junit in action
 
Design patterns and Refactoring
Design patterns and RefactoringDesign patterns and Refactoring
Design patterns and Refactoring
 
Test Driven Development
Test Driven DevelopmentTest Driven Development
Test Driven Development
 
Unit testing and scaffolding
Unit testing and scaffoldingUnit testing and scaffolding
Unit testing and scaffolding
 
Web frameworks
Web frameworksWeb frameworks
Web frameworks
 

Dernier

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Dernier (20)

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

Unsupervised Machine Learning for Clone Detection

  • 1. UNSUPERVISED MACHINE LEARNING FOR CLONE DETECTION Valerio Maggio, Ph.D. June 25, 2013 valerio.maggio@unina.it
  • 2. General Disclaimer: All the Maths appearing in the next slides is only intended to better introduce the considered case studies. Speakers are not responsible for any possible disease or “brain consumption” caused by too much formulas. So BEWARE; use this information at your own risk! It's intention is solely educational. We would strongly encourage you to use this information in cooperation with a medical or health professional. AwfulMaths
  • 3. Number one in the stink parade is duplicated code. If you see the same code structure in more than one place, you can be sure that your program will be better if you find a way to unify them.
  • 7. PROBL EM S T A T E M E N T CLONE DETECTION Software clones are fragments of code that are similar according to some predefined measure of similarity I.D. Baxter, 1998
  • 8. PROBL EM S T A T E M E N T CLONE DETECTION
  • 9. PROBL EM S T A T E M E N T CLONE DETECTION Clones Textual Similarity
  • 10. PROBL EM S T A T E M E N T CLONE DETECTION Clones Functional Similarity
  • 11. PROBL EM S T A T E M E N T CLONE DETECTION Clones affect the reliability of the system! Sneaky Bug!
  • 13. THE ORIGINAL ONE # Original Fragment def do_something_cool_in_Python(filepath, marker='---end---'): ! lines = list() ! with open(filepath) as report: ! ! for l in report: ! ! ! if l.endswith(marker): ! ! ! ! lines.append(l) # Stores only lines that ends with "marker" ! return lines #Return the list of different lines
  • 14. TYPE 1: Exact Copy • Identical code segments except for differences in layout, whitespace, and comments
  • 15. def do_something_cool_in_Python (filepath, marker='---end---'): ! lines = list() # This list is initially empty ! with open(filepath) as report: ! ! for l in report: # It goes through the lines of the file ! ! ! if l.endswith(marker): ! ! ! ! lines.append(l) ! return lines TYPE 1: Exact Copy • Identical code segments except for differences in layout, whitespace, and comments # Original Fragment def do_something_cool_in_Python(filepath, marker='---end---'): ! lines = list() ! with open(filepath) as report: ! ! for l in report: ! ! ! if l.endswith(marker): ! ! ! ! lines.append(l) # Stores only lines that ends with "marker" ! return lines #Return the list of different lines
  • 16. TYPE 2: Parameter Substituted • Structurally identical segments except for differences in identifiers, literals, layout, whitespace, and comments
  • 17. # Type 2 Clone def do_something_cool_in_Python(path, end='---end---'): ! targets = list() ! with open(path) as data_file: ! ! for t in datae: ! ! ! if l.endswith(end): ! ! ! ! targets.append(t) # Stores only lines that ends with "marker" ! #Return the list of different lines ! return targets # Original Fragment def do_something_cool_in_Python(filepath, marker='---end---'): ! lines = list() ! with open(filepath) as report: ! ! for l in report: ! ! ! if l.endswith(marker): ! ! ! ! lines.append(l) # Stores only lines that ends with "marker" ! return lines #Return the list of different lines TYPE 2: Parameter Substituted • Structurally identical segments except for differences in identifiers, literals, layout, whitespace, and comments
  • 18. TYPE 3: Structure Substituted • Similar segments with further modifications such as changed, added (or deleted) statements, in additions to variations in identifiers, literals, layout and comments
  • 19. import os def do_something_with(path, marker='---end---'): ! # Check if the input path corresponds to a file ! if not os.path.isfile(path): ! ! return None ! bad_ones = list() ! good_ones = list() ! with open(path) as report: ! ! for line in report: ! ! ! line = line.strip() ! ! ! if line.endswith(marker): ! ! ! ! good_ones.append(line) ! ! ! else: ! ! ! ! bad_ones.append(line) ! #Return the lists of different lines ! return good_ones, bad_ones TYPE 3: Structure Substituted • Similar segments with further modifications such as changed, added (or deleted) statements, in additions to variations in identifiers, literals, layout and comments
  • 20. import os def do_something_with(path, marker='---end---'): ! # Check if the input path corresponds to a file ! if not os.path.isfile(path): ! ! return None ! bad_ones = list() ! good_ones = list() ! with open(path) as report: ! ! for line in report: ! ! ! line = line.strip() ! ! ! if line.endswith(marker): ! ! ! ! good_ones.append(line) ! ! ! else: ! ! ! ! bad_ones.append(line) ! #Return the lists of different lines ! return good_ones, bad_ones TYPE 3: Structure Substituted • Similar segments with further modifications such as changed, added (or deleted) statements, in additions to variations in identifiers, literals, layout and comments
  • 21. import os def do_something_with(path, marker='---end---'): ! # Check if the input path corresponds to a file ! if not os.path.isfile(path): ! ! return None ! bad_ones = list() ! good_ones = list() ! with open(path) as report: ! ! for line in report: ! ! ! line = line.strip() ! ! ! if line.endswith(marker): ! ! ! ! good_ones.append(line) ! ! ! else: ! ! ! ! bad_ones.append(line) ! #Return the lists of different lines ! return good_ones, bad_ones TYPE 3: Structure Substituted • Similar segments with further modifications such as changed, added (or deleted) statements, in additions to variations in identifiers, literals, layout and comments
  • 22. import os def do_something_with(path, marker='---end---'): ! # Check if the input path corresponds to a file ! if not os.path.isfile(path): ! ! return None ! bad_ones = list() ! good_ones = list() ! with open(path) as report: ! ! for line in report: ! ! ! line = line.strip() ! ! ! if line.endswith(marker): ! ! ! ! good_ones.append(line) ! ! ! else: ! ! ! ! bad_ones.append(line) ! #Return the lists of different lines ! return good_ones, bad_ones TYPE 3: Structure Substituted • Similar segments with further modifications such as changed, added (or deleted) statements, in additions to variations in identifiers, literals, layout and comments
  • 23. TYPE 4: “Functional” Copies • Semantically equivalent segments that perform the same computation but are implemented by different syntactic variants
  • 24. # Original Fragment def do_something_cool_in_Python(filepath, marker='---end---'): ! lines = list() ! with open(filepath) as report: ! ! for l in report: ! ! ! if l.endswith(marker): ! ! ! ! lines.append(l) # Stores only lines that ends with "marker" ! return lines #Return the list of different lines def do_always_the_same_stuff(filepath, marker='---end---'): ! report = open(filepath) ! file_lines = report.readlines() ! report.close() ! #Filters only the lines ending with marker ! return filter(lambda l: len(l) and l.endswith(marker), file_lines) TYPE 4: “Functional” Copies • Semantically equivalent segments that perform the same computation but are implemented by different syntactic variants
  • 25.
  • 31. SOURCECODEINFORMATION FUNCTION parser_compare PARAMS PARAMPARAM node *left node *right IF-STMT IF-STMT RETURN-STMT BODY CALL-STMT parser_compare_node PARAMS STRUCT-OP right st_nodeleft st_node BODY BODYCOND COND OR ==== left right0 0 == rightleft RETURN- STMTRETURN-STMT 00
  • 32. SOURCECODEINFORMATION ENTRY EXIT FORMAL-IN ACTUAL-IN ACTUAL-IN FORMAL-IN BODY CONTROL-POINT EXPR CONTROL-POINT CONTROL-POINT CALL-SITE RETURN ACTUAL-OUT RETURN EXPR EXPR FORMAL-OUT
  • 38. • String/Token based Techniques: • Pros: Run very fast • Cons: Too many false clones STATEOFTHEART TECHNIQUES
  • 39. • String/Token based Techniques: • Pros: Run very fast • Cons: Too many false clones • Syntax based (AST) Techniques: • Pros: Well suited to detect structural similarities • Cons: Not Properly suited to detect Type 3 Clones STATEOFTHEART TECHNIQUES
  • 40. • String/Token based Techniques: • Pros: Run very fast • Cons: Too many false clones • Syntax based (AST) Techniques: • Pros: Well suited to detect structural similarities • Cons: Not Properly suited to detect Type 3 Clones • Graph based Techniques: • Pros: The only one able to deal with Type 4 Clones • Cons: Performance Issues STATEOFTHEART TECHNIQUES
  • 42. USE MACHINE LEARNING L U K E • Provides computational effective solutions to analyze large data sets
  • 43. USE MACHINE LEARNING L U K E • Provides computational effective solutions to analyze large data sets • Provides solutions that can be tailored to different tasks/domains
  • 44. USE MACHINE LEARNING L U K E • Provides computational effective solutions to analyze large data sets • Provides solutions that can be tailored to different tasks/domains • Requires many efforts in:
  • 45. USE MACHINE LEARNING L U K E • Provides computational effective solutions to analyze large data sets • Provides solutions that can be tailored to different tasks/domains • Requires many efforts in: • the definition of the relevant information best suited for the specific task/domain
  • 46. USE MACHINE LEARNING L U K E • Provides computational effective solutions to analyze large data sets • Provides solutions that can be tailored to different tasks/domains • Requires many efforts in: • the definition of the relevant information best suited for the specific task/domain • the application of the learning algorithms to the considered data
  • 47. UNSUPERVISEDLEARNING • Supervised Learning: • Learn from labelled samples • Unsupervised Learning: • Learn (directly) from the data Learn by examples
  • 48. UNSUPERVISEDLEARNING • Supervised Learning: • Learn from labelled samples • Unsupervised Learning: • Learn (directly) from the data Learn by examples (+) No cost of labeling samples (-) Trade-off imposed on the quality of the data
  • 49. CODE STRUCTURES KERNELSFORSTRUCTURES Computation of the dot product between (Graph) Structures K( ),
  • 50. CODE STRUCTURES KERNELSFORSTRUCTURES Abstract Syntax Tree (AST) Tree structure representing the syntactic structure of the different instructions of a program (function) Program Dependencies Graph (PDG) (Directed) Graph structure representing the relationship among the different statement of a program Computation of the dot product between (Graph) Structures K( ),
  • 52. < x y = = x + x 1 y - y 1 while block while block block if > b a = = a + a 1 b - b 1 > b 0 = c 3 CODE AST KERNELFORCLONES
  • 53. < x y = = x + x 1 y - y 1 while block while block block if > b a = = a + a 1 b - b 1 > b 0 = c 3 CODE AST AST KERNEL KERNELFORCLONES < block while = = block = y - = x + + x 1 - y 1 < x y > b 0 = c 3 if block > b a - b 1 < block while + a 1 = b - = a +
  • 55. while block< x y KERNELS FOR CODE STRUCTURES: AST KERNELFEATURES Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT
  • 56. while block< x y KERNELS FOR CODE STRUCTURES: AST KERNELFEATURES Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT Instruction (I) i.e., FOR, IF, WHILE, RETURN
  • 57. while block< x y KERNELS FOR CODE STRUCTURES: AST KERNELFEATURES Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT Instruction (I) i.e., FOR, IF, WHILE, RETURN Context (C) i.e., Instruction Class of the closer statement node
  • 58. while block< x y KERNELS FOR CODE STRUCTURES: AST KERNELFEATURES Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT Instruction (I) i.e., FOR, IF, WHILE, RETURN Context (C) i.e., Instruction Class of the closer statement node Lexemes (Ls) Lexical information gathered (recursively) from leaves
  • 59. while block< x y KERNELS FOR CODE STRUCTURES: AST KERNELFEATURES IC = Conditional-Expr I = Less-operator C = Loop Ls= [x,y] IC = Loop I = while-loop C = Function-Body Ls= [x, y] Instruction Class (IC) i.e., LOOP, CALL, CONDITIONAL_STATEMENT Instruction (I) i.e., FOR, IF, WHILE, RETURN Context (C) i.e., Instruction Class of the closer statement node Lexemes (Ls) Lexical information gathered (recursively) from leaves IC = Block I = while-body C = Loop Ls= [ x ]
  • 60. CLONE DETECTION • Comparison with another (pure) AST-based clone detector • Comparison on a system with randomly seeded clones 0 0.25 0.5 0.75 1 Precision Recall F-measure CloneDigger Tree Kernel Tool RE SULTS Results refer to clones where code fragments have been modified by adding/ removing or changing code statements
  • 61. 0 0.25 0.50 0.75 1.00 0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 Precision, Recall and F-Measure Precision Recall F1 Precision: How accurate are the obtained results? (Altern.) How many errors do they contain? Recall: How complete are the obtained results? (Altern.) How many clones have been retrieved w.r.t. Total Clones?
  • 62. CODE STRUCTURES PDG NODES AND EDGES while call-site argexpr
  • 63. CODE STRUCTURES PDG • Two Types of Nodes • Control Nodes (Dashed ones) • e.g., if - for - while - function calls... • Data Nodes • e.g., expressions - parameters... NODES AND EDGES while call-site argexpr
  • 64. CODE STRUCTURES PDG • Two Types of Nodes • Control Nodes (Dashed ones) • e.g., if - for - while - function calls... • Data Nodes • e.g., expressions - parameters... • Two Types of Edges (i.e., dependencies) • Control edges (Dashed ones) • Data edges NODES AND EDGES while call-site argexpr
  • 65. • Features of nodes: • Node Label • i.e., , WHILE, CALL-SITE, EXPR, ... • Node Type • i.e., Data Node or Control Node • Features of edges: • Edge Type • i.e., Data Edge or Control Edge KERNELS FOR CODE STRUCTURES: PDG GRAPH KERNELS FOR PDG while call-site arg expr expr
  • 66. • Features of nodes: • Node Label • i.e., , WHILE, CALL-SITE, EXPR, ... • Node Type • i.e., Data Node or Control Node • Features of edges: • Edge Type • i.e., Data Edge or Control Edge KERNELS FOR CODE STRUCTURES: PDG Node Label = WHILE Node Type = Control Node GRAPH KERNELS FOR PDG while call-site arg expr expr Control Edge Data Edge
  • 67. while call-site arg expr expr while call-site arg expr call-site GRAPH KERNELS FOR PDG • Goal: Identify common subgraphs • Selectors: Compare nodes to each others and explore the subgraphs of only “compatible” nodes (i.e., Nodes of the same type) • Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
  • 68. while call-site arg expr expr while call-site arg expr call-site GRAPH KERNELS FOR PDG • Goal: Identify common subgraphs • Selectors: Compare nodes to each others and explore the subgraphs of only “compatible” nodes (i.e., Nodes of the same type) • Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
  • 69. while call-site arg expr expr while call-site arg expr call-site GRAPH KERNELS FOR PDG • Goal: Identify common subgraphs • Selectors: Compare nodes to each others and explore the subgraphs of only “compatible” nodes (i.e., Nodes of the same type) • Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
  • 70. while call-site arg expr expr while call-site arg expr call-site GRAPH KERNELS FOR PDG • Goal: Identify common subgraphs • Selectors: Compare nodes to each others and explore the subgraphs of only “compatible” nodes (i.e., Nodes of the same type) • Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
  • 72.
  • 74. PROBL EM S T A T E M E N T (MODEL) CLONE DETECTION Models: models are typically represented visually, as box-and-arrow diagrams, and the clones we are searching for are similar subgraphs of these diagrams. Model Granularity: models could be represented at different levels of granularity (such as the source code) corresponding to different syntactic (and semantic) units. Models Clones are categorized in (three) different Types
  • 76. TYPE 1C L O N E S (MODEL) CLONE DETECTION • Type 1 (exact) model clones: Identical model fragments except for variations in visual presentation, layout and formatting.
  • 77. TYPE 2C L O N E S (MODEL) CLONE DETECTION Type 2 (renamed) model clones: Structurally identical model fragments except for variations in labels, values, types, visual presentation, layout and formatting. model@Friction Mode Logic/Break Apart Detection model@Friction Mode Logic/Lockup Detection/Required Friction for Lockup
  • 78. TYPE 3C L O N E S (MODEL) CLONE DETECTION Type 3 (near-miss) model clones: Model fragments with further modifications, such as changes in position or connection with respect to other model fragments and small additions or removals of blocks or lines in addition to variations in labels, values, types, visual presentation, layout and formatting. model@Speed.speed_estimation model@Throttle.throttle_estimation
  • 80. THANK YOU Valerio Maggio Ph.D., University of Naples “Federico II” valerio.maggio@unina.it