This document discusses unsupervised machine learning techniques for clone detection in source code. It begins by defining different types of code clones and describing current state-of-the-art clone detection tools. It then argues that machine learning approaches, such as using kernel methods to compare abstract syntax trees, can provide more computationally efficient and accurate clone detection compared to traditional text-, token-, and syntax-based techniques. The document provides examples of using kernel functions to compute similarities between code structure representations like ASTs to enable unsupervised machine learning for clone detection.
2. General Disclaimer:
All the Maths appearing in the next slides is only intended to better introduce the considered case studies. Speakers are not
responsible for any possible disease or “brain consumption” caused by too much formulas.
So BEWARE; use this information at your own risk!
It's intention is solely educational. We would strongly encourage you to use this information in cooperation with a medical or
health professional.
AwfulMaths
3. Number one in the stink parade is duplicated code.
If you see the same code structure in more than one
place, you can be sure that your program will be better
if you find a way to unify them.
7. PROBL
EM
S T A T E
M E N T
CLONE DETECTION
Software clones are fragments of code that are similar according
to some predefined measure of similarity
I.D. Baxter, 1998
13. THE ORIGINAL ONE
# Original Fragment
def do_something_cool_in_Python(filepath, marker='---end---'):
! lines = list()
! with open(filepath) as report:
! ! for l in report:
! ! ! if l.endswith(marker):
! ! ! ! lines.append(l) # Stores only lines that ends with "marker"
! return lines #Return the list of different lines
14. TYPE 1: Exact Copy
• Identical code segments except for differences in
layout, whitespace, and comments
15. def do_something_cool_in_Python (filepath, marker='---end---'):
! lines = list() # This list is initially empty
! with open(filepath) as report:
! ! for l in report: # It goes through the lines of the file
! ! ! if l.endswith(marker):
! ! ! ! lines.append(l)
! return lines
TYPE 1: Exact Copy
• Identical code segments except for differences in
layout, whitespace, and comments
# Original Fragment
def do_something_cool_in_Python(filepath, marker='---end---'):
! lines = list()
! with open(filepath) as report:
! ! for l in report:
! ! ! if l.endswith(marker):
! ! ! ! lines.append(l) # Stores only lines that ends with "marker"
! return lines #Return the list of different lines
16. TYPE 2: Parameter Substituted
• Structurally identical segments except for differences in identifiers, literals,
layout, whitespace, and comments
17. # Type 2 Clone
def do_something_cool_in_Python(path, end='---end---'):
! targets = list()
! with open(path) as data_file:
! ! for t in datae:
! ! ! if l.endswith(end):
! ! ! ! targets.append(t) # Stores only lines that ends with "marker"
! #Return the list of different lines
! return targets
# Original Fragment
def do_something_cool_in_Python(filepath, marker='---end---'):
! lines = list()
! with open(filepath) as report:
! ! for l in report:
! ! ! if l.endswith(marker):
! ! ! ! lines.append(l) # Stores only lines that ends with "marker"
! return lines #Return the list of different lines
TYPE 2: Parameter Substituted
• Structurally identical segments except for differences in identifiers, literals,
layout, whitespace, and comments
18. TYPE 3: Structure Substituted
• Similar segments with further modifications such as changed, added (or deleted)
statements, in additions to variations in identifiers, literals, layout and comments
19. import os
def do_something_with(path, marker='---end---'):
! # Check if the input path corresponds to a file
! if not os.path.isfile(path):
! ! return None
! bad_ones = list()
! good_ones = list()
! with open(path) as report:
! ! for line in report:
! ! ! line = line.strip()
! ! ! if line.endswith(marker):
! ! ! ! good_ones.append(line)
! ! ! else:
! ! ! ! bad_ones.append(line)
! #Return the lists of different lines
! return good_ones, bad_ones
TYPE 3: Structure Substituted
• Similar segments with further modifications such as changed, added (or deleted)
statements, in additions to variations in identifiers, literals, layout and comments
20. import os
def do_something_with(path, marker='---end---'):
! # Check if the input path corresponds to a file
! if not os.path.isfile(path):
! ! return None
! bad_ones = list()
! good_ones = list()
! with open(path) as report:
! ! for line in report:
! ! ! line = line.strip()
! ! ! if line.endswith(marker):
! ! ! ! good_ones.append(line)
! ! ! else:
! ! ! ! bad_ones.append(line)
! #Return the lists of different lines
! return good_ones, bad_ones
TYPE 3: Structure Substituted
• Similar segments with further modifications such as changed, added (or deleted)
statements, in additions to variations in identifiers, literals, layout and comments
21. import os
def do_something_with(path, marker='---end---'):
! # Check if the input path corresponds to a file
! if not os.path.isfile(path):
! ! return None
! bad_ones = list()
! good_ones = list()
! with open(path) as report:
! ! for line in report:
! ! ! line = line.strip()
! ! ! if line.endswith(marker):
! ! ! ! good_ones.append(line)
! ! ! else:
! ! ! ! bad_ones.append(line)
! #Return the lists of different lines
! return good_ones, bad_ones
TYPE 3: Structure Substituted
• Similar segments with further modifications such as changed, added (or deleted)
statements, in additions to variations in identifiers, literals, layout and comments
22. import os
def do_something_with(path, marker='---end---'):
! # Check if the input path corresponds to a file
! if not os.path.isfile(path):
! ! return None
! bad_ones = list()
! good_ones = list()
! with open(path) as report:
! ! for line in report:
! ! ! line = line.strip()
! ! ! if line.endswith(marker):
! ! ! ! good_ones.append(line)
! ! ! else:
! ! ! ! bad_ones.append(line)
! #Return the lists of different lines
! return good_ones, bad_ones
TYPE 3: Structure Substituted
• Similar segments with further modifications such as changed, added (or deleted)
statements, in additions to variations in identifiers, literals, layout and comments
23. TYPE 4: “Functional” Copies
• Semantically equivalent segments that perform the same
computation but are implemented by different syntactic variants
24. # Original Fragment
def do_something_cool_in_Python(filepath, marker='---end---'):
! lines = list()
! with open(filepath) as report:
! ! for l in report:
! ! ! if l.endswith(marker):
! ! ! ! lines.append(l) # Stores only lines that ends with "marker"
! return lines #Return the list of different lines
def do_always_the_same_stuff(filepath, marker='---end---'):
! report = open(filepath)
! file_lines = report.readlines()
! report.close()
! #Filters only the lines ending with marker
! return filter(lambda l: len(l) and l.endswith(marker), file_lines)
TYPE 4: “Functional” Copies
• Semantically equivalent segments that perform the same
computation but are implemented by different syntactic variants
38. • String/Token based Techniques:
• Pros: Run very fast
• Cons: Too many false clones
STATEOFTHEART
TECHNIQUES
39. • String/Token based Techniques:
• Pros: Run very fast
• Cons: Too many false clones
• Syntax based (AST) Techniques:
• Pros: Well suited to detect structural similarities
• Cons: Not Properly suited to detect Type 3 Clones
STATEOFTHEART
TECHNIQUES
40. • String/Token based Techniques:
• Pros: Run very fast
• Cons: Too many false clones
• Syntax based (AST) Techniques:
• Pros: Well suited to detect structural similarities
• Cons: Not Properly suited to detect Type 3 Clones
• Graph based Techniques:
• Pros: The only one able to deal with Type 4 Clones
• Cons: Performance Issues
STATEOFTHEART
TECHNIQUES
43. USE
MACHINE
LEARNING
L U K E
• Provides computational effective solutions to analyze large data sets
• Provides solutions that can be tailored to different tasks/domains
44. USE
MACHINE
LEARNING
L U K E
• Provides computational effective solutions to analyze large data sets
• Provides solutions that can be tailored to different tasks/domains
• Requires many efforts in:
45. USE
MACHINE
LEARNING
L U K E
• Provides computational effective solutions to analyze large data sets
• Provides solutions that can be tailored to different tasks/domains
• Requires many efforts in:
• the definition of the relevant information best suited for the specific task/domain
46. USE
MACHINE
LEARNING
L U K E
• Provides computational effective solutions to analyze large data sets
• Provides solutions that can be tailored to different tasks/domains
• Requires many efforts in:
• the definition of the relevant information best suited for the specific task/domain
• the application of the learning algorithms to the considered data
48. UNSUPERVISEDLEARNING
• Supervised Learning:
• Learn from labelled samples
• Unsupervised Learning:
• Learn (directly) from the data
Learn by examples
(+) No cost of labeling samples
(-) Trade-off imposed on the quality of the data
50. CODE
STRUCTURES
KERNELSFORSTRUCTURES
Abstract Syntax Tree (AST)
Tree structure representing the syntactic structure of
the different instructions of a program (function)
Program Dependencies Graph (PDG)
(Directed) Graph structure representing the relationship
among the different statement of a program
Computation of the dot product between (Graph) Structures
K( ),
52. <
x y = =
x +
x 1
y -
y 1
while
block
while
block
block
if
>
b a = =
a +
a 1
b -
b 1
>
b 0 =
c 3
CODE AST
KERNELFORCLONES
53. <
x y = =
x +
x 1
y -
y 1
while
block
while
block
block
if
>
b a = =
a +
a 1
b -
b 1
>
b 0 =
c 3
CODE AST AST KERNEL
KERNELFORCLONES
<
block
while
= =
block
=
y -
=
x +
+
x 1
-
y 1
<
x y
>
b 0 =
c 3
if
block
>
b a
-
b 1
<
block
while
+
a 1
=
b -
=
a +
58. while
block<
x y
KERNELS
FOR CODE
STRUCTURES:
AST
KERNELFEATURES Instruction Class (IC)
i.e., LOOP, CALL,
CONDITIONAL_STATEMENT
Instruction (I)
i.e., FOR, IF, WHILE, RETURN
Context (C)
i.e., Instruction Class of
the closer statement node
Lexemes (Ls)
Lexical information gathered
(recursively) from leaves
59. while
block<
x y
KERNELS
FOR CODE
STRUCTURES:
AST
KERNELFEATURES
IC = Conditional-Expr
I = Less-operator
C = Loop
Ls= [x,y]
IC = Loop
I = while-loop
C = Function-Body
Ls= [x, y]
Instruction Class (IC)
i.e., LOOP, CALL,
CONDITIONAL_STATEMENT
Instruction (I)
i.e., FOR, IF, WHILE, RETURN
Context (C)
i.e., Instruction Class of
the closer statement node
Lexemes (Ls)
Lexical information gathered
(recursively) from leaves
IC = Block
I = while-body
C = Loop
Ls= [ x ]
60. CLONE DETECTION
• Comparison with another (pure) AST-based clone detector
• Comparison on a system with randomly seeded clones
0
0.25
0.5
0.75
1
Precision Recall F-measure
CloneDigger Tree Kernel Tool
RE
SULTS
Results refer to clones where code
fragments have been modified by adding/
removing or changing code statements
61. 0
0.25
0.50
0.75
1.00
0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98
Precision, Recall and F-Measure
Precision Recall F1
Precision: How accurate are the obtained results?
(Altern.) How many errors do they contain?
Recall: How complete are the obtained results?
(Altern.) How many clones have been retrieved w.r.t. Total Clones?
63. CODE
STRUCTURES
PDG
• Two Types of Nodes
• Control Nodes (Dashed ones)
• e.g., if - for - while - function calls...
• Data Nodes
• e.g., expressions - parameters...
NODES AND EDGES
while call-site
argexpr
64. CODE
STRUCTURES
PDG
• Two Types of Nodes
• Control Nodes (Dashed ones)
• e.g., if - for - while - function calls...
• Data Nodes
• e.g., expressions - parameters...
• Two Types of Edges (i.e., dependencies)
• Control edges (Dashed ones)
• Data edges
NODES AND EDGES
while call-site
argexpr
65. • Features of nodes:
• Node Label
• i.e., , WHILE, CALL-SITE, EXPR, ...
• Node Type
• i.e., Data Node or Control Node
• Features of edges:
• Edge Type
• i.e., Data Edge or Control Edge
KERNELS
FOR CODE
STRUCTURES:
PDG
GRAPH KERNELS
FOR PDG
while
call-site
arg
expr expr
66. • Features of nodes:
• Node Label
• i.e., , WHILE, CALL-SITE, EXPR, ...
• Node Type
• i.e., Data Node or Control Node
• Features of edges:
• Edge Type
• i.e., Data Edge or Control Edge
KERNELS
FOR CODE
STRUCTURES:
PDG
Node Label = WHILE
Node Type = Control Node
GRAPH KERNELS
FOR PDG
while
call-site
arg
expr expr
Control Edge
Data Edge
67. while
call-site
arg
expr expr
while
call-site
arg
expr call-site
GRAPH KERNELS FOR PDG
• Goal: Identify common subgraphs
• Selectors: Compare nodes to each others and explore the subgraphs of only “compatible”
nodes (i.e., Nodes of the same type)
• Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
68. while
call-site
arg
expr expr
while
call-site
arg
expr call-site
GRAPH KERNELS FOR PDG
• Goal: Identify common subgraphs
• Selectors: Compare nodes to each others and explore the subgraphs of only “compatible”
nodes (i.e., Nodes of the same type)
• Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
69. while
call-site
arg
expr expr
while
call-site
arg
expr call-site
GRAPH KERNELS FOR PDG
• Goal: Identify common subgraphs
• Selectors: Compare nodes to each others and explore the subgraphs of only “compatible”
nodes (i.e., Nodes of the same type)
• Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
70. while
call-site
arg
expr expr
while
call-site
arg
expr call-site
GRAPH KERNELS FOR PDG
• Goal: Identify common subgraphs
• Selectors: Compare nodes to each others and explore the subgraphs of only “compatible”
nodes (i.e., Nodes of the same type)
• Context: The subgraph of a node (with paths whose lengths are at most L to avoid loops)
74. PROBL
EM
S T A T E
M E N T
(MODEL) CLONE
DETECTION
Models: models are typically represented visually, as box-and-arrow diagrams,
and the clones we are searching for are similar subgraphs of these diagrams.
Model Granularity: models could be represented at different levels of granularity
(such as the source code) corresponding to different syntactic (and semantic)
units.
Models Clones are categorized in (three) different Types
76. TYPE 1C L O N E S
(MODEL) CLONE
DETECTION
• Type 1 (exact) model clones: Identical model fragments except for
variations in visual presentation, layout and formatting.
77. TYPE 2C L O N E S
(MODEL) CLONE
DETECTION
Type 2 (renamed) model clones: Structurally identical model fragments except
for variations in labels, values, types, visual presentation, layout and formatting.
model@Friction Mode Logic/Break
Apart Detection
model@Friction Mode Logic/Lockup
Detection/Required Friction for
Lockup
78. TYPE 3C L O N E S
(MODEL) CLONE
DETECTION
Type 3 (near-miss) model clones: Model fragments with further modifications,
such as changes in position or connection with respect to other model fragments
and small additions or removals of blocks or lines in addition to variations in labels,
values, types, visual presentation, layout and formatting.
model@Speed.speed_estimation
model@Throttle.throttle_estimation