SlideShare une entreprise Scribd logo
1  sur  75
Télécharger pour lire hors ligne
LINSEN
AN EFFICIENT APPROACH
TO SPLIT IDENTIFIERS AND
EXPAND ABBREVIATIONS
Anna Corazza, Sergio Di Martino, Valerio Maggio
Università di Napoli “Federico II”
26th Sept. 2012, ICSM2012@Riva del Garda(Trento), Italy
MOTIVA
TIONS IR FOR SE
MOTIVA
TIONS IR FOR SE
IRFORNATURALLANGUAGE
1. Tokenization
IRFORNATURALLANGUAGE
1. Tokenization
IRFORNATURALLANGUAGE
Draws, the, are,
NullHandle,
box, r,
Rectangle, g,
Graphics,
box,
displayBox,
...
1. Tokenization
2. Normalization
IRFORNATURALLANGUAGE
Draws, the, are,
NullHandle,
box, r,
Rectangle, g,
Graphics,
box,
displayBox,
...
1. Tokenization
2. Normalization
IRFORNATURALLANGUAGE
Draws, the, are,
NullHandle,
box, r,
Rectangle, g,
Graphics,
box,
displayBox,
...
1. Change to Lower case
draws, the, are,
nullhandle,
box, r,
rectangle, g,
graphics,
box,
displaybox,
...
1. Tokenization
2. Normalization
IRFORNATURALLANGUAGE
Draws, the, are,
NullHandle,
box, r,
Rectangle, g,
Graphics,
box,
displayBox,
...
1. Change to Lower case
draws, the, are,
nullhandle,
box, r,
rectangle, g,
graphics,
box,
displaybox,
...2.Remove StopWords
1. Tokenization
2. Normalization
IRFORNATURALLANGUAGE
Draws, the, are,
NullHandle,
box, r,
Rectangle, g,
Graphics,
box,
displayBox,
...
1. Change to Lower case
draw, the, are,
nullhandl,
box, r,
rectangl, g,
graphic,
box,
displaybox,
...2.Remove StopWords
3.Apply Stemming
4. ...
Implicit assumption: The “same” words are used whenever a
particular concept is described
1. Tokenization
2. Normalization
IRFORNATURALLANGUAGE
Draws, the, are,
NullHandle,
box, r,
Rectangle, g,
Graphics,
box,
displayBox,
...
1. Change to Lower case
draw, the, are,
nullhandl,
box, r,
rectangl, g,
graphic,
box,
displaybox,
...2.Remove StopWords
3.Apply Stemming
4. ...
Implicit assumption: The “same” words are used whenever a
particular concept is described
1. Tokenization
2. Normalization
IRFORNATURALLANGUAGE
Draws, the, are,
NullHandle,
box, r,
Rectangle, g,
Graphics,
box,
displayBox,
...
1. Change to Lower case
draw, the, are,
nullhandl,
box, r,
rectangl, g,
graphic,
box,
displaybox,
...2.Remove StopWords
3.Apply Stemming
4. ...
1. Tokenization
IRFORSOURCECODE
2. Normalization
1.5 Identifier
Splitting
1. Tokenization
IRFORSOURCECODE
2. Normalization
1.5 Identifier
Splitting
• snake_case Splitter: r’(?<=w)_’
• display_box ==> display | box
• camelCase/PascalCase Splitter: r’(?<!^)([A-Z][a-z]+)’
• displayBox ==> display | Box
draw, the, are,
null, handl,
box, r,
rectangl, g,
graphic,
box,
display, box,
...
IRFORSOURCECODE
IRFORSOURCECODE
• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’
• drawXORRect ==> drawXOR | Rect
• drawxorrect ==> NO SPLIT
IRFORSOURCECODE
• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’
• drawXORRect ==> drawXOR | Rect
• drawxorrect ==> NO SPLIT
IRFORSOURCECODE Splitting algorithms based on naming conventions
are not robust enough
Splitting algorithms based on naming conventions
are not robust enough
• Heavy use of Abbreviations in the source code
IRFORSOURCECODE
Splitting algorithms based on naming conventions
are not robust enough
• Heavy use of Abbreviations in the source code
• rect as for Rectangle
• r as for Rectangle
IRFORSOURCECODE
Splitting algorithms based on naming conventions
are not robust enough
• Heavy use of Abbreviations in the source code
• rect as for Rectangle
• r as for Rectangle
IRFORSOURCECODE
1. Tokenization
IDENTIFIERMAPPING
2. Normalization
1.5 Identifier
Mapping
• SAMURAI (Enslen, et.al , 2011)
• TIDIER (Guerrouj, et.al , 2011)
• GenTest+Normalize (Lawrie and Binkley, 2011)
• AMAP (Hill and Pollock, 2008)
• ...
• LINSEN
draw, the, are,
null, handl,
box, r,
rectangl, g,
graphic,
box,
display, box,
...
LINSEN
A L G O
R I T H M CONTRIBUTION
• Novel technique for the Identifier Mapping
LINSEN
A L G O
R I T H M CONTRIBUTION
• Novel technique for the Identifier Mapping
• Based on an efficient String Matching technique:
Baeza-Yates&Perlberg Algorithm (BYP)
LINSEN
A L G O
R I T H M CONTRIBUTION
• Novel technique for the Identifier Mapping
• Based on an efficient String Matching technique:
Baeza-Yates&Perlberg Algorithm (BYP)
• Applied on a Graph-based model
LINSEN
A L G O
R I T H M CONTRIBUTION
• Novel technique for the Identifier Mapping
• Based on an efficient String Matching technique:
Baeza-Yates&Perlberg Algorithm (BYP)
• Applied on a Graph-based model
• Able to both Split Identifiers and Expand possible occurring
abbreviations
There could be multiple and equally correct splitting or
expansion solutions
THEAMBIGUITYPROBLEM
There could be multiple and equally correct splitting or
expansion solutions
• r as for Rectangle OR red
THEAMBIGUITYPROBLEM
There could be multiple and equally correct splitting or
expansion solutions
• r as for Rectangle OR red
THEAMBIGUITYPROBLEM
•nsISupport ==> ns|IS|up|ports OR
==> ns|I|Supports
DICTIONARIES
DICTIONARIES
DICTIONARIES
Application-aware
Dictionaries
DICTIONARIES
Application-aware
Dictionaries(108,315 Entries)
DICTIONARIES
Application-aware
Dictionaries
(22,940 Entries)
(108,315 Entries)
DICTIONARIES
Application-aware
Dictionaries
(22,940 Entries)
(108,315 Entries)
(588 Entries)
GRAPHMODELModel: Weighted Directed Graph Example: drawXORRect identifier
GRAPHMODEL
• NODES correspond to characters of the current identifier
Model: Weighted Directed Graph Example: drawXORRect identifier
0
11
1
9
4
5
7
8
2 63
10
GRAPHMODEL
• ARCS corresponds to matchings between identifier substrings
and dictionary words
• NODES correspond to characters of the current identifier
Model: Weighted Directed Graph Example: drawXORRect identifier
0
11
1
9
4
5
7
8
2 63
10
“raw”,c(“raw”)
“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“rectangle”,c(“rectangle”)
GRAPHMODEL
• ARCS corresponds to matchings between identifier substrings
and dictionary words
• Application of the String Matching Algorithm (BYP)
• NODES correspond to characters of the current identifier
Model: Weighted Directed Graph Example: drawXORRect identifier
0
11
1
9
4
5
7
8
2 63
10
“raw”,c(“raw”)
“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“rectangle”,c(“rectangle”)
GRAPHMODEL
• ARCS corresponds to matchings between identifier substrings
and dictionary words
• Application of the String Matching Algorithm (BYP)
• Padding Arcs to ensure the Graph always connected
• NODES correspond to characters of the current identifier
Model: Weighted Directed Graph Example: drawXORRect identifier
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,C-MAX
6
“a”,C-MAX
3
“w”,C-MAX
“X”,C-MAX
“O”,C-MAX
“e”,C-MAX
10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)
“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“rectangle”,c(“rectangle”)
“R”,C-MAX
GRAPHMODEL
• Every Arc is Labelled with the corresponding dictionary word
• Weights represent the “cost” of each matching
• Cost function [c(“word”)] favors longest words and words coming
from the application-aware dictionaries
Model: Weighted Directed Graph Example: drawXORRect identifier
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,C-MAX
6
“a”,C-MAX
3
“w”,C-MAX
“X”,C-MAX
“O”,C-MAX
“e”,C-MAX
10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)
“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“rectangle”,c(“rectangle”)
“R”,C-MAX
GRAPHMODEL
• The final Mapping Solution corresponds to the sequence of labels in the
path with the minimum cost (Djikstra Algorithm)
Model: Weighted Directed Graph Example: drawXORRect identifier
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,C-MAX
6
“a”,C-MAX
3
“w”,C-MAX
“X”,C-MAX
“O”,C-MAX
“e”,C-MAX
10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)
“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“rectangle”,c(“rectangle”)
“R”,C-MAX
STRING MATCHING
• Application of the Baeza-Yates and Perlberg (BYP) Algorithm
• Signature: BYP(identifier, word, φ(word))
• identifier: target string
• word: string to match
• φ(·): Tolerance (Error) function
• Bounds the length of acceptable matchings
Advantage: Use the same algorithm for both the splitting and
the expansion step with different input Tolerance function
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)
BYPFORSPLITTING
..
draw,
the,
are,
null,
handle,
box,
red,
rectangle,
...
...
echo,
testing,
threading,
xpm,
xor,
....
...
abort
absolute
abstract
...
or
...
raw
...
DFile DCOMPUTER-SCIENCE DEnglish
Identifier:
drawXORRect
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)
BYPFORSPLITTING
..
draw,
the,
are,
null,
handle,
box,
red,
rectangle,
...
...
echo,
testing,
threading,
xpm,
xor,
....
...
abort
absolute
abstract
...
or
...
raw
...
0
11
1
9
4
5
7
8
2 63
10
DFile DCOMPUTER-SCIENCE DEnglish
Identifier:
drawXORRect
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)
BYPFORSPLITTING
..
draw,
the,
are,
null,
handle,
box,
red,
rectangle,
...
...
echo,
testing,
threading,
xpm,
xor,
....
...
abort
absolute
abstract
...
or
...
raw
...
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,C-MAX
6
“a”,C-MAX
3
“w”,C-MAX
“X”,C-MAX
“O”,C-MAX
“e”,C-MAX
10
“c”,C-MAX“t”,C-MAX
“R”,C-MAX
DFile DCOMPUTER-SCIENCE DEnglish
Identifier:
drawXORRect
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)
BYPFORSPLITTING
...
echo,
testing,
threading,
xpm,
xor,
....
...
abort
absolute
abstract
...
or
...
raw
...
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,C-MAX
6
“a”,C-MAX
3
“w”,C-MAX
“X”,C-MAX
“O”,C-MAX
“e”,C-MAX
10
“c”,C-MAX“t”,C-MAX
“draw”,c(“draw”)
“R”,C-MAX
..
draw,
the,
are,
null,
handle,
box,
red,
rectangle,
...
DFile DCOMPUTER-SCIENCE DEnglish
Identifier:
drawXORRect
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)
BYPFORSPLITTING
...
abort
absolute
abstract
...
or
...
raw
...
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,C-MAX
6
“a”,C-MAX
3
“w”,C-MAX
“X”,C-MAX
“O”,C-MAX
“e”,C-MAX
10
“c”,C-MAX“t”,C-MAX
“draw”,c(“draw”) “xor”,c(“xor”)
“R”,C-MAX
..
draw,
the,
are,
null,
handle,
box,
red,
rectangle,
...
DFile ...
echo,
testing,
threading,
xpm,
xor,
....
DCOMPUTER-SCIENCE DEnglish
Identifier:
drawXORRect
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)
BYPFORSPLITTING
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,C-MAX
6
“a”,C-MAX
3
“w”,C-MAX
“X”,C-MAX
“O”,C-MAX
“e”,C-MAX
10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)
“draw”,c(“draw”) “xor”,c(“xor”)
“R”,C-MAX
..
draw,
the,
are,
null,
handle,
box,
red,
rectangle,
...
DFile ...
echo,
testing,
threading,
xpm,
xor,
....
DCOMPUTER-SCIENCE ...
abort
absolute
abstract
...
or
...
raw
...
DEnglish
Identifier:
drawXORRect
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)
BYPFORSPLITTING
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,C-MAX
6
“a”,C-MAX
3
“w”,C-MAX
“X”,C-MAX
“O”,C-MAX
“e”,C-MAX
10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)
“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“R”,C-MAX
..
draw,
the,
are,
null,
handle,
box,
red,
rectangle,
...
DFile ...
echo,
testing,
threading,
xpm,
xor,
....
DCOMPUTER-SCIENCE ...
abort
absolute
abstract
...
or
...
raw
...
DEnglish
Identifier:
drawXORRect
• BYP(identifier, word, φSplit(word))
φSplit : Exact Matching (i.e., No Errors allowed)
BYPFORSPLITTING
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,C-MAX
6
“a”,C-MAX
3
“w”,C-MAX
“X”,C-MAX
“O”,C-MAX
“e”,C-MAX
10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)
“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“R”,C-MAX
..
draw,
the,
are,
null,
handle,
box,
red,
rectangle,
...
DFile ...
echo,
testing,
threading,
xpm,
xor,
....
DCOMPUTER-SCIENCE ...
abort
absolute
abstract
...
or
...
raw
...
DEnglish
Identifier:
drawXORRect
• BYP(identifier, word, φExp(word))
• φExp : Approximate Matching
BYPFOREXPANSION
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,C-MAX
6
“a”,C-MAX
3
“w”,C-MAX
“X”,C-MAX
“O”,C-MAX
“e”,C-MAX
10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)
“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“R”,C-MAX
..
draw,
the,
are,
null,
handle,
box,
red,
rectangle,
...
...
echo,
testing,
threading,
xpm,
xor,
....
DCOMPUTER-SCIENCE
...
abort
absolute
abstract
...
or
...
raw
...
DEnglishDFile
Identifier:
drawXORRect
• BYP(identifier, word, φExp(word))
• φExp : Approximate Matching
BYPFOREXPANSION
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,C-MAX
6
“a”,C-MAX
3
“w”,C-MAX
“X”,C-MAX
“O”,C-MAX
“e”,C-MAX
10
“c”,C-MAX“t”,C-MAX
“draw”,c(“draw”) “xor”,c(“xor”)
“R”,C-MAX
..
draw,
the,
are,
null,
handle,
box,
red,
rectangle,
...
...
echo,
testing,
threading,
xpm,
xor,
....
DCOMPUTER-SCIENCE
...
abort
absolute
abstract
...
or
...
raw
...
DEnglishDFile
Identifier:
drawXORRect
• BYP(identifier, word, φExp(word))
• φExp : Approximate Matching
BYPFOREXPANSION
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,C-MAX
6
“a”,C-MAX
3
“w”,C-MAX
“X”,C-MAX
“O”,C-MAX
“e”,C-MAX
10
“c”,C-MAX“t”,C-MAX
“draw”,c(“draw”) “xor”,c(“xor”)
“R”,C-MAX
...
echo,
testing,
threading,
xpm,
xor,
....
DCOMPUTER-SCIENCE
...
abort
absolute
abstract
...
or
...
raw
...
DEnglish
“red”,c(“red”)
..
draw,
the,
are,
null,
handle,
box,
red,
rectangle,
...
DFile
Identifier:
drawXORRect
• BYP(identifier, word, φExp(word))
• φExp : Approximate Matching
BYPFOREXPANSION
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,C-MAX
6
“a”,C-MAX
3
“w”,C-MAX
“X”,C-MAX
“O”,C-MAX
“e”,C-MAX
10
“c”,C-MAX“t”,C-MAX
“draw”,c(“draw”) “xor”,c(“xor”)
“R”,C-MAX
...
echo,
testing,
threading,
xpm,
xor,
....
DCOMPUTER-SCIENCE
...
abort
absolute
abstract
...
or
...
raw
...
DEnglish
“red”,c(“red”)
“rectangle”,c(“rectangle”)
..
draw,
the,
are,
null,
handle,
box,
red,
rectangle,
...
DFile
Identifier:
drawXORRect
• BYP(identifier, word, φExp(word))
• φExp : Approximate Matching
BYPFOREXPANSION
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,C-MAX
6
“a”,C-MAX
3
“w”,C-MAX
“X”,C-MAX
“O”,C-MAX
“e”,C-MAX
10
“c”,C-MAX“t”,C-MAX
“draw”,c(“draw”) “xor”,c(“xor”)
“R”,C-MAX
...
echo,
testing,
threading,
xpm,
xor,
....
DCOMPUTER-SCIENCE
...
abort
absolute
abstract
...
or
...
raw
...
DEnglish
“red”,c(“red”)
“rectangle”,c(“rectangle”)
..
draw,
the,
are,
null,
handle,
box,
red,
rectangle,
...
DFile
Identifier:
drawXORRect
EMPIRI
CAL
E V A L U
A T I O N
RESEARCH
QUESTIONS
• RQ1:
How does LINSEN compare with state-of-the-art approaches as for the splitting of
identifiers?
EMPIRI
CAL
E V A L U
A T I O N
RESEARCH
QUESTIONS
• RQ1:
How does LINSEN compare with state-of-the-art approaches as for the splitting of
identifiers?
• RQ2:
How does LINSEN compare with state-of-the-art approaches as for the mapping
of identifiers to dictionary words?
EMPIRI
CAL
E V A L U
A T I O N
RESEARCH
QUESTIONS
• RQ1:
How does LINSEN compare with state-of-the-art approaches as for the splitting of
identifiers?
• RQ2:
How does LINSEN compare with state-of-the-art approaches as for the mapping
of identifiers to dictionary words?
• RQ3:
What is the ability of the LINSEN approach in dealing with different types of
abbreviations?
CASE STUDIES
DTW (Madani et. al 2010)
GenTest+Normalize (Lawrie and Binkley, 2011)
RQ1 and
RQ2}
RQ1 only
LUDISO Dataset
(2012)
AMAP (Hill and Pollock, 2008)
RQ3 only
15 out of 750 software systems
Covering the 58% of total identifiers
EMPIRI
CAL
E V A L U
A T I O N
EVALUATION METRICS
Comparability of Results:
Accuracy rate
Qualitative Evaluation:
Precision/Recall/F-1
[Guerrouj, et.al , 2011]
EMPIRI
CAL
E V A L U
A T I O N
EVALUATION METRICS
Comparability of Results:
Accuracy rate
Qualitative Evaluation:
Precision/Recall/F-1
[Guerrouj, et.al , 2011]
EMPIRI
CAL
E V A L U
A T I O N
EVALUATION METRICS
Comparability of Results:
Accuracy rate
• Identifier Level evaluation:
Each mapping result must be completely
correct
• Soft-word Level evaluation:
“Partial credit” given to each word correctly
mapped
Qualitative Evaluation:
Precision/Recall/F-1
[Guerrouj, et.al , 2011]
As for the comparison with
GenTest+Normalize
(Lawrie and Binkley, 2011)
EMPIRI
CAL
E V A L U
A T I O N
RQ1: SPLITTING
Accuracy Rates for the
comparison with
DTW (Madani et. al 2010)
0
0.25
0.5
0.75
1
JhotDraw 5.1 Lynx 2.8.5
DTW LINSEN DTW LINSEN
DTW
LINSEN
RE
SULTS
R Q 1
Accuracy Rates for the
comparison with GenTest
(Lawrie and Binkley, 2011)
0
0.175
0.35
0.525
0.7
which 2.20 a2ps 4.14
Identifier Level
0
0.2
0.4
0.6
0.8
which 2.20 a2ps 4.14
Soft-word Level
GenTest
LINSEN
GenTest
LINSEN
RE
SULTS
R Q 1 RQ1: SPLITTING
Accuracy Rates for the
comparison with
DTW (Madani et. al 2010)
0
0.25
0.5
0.75
1
JhotDraw 5.1 Lynx 2.8.5
DTW
LINSEN
RQ2: MAPPING
RE
SULTS
R Q 2
Accuracy Rates for the
comparison with Normalize
(Lawrie and Binkley, 2011)
0
0.15
0.3
0.45
0.6
which 2.20 a2ps 4.14
Identifier Level
0
0.225
0.45
0.675
0.9
which 2.20 a2ps 4.14
Soft-word Level
Normalize
LINSEN
Normalize
LINSEN
RE
SULTS
R Q 2 RQ2: MAPPING
Accuracy Rates for the
comparison with
AMAP (Hill and Pollock, 2008)
0
0.225
0.45
0.675
0.9
CW DL OO AC PR SL
AMAP
LINSEN
RQ3: EXPANSION
RE
SULTS
R Q 3
CW: Combination Words
DL: Dropped Letters
OO: Others
AC: Acronyms
PR: Prefix
SL: Single Letters
CONCLUSIONS
CONCLUSIONS
CONCLUSIONS
CONCLUSIONS
CONCLUSIONS
FUTURE WORKS
• Evaluation of the impact of each adopted dictionary on
the performance
• Improve or change or add dictionaries
• Improve the implementation of the prototype to speed up
the computation
• Make use of parallel computation to process each
identifier in isolation
THANK YOU
26th Sept. 2012, ICSM2012@Riva del Garda(Trento), Italy
Valerio Maggio
Ph.D. Student, University of Naples “Federico II”
valerio.maggio@unina.it

Contenu connexe

Plus de Valerio Maggio

Refactoring: Improve the design of existing code
Refactoring: Improve the design of existing codeRefactoring: Improve the design of existing code
Refactoring: Improve the design of existing codeValerio Maggio
 
Scaffolding with JMock
Scaffolding with JMockScaffolding with JMock
Scaffolding with JMockValerio Maggio
 
Unit testing with Junit
Unit testing with JunitUnit testing with Junit
Unit testing with JunitValerio Maggio
 
Design patterns and Refactoring
Design patterns and RefactoringDesign patterns and Refactoring
Design patterns and RefactoringValerio Maggio
 
Test Driven Development
Test Driven DevelopmentTest Driven Development
Test Driven DevelopmentValerio Maggio
 
Unit testing and scaffolding
Unit testing and scaffoldingUnit testing and scaffolding
Unit testing and scaffoldingValerio Maggio
 

Plus de Valerio Maggio (8)

Refactoring: Improve the design of existing code
Refactoring: Improve the design of existing codeRefactoring: Improve the design of existing code
Refactoring: Improve the design of existing code
 
Scaffolding with JMock
Scaffolding with JMockScaffolding with JMock
Scaffolding with JMock
 
Junit in action
Junit in actionJunit in action
Junit in action
 
Unit testing with Junit
Unit testing with JunitUnit testing with Junit
Unit testing with Junit
 
Design patterns and Refactoring
Design patterns and RefactoringDesign patterns and Refactoring
Design patterns and Refactoring
 
Test Driven Development
Test Driven DevelopmentTest Driven Development
Test Driven Development
 
Unit testing and scaffolding
Unit testing and scaffoldingUnit testing and scaffolding
Unit testing and scaffolding
 
Web frameworks
Web frameworksWeb frameworks
Web frameworks
 

Dernier

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 

Dernier (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 

LINSEN an efficient approach to split identifiers and expand abbreviations

  • 1. LINSEN AN EFFICIENT APPROACH TO SPLIT IDENTIFIERS AND EXPAND ABBREVIATIONS Anna Corazza, Sergio Di Martino, Valerio Maggio Università di Napoli “Federico II” 26th Sept. 2012, ICSM2012@Riva del Garda(Trento), Italy
  • 6. 1. Tokenization IRFORNATURALLANGUAGE Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...
  • 7. 1. Tokenization 2. Normalization IRFORNATURALLANGUAGE Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...
  • 8. 1. Tokenization 2. Normalization IRFORNATURALLANGUAGE Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ... 1. Change to Lower case draws, the, are, nullhandle, box, r, rectangle, g, graphics, box, displaybox, ...
  • 9. 1. Tokenization 2. Normalization IRFORNATURALLANGUAGE Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ... 1. Change to Lower case draws, the, are, nullhandle, box, r, rectangle, g, graphics, box, displaybox, ...2.Remove StopWords
  • 10. 1. Tokenization 2. Normalization IRFORNATURALLANGUAGE Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ... 1. Change to Lower case draw, the, are, nullhandl, box, r, rectangl, g, graphic, box, displaybox, ...2.Remove StopWords 3.Apply Stemming 4. ...
  • 11. Implicit assumption: The “same” words are used whenever a particular concept is described 1. Tokenization 2. Normalization IRFORNATURALLANGUAGE Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ... 1. Change to Lower case draw, the, are, nullhandl, box, r, rectangl, g, graphic, box, displaybox, ...2.Remove StopWords 3.Apply Stemming 4. ...
  • 12. Implicit assumption: The “same” words are used whenever a particular concept is described 1. Tokenization 2. Normalization IRFORNATURALLANGUAGE Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ... 1. Change to Lower case draw, the, are, nullhandl, box, r, rectangl, g, graphic, box, displaybox, ...2.Remove StopWords 3.Apply Stemming 4. ...
  • 14. 1. Tokenization IRFORSOURCECODE 2. Normalization 1.5 Identifier Splitting • snake_case Splitter: r’(?<=w)_’ • display_box ==> display | box • camelCase/PascalCase Splitter: r’(?<!^)([A-Z][a-z]+)’ • displayBox ==> display | Box draw, the, are, null, handl, box, r, rectangl, g, graphic, box, display, box, ...
  • 17. • camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’ • drawXORRect ==> drawXOR | Rect • drawxorrect ==> NO SPLIT IRFORSOURCECODE
  • 18. • camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’ • drawXORRect ==> drawXOR | Rect • drawxorrect ==> NO SPLIT IRFORSOURCECODE Splitting algorithms based on naming conventions are not robust enough
  • 19. Splitting algorithms based on naming conventions are not robust enough • Heavy use of Abbreviations in the source code IRFORSOURCECODE
  • 20. Splitting algorithms based on naming conventions are not robust enough • Heavy use of Abbreviations in the source code • rect as for Rectangle • r as for Rectangle IRFORSOURCECODE
  • 21. Splitting algorithms based on naming conventions are not robust enough • Heavy use of Abbreviations in the source code • rect as for Rectangle • r as for Rectangle IRFORSOURCECODE
  • 22. 1. Tokenization IDENTIFIERMAPPING 2. Normalization 1.5 Identifier Mapping • SAMURAI (Enslen, et.al , 2011) • TIDIER (Guerrouj, et.al , 2011) • GenTest+Normalize (Lawrie and Binkley, 2011) • AMAP (Hill and Pollock, 2008) • ... • LINSEN draw, the, are, null, handl, box, r, rectangl, g, graphic, box, display, box, ...
  • 23. LINSEN A L G O R I T H M CONTRIBUTION • Novel technique for the Identifier Mapping
  • 24. LINSEN A L G O R I T H M CONTRIBUTION • Novel technique for the Identifier Mapping • Based on an efficient String Matching technique: Baeza-Yates&Perlberg Algorithm (BYP)
  • 25. LINSEN A L G O R I T H M CONTRIBUTION • Novel technique for the Identifier Mapping • Based on an efficient String Matching technique: Baeza-Yates&Perlberg Algorithm (BYP) • Applied on a Graph-based model
  • 26. LINSEN A L G O R I T H M CONTRIBUTION • Novel technique for the Identifier Mapping • Based on an efficient String Matching technique: Baeza-Yates&Perlberg Algorithm (BYP) • Applied on a Graph-based model • Able to both Split Identifiers and Expand possible occurring abbreviations
  • 27. There could be multiple and equally correct splitting or expansion solutions THEAMBIGUITYPROBLEM
  • 28. There could be multiple and equally correct splitting or expansion solutions • r as for Rectangle OR red THEAMBIGUITYPROBLEM
  • 29. There could be multiple and equally correct splitting or expansion solutions • r as for Rectangle OR red THEAMBIGUITYPROBLEM •nsISupport ==> ns|IS|up|ports OR ==> ns|I|Supports
  • 36. GRAPHMODELModel: Weighted Directed Graph Example: drawXORRect identifier
  • 37. GRAPHMODEL • NODES correspond to characters of the current identifier Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 5 7 8 2 63 10
  • 38. GRAPHMODEL • ARCS corresponds to matchings between identifier substrings and dictionary words • NODES correspond to characters of the current identifier Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 5 7 8 2 63 10 “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “rectangle”,c(“rectangle”)
  • 39. GRAPHMODEL • ARCS corresponds to matchings between identifier substrings and dictionary words • Application of the String Matching Algorithm (BYP) • NODES correspond to characters of the current identifier Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 5 7 8 2 63 10 “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “rectangle”,c(“rectangle”)
  • 40. GRAPHMODEL • ARCS corresponds to matchings between identifier substrings and dictionary words • Application of the String Matching Algorithm (BYP) • Padding Arcs to ensure the Graph always connected • NODES correspond to characters of the current identifier Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX“t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “rectangle”,c(“rectangle”) “R”,C-MAX
  • 41. GRAPHMODEL • Every Arc is Labelled with the corresponding dictionary word • Weights represent the “cost” of each matching • Cost function [c(“word”)] favors longest words and words coming from the application-aware dictionaries Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX“t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “rectangle”,c(“rectangle”) “R”,C-MAX
  • 42. GRAPHMODEL • The final Mapping Solution corresponds to the sequence of labels in the path with the minimum cost (Djikstra Algorithm) Model: Weighted Directed Graph Example: drawXORRect identifier 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX“t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “rectangle”,c(“rectangle”) “R”,C-MAX
  • 43. STRING MATCHING • Application of the Baeza-Yates and Perlberg (BYP) Algorithm • Signature: BYP(identifier, word, φ(word)) • identifier: target string • word: string to match • φ(·): Tolerance (Error) function • Bounds the length of acceptable matchings Advantage: Use the same algorithm for both the splitting and the expansion step with different input Tolerance function
  • 44. • BYP(identifier, word, φSplit(word)) φSplit : Exact Matching (i.e., No Errors allowed) BYPFORSPLITTING .. draw, the, are, null, handle, box, red, rectangle, ... ... echo, testing, threading, xpm, xor, .... ... abort absolute abstract ... or ... raw ... DFile DCOMPUTER-SCIENCE DEnglish Identifier: drawXORRect
  • 45. • BYP(identifier, word, φSplit(word)) φSplit : Exact Matching (i.e., No Errors allowed) BYPFORSPLITTING .. draw, the, are, null, handle, box, red, rectangle, ... ... echo, testing, threading, xpm, xor, .... ... abort absolute abstract ... or ... raw ... 0 11 1 9 4 5 7 8 2 63 10 DFile DCOMPUTER-SCIENCE DEnglish Identifier: drawXORRect
  • 46. • BYP(identifier, word, φSplit(word)) φSplit : Exact Matching (i.e., No Errors allowed) BYPFORSPLITTING .. draw, the, are, null, handle, box, red, rectangle, ... ... echo, testing, threading, xpm, xor, .... ... abort absolute abstract ... or ... raw ... 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX“t”,C-MAX “R”,C-MAX DFile DCOMPUTER-SCIENCE DEnglish Identifier: drawXORRect
  • 47. • BYP(identifier, word, φSplit(word)) φSplit : Exact Matching (i.e., No Errors allowed) BYPFORSPLITTING ... echo, testing, threading, xpm, xor, .... ... abort absolute abstract ... or ... raw ... 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX“t”,C-MAX “draw”,c(“draw”) “R”,C-MAX .. draw, the, are, null, handle, box, red, rectangle, ... DFile DCOMPUTER-SCIENCE DEnglish Identifier: drawXORRect
  • 48. • BYP(identifier, word, φSplit(word)) φSplit : Exact Matching (i.e., No Errors allowed) BYPFORSPLITTING ... abort absolute abstract ... or ... raw ... 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX“t”,C-MAX “draw”,c(“draw”) “xor”,c(“xor”) “R”,C-MAX .. draw, the, are, null, handle, box, red, rectangle, ... DFile ... echo, testing, threading, xpm, xor, .... DCOMPUTER-SCIENCE DEnglish Identifier: drawXORRect
  • 49. • BYP(identifier, word, φSplit(word)) φSplit : Exact Matching (i.e., No Errors allowed) BYPFORSPLITTING 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX“t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “xor”,c(“xor”) “R”,C-MAX .. draw, the, are, null, handle, box, red, rectangle, ... DFile ... echo, testing, threading, xpm, xor, .... DCOMPUTER-SCIENCE ... abort absolute abstract ... or ... raw ... DEnglish Identifier: drawXORRect
  • 50. • BYP(identifier, word, φSplit(word)) φSplit : Exact Matching (i.e., No Errors allowed) BYPFORSPLITTING 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX“t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “R”,C-MAX .. draw, the, are, null, handle, box, red, rectangle, ... DFile ... echo, testing, threading, xpm, xor, .... DCOMPUTER-SCIENCE ... abort absolute abstract ... or ... raw ... DEnglish Identifier: drawXORRect
  • 51. • BYP(identifier, word, φSplit(word)) φSplit : Exact Matching (i.e., No Errors allowed) BYPFORSPLITTING 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX“t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “R”,C-MAX .. draw, the, are, null, handle, box, red, rectangle, ... DFile ... echo, testing, threading, xpm, xor, .... DCOMPUTER-SCIENCE ... abort absolute abstract ... or ... raw ... DEnglish Identifier: drawXORRect
  • 52. • BYP(identifier, word, φExp(word)) • φExp : Approximate Matching BYPFOREXPANSION 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX“t”,C-MAX “raw”,c(“raw”) “draw”,c(“draw”) “or”,c(“or”) “xor”,c(“xor”) “R”,C-MAX .. draw, the, are, null, handle, box, red, rectangle, ... ... echo, testing, threading, xpm, xor, .... DCOMPUTER-SCIENCE ... abort absolute abstract ... or ... raw ... DEnglishDFile Identifier: drawXORRect
  • 53. • BYP(identifier, word, φExp(word)) • φExp : Approximate Matching BYPFOREXPANSION 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX“t”,C-MAX “draw”,c(“draw”) “xor”,c(“xor”) “R”,C-MAX .. draw, the, are, null, handle, box, red, rectangle, ... ... echo, testing, threading, xpm, xor, .... DCOMPUTER-SCIENCE ... abort absolute abstract ... or ... raw ... DEnglishDFile Identifier: drawXORRect
  • 54. • BYP(identifier, word, φExp(word)) • φExp : Approximate Matching BYPFOREXPANSION 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX“t”,C-MAX “draw”,c(“draw”) “xor”,c(“xor”) “R”,C-MAX ... echo, testing, threading, xpm, xor, .... DCOMPUTER-SCIENCE ... abort absolute abstract ... or ... raw ... DEnglish “red”,c(“red”) .. draw, the, are, null, handle, box, red, rectangle, ... DFile Identifier: drawXORRect
  • 55. • BYP(identifier, word, φExp(word)) • φExp : Approximate Matching BYPFOREXPANSION 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX“t”,C-MAX “draw”,c(“draw”) “xor”,c(“xor”) “R”,C-MAX ... echo, testing, threading, xpm, xor, .... DCOMPUTER-SCIENCE ... abort absolute abstract ... or ... raw ... DEnglish “red”,c(“red”) “rectangle”,c(“rectangle”) .. draw, the, are, null, handle, box, red, rectangle, ... DFile Identifier: drawXORRect
  • 56. • BYP(identifier, word, φExp(word)) • φExp : Approximate Matching BYPFOREXPANSION 0 11 1 9 4 “r”,C-MAX “d”,C-MAX 5 7 8 2 “R”,C-MAX 6 “a”,C-MAX 3 “w”,C-MAX “X”,C-MAX “O”,C-MAX “e”,C-MAX 10 “c”,C-MAX“t”,C-MAX “draw”,c(“draw”) “xor”,c(“xor”) “R”,C-MAX ... echo, testing, threading, xpm, xor, .... DCOMPUTER-SCIENCE ... abort absolute abstract ... or ... raw ... DEnglish “red”,c(“red”) “rectangle”,c(“rectangle”) .. draw, the, are, null, handle, box, red, rectangle, ... DFile Identifier: drawXORRect
  • 57. EMPIRI CAL E V A L U A T I O N RESEARCH QUESTIONS • RQ1: How does LINSEN compare with state-of-the-art approaches as for the splitting of identifiers?
  • 58. EMPIRI CAL E V A L U A T I O N RESEARCH QUESTIONS • RQ1: How does LINSEN compare with state-of-the-art approaches as for the splitting of identifiers? • RQ2: How does LINSEN compare with state-of-the-art approaches as for the mapping of identifiers to dictionary words?
  • 59. EMPIRI CAL E V A L U A T I O N RESEARCH QUESTIONS • RQ1: How does LINSEN compare with state-of-the-art approaches as for the splitting of identifiers? • RQ2: How does LINSEN compare with state-of-the-art approaches as for the mapping of identifiers to dictionary words? • RQ3: What is the ability of the LINSEN approach in dealing with different types of abbreviations?
  • 60. CASE STUDIES DTW (Madani et. al 2010) GenTest+Normalize (Lawrie and Binkley, 2011) RQ1 and RQ2} RQ1 only LUDISO Dataset (2012) AMAP (Hill and Pollock, 2008) RQ3 only 15 out of 750 software systems Covering the 58% of total identifiers EMPIRI CAL E V A L U A T I O N
  • 61. EVALUATION METRICS Comparability of Results: Accuracy rate Qualitative Evaluation: Precision/Recall/F-1 [Guerrouj, et.al , 2011] EMPIRI CAL E V A L U A T I O N
  • 62. EVALUATION METRICS Comparability of Results: Accuracy rate Qualitative Evaluation: Precision/Recall/F-1 [Guerrouj, et.al , 2011] EMPIRI CAL E V A L U A T I O N
  • 63. EVALUATION METRICS Comparability of Results: Accuracy rate • Identifier Level evaluation: Each mapping result must be completely correct • Soft-word Level evaluation: “Partial credit” given to each word correctly mapped Qualitative Evaluation: Precision/Recall/F-1 [Guerrouj, et.al , 2011] As for the comparison with GenTest+Normalize (Lawrie and Binkley, 2011) EMPIRI CAL E V A L U A T I O N
  • 64. RQ1: SPLITTING Accuracy Rates for the comparison with DTW (Madani et. al 2010) 0 0.25 0.5 0.75 1 JhotDraw 5.1 Lynx 2.8.5 DTW LINSEN DTW LINSEN DTW LINSEN RE SULTS R Q 1
  • 65. Accuracy Rates for the comparison with GenTest (Lawrie and Binkley, 2011) 0 0.175 0.35 0.525 0.7 which 2.20 a2ps 4.14 Identifier Level 0 0.2 0.4 0.6 0.8 which 2.20 a2ps 4.14 Soft-word Level GenTest LINSEN GenTest LINSEN RE SULTS R Q 1 RQ1: SPLITTING
  • 66. Accuracy Rates for the comparison with DTW (Madani et. al 2010) 0 0.25 0.5 0.75 1 JhotDraw 5.1 Lynx 2.8.5 DTW LINSEN RQ2: MAPPING RE SULTS R Q 2
  • 67. Accuracy Rates for the comparison with Normalize (Lawrie and Binkley, 2011) 0 0.15 0.3 0.45 0.6 which 2.20 a2ps 4.14 Identifier Level 0 0.225 0.45 0.675 0.9 which 2.20 a2ps 4.14 Soft-word Level Normalize LINSEN Normalize LINSEN RE SULTS R Q 2 RQ2: MAPPING
  • 68. Accuracy Rates for the comparison with AMAP (Hill and Pollock, 2008) 0 0.225 0.45 0.675 0.9 CW DL OO AC PR SL AMAP LINSEN RQ3: EXPANSION RE SULTS R Q 3 CW: Combination Words DL: Dropped Letters OO: Others AC: Acronyms PR: Prefix SL: Single Letters
  • 74. FUTURE WORKS • Evaluation of the impact of each adopted dictionary on the performance • Improve or change or add dictionaries • Improve the implementation of the prototype to speed up the computation • Make use of parallel computation to process each identifier in isolation
  • 75. THANK YOU 26th Sept. 2012, ICSM2012@Riva del Garda(Trento), Italy Valerio Maggio Ph.D. Student, University of Naples “Federico II” valerio.maggio@unina.it