Wcre12b.ppt

TRIS: a Fast and Accurate Identifiers
Splitting and Expansion
Algorithm
WCRE 2012, Kingston, Canada

Latifa Guerrouj, Philippe Galinier, Yann-Gaël Guéhéneuc,
Giuliano Antoniol, and Massimiliano Di Penta

Content
Research Context
Current Practices
TRIS in Essence
Case Study
Threats to Validity
Conclusion
WCRE’12, Kingston
2/25

Research Context
Researchers agree that the identifier semantics are
important:
Help program comprehension: [CT99, DP05, LFB06]…;
Improve software quality: [MPF08, LMFB06, JH06]…;
Suggest clues: [TGM96, LMFB07]...

Composed identifiers:
Camel Case: MyLocalAccount, User_Address
Contraction based: pntr ctr, usrAdrss , imagEdge
Good and possibly known to the developers:
WCRE’12, Kingston
3/25

hmmm, ixoth , pqrstuvwxyz

Current Practices
Camel Case-based approaches [LFB06]
Case[LFB06]
Samurai by Enslen [ELK09]:
Mines term frequencies in large source code bases and relies on two
assumptions:
1) A substring composing an identifier is also likely to be used in
other parts of the program or in other programs.
2) Given two possible splits of a given identifier, the split that
most likely represents the developer’s intent partitions the
identifier into terms occurring more often in the program.

WCRE’12, Kingston
4/25

Abbreviations not treated, no quantification of how
close the match is to the unknown string.

Current Practices
TIDIER [GMA11]:
[GMA11]:
Inspired by speech recognition techniques and uses Dynamic
Time Warping to match terms to domain concepts.
Exploits context in the form of specialized dictionaries and
mimics the process of transforming words via contraction
rules.

GenTest/Normalize
GenTest/Normalize [LBM10, LB11]:
GenTest generates all possible splittings and evaluates a
scoring function against each proposed splitting.
Normalize uses a machine translation technique, the
maximum coherence model.

WCRE’12, Kingston
5/25

TIDIER, GenTest and Normalize are demanding
in terms of computation time.

TRIS in Essence
TRIS assumes that developers compose identifiers:
Using terms and words reflecting domain concepts developer’s
experience, knowledge.

TRIS uses set of transformation rules:
Drop vowels, drop prefix, drop suffix, etc.

TRIS treats the problem as an optimization problem
where the cost function is:
C(wOrig w) = α*Freq(wOrig) + C(type(wOrig w))
Freq(wOrig): frequency of wOrig in the source code
WCRE’12, Kingston
6/25

C(type(wOrig

w)): cost of the transformation type

TRIS in Essence
TRIS applies a two-phase strategy:
two1) Building dictionary transformation
Computation of the frequency of dictionary words
Construction of the set of transformations
Construction of the arborescence

2) Identifier processing algorithm
Construction of the auxiliary graph of the identifier
Search for a shortest path in the auxiliary graph
WCRE’12, Kingston
7/25

TRIS in Essence
TRIS phases – Example:

WCRE’12, Kingston
8/25

Dictionary Transformations Building Information for the Identifier
callableint

TRIS in Essence

Arborescence of Transformations for the Dictionary D

WCRE’12, Kingston
9/25

Auxiliary Graph for the Identifier callableint

Case Study - Research Question

Quality focus:

Accuracy of TRIS with respect to oracles, and compared with stateof-the-art approaches: 1) Camel Case Splitter, 2) Samurai, 3)
TIDIER, and 4) GenTest.

Research Question:

What is the accuracy of the TRIS identifier splitting and expansion
approach compared with alternative state-of-the art approaches?
WCRE’12, Kingston
10/25

Case Study – Analyzed Systems
JHotDraw – Java
16 KLOC
155 files
2,348 identifiers (longer than 2 chars)
957 manually segmented identifiers
Lynx – C
174 KLOC
247 files
12,194 identifiers (longer than 2 chars)
3,085 manually segmented identifiers
Lawrie et al. Data Set
186 programs
26 MLOC of C
15 MLOC of C++
WCRE’12, Kingston
11/25

7 MLOC of Java
489 C/C++ sampled from 430 GNU projects

Case Study - Results

Precision, recall and F-measure of TRIS, Camel Case, Samurai,
and TIDIER on JHotDraw
WCRE’12, Kingston
12/25


Precision, recall and F-measure of TRIS, Camel Case, Samurai,
and TIDIER on Lynx

WCRE’12, Kingston
13/25

Approach 1
TRIS
TRIS
TRIS

Approach 2 Adj p-value
Camel Case
0.431
Samurai
0.894
TIDIER
0.024

Cliff's delta
0.041
0.001
0.043

Comparison among approaches: Results of Wilcoxon paired test
and Cliff’s Delta effect size on JHotDraw
Approach 1

Approach 2

TRIS
TRIS
TRIS

Camel Case
Samurai
TIDIER

Adj p-value

<0.001
<0.001
<0.001

Cliff's delta

0.743
0.684
0.204

Comparison among approaches: Results of Wilcoxon paired test
and Cliff’s Delta effect size on Lynx

WCRE’12, Kingston
14/25

Cliff’s delta Interpretation:
- small for 0.148 <= d <0.33
- medium for 0.33 <= d < 0.474
- large for d >= 0.474


Metric
Precision

Median
Mean
3Q
σ
0.6667
0.6368 1.0000 0.3681

1.0000

1.0000

0.8933

1.0000 0.2471

TIDIER

0.5000

0.6667

0.6496

1.0000 0.3654

TRIS

1.0000

1.0000

0.8720

1.0000 0.2606

TIDIER

0.4000

0.6667

0.6409

1.0000 0.3650

TRIS

F-measure

1Q
0.4000

TRIS
Recall

Approach
TIDIER

1.0000

1.0000

0.8790

1.0000 0.2524

Precision, recall and F-measure of TRIS and TIDIER on the
489 C Sampled Identifiers

WCRE’12, Kingston
15/25

Metric

Approach

1Q

Median Mean

3Q

σ

Precision

TRIS

1.0000 1.0000 0.9763 1.0000 0.1184

Recall

TRIS

1.0000 1.0000 0.9439 1.0000 0.1565

F-measure

TRIS

1.0000 1.0000 0.9559 1.0000 0.1358

Precision, recall and F-measure of TRIS and TIDIER on the
data set from LAWRIE et al

Approach

Identifier Splitting Correctness

Samurai
GenTest

16/25

82%

TRIS
WCRE’12, Kingston

70%

86%

Correctness of the splitting provided using the data set from
LAWRIE et al

Conclusion
TRIS maps the identifier splitting/expansion problem in a graph
optimization problem to find the optimal path in an identifier
acyclic weighted graph.
TRIS was applied on Java, C, and C++ samples and compared to
Camel Case, Samurai, TIDIER, and GenTest.
Results show that TRIS outperforms other approaches with
medium to large effect size.
TRIS is also efficient in terms of computation time: quadratic
complexity in the length of the identifier.
WCRE’12, Kingston
17/25

Finally… Questions

Thank you for your attention
WCRE’12, Kingston
18/25

References
[LB11 ] D. Lawrie and D. Binkley, “Expanding identifiers to normalize source
code vocabulary,” International Conference on Software Maintenance , 2011, pp.
113–122.
[LBM10] D. Lawrie, D. Binkley, and C. Morrell, “Normalizing source code
vocabulary,” Working Conference on Reverse Engineering, 2010, pp. 112–122.
[GMA11] L. Guerrouj, D. P. Massimiliano, A. Giuliano, and Y.-G. Guéhéneuc,
“Tidier: An identifier splitting approach using speech recognition techniques,”
Journal of Software Maintenance and Evolution: Research and Practice, 2011.
[MGD10] N. Madani, L. Guerrouj, M. Di Penta, Y. Guéhéneuc, and G. Antoniol.
“Recognizing words from source code identifiers using speech recognition
techniques”. 14th European Conference on Software Maintenance and
Reengineering, March 2010.
[ELK09] E. Enslen, E. Hill, L. Pollock, and K. Vijay-Shanker, “Mining source
code to automatically split identifiers for software analysis”. Mining Software
Repositories, International Workshop on, vol. 0, pp. 71-80, 2009.
WCRE’12, Kingston
19/25

References
[MPF08] A. Marcus, D. Poshyvanyk, and R. Ferenc. “Using the conceptual
cohesion of classes for fault prediction in object-oriented systems”. IEEE
Transactions on Software Engineering, 34(2):287-300, 2008.
LMFB07] Dawn Lawrie, Christopher Morrell, Henry Feild, and David Binkley.
“Effective identifier names for comprehension and memory”. Innovations in
Systems and Software Engineering, 3(4):303-318, 2007.
[LFB06] D. Lawrie, H. Feild, and D. Binkley. “Syntactic identifier conciseness
and consistency”. 6th International Workshop on Source Code Analysis and
Manipulation, pages139-148, Sept 27-29, 2006.
[LMFB06] D. Lawrie, C. Morrell, H. Feild, and D. Binkley. “What’s in a name? a
study of identifiers”. 14th International Conference on Program
Comprehension, pages 3-12, Athens, Greece, 2006.
[JH06] Z. Ming Jiang and Ahmed E. Hassan. “Examining the evolution of code
comments in postgresql”. 2006 International Workshop on Mining Software
Repositories, pages 179-180, 2006.
WCRE’12, Kingston
20/25

References
[DP05] Florian Deissenbock and Markus Pizka. “Concise and consistent
naming”. Proceedings of the International Workshop on Program
Comprehension, May 2005.
[Ian00] Ian Sommerville. Software Engineering. Addison-Wesley, sixth edition,
2000.
[TGM96] Armstrong Takang, Penny A. Grubb, and Robert D. Macredie. “The
effects of comments and identifier names on program comprehensibility: an
experiential study”. Journal of Program Languages, 4(3):143–167, 1996.

[HNey84] H. Ney, “The use of a one-stage dynamic programming algorithm for
connected word recognition”. Acoustics, Speech and Signal Processing, IEEE
Transactions on, vol. 32, no. 2, pp. 263-271, Apr 1984.

WCRE’12, Kingston
21/25

Wcre12b.ppt

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (12)

Similaire à Wcre12b.ppt

Similaire à Wcre12b.ppt (20)

Plus de Ptidej Team

Plus de Ptidej Team (20)

Dernier

Dernier (20)

Wcre12b.ppt