Comment Analysis approach for Program Comprehension (Software Engineering Workshop - Crete)
1. A Comment Analysis Approach for Program
Comprehension
José L. Freitas 1
Daniela da Cruz 1
Pedro R. Henriques 1
1
Universidade do Minho, Portugal
Software Engineering Workshop, Crete
Oct. 12-13, 2012
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
2. Context
Program Comprehension is a vital task of Software
Maintenance.
In Software Maintenance, 50% of the time is spent on
comprehending the system.
Several approaches of source code analysis have been applied
to develop PC tools: program slicing, control-ow, data-ow,
etc.
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
3. Motivation
Most of PC tools are based on the extraction of structural
information.
Example: Function Y is used by function X n times. etc.
However, they lack the extraction of the meaning of a program
or the Problem domain concepts related with the program.
Example: Function Y calculates the amount of credit of a
banking account. etc.
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
4. Motivation
Comments can be the biggest source of semantic information
on code, alongside with identiers.
1 / ∗ T h i s f u n c t i o n r e c e i v e s t h e i d number o f a b a n k i n g
a c c o u n t and r e t u r n s t h e a v a i l a b l e amount o f c r e d i t
∗/
3 int credit ( int id ){ . . . }
Why not use comments to search for Problem Domain concepts,
needed to understand a program?
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
5. Bad and Good Comments
When a comment is bad or good?
Apart from the existing controversy around this subject, a bad
comment can start from being a comment which is inconsistent
with the code which is commenting, and that leads to the
misleading of the person who reads it.
states that comments help on the comprehension if they
provide Problem and Program Domain information and means to
Brooks
establish bridges between those two domains.
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
6. Goal
Create a Program Comprehension tool that explores comments to
search for Problem Domain concepts: Darius. 1
1
Relative to King Darius I of Persia, the rst known man to create the rst
bridge between Europe and Asia, on the Bosphorus strait.
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
7. Outline
Darius Comment Evaluator
Preliminary study
1
Darius Concept Locator
Experiment
2
3 Conclusion
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
8. Darius Comment Evaluator
The rst version of Darius analyzes:
Comment Quantity: number of comments, percentage of
comments, etc.
Comment Content: Use of Problem Domain and Program
Domain terms.
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
9. Darius Preliminary study
Comment Evaluator Modules
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
10. Darius GUI (1)
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
11. Darius Comment Extractor module
Comment Extractor
Darius extract three types of comments:
1 Inline Comments, IC for short: // ...
2 Block Comments, BC for short: /* ... */
3 JavaDoc Comments, BC for short: /** ... */
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
12. Darius Comment Extractor module
In order to discover and identify what type of source code entity is
associated with the comment, the next line after the comment is
extracted too. Darius associates comments with:
1 classes
2 interfaces
3 methods
4 conditionals (if)
5 loops (while and for)
6 switches
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
13. Darius Statistics Calculator module
Statistic Calculator module
Number of comments of a project (global, per type of
comment and per line of source code);
Average number of comment lines per lines of code;
Average number of lines of a non inline comment;
Average number of each type of source code entity which is
commented;
Type of comments most used (global and per source code
entity).
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
14. Darius Words Analyzer module
Words Analyzer module
Given a list of words extracted from the ontology of the Problem
Domain, Darius computes:
Percentage and frequency of words in the list found in
comments;
Frequency of each type of comment that contains words from
the list;
Frequency of each type of source code entity commented that
contains words from the list.
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
15. Darius Words Analyzer module
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
16. Outline
Darius Comment Evaluator
Preliminary study
1
Darius Concept Locator
Experiment
2
3 Conclusion
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
17. Darius Preliminary study
In order to perform a preliminary study, 10 open-source software
projects written in Java were selected.
The choice for the use of open-source projects has two reasons:
1 The source code is totally free;
Open-source software projects are highly used by the
community to change and manipulate the source code over
2
and over again
These kind of projects tend to be constantly updated and thus
comprehension tasks are involved. Commenting can be a proper
way of helping on these tasks.
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
18. Darius Preliminary study
Project Description Files LoC Classes
iText PDF Library 480 145666 403
ganttproject Project Management Library 530 68945 394
gwt-dev Google's Web Toolkit 987 192738 803
jEdit Text Editor 531 176006 404
vuze Peer-to-peer client 3284 785935 2463
junit Tests Framework 154 10926 130
jfreechart Chart Library 989 313231 876
antlr Grammar Framework 221 85867 212
jexcelapi Excel Library 438 93876 166
robocode Programming Game of Robots 571 81519 485
Total 8185 1954709 6336
Table : Description and size of each selected project
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
19. Darius Preliminary study results
Comment Quantity: 6/10 test programs ≥ 19% comments
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
20. Darius Preliminary study results
Type of Comments
Project #CM CM/LOSC #IC #BC #JD
iText 13343 0.24 4930 3777 4636
ganttproject 4468 0.11 2925 814 729
gwt-dev 12969 0.16 7219 866 4884
jEdit 18986 0.21 806 14421 3759
vuze 27723 0.08 18245 2319 7159
junit 519 0.21 2 77 440
jfreechart 22516 0.27 6592 2530 13394
antlr 5292 0.14 3903 1380 9
jexcelapi 8354 0.26 2213 775 5366
robocode 5071 0.19 3108 102 1861
Total 119241 0.16 63633 13371 42237
Table : Comments Frequency in the projects.
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
22. Darius Preliminary study results
Project If For While Switch Class Interf. Method
iText IC IC IC IC JD JD JD
ganttproject IC IC IC IC JD JD JD
gwt-dev IC IC IC IC JD JD JD
jEdit IC IC IC IC JD JD JD
vuze IC IC IC IC JD JD JD
junit IC NA NA NA JD JD JD
jfreechart IC IC IC IC JD JD JD
antlr IC IC IC IC BC BC BC
jexcelapi IC IC BC NA JD JD JD
robocode IC IC IC IC JD JD JD
Total IC IC IC IC JD JD JD
Table : Most used type of comment per type of source code entity
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
23. Darius Preliminary study results
Comment Content: 10/10 test programs ≥ 23% Problem
and Program Domain terms
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
24. Darius Preliminary study results
Goal: explore the content of comments, by checking weather
comments contain Problem and Program domain information.
Information necessary to run these tests:
a list of problem domain terms for each one of the software
projects.
a (single) list of program domain terms.
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
25. Darius Preliminary study results
Project Problem Domain Program Domain
iText 92.31 86.76
ganttproject 84.31 75.0
gwt-dev 56.34 86.76
jEdit 89.74 86.76
vuze 92.11 88.24
junit 81.82 67.65
jfreechart 86.36 89.71
antlr 88.24 83.82
jexcelapi 79.31 85.29
robocode 88.89 83.82
Total 82.21 83.38
Table : Percentage of domain words found
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
26. Darius Preliminary study results
IC BC JD
Project Prob. Prog. Prob. Prog. Prob. Prog.
iText 13.05 13.99 6.89 14.1 14.7 17.36
ganttproject 13.76 13.52 14.78 11.58 14.67 14.2
gwt-dev 0.96 19.7 2.03 18.31 2.62 22.16
jEdit 5.1 17.15 6.44 24.76 9.28 16.69
vuze 4.6 18.02 5.14 11.38 4.29 18.89
junit 0 20.0 17.14 16.57 22.66 25.77
jfreechart 20.7 20.73 16.74 12.45 15.58 21.41
antlr 13.85 13.81 13.95 10.7 2.13 11.35
jexcelapi 10.38 16.16 17.08 12.97 24.97 17.01
robocode 17.0 14.06 16.52 12.6 25.13 12.5
Total 9.97 16.27 8.33 14.58 13.13 19.13
Table : Frequency (%) of words of each Domain per type of comment
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
27. Darius Preliminary study conclusion
Higher level source code entities tend to have comments
oriented for Problem Domain information, while comments
of lower level entities tends to include more Program
Domain information.
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
28. Outline
Darius Comment Evaluator
Preliminary study
1
Darius Concept Locator
Experiment
2
3 Conclusion
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
29. Darius - Problem Concept Location
Goal: Search of Problem Domain concepts to nd the mappings of
these concepts on the source code.
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
30. Darius GUI (2)
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
31. Darius Employed Techniques
Latent Semantic Analysis (LSA)
A technique in natural language processing, of analyzing
relationships between a set of documents and the terms they
contain by producing a set of concepts related to the documents
and terms.
LSA assumes that words that are close in meaning will occur close
together in text. It constructs a matrix containing word counts per
paragraph (rows represent unique words and columns represent
each paragraph).
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
32. Darius Employed Techniques
Vector Space Model (VSM)
An algebraic model for representing text documents as vectors.
Each dimension corresponds to a separate term. If a term occurs in
the document, its value in the vector is non-zero.
A weight is used to evaluate how important a word is to a
document in a collection. The importance increases
proportionally to the number of times a word appears in the
document.
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
33. Outline
Darius Comment Evaluator
Preliminary study
1
Darius Concept Locator
Experiment
2
3 Conclusion
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
34. Darius Experiment with iText
The object of study chosen to be subject on this test, is iText.
iText contains a sucient amount of comments, and the contents
of that comments have a sucient dose of Problem and Program
domain information, and so this program can be explored for PC
purposes using its comments.
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
35. Darius Experiment with iText
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
36. Darius Experiment with iText
How a PDF document is created?
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
37. Darius Experiment with iText
How to write a PDF document into an output stream?
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
38. Darius Experiment with iText
How to add a title to a PDF?
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
39. Darius Experiment with iText
Going deeper in the searches and using the information discovered
in each executed query, the programmer can build an incremental
knowledge of the software.
The programmer should be able to gure out the implementation of
every concept on the source code and the relations among them, by
using the information present on comments.
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
40. Outline
Darius Comment Evaluator
Preliminary study
1
Darius Concept Locator
Experiment
2
3 Conclusion
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension
41. Conclusion
Do real world programs actually contain enough and meaningful
comments to justify the analysis eort and the approach proposed?
Using simple but eective queries the process of locating
concepts using comment information is faster than the
complex task of reading the whole source code of the program.
Darius shows the potential value of comprehension that
comments poses.
As future work:
Questionnaires will be made to understand how a programmer
would deal with Darius.
Develop Darius as a plugin for an IDE (e.g. Eclipse).
Freitas, Cruz, Henriques Comment Analysis for Program Comprehension