Paper: A Comparison of Stemmers on Source Code Identifiers for Software
Search
Authors: Andrew Wiese, Valerie Ho, Emily Hill.
Session: ERA1 - Linguistic Analysis of Software Artifacts
How to Troubleshoot Apps for the Modern Connected Worker
ERA - A Comparison of Stemmers on Source Code Identifiers for Software Search
1. A Comparison of Stemmers on
Source Code Identifiers for
Software Search
Andrew Wiese,Valerie Ho, Emily Hill
Montclair State University
Thursday, October 6, 2011
2. Problem: Source Code Search
• Challenge: Query words may not exactly
match source code words & can hurt search
• Example: “add item” query should match
• add, adds, adding, added
• item, items
• Stemming used by Information Retrieval (IR)
systems to strip suffixes
• reduce all words to root form, or stem
• a.k.a. word conflation
Thursday, October 6, 2011
3. What makes stemming source code
different from traditional IR?
• Word choice more restrictive in naming identifiers
than in natural language (NL) documents
• NL: stem, stems, stemmer, stemming, stemmed
• Code: stem, stemmer
• Classes that encapsulate actions have names with
nominalized verbs:
• play → player
• compile → compiler
• Tradtional IR prefer light Porter’s
• tends not to stem across parts of speech
• E.g., noun ‘player’ will not stem to verb ‘play’
Thursday, October 6, 2011
4. Stemming Challenges
• Understemming
• stemmer assigns different stems to words in the same concept
• reduces number of relevant results in search
(i.e., reduces recall)
• Overstemming
• stemmer assigns the same stem for words with different
meanings (e.g., business conflated with busy,
university with universe)
• increases number of irrelevant results (i.e., reduces precision)
• Stemmers categorized by type of error
• Light stemmers: understem
• Heavy stemmers: overstem
Thursday, October 6, 2011
5. A Brief History of Stemming
• Light Stemmers (tend not to stem across parts of speech)
• Porter (1980): rule-based, simple & efficient
• Most popular stemmer in IR & SE
• Snowball (2001): minor rule improvements
• KStem (1993): morphology-based
• based on word’s structure & hand-tuned dictionary
• in experiments shown to outperform porter’s
• Heavy Stemmers
• Lovins (1968): rule-based
• Paice (1990): rule-based
• MStem: morphological (PC-Kimmo), specialized
for source code using word frequencies
Thursday, October 6, 2011
6. Our Contribution
• Compare performance of 5 stemmers on
source code identifiers
• Evaluation 1: compare conflated word classes
• started from 100 most frequently occurring
words in 9,000 open source Java programs
• analyzed by 2 human Java programmers in
terms of accuracy & completeness
• Evaluation 2: compare effect of using 5
stemmers vs not stemming on 8 search tasks
Thursday, October 6, 2011
7. Stemmer Word Classes Comparison
• accurate: word class contains no unrelated words
• complete: word class not missing related words
(rely on greediness & diversity of stemmers)
• context sensitive (CS): multiple senses or disagreement
100
90
No. Accurate & Complete
80
70
60
58%
50 53%
40 37%
32%
30 29%
20
10
e CS er e ll m m
Non ort Paic w ba Ste Ste
P no K M
S
None Context PORTER PAICE SNOWBALL KSTEM MSTEM
Sensitive
Thursday, October 6, 2011
8. element KStem element
(MStem) MStem element, elemental, elements
stemmers
Paice el, ela, ele, element, elemental, elementary, and inaccu
Word Classes Example
elemente, elementen, elements, elen, eles,
eli, elif, elise, elist, ell, elle, ellen, eller, els,
words. Fo
‘method’ w
• Stemmer comparison for 2 examples
else, elseif, elses, elsif
Porter import, importable, importance, important, with Span
Table I
and, in the
• Underlined words in all stemmer classes
imported, importer, importers, importing, the adverb
S TEMMER WORD CLASS COMPARISONS FOR 4 EXAMPLES ( UNDERLINED
imports
WORDS ARE IN THE WORD CLASSES FOR ALL STEMMERS ) quently we
KStem con
Snowbl import, importable, importance, important,
importantly, imported, importer, importers,
word frequ
with ‘else’
Word Stemmer Word Class uses an En
(A & C)
importing, imports ‘stationary’
import KStem import, importable, imported, importer, The ann
Porter element, elemental, elemente, elements
(Kstem) importers, importing, imports C. Threats
Snwbl
MStem
element, elemental, elemente, elements phological
element KStem element importable, importance, important,
import,
Because
(MStem) MStem importantly, imported, importer, importers,
element, elemental, elements
stemmers
Paice el, ela, ele,imports elemental, elementary,
importing, element, programs,
and inaccu
Paice elemente, elementen,importance, elen, eles,
import, importable, elements, important, words.lang
ming For
eli, elif, elise,importar, elle, ellen, eller, els,
importantly, elist, ell, imported, importer, 9,000+ Jav
else, elseif,importing, imports
importers, elses, elsif ‘method’ w
add, adde, addes, adds
frequent w
with Spani
Porter import, importable, importance, important,
Snwbl imported, addes, adds
add, adde, importer, importers, importing, and,large s
the in the
add KStem add, addable, added, addes, adding, adds
imports it is unlik
KStem wer
(CS) MStem
Snowbl import, importable, adder, adding, addition,
add, addable, added, importance, important,
of 100 wo
word frequ
importantly,additionally,importer, importers,
additional, imported, additions, additive,
importing, adds
additivity, imports of word cl
uses an En
import Paice
KStem import, add, addable, imported, importer,
ad, ada, importable, adde, added, adder, may not g
(Kstem) importers, importing, ade, ads
addes, adding, adds, imports C. Threats
stemmers.
Porter
MStem import,named, namely, names, naming
name, importable, importance, important,
Snwbl name, named, namely, names, naming can be am
Because
importantly, imported, importer, importers,
Thursday, October 6, 2011 name KStem name, nameable, named, namer, names, the ‘contex
9. Stemming and Source Code Search
• search technique: tf-idf
• search tasks: 8 with 48 queries from prior study
[Shepherd, et al. ’07]
• Paice: overstemming & understemming mistakes improved
results for 2 tasks (e.g., textfield report element)
1.0
Area Under the Curve
0.9
0.8
0.7
0.6
0.5
NoStem Porter
!
! Snowbl
!
! KStem
!
! MStem
!
! Paice
!
!
Thursday, October 6, 2011
10. Conclusion
• Morphological stemmers appear to be more
accurate & complete than rule-based
• In search, stemming more consistently produces
relevant results than not stemming
• Heavy stemmers like MStem & Paice appear to be
more effective in searching source code than light
stemmers like Porter
• Future work: more examples (less frequent &
more domain-specific), more human judgements,
more search tasks, other SE tasks beyond search
Thursday, October 6, 2011