SlideShare une entreprise Scribd logo
1  sur  10
Télécharger pour lire hors ligne
A Comparison of Stemmers on
                  Source Code Identifiers for
                       Software Search
                            Andrew Wiese,Valerie Ho, Emily Hill
                                Montclair State University




Thursday, October 6, 2011
Problem: Source Code Search
                     • Challenge: Query words may not exactly
                            match source code words & can hurt search
                     •      Example: “add item” query should match
                            • add, adds, adding, added
                            • item, items
                     •      Stemming used by Information Retrieval (IR)
                            systems to strip suffixes
                            • reduce all words to root form, or stem
                            • a.k.a. word conflation


Thursday, October 6, 2011
What makes stemming source code
                  different from traditional IR?
                    •       Word choice more restrictive in naming identifiers
                            than in natural language (NL) documents
                            • NL: stem, stems, stemmer, stemming, stemmed
                            • Code: stem, stemmer
                    •       Classes that encapsulate actions have names with
                            nominalized verbs:
                            • play → player
                            • compile → compiler
                    •       Tradtional IR prefer light Porter’s
                            • tends not to stem across parts of speech
                            • E.g., noun ‘player’ will not stem to verb ‘play’

Thursday, October 6, 2011
Stemming Challenges
       •       Understemming
             •   stemmer assigns different stems to words in the same concept
             •   reduces number of relevant results in search
                 (i.e., reduces recall)
       •       Overstemming
             •   stemmer assigns the same stem for words with different
                 meanings (e.g., business conflated with busy,
                 university with universe)
             •   increases number of irrelevant results (i.e., reduces precision)
       •       Stemmers categorized by type of error
             •   Light stemmers: understem
             •   Heavy stemmers: overstem


Thursday, October 6, 2011
A Brief History of Stemming
              • Light Stemmers (tend not to stem across parts of speech)
               • Porter (1980): rule-based, simple & efficient
                            •   Most popular stemmer in IR & SE
                            •   Snowball (2001): minor rule improvements
                    •       KStem (1993): morphology-based
                            •   based on word’s structure & hand-tuned dictionary
                            •   in experiments shown to outperform porter’s
              • Heavy Stemmers
               • Lovins (1968): rule-based
               • Paice (1990): rule-based
               • MStem: morphological (PC-Kimmo), specialized
                            for source code using word frequencies

Thursday, October 6, 2011
Our Contribution
                 • Compare performance of 5 stemmers on
                   source code identifiers
                 • Evaluation 1: compare conflated word classes
                  • started from 100 most frequently occurring
                     words in 9,000 open source Java programs
                  • analyzed by 2 human Java programmers in
                     terms of accuracy & completeness
                 • Evaluation 2: compare effect of using 5
                        stemmers vs not stemming on 8 search tasks



Thursday, October 6, 2011
Stemmer Word Classes Comparison
            •      accurate: word class contains no unrelated words
            •      complete: word class not missing related words
                   (rely on greediness & diversity of stemmers)
            •      context sensitive (CS): multiple senses or disagreement
                                                      100
                                                      90
                            No. Accurate & Complete




                                                      80
                                                      70
                                                      60
                                                                                                                           58%
                                                      50                                                         53%
                                                      40                                                37%
                                                                                            32%
                                                      30                     29%
                                                      20
                                                      10

                                                                 e   CS              er         e          ll         m        m
                                                             Non                 ort       Paic        w ba        Ste      Ste
                                                                            P                        no          K        M
                                                                                                    S
                                                            None     Context              PORTER         PAICE      SNOWBALL       KSTEM   MSTEM
                                                                     Sensitive
Thursday, October 6, 2011
element      KStem    element
                              (MStem)      MStem    element, elemental, elements
                                                                                                             stemmers
                                           Paice    el, ela, ele, element, elemental, elementary,            and inaccu
                                  Word Classes Example
                                                    elemente, elementen, elements, elen, eles,
                                                    eli, elif, elise, elist, ell, elle, ellen, eller, els,
                                                                                                             words. Fo
                                                                                                             ‘method’ w

                     • Stemmer comparison for 2 examples
                                                    else, elseif, elses, elsif
                                         Porter     import, importable, importance, important,               with Span
                                                            Table I
                                                                                                             and, in the
                     • Underlined words in all stemmer classes
                                                    imported, importer, importers, importing,                the adverb
                            S TEMMER WORD CLASS COMPARISONS FOR 4 EXAMPLES ( UNDERLINED
                                                    imports
                                    WORDS ARE IN THE WORD CLASSES FOR ALL STEMMERS )                         quently we
                                                                                                             KStem con
                                         Snowbl     import, importable, importance, important,
                                                    importantly, imported, importer, importers,
                                                                                                             word frequ
                                                                                                             with ‘else’
                               Word      Stemmer Word Class                                                  uses an En
                               (A & C)
                                                    importing, imports                                       ‘stationary’
                               import    KStem      import, importable, imported, importer,                     The ann
                                         Porter     element, elemental, elemente, elements
                               (Kstem)              importers, importing, imports                            C. Threats
                                         Snwbl
                                         MStem
                                                    element, elemental, elemente, elements                   phological
                               element   KStem      element importable, importance, important,
                                                    import,
                                                                                                                Because
                               (MStem)   MStem      importantly, imported, importer, importers,
                                                    element, elemental, elements
                                                                                                             stemmers
                                         Paice      el, ela, ele,imports elemental, elementary,
                                                    importing, element,                                      programs,
                                                                                                             and inaccu
                                         Paice      elemente, elementen,importance, elen, eles,
                                                    import, importable, elements, important,                 words.lang
                                                                                                             ming For
                                                    eli, elif, elise,importar, elle, ellen, eller, els,
                                                    importantly, elist, ell, imported, importer,             9,000+ Jav
                                                    else, elseif,importing, imports
                                                    importers, elses, elsif                                  ‘method’ w
                                                    add, adde, addes, adds
                                                                                                             frequent w
                                                                                                             with Spani
                                         Porter     import, importable, importance, important,
                                         Snwbl      imported, addes, adds
                                                    add, adde, importer, importers, importing,               and,large s
                                                                                                             the in the
                               add       KStem      add, addable, added, addes, adding, adds
                                                    imports                                                  it is unlik
                                                                                                             KStem wer
                               (CS)      MStem
                                         Snowbl     import, importable, adder, adding, addition,
                                                    add, addable, added, importance, important,
                                                                                                             of 100 wo
                                                                                                             word frequ
                                                    importantly,additionally,importer, importers,
                                                    additional, imported, additions, additive,
                                                    importing, adds
                                                    additivity, imports                                      of word cl
                                                                                                             uses an En
                               import    Paice
                                         KStem      import, add, addable, imported, importer,
                                                    ad, ada, importable, adde, added, adder,                 may not g
                               (Kstem)              importers, importing, ade, ads
                                                    addes, adding, adds, imports                             C. Threats
                                                                                                             stemmers.
                                         Porter
                                         MStem      import,named, namely, names, naming
                                                    name, importable, importance, important,
                                         Snwbl      name, named, namely, names, naming                       can be am
                                                                                                                Because
                                                    importantly, imported, importer, importers,
Thursday, October 6, 2011      name      KStem      name, nameable, named, namer, names,                     the ‘contex
Stemming and Source Code Search
            •      search technique: tf-idf
            •      search tasks: 8 with 48 queries from prior study
                   [Shepherd, et al. ’07]
            •      Paice: overstemming & understemming mistakes improved
                   results for 2 tasks (e.g., textfield report element)
                                                   1.0
                            Area Under the Curve

                                                   0.9
                                                   0.8
                                                   0.7
                                                   0.6
                                                   0.5




                                                         NoStem   Porter
                                                                    !
                                                                    !      Snowbl
                                                                             !
                                                                             !      KStem
                                                                                      !
                                                                                      !     MStem
                                                                                              !
                                                                                              !     Paice
                                                                                                      !
                                                                                                      !



Thursday, October 6, 2011
Conclusion
                     •      Morphological stemmers appear to be more
                            accurate & complete than rule-based

                     •      In search, stemming more consistently produces
                            relevant results than not stemming

                     •      Heavy stemmers like MStem & Paice appear to be
                            more effective in searching source code than light
                            stemmers like Porter

                     •      Future work: more examples (less frequent &
                            more domain-specific), more human judgements,
                            more search tasks, other SE tasks beyond search



Thursday, October 6, 2011

Contenu connexe

En vedette

Components - Graph Based Detection of Library API Limitations
Components - Graph Based Detection of Library API LimitationsComponents - Graph Based Detection of Library API Limitations
Components - Graph Based Detection of Library API LimitationsICSM 2011
 
Faults and Regression Testing - Fault interaction and its repercussions
Faults and Regression Testing - Fault interaction and its repercussionsFaults and Regression Testing - Fault interaction and its repercussions
Faults and Regression Testing - Fault interaction and its repercussionsICSM 2011
 
ERA - Measuring Disruption from Software Evolution Activities Using Graph-Bas...
ERA - Measuring Disruption from Software Evolution Activities Using Graph-Bas...ERA - Measuring Disruption from Software Evolution Activities Using Graph-Bas...
ERA - Measuring Disruption from Software Evolution Activities Using Graph-Bas...ICSM 2011
 
Impact analysis - A Seismology-inspired Approach to Study Change Propagation
Impact analysis - A Seismology-inspired Approach to Study Change PropagationImpact analysis - A Seismology-inspired Approach to Study Change Propagation
Impact analysis - A Seismology-inspired Approach to Study Change PropagationICSM 2011
 
ERA - Clustering and Recommending Collections of Code Relevant to Task
ERA - Clustering and Recommending Collections of Code Relevant to TaskERA - Clustering and Recommending Collections of Code Relevant to Task
ERA - Clustering and Recommending Collections of Code Relevant to TaskICSM 2011
 
Lionel Briand ICSM 2011 Keynote
Lionel Briand ICSM 2011 KeynoteLionel Briand ICSM 2011 Keynote
Lionel Briand ICSM 2011 KeynoteICSM 2011
 
Richard Kemmerer Keynote icsm11
Richard Kemmerer Keynote icsm11Richard Kemmerer Keynote icsm11
Richard Kemmerer Keynote icsm11ICSM 2011
 
Migration and Refactoring - Identifying Overly Strong Conditions in Refactori...
Migration and Refactoring - Identifying Overly Strong Conditions in Refactori...Migration and Refactoring - Identifying Overly Strong Conditions in Refactori...
Migration and Refactoring - Identifying Overly Strong Conditions in Refactori...ICSM 2011
 
Impact Analysis - ImpactScale: Quantifying Change Impact to Predict Faults in...
Impact Analysis - ImpactScale: Quantifying Change Impact to Predict Faults in...Impact Analysis - ImpactScale: Quantifying Change Impact to Predict Faults in...
Impact Analysis - ImpactScale: Quantifying Change Impact to Predict Faults in...ICSM 2011
 
Industry - Relating Developers' Concepts and Artefact Vocabulary in a Financ...
Industry -  Relating Developers' Concepts and Artefact Vocabulary in a Financ...Industry -  Relating Developers' Concepts and Artefact Vocabulary in a Financ...
Industry - Relating Developers' Concepts and Artefact Vocabulary in a Financ...ICSM 2011
 
ERA - Tracking Technical Debt
ERA - Tracking Technical DebtERA - Tracking Technical Debt
ERA - Tracking Technical DebtICSM 2011
 
ICSM'01 Most Influential Paper - Rainer Koschke
ICSM'01 Most Influential Paper - Rainer KoschkeICSM'01 Most Influential Paper - Rainer Koschke
ICSM'01 Most Influential Paper - Rainer KoschkeICSM 2011
 
Industry - Estimating software maintenance effort from use cases an indu...
Industry - Estimating software maintenance effort from use cases an      indu...Industry - Estimating software maintenance effort from use cases an      indu...
Industry - Estimating software maintenance effort from use cases an indu...ICSM 2011
 
Dynamic Analysis - SCOTCH: Improving Test-to-Code Traceability using Slicing ...
Dynamic Analysis - SCOTCH: Improving Test-to-Code Traceability using Slicing ...Dynamic Analysis - SCOTCH: Improving Test-to-Code Traceability using Slicing ...
Dynamic Analysis - SCOTCH: Improving Test-to-Code Traceability using Slicing ...ICSM 2011
 
Reliability and Quality - Predicting post-release defects using pre-release f...
Reliability and Quality - Predicting post-release defects using pre-release f...Reliability and Quality - Predicting post-release defects using pre-release f...
Reliability and Quality - Predicting post-release defects using pre-release f...ICSM 2011
 
ERA - Measuring Maintainability of Spreadsheets in the Wild
ERA - Measuring Maintainability of Spreadsheets in the Wild ERA - Measuring Maintainability of Spreadsheets in the Wild
ERA - Measuring Maintainability of Spreadsheets in the Wild ICSM 2011
 
Components - Crossing the Boundaries while Analyzing Heterogeneous Component-...
Components - Crossing the Boundaries while Analyzing Heterogeneous Component-...Components - Crossing the Boundaries while Analyzing Heterogeneous Component-...
Components - Crossing the Boundaries while Analyzing Heterogeneous Component-...ICSM 2011
 
Industry - The Evolution of Information Systems. A Case Study on Document Man...
Industry - The Evolution of Information Systems. A Case Study on Document Man...Industry - The Evolution of Information Systems. A Case Study on Document Man...
Industry - The Evolution of Information Systems. A Case Study on Document Man...ICSM 2011
 
Postdoc Symposium - Abram Hindle
Postdoc Symposium - Abram HindlePostdoc Symposium - Abram Hindle
Postdoc Symposium - Abram HindleICSM 2011
 
Postdoc symposium - A Logic Meta-Programming Foundation for Example-Driven Pa...
Postdoc symposium - A Logic Meta-Programming Foundation for Example-Driven Pa...Postdoc symposium - A Logic Meta-Programming Foundation for Example-Driven Pa...
Postdoc symposium - A Logic Meta-Programming Foundation for Example-Driven Pa...ICSM 2011
 

En vedette (20)

Components - Graph Based Detection of Library API Limitations
Components - Graph Based Detection of Library API LimitationsComponents - Graph Based Detection of Library API Limitations
Components - Graph Based Detection of Library API Limitations
 
Faults and Regression Testing - Fault interaction and its repercussions
Faults and Regression Testing - Fault interaction and its repercussionsFaults and Regression Testing - Fault interaction and its repercussions
Faults and Regression Testing - Fault interaction and its repercussions
 
ERA - Measuring Disruption from Software Evolution Activities Using Graph-Bas...
ERA - Measuring Disruption from Software Evolution Activities Using Graph-Bas...ERA - Measuring Disruption from Software Evolution Activities Using Graph-Bas...
ERA - Measuring Disruption from Software Evolution Activities Using Graph-Bas...
 
Impact analysis - A Seismology-inspired Approach to Study Change Propagation
Impact analysis - A Seismology-inspired Approach to Study Change PropagationImpact analysis - A Seismology-inspired Approach to Study Change Propagation
Impact analysis - A Seismology-inspired Approach to Study Change Propagation
 
ERA - Clustering and Recommending Collections of Code Relevant to Task
ERA - Clustering and Recommending Collections of Code Relevant to TaskERA - Clustering and Recommending Collections of Code Relevant to Task
ERA - Clustering and Recommending Collections of Code Relevant to Task
 
Lionel Briand ICSM 2011 Keynote
Lionel Briand ICSM 2011 KeynoteLionel Briand ICSM 2011 Keynote
Lionel Briand ICSM 2011 Keynote
 
Richard Kemmerer Keynote icsm11
Richard Kemmerer Keynote icsm11Richard Kemmerer Keynote icsm11
Richard Kemmerer Keynote icsm11
 
Migration and Refactoring - Identifying Overly Strong Conditions in Refactori...
Migration and Refactoring - Identifying Overly Strong Conditions in Refactori...Migration and Refactoring - Identifying Overly Strong Conditions in Refactori...
Migration and Refactoring - Identifying Overly Strong Conditions in Refactori...
 
Impact Analysis - ImpactScale: Quantifying Change Impact to Predict Faults in...
Impact Analysis - ImpactScale: Quantifying Change Impact to Predict Faults in...Impact Analysis - ImpactScale: Quantifying Change Impact to Predict Faults in...
Impact Analysis - ImpactScale: Quantifying Change Impact to Predict Faults in...
 
Industry - Relating Developers' Concepts and Artefact Vocabulary in a Financ...
Industry -  Relating Developers' Concepts and Artefact Vocabulary in a Financ...Industry -  Relating Developers' Concepts and Artefact Vocabulary in a Financ...
Industry - Relating Developers' Concepts and Artefact Vocabulary in a Financ...
 
ERA - Tracking Technical Debt
ERA - Tracking Technical DebtERA - Tracking Technical Debt
ERA - Tracking Technical Debt
 
ICSM'01 Most Influential Paper - Rainer Koschke
ICSM'01 Most Influential Paper - Rainer KoschkeICSM'01 Most Influential Paper - Rainer Koschke
ICSM'01 Most Influential Paper - Rainer Koschke
 
Industry - Estimating software maintenance effort from use cases an indu...
Industry - Estimating software maintenance effort from use cases an      indu...Industry - Estimating software maintenance effort from use cases an      indu...
Industry - Estimating software maintenance effort from use cases an indu...
 
Dynamic Analysis - SCOTCH: Improving Test-to-Code Traceability using Slicing ...
Dynamic Analysis - SCOTCH: Improving Test-to-Code Traceability using Slicing ...Dynamic Analysis - SCOTCH: Improving Test-to-Code Traceability using Slicing ...
Dynamic Analysis - SCOTCH: Improving Test-to-Code Traceability using Slicing ...
 
Reliability and Quality - Predicting post-release defects using pre-release f...
Reliability and Quality - Predicting post-release defects using pre-release f...Reliability and Quality - Predicting post-release defects using pre-release f...
Reliability and Quality - Predicting post-release defects using pre-release f...
 
ERA - Measuring Maintainability of Spreadsheets in the Wild
ERA - Measuring Maintainability of Spreadsheets in the Wild ERA - Measuring Maintainability of Spreadsheets in the Wild
ERA - Measuring Maintainability of Spreadsheets in the Wild
 
Components - Crossing the Boundaries while Analyzing Heterogeneous Component-...
Components - Crossing the Boundaries while Analyzing Heterogeneous Component-...Components - Crossing the Boundaries while Analyzing Heterogeneous Component-...
Components - Crossing the Boundaries while Analyzing Heterogeneous Component-...
 
Industry - The Evolution of Information Systems. A Case Study on Document Man...
Industry - The Evolution of Information Systems. A Case Study on Document Man...Industry - The Evolution of Information Systems. A Case Study on Document Man...
Industry - The Evolution of Information Systems. A Case Study on Document Man...
 
Postdoc Symposium - Abram Hindle
Postdoc Symposium - Abram HindlePostdoc Symposium - Abram Hindle
Postdoc Symposium - Abram Hindle
 
Postdoc symposium - A Logic Meta-Programming Foundation for Example-Driven Pa...
Postdoc symposium - A Logic Meta-Programming Foundation for Example-Driven Pa...Postdoc symposium - A Logic Meta-Programming Foundation for Example-Driven Pa...
Postdoc symposium - A Logic Meta-Programming Foundation for Example-Driven Pa...
 

Similaire à ERA - A Comparison of Stemmers on Source Code Identifiers for Software Search

Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Saurabh Kaushik
 
Analyzing Arguments during a Debate using Natural Language Processing in Python
Analyzing Arguments during a Debate using Natural Language Processing in PythonAnalyzing Arguments during a Debate using Natural Language Processing in Python
Analyzing Arguments during a Debate using Natural Language Processing in PythonAbhinav Gupta
 
The CLUES database: automated search for linguistic cognates
The CLUES database: automated search for linguistic cognatesThe CLUES database: automated search for linguistic cognates
The CLUES database: automated search for linguistic cognatesMark Planigale
 
Word vectorization(embedding) with nnlm
Word vectorization(embedding) with nnlmWord vectorization(embedding) with nnlm
Word vectorization(embedding) with nnlmhyunsung lee
 
sa-mincut-aditya.ppt
sa-mincut-aditya.pptsa-mincut-aditya.ppt
sa-mincut-aditya.pptaashnareddy1
 
2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptxSan Kim
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text miningLokesh Ramaswamy
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text miningLokesh Ramaswamy
 
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...IT Arena
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
Gender and language (linguistics, social network theory, Twitter!)
Gender and language (linguistics, social network theory, Twitter!)Gender and language (linguistics, social network theory, Twitter!)
Gender and language (linguistics, social network theory, Twitter!)Tyler Schnoebelen
 
Gender, language, and Twitter: Social theory and computational methods
Gender, language, and Twitter: Social theory and computational methodsGender, language, and Twitter: Social theory and computational methods
Gender, language, and Twitter: Social theory and computational methodsIdibon1
 
Sat lessons power point dt6 10.05.2011
Sat lessons power point dt6 10.05.2011Sat lessons power point dt6 10.05.2011
Sat lessons power point dt6 10.05.2011VJN_88_
 
Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...
Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...
Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...hajinouha0
 
Personalised Terms Derivative- Semantic Stemming
Personalised Terms Derivative- Semantic StemmingPersonalised Terms Derivative- Semantic Stemming
Personalised Terms Derivative- Semantic Stemmingnitin jha
 
Story generation-Sarah Saneei
Story generation-Sarah SaneeiStory generation-Sarah Saneei
Story generation-Sarah SaneeiSRah Sanei
 

Similaire à ERA - A Comparison of Stemmers on Source Code Identifiers for Software Search (20)

Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
 
NLP PPT.pptx
NLP PPT.pptxNLP PPT.pptx
NLP PPT.pptx
 
Analyzing Arguments during a Debate using Natural Language Processing in Python
Analyzing Arguments during a Debate using Natural Language Processing in PythonAnalyzing Arguments during a Debate using Natural Language Processing in Python
Analyzing Arguments during a Debate using Natural Language Processing in Python
 
The CLUES database: automated search for linguistic cognates
The CLUES database: automated search for linguistic cognatesThe CLUES database: automated search for linguistic cognates
The CLUES database: automated search for linguistic cognates
 
Word vectorization(embedding) with nnlm
Word vectorization(embedding) with nnlmWord vectorization(embedding) with nnlm
Word vectorization(embedding) with nnlm
 
A^2_Poster
A^2_PosterA^2_Poster
A^2_Poster
 
sa-mincut-aditya.ppt
sa-mincut-aditya.pptsa-mincut-aditya.ppt
sa-mincut-aditya.ppt
 
sa.ppt
sa.pptsa.ppt
sa.ppt
 
sa-mincut-aditya.ppt
sa-mincut-aditya.pptsa-mincut-aditya.ppt
sa-mincut-aditya.ppt
 
2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text mining
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text mining
 
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Gender and language (linguistics, social network theory, Twitter!)
Gender and language (linguistics, social network theory, Twitter!)Gender and language (linguistics, social network theory, Twitter!)
Gender and language (linguistics, social network theory, Twitter!)
 
Gender, language, and Twitter: Social theory and computational methods
Gender, language, and Twitter: Social theory and computational methodsGender, language, and Twitter: Social theory and computational methods
Gender, language, and Twitter: Social theory and computational methods
 
Sat lessons power point dt6 10.05.2011
Sat lessons power point dt6 10.05.2011Sat lessons power point dt6 10.05.2011
Sat lessons power point dt6 10.05.2011
 
Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...
Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...
Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...
 
Personalised Terms Derivative- Semantic Stemming
Personalised Terms Derivative- Semantic StemmingPersonalised Terms Derivative- Semantic Stemming
Personalised Terms Derivative- Semantic Stemming
 
Story generation-Sarah Saneei
Story generation-Sarah SaneeiStory generation-Sarah Saneei
Story generation-Sarah Saneei
 

Dernier

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Dernier (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

ERA - A Comparison of Stemmers on Source Code Identifiers for Software Search

  • 1. A Comparison of Stemmers on Source Code Identifiers for Software Search Andrew Wiese,Valerie Ho, Emily Hill Montclair State University Thursday, October 6, 2011
  • 2. Problem: Source Code Search • Challenge: Query words may not exactly match source code words & can hurt search • Example: “add item” query should match • add, adds, adding, added • item, items • Stemming used by Information Retrieval (IR) systems to strip suffixes • reduce all words to root form, or stem • a.k.a. word conflation Thursday, October 6, 2011
  • 3. What makes stemming source code different from traditional IR? • Word choice more restrictive in naming identifiers than in natural language (NL) documents • NL: stem, stems, stemmer, stemming, stemmed • Code: stem, stemmer • Classes that encapsulate actions have names with nominalized verbs: • play → player • compile → compiler • Tradtional IR prefer light Porter’s • tends not to stem across parts of speech • E.g., noun ‘player’ will not stem to verb ‘play’ Thursday, October 6, 2011
  • 4. Stemming Challenges • Understemming • stemmer assigns different stems to words in the same concept • reduces number of relevant results in search (i.e., reduces recall) • Overstemming • stemmer assigns the same stem for words with different meanings (e.g., business conflated with busy, university with universe) • increases number of irrelevant results (i.e., reduces precision) • Stemmers categorized by type of error • Light stemmers: understem • Heavy stemmers: overstem Thursday, October 6, 2011
  • 5. A Brief History of Stemming • Light Stemmers (tend not to stem across parts of speech) • Porter (1980): rule-based, simple & efficient • Most popular stemmer in IR & SE • Snowball (2001): minor rule improvements • KStem (1993): morphology-based • based on word’s structure & hand-tuned dictionary • in experiments shown to outperform porter’s • Heavy Stemmers • Lovins (1968): rule-based • Paice (1990): rule-based • MStem: morphological (PC-Kimmo), specialized for source code using word frequencies Thursday, October 6, 2011
  • 6. Our Contribution • Compare performance of 5 stemmers on source code identifiers • Evaluation 1: compare conflated word classes • started from 100 most frequently occurring words in 9,000 open source Java programs • analyzed by 2 human Java programmers in terms of accuracy & completeness • Evaluation 2: compare effect of using 5 stemmers vs not stemming on 8 search tasks Thursday, October 6, 2011
  • 7. Stemmer Word Classes Comparison • accurate: word class contains no unrelated words • complete: word class not missing related words (rely on greediness & diversity of stemmers) • context sensitive (CS): multiple senses or disagreement 100 90 No. Accurate & Complete 80 70 60 58% 50 53% 40 37% 32% 30 29% 20 10 e CS er e ll m m Non ort Paic w ba Ste Ste P no K M S None Context PORTER PAICE SNOWBALL KSTEM MSTEM Sensitive Thursday, October 6, 2011
  • 8. element KStem element (MStem) MStem element, elemental, elements stemmers Paice el, ela, ele, element, elemental, elementary, and inaccu Word Classes Example elemente, elementen, elements, elen, eles, eli, elif, elise, elist, ell, elle, ellen, eller, els, words. Fo ‘method’ w • Stemmer comparison for 2 examples else, elseif, elses, elsif Porter import, importable, importance, important, with Span Table I and, in the • Underlined words in all stemmer classes imported, importer, importers, importing, the adverb S TEMMER WORD CLASS COMPARISONS FOR 4 EXAMPLES ( UNDERLINED imports WORDS ARE IN THE WORD CLASSES FOR ALL STEMMERS ) quently we KStem con Snowbl import, importable, importance, important, importantly, imported, importer, importers, word frequ with ‘else’ Word Stemmer Word Class uses an En (A & C) importing, imports ‘stationary’ import KStem import, importable, imported, importer, The ann Porter element, elemental, elemente, elements (Kstem) importers, importing, imports C. Threats Snwbl MStem element, elemental, elemente, elements phological element KStem element importable, importance, important, import, Because (MStem) MStem importantly, imported, importer, importers, element, elemental, elements stemmers Paice el, ela, ele,imports elemental, elementary, importing, element, programs, and inaccu Paice elemente, elementen,importance, elen, eles, import, importable, elements, important, words.lang ming For eli, elif, elise,importar, elle, ellen, eller, els, importantly, elist, ell, imported, importer, 9,000+ Jav else, elseif,importing, imports importers, elses, elsif ‘method’ w add, adde, addes, adds frequent w with Spani Porter import, importable, importance, important, Snwbl imported, addes, adds add, adde, importer, importers, importing, and,large s the in the add KStem add, addable, added, addes, adding, adds imports it is unlik KStem wer (CS) MStem Snowbl import, importable, adder, adding, addition, add, addable, added, importance, important, of 100 wo word frequ importantly,additionally,importer, importers, additional, imported, additions, additive, importing, adds additivity, imports of word cl uses an En import Paice KStem import, add, addable, imported, importer, ad, ada, importable, adde, added, adder, may not g (Kstem) importers, importing, ade, ads addes, adding, adds, imports C. Threats stemmers. Porter MStem import,named, namely, names, naming name, importable, importance, important, Snwbl name, named, namely, names, naming can be am Because importantly, imported, importer, importers, Thursday, October 6, 2011 name KStem name, nameable, named, namer, names, the ‘contex
  • 9. Stemming and Source Code Search • search technique: tf-idf • search tasks: 8 with 48 queries from prior study [Shepherd, et al. ’07] • Paice: overstemming & understemming mistakes improved results for 2 tasks (e.g., textfield report element) 1.0 Area Under the Curve 0.9 0.8 0.7 0.6 0.5 NoStem Porter ! ! Snowbl ! ! KStem ! ! MStem ! ! Paice ! ! Thursday, October 6, 2011
  • 10. Conclusion • Morphological stemmers appear to be more accurate & complete than rule-based • In search, stemming more consistently produces relevant results than not stemming • Heavy stemmers like MStem & Paice appear to be more effective in searching source code than light stemmers like Porter • Future work: more examples (less frequent & more domain-specific), more human judgements, more search tasks, other SE tasks beyond search Thursday, October 6, 2011