SlideShare une entreprise Scribd logo
1  sur  31
Télécharger pour lire hors ligne
A reuse repository
   with automated
synonym support and
  cluster generation

       Laust Rud Jensen
     Århus University, 2004
Outline

1. Introduction
2. Problem
3. Solution
4. Experiments
5. Conclusion


                     2
1. Introduction

• Constructed a reuse support system
• Fully functional prototype with performance
  enabling interactive use

• Usable for other applications, as it is a general
  system



                         3
Outline

1. Introduction
2. Problem
3. Solution
4. Experiments
5. Conclusion


                   4
2. Problem


• Reuse is a Good Thing, but not done enough
• Code reuse repository available, but needs
  search function




                     5
Java


• Platform independent
• Easy to use
• Well documented using Javadoc


                    6
Javadoc problems

• Browsing for relevant components is
  insufficient:

  • Assumes existing knowledge
  • Information overload

                     9
Simple: keyword search

• Exact word matching
• Too literal, but stemming can help
• Vocabulary mismatch, <20% agreement
  [Furnas, 1987]



                   10
Outline

1. Introduction
2. Problem
3. Solution
4. Experiments
5. Conclusion


                   11
3. Solution

• Create a search engine
• Use automated indexing methods
• Automatic synonym handling
• Grouping search results to assist in
  information discernment



                      12
Search engine
         technology
• Information Retrieval method
• Vector Space Model
• Latent Semantic Indexing
• Clustering done by existing Open Source
  system



                    13
ourse the choice of words within the similar doc
eems to be some overlap between documents. C
           Vector Space Model
nd d6 . Apparently these documents have nothin

                                            
                d1 d2 d3 d4 d5 d6
     cosmonaut 1 0 1 0 0 0 
                                 
     astronaut 0 1 0 0 0 0 
  X=
     moon
                                  
               1 1 0 0 0 0      
     car       1 0 0 1 1 0 
      truck     0 0 0 1 0 1

Example document contents with simple binary ter
                       14
he example T0 matrix resulting from SVD being performed on X
                                                                                    
om figure 3.1 on page 15.                        0.44 0.13    0.48      0.70  0.26

                     Latent Semantic           0.30 0.33
                                              
                                         T0 =  −0.57 0.59
                                              
                                                                  0.51
                                                                  0.37
                                                                      −0.35 −0.65 
                                                                      −0.15
                                                                                  
                                                                             0.41 
                                                                                  
                            CHAPTER 3.  0.58 0.00
                                             TECHNOLOGY 0.00         −0.58  0.58 

                        Indexing
             2.16 0.00 0.00 0.00 0.00
           0.00 1.59 0.00 0.00 0.00 
     S0  0.00 0.00 1.28 0.00 0.00  
        =
                                            
                                                   0.25 0.73 −0.61

                                                           (3.16)
                                                                       0.16 −0.09

                                            
                         Figure 3.4: The 0.26
                                           example T0 matrix resulting from SVD being p
           0.00 0.13
           0.44    0.00 0.48 1.00 0.00 
                         0.00     0.70
        0.30 0.33                    from figure 3.1 on page 15.
                        0.51 −0.35 0.39
             0.00 0.00 0.00 0.00         −0.65 
 T0 =  −0.57 0.59
                        0.37 −0.15       0.41          (3.15)
        0.58 0.00       0.00 −0.58             
                                          0.58the                               
he example S0 matrix. The diagonal contains       singular values 0.00 0.00 0.00
 X.        0.25 0.73 −0.61        0.16 −0.09  2.16 0.00                         
                                                   0.00 1.59 0.00 0.00 0.00 
                                             S0 =  0.00 0.00 1.28 0.00 0.00 
                                                                                
he example T0 matrix resulting from SVD being performed 0.00 0.00 1.00 0.00 
                                                            on X
                                                 0.00
om figure 0.75on page 15.
         3.1     0.29 −0.28 −0.00 −0.53              0.00 0.00 0.00 0.00 0.39
        0.28    0.53    0.75     0.00     0.29 
                                               
        0.20    0.19 −0.45 3.5: The 0.63  S matrix. The diagonal contains the
 D0 =                  Figure 0.58       example 0
                                                         (3.17)
        0.45 −0.63      0.20 −0.000.000.19 
             2.16 0.00 0.00 0.00 of X.
                                               
           0.00 1.59 0.00 −0.58
        0.33 −0.22 −0.12 0.00 0.000.41    
                                           
           0.00 0.00 1.28 0.00 0.00 
     S0 =0.12 −0.41
                        0.33     0.58 −0.22            (3.16) k = 2
           0.00 0.00 0.00 1.00 0.00                                                 
                                                 0.75    0.29 −0.28 −0.00 −0.53
             0.00 0.00 0.00 0.00 0.3915  0.28           0.53     0.75    0.00  0.29 
documents seem to cover two different topics, namely space and
 of course the choice of words within the similar documents differ.
                          
so seems to be some overlap between documents. Compare doc-    

          Latent Semantic Indexing
                            0.75    0.29 −0.28 −0.00 −0.53
d5 and d6 . Apparently these documents have nothing in common
                           0.28    0.53    0.75    0.00  0.29 
                                                              
                           0.20    0.19 −0.45 0.58      0.63 
                   D0 = d0.45 2 −0.63 4 d5 d6 −0.00
                                                               
                                                                  (3.17
                           1 d d3 d 0.20                 0.19 
             cosmonaut  1 0 −0.220 −0.120 −0.58
                            0.33    1       0            0.41 
                                                  
             astronaut     0 1 −0.410 0.330 0.58 −0.22
                            0.12    0       0
      X=                                         
             moon          1 1 0 0 0 0           
       Figure car The example 0 0 matrix, and the 
             3.6:          1    D 0 1 1 0 shaded part is the D matrix.
              truck         0 0 0 1 0 1

           
3.1: Example document contents with simple binary term weightings.        
                            d1      d2      d3       d4     d5       d6
            cosmonout 0.85       0.52    0.28     0.13   0.21 −0.08      
                                                                         
            astronaut 0.36       0.36    0.16 −0.21 −0.03 −0.18          
      X=
       ˆ
            moon
                                                                           (3.18
                                                                          
                         1.00    0.71    0.36 −0.05      0.16 −0.21      
            car          0.98    0.13    0.21     1.03   0.62     0.41   
              truck       0.13 −0.39 −0.08         0.90   0.41     0.49

                                      16
                                      ˆ        T
   T                     
                         0               0.44   0.13
                        1             0.30   0.33   
                Matching a query
                                                                       −1
                                                       2.16 0.00
 Dq   =   0.28 0.53 =  1 
                        
                                   
                                       −0.57   0.59   
                                                       
                        0                              0.00 1.59
                                         0.58   0.00
                         0               0.25   0.73
          •
        3.5.“moon astronaut”
             CLUSTERING

          • [cosmonaut, astronaut, moon, car, truck]
gure 3.8: Forming X and performing the calculations leading to the vecto
                    q
          for the query document, D .
                                  q

             Xq =       0 1 1 0 0
                                             T                  
                                            0       0.44 0.13
                 d1   d2   d3    d4
                                               d5     d6
             X 0.41 1.00 0.00    
                               0.00
                                            1   0.30 0.33
                                                0.00
                                              0.00
                                                                   
                                                                       2.1
             Dq =
             ˆ      0.28 0.53 = 
                                                −0.57 0.59
                                            1                    
                                                                   
             X 0.75 1.00 0.94 −0.45
                                            −0.11 −0.71                0.0
                                            0   0.58 0.00        
                                            0       0.25 0.73
 Table 3.2: Similarity between query document and original documents.

         Figure 3.8: Forming Xq and performing the calculations leadi
                                   17
Clustering: Carrot
  USER INTERFACE




ure 5.4: Data flow when processing in   Carrot from input to output. F
        the manual, available from18the Carrot homepage.
Outline

1. Introduction
2. Problem
3. Solution
4. Experiments
5. Conclusion


                    20
4. Experiments


• Performance measurement
• Tuning representation to data
• Evaluating clusters


                     21
Precision and recall

               Precision/Recall
 and recall are the traditional measurements for gaugin
rformance of an IR system. Precision is the proportion of
which is actually relevant to a given query, and recall i
f relevant material actually retrieved:

                         #relevant retrieved
             precision =
                          #total retrieved
                       #relevant retrieved
          recall =
                   #total relevant in collection
o measurements are defined in terms of each other, what is
erpolated precision at recall levels of 10%, 20%, . . . , 100
 are then plotted as a graph. Another measurement is
                             22
Performance
                     1
                                           Average precision normal
                                          Average precision stemmed
                    0.8
Average precision




                    0.6

                    0.4

                    0.2

                     0
                          0   50   100   150   200     250   300      350   400
                                         Number of factors
                                             23
Precision/Recall
6.3. EXPERIMENTS PERFORMED                                            79



               1
                                Interpolated recall, unstemmed
                                   Interpolated recall, stemmed
              0.8

              0.6
  Precision




              0.4

              0.2

               0
                    0     0.2       0.4            0.6     0.8    1
                                          Recall
80
                        Average precision
                                       CHAPTER 6. EXPERIMENTS


                  0.2
                                              Unstemmed
                                                Stemmed
                 0.15
     Precision




                  0.1


                 0.05


                   0
                         10              100              1000
                              Documents retrieved
Evaluating clusters
6.3. EXPERIMENTS PERFORMED


 Cluster   Elements Cluster title
                               Lingo
 L1               6 Denoted by this Abstract Pathname
 L2               3 Parent Directory
 L3               2 File Objects
 L4               2 Attributes
 L5               2 Value
 L6               5 Array of Strings
 L7               5 (Other)
 7               23 Total listed
                                STC
 S1              16 files, directories, pathname
                           26
CHAPTER 6. EXPERIMEN




                                                            L1
                                                            L2
                                                            L3
                                                            L4
                                                            L5
                                                            L6
                                                            L7
                                                                                        S1
                                                                                        S2
                                                                                        S3
                       #    Method                   Rel.
                       1    mkdir()                   •     •   ◦   ◦   ◦   ◦   ◦   ◦   •   ◦   •
                       2    mkdirs()                  •     ◦   •   ◦   ◦   ◦   ◦   ◦   •   ◦   •
                       3    createSubcontext()        ◦     ◦   ◦   ◦   •   ◦   ◦   ◦   ◦   ◦   •
                       4    isDirectory()             ◦     •   ◦   ◦   ◦   ◦   ◦   ◦   •   ◦   ◦
                       5    setCacheDirectory()       ◦     ◦   ◦   ◦   ◦   •   ◦   ◦   •   •   •
                       6    isFile()                  ◦     ◦   ◦   ◦   ◦   ◦   ◦   •   •   ◦   •
                       7    getCanonicalPath()        ◦     ◦   ◦   ◦   ◦   ◦   ◦   •   •   ◦   ◦
                       8    delete()                  ◦     •   ◦   ◦   ◦   ◦   ◦   ◦   •   ◦   ◦
                       9    createSubcontext()        ◦     ◦   ◦   ◦   ◦   ◦   ◦   ◦   ◦   ◦   ◦
                       10   createNewFolder()         ◦     ◦   ◦   ◦   •   ◦   ◦   •   ◦   •   •
                       11   create environment()      ◦     ◦   ◦   •   ◦   ◦   ◦   ◦   ◦   ◦   •
                       12   listFiles()               ◦     •   ◦   ◦   ◦   ◦   •   ◦   •   •   ◦
                       13   getParentDirectory()      ◦     ◦   •   •   ◦   ◦   ◦   ◦   •   ◦   ◦
                       14   list()                    ◦     •   ◦   ◦   ◦   ◦   •   ◦   •   •   ◦
                       15   createTempFile()          ◦     ◦   ◦   ◦   ◦   ◦   ◦   •   •   •   •
                       16   setCurrentDirectory()     ◦     ◦   •   ◦   ◦   ◦   ◦   ◦   •   •   ◦
                       17   length()                  ◦     •   ◦   ◦   ◦   •   ◦   ◦   ◦   ◦   ◦
                       18   createFileSystemRoot()    ◦     ◦   ◦   ◦   ◦   ◦   ◦   ◦   •   ◦   •
                       19   createTempFile()          ◦     ◦   ◦   ◦   ◦   ◦   ◦   •   •   •   •
                       20   listRoots()               ◦     ◦   ◦   •   ◦   ◦   ◦   ◦   •   ◦   ◦
Outline

1. Introduction
2. Problem
3. Solution
4. Experiments
5. Conclusion


                   29
Future work
• Extensions
  • Feedback mechanism
  • Additional experiments: stop-words,
    weighting

• Applications:
  • Two-way Javadoc integration
  • Other applications; more text
                     30
Conclusion


• Fully functional prototype
• Clustering helpful but needs more work
• what else? synonymy vs. polysemy?


                    31

Contenu connexe

En vedette

Share and Reuse: how data sharing can take your research to the next level
Share and Reuse: how data sharing can take your research to the next levelShare and Reuse: how data sharing can take your research to the next level
Share and Reuse: how data sharing can take your research to the next levelKrzysztof Gorgolewski
 
Improving Support for Researchers: How Data Reuse Can Inform Data Curation
Improving Support for Researchers: How Data Reuse Can Inform Data CurationImproving Support for Researchers: How Data Reuse Can Inform Data Curation
Improving Support for Researchers: How Data Reuse Can Inform Data CurationOCLC
 
Repository and preservation systems
Repository and preservation systemsRepository and preservation systems
Repository and preservation systemsJisc
 
Software component reuse repository
Software component reuse repositorySoftware component reuse repository
Software component reuse repositorySandeep Singh
 
A First Attempt at Describing, Disseminating and Reusing Methodological Knowl...
A First Attempt at Describing, Disseminating and Reusing Methodological Knowl...A First Attempt at Describing, Disseminating and Reusing Methodological Knowl...
A First Attempt at Describing, Disseminating and Reusing Methodological Knowl...ariadnenetwork
 

En vedette (6)

Share and Reuse: how data sharing can take your research to the next level
Share and Reuse: how data sharing can take your research to the next levelShare and Reuse: how data sharing can take your research to the next level
Share and Reuse: how data sharing can take your research to the next level
 
Improving Support for Researchers: How Data Reuse Can Inform Data Curation
Improving Support for Researchers: How Data Reuse Can Inform Data CurationImproving Support for Researchers: How Data Reuse Can Inform Data Curation
Improving Support for Researchers: How Data Reuse Can Inform Data Curation
 
Repository and preservation systems
Repository and preservation systemsRepository and preservation systems
Repository and preservation systems
 
Software component reuse repository
Software component reuse repositorySoftware component reuse repository
Software component reuse repository
 
Software resuse
Software  resuseSoftware  resuse
Software resuse
 
A First Attempt at Describing, Disseminating and Reusing Methodological Knowl...
A First Attempt at Describing, Disseminating and Reusing Methodological Knowl...A First Attempt at Describing, Disseminating and Reusing Methodological Knowl...
A First Attempt at Describing, Disseminating and Reusing Methodological Knowl...
 

Dernier

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 

Dernier (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Slides for presentation of "A reuse repository with automated synonym support and cluster generation"

  • 1. A reuse repository with automated synonym support and cluster generation Laust Rud Jensen Århus University, 2004
  • 2. Outline 1. Introduction 2. Problem 3. Solution 4. Experiments 5. Conclusion 2
  • 3. 1. Introduction • Constructed a reuse support system • Fully functional prototype with performance enabling interactive use • Usable for other applications, as it is a general system 3
  • 4. Outline 1. Introduction 2. Problem 3. Solution 4. Experiments 5. Conclusion 4
  • 5. 2. Problem • Reuse is a Good Thing, but not done enough • Code reuse repository available, but needs search function 5
  • 6. Java • Platform independent • Easy to use • Well documented using Javadoc 6
  • 7.
  • 8.
  • 9. Javadoc problems • Browsing for relevant components is insufficient: • Assumes existing knowledge • Information overload 9
  • 10. Simple: keyword search • Exact word matching • Too literal, but stemming can help • Vocabulary mismatch, <20% agreement [Furnas, 1987] 10
  • 11. Outline 1. Introduction 2. Problem 3. Solution 4. Experiments 5. Conclusion 11
  • 12. 3. Solution • Create a search engine • Use automated indexing methods • Automatic synonym handling • Grouping search results to assist in information discernment 12
  • 13. Search engine technology • Information Retrieval method • Vector Space Model • Latent Semantic Indexing • Clustering done by existing Open Source system 13
  • 14. ourse the choice of words within the similar doc eems to be some overlap between documents. C Vector Space Model nd d6 . Apparently these documents have nothin   d1 d2 d3 d4 d5 d6  cosmonaut 1 0 1 0 0 0     astronaut 0 1 0 0 0 0  X=  moon   1 1 0 0 0 0    car 1 0 0 1 1 0  truck 0 0 0 1 0 1 Example document contents with simple binary ter 14
  • 15. he example T0 matrix resulting from SVD being performed on X   om figure 3.1 on page 15. 0.44 0.13 0.48 0.70 0.26 Latent Semantic  0.30 0.33  T0 =  −0.57 0.59  0.51 0.37 −0.35 −0.65  −0.15  0.41    CHAPTER 3.  0.58 0.00  TECHNOLOGY 0.00 −0.58 0.58   Indexing 2.16 0.00 0.00 0.00 0.00  0.00 1.59 0.00 0.00 0.00  S0  0.00 0.00 1.28 0.00 0.00   =  0.25 0.73 −0.61 (3.16) 0.16 −0.09  Figure 3.4: The 0.26 example T0 matrix resulting from SVD being p  0.00 0.13 0.44 0.00 0.48 1.00 0.00  0.00 0.70  0.30 0.33 from figure 3.1 on page 15.  0.51 −0.35 0.39 0.00 0.00 0.00 0.00 −0.65  T0 =  −0.57 0.59  0.37 −0.15 0.41  (3.15)  0.58 0.00 0.00 −0.58  0.58the   he example S0 matrix. The diagonal contains singular values 0.00 0.00 0.00 X. 0.25 0.73 −0.61 0.16 −0.09  2.16 0.00   0.00 1.59 0.00 0.00 0.00  S0 =  0.00 0.00 1.28 0.00 0.00    he example T0 matrix resulting from SVD being performed 0.00 0.00 1.00 0.00  on X    0.00 om figure 0.75on page 15. 3.1 0.29 −0.28 −0.00 −0.53 0.00 0.00 0.00 0.00 0.39  0.28 0.53 0.75 0.00 0.29     0.20 0.19 −0.45 3.5: The 0.63  S matrix. The diagonal contains the D0 =   Figure 0.58 example 0   (3.17)  0.45 −0.63 0.20 −0.000.000.19  2.16 0.00 0.00 0.00 of X.    0.00 1.59 0.00 −0.58  0.33 −0.22 −0.12 0.00 0.000.41      0.00 0.00 1.28 0.00 0.00  S0 =0.12 −0.41  0.33 0.58 −0.22  (3.16) k = 2  0.00 0.00 0.00 1.00 0.00   0.75 0.29 −0.28 −0.00 −0.53 0.00 0.00 0.00 0.00 0.3915  0.28 0.53 0.75 0.00 0.29 
  • 16. documents seem to cover two different topics, namely space and of course the choice of words within the similar documents differ.  so seems to be some overlap between documents. Compare doc-  Latent Semantic Indexing 0.75 0.29 −0.28 −0.00 −0.53 d5 and d6 . Apparently these documents have nothing in common  0.28 0.53 0.75 0.00 0.29     0.20 0.19 −0.45 0.58 0.63   D0 = d0.45 2 −0.63 4 d5 d6 −0.00   (3.17  1 d d3 d 0.20 0.19   cosmonaut  1 0 −0.220 −0.120 −0.58 0.33 1 0  0.41     astronaut 0 1 −0.410 0.330 0.58 −0.22 0.12 0 0 X=    moon 1 1 0 0 0 0   Figure car The example 0 0 matrix, and the   3.6: 1 D 0 1 1 0 shaded part is the D matrix. truck 0 0 0 1 0 1  3.1: Example document contents with simple binary term weightings.  d1 d2 d3 d4 d5 d6  cosmonout 0.85 0.52 0.28 0.13 0.21 −0.08     astronaut 0.36 0.36 0.16 −0.21 −0.03 −0.18  X= ˆ  moon  (3.18   1.00 0.71 0.36 −0.05 0.16 −0.21   car 0.98 0.13 0.21 1.03 0.62 0.41  truck 0.13 −0.39 −0.08 0.90 0.41 0.49 16 ˆ T
  • 17. T   0 0.44 0.13  1   0.30 0.33  Matching a query −1     2.16 0.00 Dq = 0.28 0.53 =  1      −0.57 0.59    0    0.00 1.59 0.58 0.00 0 0.25 0.73 • 3.5.“moon astronaut” CLUSTERING • [cosmonaut, astronaut, moon, car, truck] gure 3.8: Forming X and performing the calculations leading to the vecto q for the query document, D . q Xq = 0 1 1 0 0  T   0 0.44 0.13 d1 d2 d3 d4  d5 d6 X 0.41 1.00 0.00  0.00 1   0.30 0.33   0.00 0.00   2.1 Dq = ˆ 0.28 0.53 =    −0.57 0.59 1     X 0.75 1.00 0.94 −0.45  −0.11 −0.71 0.0 0   0.58 0.00  0 0.25 0.73 Table 3.2: Similarity between query document and original documents. Figure 3.8: Forming Xq and performing the calculations leadi 17
  • 18. Clustering: Carrot USER INTERFACE ure 5.4: Data flow when processing in Carrot from input to output. F the manual, available from18the Carrot homepage.
  • 19.
  • 20. Outline 1. Introduction 2. Problem 3. Solution 4. Experiments 5. Conclusion 20
  • 21. 4. Experiments • Performance measurement • Tuning representation to data • Evaluating clusters 21
  • 22. Precision and recall Precision/Recall and recall are the traditional measurements for gaugin rformance of an IR system. Precision is the proportion of which is actually relevant to a given query, and recall i f relevant material actually retrieved: #relevant retrieved precision = #total retrieved #relevant retrieved recall = #total relevant in collection o measurements are defined in terms of each other, what is erpolated precision at recall levels of 10%, 20%, . . . , 100 are then plotted as a graph. Another measurement is 22
  • 23. Performance 1 Average precision normal Average precision stemmed 0.8 Average precision 0.6 0.4 0.2 0 0 50 100 150 200 250 300 350 400 Number of factors 23
  • 24. Precision/Recall 6.3. EXPERIMENTS PERFORMED 79 1 Interpolated recall, unstemmed Interpolated recall, stemmed 0.8 0.6 Precision 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Recall
  • 25. 80 Average precision CHAPTER 6. EXPERIMENTS 0.2 Unstemmed Stemmed 0.15 Precision 0.1 0.05 0 10 100 1000 Documents retrieved
  • 26. Evaluating clusters 6.3. EXPERIMENTS PERFORMED Cluster Elements Cluster title Lingo L1 6 Denoted by this Abstract Pathname L2 3 Parent Directory L3 2 File Objects L4 2 Attributes L5 2 Value L6 5 Array of Strings L7 5 (Other) 7 23 Total listed STC S1 16 files, directories, pathname 26
  • 27. CHAPTER 6. EXPERIMEN L1 L2 L3 L4 L5 L6 L7 S1 S2 S3 # Method Rel. 1 mkdir() • • ◦ ◦ ◦ ◦ ◦ ◦ • ◦ • 2 mkdirs() • ◦ • ◦ ◦ ◦ ◦ ◦ • ◦ • 3 createSubcontext() ◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ • 4 isDirectory() ◦ • ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦ 5 setCacheDirectory() ◦ ◦ ◦ ◦ ◦ • ◦ ◦ • • • 6 isFile() ◦ ◦ ◦ ◦ ◦ ◦ ◦ • • ◦ • 7 getCanonicalPath() ◦ ◦ ◦ ◦ ◦ ◦ ◦ • • ◦ ◦ 8 delete() ◦ • ◦ ◦ ◦ ◦ ◦ ◦ • ◦ ◦ 9 createSubcontext() ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ 10 createNewFolder() ◦ ◦ ◦ ◦ • ◦ ◦ • ◦ • • 11 create environment() ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ ◦ • 12 listFiles() ◦ • ◦ ◦ ◦ ◦ • ◦ • • ◦ 13 getParentDirectory() ◦ ◦ • • ◦ ◦ ◦ ◦ • ◦ ◦ 14 list() ◦ • ◦ ◦ ◦ ◦ • ◦ • • ◦ 15 createTempFile() ◦ ◦ ◦ ◦ ◦ ◦ ◦ • • • • 16 setCurrentDirectory() ◦ ◦ • ◦ ◦ ◦ ◦ ◦ • • ◦ 17 length() ◦ • ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦ 18 createFileSystemRoot() ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ • ◦ • 19 createTempFile() ◦ ◦ ◦ ◦ ◦ ◦ ◦ • • • • 20 listRoots() ◦ ◦ ◦ • ◦ ◦ ◦ ◦ • ◦ ◦
  • 28.
  • 29. Outline 1. Introduction 2. Problem 3. Solution 4. Experiments 5. Conclusion 29
  • 30. Future work • Extensions • Feedback mechanism • Additional experiments: stop-words, weighting • Applications: • Two-way Javadoc integration • Other applications; more text 30
  • 31. Conclusion • Fully functional prototype • Clustering helpful but needs more work • what else? synonymy vs. polysemy? 31