SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
Using Wikipedia as a reference
    for extracting semantic
    information from a text

            Andrea Prato
                  &
          Marco Ronchetti
      Università di Trento, Italy
Explicit Semantic Analysis




                             Gabrilovich
                             Markovich
                             2007
Throw away:

Stopwords
Fragment pages (<100 words)
Suffixes (stemming)
- Leukemia
                                                - Severe combined
                                                immunodeficiency
    A sample (ESA)                              - Cancer
                                                -Non-Hodgkin lymphoma
The development of T-cell leukaemia             - AIDS
   following the otherwise successful           -ICD-10 Chapter II:
   treatment of three patients with X-linked
   severe combined immune deficiency (X-
                                                Neoplasms;
   SCID) in gene-therapy trials using           -Chapter III: Diseases of the
   haematopoietic stem cells has led to a re-   blood and blood-forming
   evaluation of this approach. Using a
   mouse model for gene therapy of X-
                                                organs, and certain
   SCID, we find that the corrective             disorders involving the
   therapeutic gene IL2RG itself can act as     immune mechanism
   a contributor to the genesis of T-cell
   lymphomas, with one-third of animals
                                                - Bone marrow transplant
   being affected. Gene-therapy trials for X-   - Immunosuppressive drug
   SCID, which have been based on the           - Acute lymphoblastic
   assumption that IL2RG is minimally
   oncogenic, may therefore pose some risk
                                                leukemia
   to patients.                                 - Multiple sclerosis.
1-Glossary_of_cue_sports_terms
    A sample (ESA)                               2-Swimming,
                                                 3-Ian_Thorpe.
                                                 4-NCAA_football_bowl_games,
Being so tightly packed, Venice doesn't          2005-06,
   make an ideal place to come to practise
                                                 5-Swimming_machine,
   your favourite sport, although you'll get a
                                                 6-American_football_strategy,
   decent workout just walking around and
   up and down bridges! If you've got any        7-Contract_bridge_glossary,
   energy left for some extra exercise, try a    8-Olympic_Games,
   spot of swimming (although pools are          9-Pingu_episodes_series_6,
   rare) or even a jog. Venice is a bit of a     10-Venice.
   desert for swimmers. You can go in off        …
   the Lido (if you're game) or at one of        15 - Corruption_in_Ghana
   Venice's two public swimming pools            …
   (handily, they close in summer).              27 - Legislative_system_of_the
Lonely Planet Tourist Guide                      Peopleʼs_Republic_of_China.
Clustering
Wikipedia is hyperlinked
Swimming is clustered with Olympic Games
1-Glossary_of_cue_sports_terms
    A sample (ESA)                               2-Swimming,
                                                 3-Ian_Thorpe.
                                                 4-NCAA_football_bowl_games,
Being so tightly packed, Venice doesn't          2005-06,
   make an ideal place to come to practise
                                                 5-Swimming_machine,
   your favourite sport, although you'll get a
                                                 6-American_football_strategy,
   decent workout just walking around and
   up and down bridges! If you've got any        7-Contract_bridge_glossary,
   energy left for some extra exercise, try a    8-Olympic_Games,
   spot of swimming (although pools are          9-Pingu_episodes_series_6,
   rare) or even a jog. Venice is a bit of a     10-Venice.
   desert for swimmers. You can go in off        …
   the Lido (if you're game) or at one of        15 - Corruption_in_Ghana
   Venice's two public swimming pools            …
   (handily, they close in summer).              27 - Legislative_system_of_the
Lonely Planet Tourist Guide                      Peopleʼs_Republic_of_China.
Throw away:

Large aggregators
   Category links
   Numbers
   Pages with more than (N=100) links
After clustering:

 only 3 clusters with cardinality larger than 1.
 The first cluster, with cardinality 21, was
  automatically named Swimming.
 The second and the third both have cardinality
  equal to 2, and they are named Training and
  Venice-bucentaur.
Which one is
                          machine -generated?
Validation: Turing test


                            Classification



   Text                     Classification



                            Classification
20 texts of length
Outcome   ranging between 60
          and 200 words. Texts
          were collected from
          various sources like
          newspaper articles,
          text books, random
          web pages, MSN
          Encarta.
Further improvements
Using only nouns

Using a POS Tagger to identify syntactic
 roles in document to be classified
Keep only names (throw away the rest)


No degradation in the results!
Define Multiwords

 Lexical multiword identification approach:
 The following generative pattern is considered
 ((Adj∣Noun) + ∣((Adj∣Noun) ∗ (Noun     Prep)?)
               (Adj∣Noun)∗)Noun

  +: One or more *: Zero or more ?: Zero or one ∣: Or


Validation: A candidate multiword is valid if there
          is a Wikipedia entry related to it.
Text with multiwords:

Keep all nouns
Keep all adjectives that are part of a
 multiword
Evaluation (human inspection of
results)
100 samples (50 technical, 50 generic)
Multiword improved significanty 7 (5 technical)
It improved marginally 13
It worsened marginally 6


Overall improvement: 10/% on technical text
Work in progress
Concept-mediated mapping
among documents
How similar are two docs?
                                   Jaccard Index



           Concept 1

           Concept 2   Concept 2
  Doc 1                                  Doc 3
           Concept 3   Concept 3

                       Concept 4
Syllabi comparison
Inter
links
Mapping documents in different
  languages
   Deploying Wikipedia Interlinks
                                         Jaccard Index



          Concept 1

          Concept 2          Concept 2
Doc 1                                             Doc 3
          Concept 3          Concept 3

                  INTERLINKS Concept 4

Contenu connexe

Similaire à Using Wikipedia as a reference for extracting semantic information

kurous case neural text.pdf
kurous case neural text.pdfkurous case neural text.pdf
kurous case neural text.pdfYawarAbbas73
 
Variability, Bugs, and Cognition
Variability, Bugs, and CognitionVariability, Bugs, and Cognition
Variability, Bugs, and CognitionAndrzej Wasowski
 
DNA memories
DNA memoriesDNA memories
DNA memoriesHoda msw
 
SFSCON23 - Michele Finelli - Management of large genomic data with free software
SFSCON23 - Michele Finelli - Management of large genomic data with free softwareSFSCON23 - Michele Finelli - Management of large genomic data with free software
SFSCON23 - Michele Finelli - Management of large genomic data with free softwareSouth Tyrol Free Software Conference
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streamingc.titus.brown
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible researchYannick Wurm
 
2015 10-7-11am-reproducible research
2015 10-7-11am-reproducible research2015 10-7-11am-reproducible research
2015 10-7-11am-reproducible researchYannick Wurm
 
Enabling next-generation sequencing applications with IBM Storwize V7000 Unif...
Enabling next-generation sequencing applications with IBM Storwize V7000 Unif...Enabling next-generation sequencing applications with IBM Storwize V7000 Unif...
Enabling next-generation sequencing applications with IBM Storwize V7000 Unif...IBM India Smarter Computing
 
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...CSCJournals
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talkc.titus.brown
 
Quality Assessment of Biomedical Metadata using Topic Modeling
Quality Assessment of Biomedical Metadata using Topic ModelingQuality Assessment of Biomedical Metadata using Topic Modeling
Quality Assessment of Biomedical Metadata using Topic ModelingStuti Nayak
 
SIGEVOlution Summer 2007
SIGEVOlution Summer 2007SIGEVOlution Summer 2007
SIGEVOlution Summer 2007Pier Luca Lanzi
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsmikaelhuss
 
Gene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2KGene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2KBenjamin Good
 
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLERHPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLERcscpconf
 

Similaire à Using Wikipedia as a reference for extracting semantic information (20)

kurous case neural text.pdf
kurous case neural text.pdfkurous case neural text.pdf
kurous case neural text.pdf
 
Variability, Bugs, and Cognition
Variability, Bugs, and CognitionVariability, Bugs, and Cognition
Variability, Bugs, and Cognition
 
DNA memories
DNA memoriesDNA memories
DNA memories
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
SFSCON23 - Michele Finelli - Management of large genomic data with free software
SFSCON23 - Michele Finelli - Management of large genomic data with free softwareSFSCON23 - Michele Finelli - Management of large genomic data with free software
SFSCON23 - Michele Finelli - Management of large genomic data with free software
 
Ismb2009
Ismb2009Ismb2009
Ismb2009
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research
 
2015 10-7-11am-reproducible research
2015 10-7-11am-reproducible research2015 10-7-11am-reproducible research
2015 10-7-11am-reproducible research
 
Enabling next-generation sequencing applications with IBM Storwize V7000 Unif...
Enabling next-generation sequencing applications with IBM Storwize V7000 Unif...Enabling next-generation sequencing applications with IBM Storwize V7000 Unif...
Enabling next-generation sequencing applications with IBM Storwize V7000 Unif...
 
Bioinformatics.pptx
Bioinformatics.pptxBioinformatics.pptx
Bioinformatics.pptx
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
Quality Assessment of Biomedical Metadata using Topic Modeling
Quality Assessment of Biomedical Metadata using Topic ModelingQuality Assessment of Biomedical Metadata using Topic Modeling
Quality Assessment of Biomedical Metadata using Topic Modeling
 
SIGEVOlution Summer 2007
SIGEVOlution Summer 2007SIGEVOlution Summer 2007
SIGEVOlution Summer 2007
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomics
 
Gene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2KGene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2K
 
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLERHPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER
 

Dernier

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Dernier (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

Using Wikipedia as a reference for extracting semantic information

  • 1. Using Wikipedia as a reference for extracting semantic information from a text Andrea Prato & Marco Ronchetti Università di Trento, Italy
  • 2. Explicit Semantic Analysis Gabrilovich Markovich 2007
  • 3. Throw away: Stopwords Fragment pages (<100 words) Suffixes (stemming)
  • 4. - Leukemia - Severe combined immunodeficiency A sample (ESA) - Cancer -Non-Hodgkin lymphoma The development of T-cell leukaemia - AIDS following the otherwise successful -ICD-10 Chapter II: treatment of three patients with X-linked severe combined immune deficiency (X- Neoplasms; SCID) in gene-therapy trials using -Chapter III: Diseases of the haematopoietic stem cells has led to a re- blood and blood-forming evaluation of this approach. Using a mouse model for gene therapy of X- organs, and certain SCID, we find that the corrective disorders involving the therapeutic gene IL2RG itself can act as immune mechanism a contributor to the genesis of T-cell lymphomas, with one-third of animals - Bone marrow transplant being affected. Gene-therapy trials for X- - Immunosuppressive drug SCID, which have been based on the - Acute lymphoblastic assumption that IL2RG is minimally oncogenic, may therefore pose some risk leukemia to patients. - Multiple sclerosis.
  • 5. 1-Glossary_of_cue_sports_terms A sample (ESA) 2-Swimming, 3-Ian_Thorpe. 4-NCAA_football_bowl_games, Being so tightly packed, Venice doesn't 2005-06, make an ideal place to come to practise 5-Swimming_machine, your favourite sport, although you'll get a 6-American_football_strategy, decent workout just walking around and up and down bridges! If you've got any 7-Contract_bridge_glossary, energy left for some extra exercise, try a 8-Olympic_Games, spot of swimming (although pools are 9-Pingu_episodes_series_6, rare) or even a jog. Venice is a bit of a 10-Venice. desert for swimmers. You can go in off … the Lido (if you're game) or at one of 15 - Corruption_in_Ghana Venice's two public swimming pools … (handily, they close in summer). 27 - Legislative_system_of_the Lonely Planet Tourist Guide Peopleʼs_Republic_of_China.
  • 8. Swimming is clustered with Olympic Games
  • 9. 1-Glossary_of_cue_sports_terms A sample (ESA) 2-Swimming, 3-Ian_Thorpe. 4-NCAA_football_bowl_games, Being so tightly packed, Venice doesn't 2005-06, make an ideal place to come to practise 5-Swimming_machine, your favourite sport, although you'll get a 6-American_football_strategy, decent workout just walking around and up and down bridges! If you've got any 7-Contract_bridge_glossary, energy left for some extra exercise, try a 8-Olympic_Games, spot of swimming (although pools are 9-Pingu_episodes_series_6, rare) or even a jog. Venice is a bit of a 10-Venice. desert for swimmers. You can go in off … the Lido (if you're game) or at one of 15 - Corruption_in_Ghana Venice's two public swimming pools … (handily, they close in summer). 27 - Legislative_system_of_the Lonely Planet Tourist Guide Peopleʼs_Republic_of_China.
  • 10. Throw away: Large aggregators  Category links  Numbers  Pages with more than (N=100) links
  • 11. After clustering:  only 3 clusters with cardinality larger than 1.  The first cluster, with cardinality 21, was automatically named Swimming.  The second and the third both have cardinality equal to 2, and they are named Training and Venice-bucentaur.
  • 12. Which one is machine -generated? Validation: Turing test Classification Text Classification Classification
  • 13. 20 texts of length Outcome ranging between 60 and 200 words. Texts were collected from various sources like newspaper articles, text books, random web pages, MSN Encarta.
  • 15. Using only nouns Using a POS Tagger to identify syntactic roles in document to be classified Keep only names (throw away the rest) No degradation in the results!
  • 16. Define Multiwords  Lexical multiword identification approach:  The following generative pattern is considered ((Adj∣Noun) + ∣((Adj∣Noun) ∗ (Noun Prep)?) (Adj∣Noun)∗)Noun +: One or more *: Zero or more ?: Zero or one ∣: Or Validation: A candidate multiword is valid if there is a Wikipedia entry related to it.
  • 17. Text with multiwords: Keep all nouns Keep all adjectives that are part of a multiword
  • 18. Evaluation (human inspection of results) 100 samples (50 technical, 50 generic) Multiword improved significanty 7 (5 technical) It improved marginally 13 It worsened marginally 6 Overall improvement: 10/% on technical text
  • 20. Concept-mediated mapping among documents How similar are two docs? Jaccard Index Concept 1 Concept 2 Concept 2 Doc 1 Doc 3 Concept 3 Concept 3 Concept 4
  • 23. Mapping documents in different languages Deploying Wikipedia Interlinks Jaccard Index Concept 1 Concept 2 Concept 2 Doc 1 Doc 3 Concept 3 Concept 3 INTERLINKS Concept 4