SlideShare une entreprise Scribd logo
1  sur  22
Télécharger pour lire hors ligne
Multilingual Named Entity Recognition
           using Wikipedia
    Laboratory for Knowledge Discovery in Databases
   Department of Computing and Information Sciences
                 Kansas State University
     http://www.kddresearch.org/tikiwiki/tiki-index.php




              Presenter: Svitlana O. Volkova
                 Instructor: William Hsu
AGENDA

I.     Project Overview
II.    Crawling Wikipedia
III.   Synonymy Discovery with Google Sets
IV.    Experiment Design
V.     Conclusions
AGENDA

I.     Project Overview
II.    Crawling Wikipedia
III.   GoogleSets for Synonymy Discovery
IV.    Experiment
V.     Conclusions
PROJECT MILESTONES

Input: Crawler Functionality
CRAWLING WIKIPEDIA
Output: Set of Multilingual Gazetteers


      Input: Initial Gazetteer in one Language
      RELATIONSHIP DISCOVERY WITH GOOGLESETS
      Output: Extended Gazetteer with Synonyms


             Input: Extended Gazetteer with Synonyms + Content
             MULTILINGUAL NER TASK
             Output: Extracted Entities from the Content
KEY IDEA - WIKIPEDIA
 Apply Wikipedia knowledge representation for
  multilingual information extraction
             English Wiki Concepts of Interest
      …, anthrax, bovine virus, …, camelpox, surra, …




             17http://wiki.digitalmethods.net/Dmi/WikipediaAnalysis



           Russian Wiki Concepts of Interest
 …, Зоонозы, Классическая чума свиней, Лептоспироз, …
AGENDA

I.     Project Overview
II.    Crawling Wikipedia
III.   GoogleSets for Synonymy Discovery
IV.    Experiment
V.     Conclusions
CRAWLING WIKIPEDIA



Multilingual NER
(article + category
 +interwiki links)


                      Wiki Category Graph and Article Graph
GAZETTEERS EXAMPLES IN DIFFERENT
           LANGUAGES
GAZETTEERS SIZE IN DIFFERENT
                 LANGUAGES


                            19

               37                                                English
                                                      86         Japanese
                                                                 German
                       20                                        Russian




Decision: dictionaries are too small, so wee need to find a way how to
                             extend it!!!
AGENDA

I.     Project Overview
II.    Crawling Wikipedia
III.   GoogleSets for Synonymy Discovery
IV.    Experiment
V.     Conclusions
GAZETTEERS EXAMPLES:
GERMAN GOOGLE SETS OUTPUT
AGENDA

I.     Project Overview
II.    Crawling Wikipedia
III.   GoogleSets for Synonymy Discovery
IV.    Experiment
V.     Conclusions
EXPERIMENT SET UP
 Purpose: to perform named entity recognition task in
  specific domain and report accuracy of extraction using
  a) Wiki knowledge
  b) Extended lists with synonyms from Google Sets


 Hypothesis: the synonyms extraction phase is essential
  for increasing accuracy of information extraction task
DISEASE EXTRACTOR MODULE
                 INPUT AND OUTPUT
                                             Output:
                                             Index of the first character

                         Disease             Index of the last character
                        Extractor            Length of the matched text
           Input: Text Module
              from file                      Matched Text
                                             Canonical disease name
Disease ExtractionTask
  The task of disease recognition can be considered as NER/information
    extraction (IE) task
  The main purpose is to retrieve tokens that much at least one term with
    synonyms, abbreviations from list of the animal disease names
CONTEXT EXAMPLES IN DIFFERENT LANGUAGES
DUTCH
    Leptospirose komt voor in alle landen, behalve het Noordpoolgebied. De incidentie is hoog.
      Meer dan de helft van de gevallen voordoet in ernstige en vereiste reanimatie.
CZECH
    Leptospiróza se vyskytuje ve všech zemích s výjimkou Arktidy. Incidence je vysoká. Více než
      polovina případů se vyskytuje v těžké a vyžaduje resuscitaci.
GERMAN
    Leptospirose tritt in allen Ländern, mit Ausnahme der Arktis. Die Inzidenz ist hoch. Mehr als
      die Hälfte der Fälle tritt in schweren und Reanimation erforderlich.
ITALIAN
     Leptospirosi si verifica in tutti i paesi, tranne l'Artico. L'incidenza è alta. Più della metà dei
      casi si verifica in rianimazione grave e richiesti.
URKAINIAN
     Лептоспіроз відбувається в усіх країнах, за винятком Арктики. Захворюваність висока.
      Більше половини випадків відбувається в суворих і необхідність реанімації.
RUSSIAN
     Лептоспироз происходит во всех странах, за исключением Арктики. Заболеваемость
      высокая. Более половины случаев происходит в суровых и необходимости реанимации.
DISEASE EXTRACTOR MODULE DEMO
http://fingolfin.user.cis.ksu.edu:8080/diseaseextractor/
RESULTS FOR DISEASE EXTRACTOR MODULE

       INPUT A                OUTPUT A
Foot and mouth disease is
one of the most contagious
diseases of cloven-hooved
mammals…

       INPUT B                OUTPUT B
Rift Valley Fever | CDC
Special Pathogens Branch
Mission Statement Disease …
AGENDA

I.     Project Overview
II.    Crawling Wikipedia
III.   GoogleSets for Synonymy Discovery
IV.    Experiment
V.     Conclusions
CONCLUSIONS
 ApplyingWikipedia knowledge for multilingual NERTask


 Phase 1: CrawlingWiki – completed
 Phase 2: Google Sets Expansion – completed
 Phase 3: Multilingual Disease Extraction – in progress


 Novelty: Overcome Wiki limitations by applying Google Sets
  expansion approach

 In order to estimate accuracy we need to have annotated data in
  different languages
REFERENCES
   Torsten Zesch and Iryna Gurevych, Analysis of the Wikipedia Category Graph for NLP
    Applications, In: Proceedings of the TextGraphs-2 Workshop (NAACL-HLT 2007), p.
    1--8,               April             2007.          http://elara.tk.informatik.tu-
    darmstadt.de/publications/2007/hlt-textgraphs.pdf

   Watanabe, Yotaro and Asahara, Masayuki and Matsumoto, Yuji, A Graph-Based
    Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields,
    Proceedings of the 2007 Joint Conference on Empirical Methods in Natural
    Language Processing and Computational Natural Language Learning (EMNLP-
    CoNLL), 649-657. http://www.aclweb.org/anthology/D/D07/D07-1068

   Manning, C., & Schutze, H. Foundations of statistical natural language processing.
    Cambridge, MA: MIT Press, 1999.
ACKNOWLEDGEMENTS

 Dr. William Hsu for meaningful guidance




 John Drouhard for building extraction architecture




 Landon Fowles for expanding gazetteers using Google Sets

Contenu connexe

Similaire à Multilingual Ner Using Wiki

ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biologypetermurrayrust
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic BiologyTheContentMine
 
Wf4Ever: Work!ows for Methodology and Science Preservation
Wf4Ever: Work!ows for Methodology and Science PreservationWf4Ever: Work!ows for Methodology and Science Preservation
Wf4Ever: Work!ows for Methodology and Science PreservationJoint ALMA Observatory
 
What you Can Make Out of Linked Data
What you Can Make Out of Linked DataWhat you Can Make Out of Linked Data
What you Can Make Out of Linked DataMarco Fossati
 
New Research Articles 2020 May Issue International Journal of Software Engin...
New Research Articles 2020 May  Issue International Journal of Software Engin...New Research Articles 2020 May  Issue International Journal of Software Engin...
New Research Articles 2020 May Issue International Journal of Software Engin...ijseajournal
 
The beauty of workflows and models
The beauty of workflows and modelsThe beauty of workflows and models
The beauty of workflows and modelsmyGrid team
 
Reproducibility Using Semantics: An Overview
Reproducibility Using Semantics: An OverviewReproducibility Using Semantics: An Overview
Reproducibility Using Semantics: An Overviewdgarijo
 
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...Maryam Farooq
 
Scott Edmunds, ReCon 2015: Beyond Dead Trees, Publishing Digital Research Obj...
Scott Edmunds, ReCon 2015: Beyond Dead Trees, Publishing Digital Research Obj...Scott Edmunds, ReCon 2015: Beyond Dead Trees, Publishing Digital Research Obj...
Scott Edmunds, ReCon 2015: Beyond Dead Trees, Publishing Digital Research Obj...GigaScience, BGI Hong Kong
 
Scott Edmunds Open data examples, from the Science as an Open Enterprise sess...
Scott Edmunds Open data examples, from the Science as an Open Enterprise sess...Scott Edmunds Open data examples, from the Science as an Open Enterprise sess...
Scott Edmunds Open data examples, from the Science as an Open Enterprise sess...GigaScience, BGI Hong Kong
 
Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: th...
Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: th...Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: th...
Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: th...GigaScience, BGI Hong Kong
 
Wikidata and the Semantic Web of Food
Wikidata and the  Semantic Web of FoodWikidata and the  Semantic Web of Food
Wikidata and the Semantic Web of FoodBenjamin Good
 
CSCW in Times of Social Media
CSCW in Times of Social MediaCSCW in Times of Social Media
CSCW in Times of Social MediaHendrik Drachsler
 
Reproducible method and benchmarking publishing for the data (and evidence) d...
Reproducible method and benchmarking publishing for the data (and evidence) d...Reproducible method and benchmarking publishing for the data (and evidence) d...
Reproducible method and benchmarking publishing for the data (and evidence) d...GigaScience, BGI Hong Kong
 
EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)
EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)
EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)Dag Endresen
 
Interlinking two institutional KOS about Agroecology: using LOD Agrovoc to ci...
Interlinking two institutional KOS about Agroecology: using LOD Agrovoc to ci...Interlinking two institutional KOS about Agroecology: using LOD Agrovoc to ci...
Interlinking two institutional KOS about Agroecology: using LOD Agrovoc to ci...pascal aventurier
 

Similaire à Multilingual Ner Using Wiki (20)

ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biology
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biology
 
Wf4Ever: Work!ows for Methodology and Science Preservation
Wf4Ever: Work!ows for Methodology and Science PreservationWf4Ever: Work!ows for Methodology and Science Preservation
Wf4Ever: Work!ows for Methodology and Science Preservation
 
What you Can Make Out of Linked Data
What you Can Make Out of Linked DataWhat you Can Make Out of Linked Data
What you Can Make Out of Linked Data
 
Wikiomics
WikiomicsWikiomics
Wikiomics
 
New Research Articles 2020 May Issue International Journal of Software Engin...
New Research Articles 2020 May  Issue International Journal of Software Engin...New Research Articles 2020 May  Issue International Journal of Software Engin...
New Research Articles 2020 May Issue International Journal of Software Engin...
 
The beauty of workflows and models
The beauty of workflows and modelsThe beauty of workflows and models
The beauty of workflows and models
 
Wikis at work
Wikis at workWikis at work
Wikis at work
 
Reproducibility Using Semantics: An Overview
Reproducibility Using Semantics: An OverviewReproducibility Using Semantics: An Overview
Reproducibility Using Semantics: An Overview
 
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...
 
A Collaborative Framework for Managing and Publishing KOS
A Collaborative  Framework for  Managing and Publishing KOS A Collaborative  Framework for  Managing and Publishing KOS
A Collaborative Framework for Managing and Publishing KOS
 
AGROVOC: FAO’s multilingual thesaurus as a building block for linked open data
AGROVOC: FAO’s multilingual thesaurus as a building block for linked open dataAGROVOC: FAO’s multilingual thesaurus as a building block for linked open data
AGROVOC: FAO’s multilingual thesaurus as a building block for linked open data
 
Scott Edmunds, ReCon 2015: Beyond Dead Trees, Publishing Digital Research Obj...
Scott Edmunds, ReCon 2015: Beyond Dead Trees, Publishing Digital Research Obj...Scott Edmunds, ReCon 2015: Beyond Dead Trees, Publishing Digital Research Obj...
Scott Edmunds, ReCon 2015: Beyond Dead Trees, Publishing Digital Research Obj...
 
Scott Edmunds Open data examples, from the Science as an Open Enterprise sess...
Scott Edmunds Open data examples, from the Science as an Open Enterprise sess...Scott Edmunds Open data examples, from the Science as an Open Enterprise sess...
Scott Edmunds Open data examples, from the Science as an Open Enterprise sess...
 
Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: th...
Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: th...Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: th...
Scott Edmunds talk at G3 (Great GigaScience & Galaxy) workshop: Open Data: th...
 
Wikidata and the Semantic Web of Food
Wikidata and the  Semantic Web of FoodWikidata and the  Semantic Web of Food
Wikidata and the Semantic Web of Food
 
CSCW in Times of Social Media
CSCW in Times of Social MediaCSCW in Times of Social Media
CSCW in Times of Social Media
 
Reproducible method and benchmarking publishing for the data (and evidence) d...
Reproducible method and benchmarking publishing for the data (and evidence) d...Reproducible method and benchmarking publishing for the data (and evidence) d...
Reproducible method and benchmarking publishing for the data (and evidence) d...
 
EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)
EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)
EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)
 
Interlinking two institutional KOS about Agroecology: using LOD Agrovoc to ci...
Interlinking two institutional KOS about Agroecology: using LOD Agrovoc to ci...Interlinking two institutional KOS about Agroecology: using LOD Agrovoc to ci...
Interlinking two institutional KOS about Agroecology: using LOD Agrovoc to ci...
 

Plus de Svitlana volkova

Plus de Svitlana volkova (18)

EACL'12 Poster
EACL'12 PosterEACL'12 Poster
EACL'12 Poster
 
Grace Hopper Celebration 2010
Grace Hopper Celebration 2010Grace Hopper Celebration 2010
Grace Hopper Celebration 2010
 
Multimodal Information Extraction: Disease, Date and Location Retrieval
Multimodal Information Extraction: Disease, Date and Location RetrievalMultimodal Information Extraction: Disease, Date and Location Retrieval
Multimodal Information Extraction: Disease, Date and Location Retrieval
 
Web Intelligence 2010
Web Intelligence 2010Web Intelligence 2010
Web Intelligence 2010
 
Master Thesis
Master ThesisMaster Thesis
Master Thesis
 
MS Thesis Short
MS Thesis ShortMS Thesis Short
MS Thesis Short
 
IEEE ISI'10
IEEE ISI'10IEEE ISI'10
IEEE ISI'10
 
MedEx'10
MedEx'10MedEx'10
MedEx'10
 
WiML Poster
WiML PosterWiML Poster
WiML Poster
 
Topics Modeling
Topics ModelingTopics Modeling
Topics Modeling
 
Project Proposal Topics Modeling (Ir)
Project Proposal    Topics Modeling (Ir)Project Proposal    Topics Modeling (Ir)
Project Proposal Topics Modeling (Ir)
 
Social Networks
Social NetworksSocial Networks
Social Networks
 
Methods Of Reliability Analysis
Methods Of Reliability AnalysisMethods Of Reliability Analysis
Methods Of Reliability Analysis
 
Ohio Project
Ohio ProjectOhio Project
Ohio Project
 
Ukraine Presentation
Ukraine PresentationUkraine Presentation
Ukraine Presentation
 
Ukraine Presentation at Kansas State University
Ukraine Presentation at Kansas State UniversityUkraine Presentation at Kansas State University
Ukraine Presentation at Kansas State University
 
Communicatons Fulbright
Communicatons FulbrightCommunicatons Fulbright
Communicatons Fulbright
 
Communications Ternopil
Communications TernopilCommunications Ternopil
Communications Ternopil
 

Dernier

slides CapTechTalks Webinar May 2024 Alexander Perry.pptx
slides CapTechTalks Webinar May 2024 Alexander Perry.pptxslides CapTechTalks Webinar May 2024 Alexander Perry.pptx
slides CapTechTalks Webinar May 2024 Alexander Perry.pptxCapitolTechU
 
Matatag-Curriculum and the 21st Century Skills Presentation.pptx
Matatag-Curriculum and the 21st Century Skills Presentation.pptxMatatag-Curriculum and the 21st Century Skills Presentation.pptx
Matatag-Curriculum and the 21st Century Skills Presentation.pptxJenilouCasareno
 
factors influencing drug absorption-final-2.pptx
factors influencing drug absorption-final-2.pptxfactors influencing drug absorption-final-2.pptx
factors influencing drug absorption-final-2.pptxSanjay Shekar
 
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽中 央社
 
How to the fix Attribute Error in odoo 17
How to the fix Attribute Error in odoo 17How to the fix Attribute Error in odoo 17
How to the fix Attribute Error in odoo 17Celine George
 
The Benefits and Challenges of Open Educational Resources
The Benefits and Challenges of Open Educational ResourcesThe Benefits and Challenges of Open Educational Resources
The Benefits and Challenges of Open Educational Resourcesaileywriter
 
REPRODUCTIVE TOXICITY STUDIE OF MALE AND FEMALEpptx
REPRODUCTIVE TOXICITY  STUDIE OF MALE AND FEMALEpptxREPRODUCTIVE TOXICITY  STUDIE OF MALE AND FEMALEpptx
REPRODUCTIVE TOXICITY STUDIE OF MALE AND FEMALEpptxmanishaJyala2
 
Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17Celine George
 
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...Nguyen Thanh Tu Collection
 
Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...
Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...
Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...Denish Jangid
 
Dementia (Alzheimer & vasular dementia).
Dementia (Alzheimer & vasular dementia).Dementia (Alzheimer & vasular dementia).
Dementia (Alzheimer & vasular dementia).Mohamed Rizk Khodair
 
ppt your views.ppt your views of your college in your eyes
ppt your views.ppt your views of your college in your eyesppt your views.ppt your views of your college in your eyes
ppt your views.ppt your views of your college in your eyesashishpaul799
 
Essential Safety precautions during monsoon season
Essential Safety precautions during monsoon seasonEssential Safety precautions during monsoon season
Essential Safety precautions during monsoon seasonMayur Khatri
 
Application of Matrices in real life. Presentation on application of matrices
Application of Matrices in real life. Presentation on application of matricesApplication of Matrices in real life. Presentation on application of matrices
Application of Matrices in real life. Presentation on application of matricesRased Khan
 
Neurulation and the formation of the neural tube
Neurulation and the formation of the neural tubeNeurulation and the formation of the neural tube
Neurulation and the formation of the neural tubeSaadHumayun7
 
BỘ LUYỆN NGHE TIẾNG ANH 8 GLOBAL SUCCESS CẢ NĂM (GỒM 12 UNITS, MỖI UNIT GỒM 3...
BỘ LUYỆN NGHE TIẾNG ANH 8 GLOBAL SUCCESS CẢ NĂM (GỒM 12 UNITS, MỖI UNIT GỒM 3...BỘ LUYỆN NGHE TIẾNG ANH 8 GLOBAL SUCCESS CẢ NĂM (GỒM 12 UNITS, MỖI UNIT GỒM 3...
BỘ LUYỆN NGHE TIẾNG ANH 8 GLOBAL SUCCESS CẢ NĂM (GỒM 12 UNITS, MỖI UNIT GỒM 3...Nguyen Thanh Tu Collection
 
An Overview of the Odoo 17 Discuss App.pptx
An Overview of the Odoo 17 Discuss App.pptxAn Overview of the Odoo 17 Discuss App.pptx
An Overview of the Odoo 17 Discuss App.pptxCeline George
 
Navigating the Misinformation Minefield: The Role of Higher Education in the ...
Navigating the Misinformation Minefield: The Role of Higher Education in the ...Navigating the Misinformation Minefield: The Role of Higher Education in the ...
Navigating the Misinformation Minefield: The Role of Higher Education in the ...Mark Carrigan
 

Dernier (20)

“O BEIJO” EM ARTE .
“O BEIJO” EM ARTE                       .“O BEIJO” EM ARTE                       .
“O BEIJO” EM ARTE .
 
slides CapTechTalks Webinar May 2024 Alexander Perry.pptx
slides CapTechTalks Webinar May 2024 Alexander Perry.pptxslides CapTechTalks Webinar May 2024 Alexander Perry.pptx
slides CapTechTalks Webinar May 2024 Alexander Perry.pptx
 
Matatag-Curriculum and the 21st Century Skills Presentation.pptx
Matatag-Curriculum and the 21st Century Skills Presentation.pptxMatatag-Curriculum and the 21st Century Skills Presentation.pptx
Matatag-Curriculum and the 21st Century Skills Presentation.pptx
 
factors influencing drug absorption-final-2.pptx
factors influencing drug absorption-final-2.pptxfactors influencing drug absorption-final-2.pptx
factors influencing drug absorption-final-2.pptx
 
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
 
How to the fix Attribute Error in odoo 17
How to the fix Attribute Error in odoo 17How to the fix Attribute Error in odoo 17
How to the fix Attribute Error in odoo 17
 
The Benefits and Challenges of Open Educational Resources
The Benefits and Challenges of Open Educational ResourcesThe Benefits and Challenges of Open Educational Resources
The Benefits and Challenges of Open Educational Resources
 
REPRODUCTIVE TOXICITY STUDIE OF MALE AND FEMALEpptx
REPRODUCTIVE TOXICITY  STUDIE OF MALE AND FEMALEpptxREPRODUCTIVE TOXICITY  STUDIE OF MALE AND FEMALEpptx
REPRODUCTIVE TOXICITY STUDIE OF MALE AND FEMALEpptx
 
Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17
 
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...
 
Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...
Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...
Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...
 
Dementia (Alzheimer & vasular dementia).
Dementia (Alzheimer & vasular dementia).Dementia (Alzheimer & vasular dementia).
Dementia (Alzheimer & vasular dementia).
 
ppt your views.ppt your views of your college in your eyes
ppt your views.ppt your views of your college in your eyesppt your views.ppt your views of your college in your eyes
ppt your views.ppt your views of your college in your eyes
 
Essential Safety precautions during monsoon season
Essential Safety precautions during monsoon seasonEssential Safety precautions during monsoon season
Essential Safety precautions during monsoon season
 
Application of Matrices in real life. Presentation on application of matrices
Application of Matrices in real life. Presentation on application of matricesApplication of Matrices in real life. Presentation on application of matrices
Application of Matrices in real life. Presentation on application of matrices
 
Neurulation and the formation of the neural tube
Neurulation and the formation of the neural tubeNeurulation and the formation of the neural tube
Neurulation and the formation of the neural tube
 
BỘ LUYỆN NGHE TIẾNG ANH 8 GLOBAL SUCCESS CẢ NĂM (GỒM 12 UNITS, MỖI UNIT GỒM 3...
BỘ LUYỆN NGHE TIẾNG ANH 8 GLOBAL SUCCESS CẢ NĂM (GỒM 12 UNITS, MỖI UNIT GỒM 3...BỘ LUYỆN NGHE TIẾNG ANH 8 GLOBAL SUCCESS CẢ NĂM (GỒM 12 UNITS, MỖI UNIT GỒM 3...
BỘ LUYỆN NGHE TIẾNG ANH 8 GLOBAL SUCCESS CẢ NĂM (GỒM 12 UNITS, MỖI UNIT GỒM 3...
 
An Overview of the Odoo 17 Discuss App.pptx
An Overview of the Odoo 17 Discuss App.pptxAn Overview of the Odoo 17 Discuss App.pptx
An Overview of the Odoo 17 Discuss App.pptx
 
Navigating the Misinformation Minefield: The Role of Higher Education in the ...
Navigating the Misinformation Minefield: The Role of Higher Education in the ...Navigating the Misinformation Minefield: The Role of Higher Education in the ...
Navigating the Misinformation Minefield: The Role of Higher Education in the ...
 
Word Stress rules esl .pptx
Word Stress rules esl               .pptxWord Stress rules esl               .pptx
Word Stress rules esl .pptx
 

Multilingual Ner Using Wiki

  • 1. Multilingual Named Entity Recognition using Wikipedia Laboratory for Knowledge Discovery in Databases Department of Computing and Information Sciences Kansas State University http://www.kddresearch.org/tikiwiki/tiki-index.php Presenter: Svitlana O. Volkova Instructor: William Hsu
  • 2. AGENDA I. Project Overview II. Crawling Wikipedia III. Synonymy Discovery with Google Sets IV. Experiment Design V. Conclusions
  • 3. AGENDA I. Project Overview II. Crawling Wikipedia III. GoogleSets for Synonymy Discovery IV. Experiment V. Conclusions
  • 4. PROJECT MILESTONES Input: Crawler Functionality CRAWLING WIKIPEDIA Output: Set of Multilingual Gazetteers Input: Initial Gazetteer in one Language RELATIONSHIP DISCOVERY WITH GOOGLESETS Output: Extended Gazetteer with Synonyms Input: Extended Gazetteer with Synonyms + Content MULTILINGUAL NER TASK Output: Extracted Entities from the Content
  • 5. KEY IDEA - WIKIPEDIA  Apply Wikipedia knowledge representation for multilingual information extraction English Wiki Concepts of Interest …, anthrax, bovine virus, …, camelpox, surra, … 17http://wiki.digitalmethods.net/Dmi/WikipediaAnalysis Russian Wiki Concepts of Interest …, Зоонозы, Классическая чума свиней, Лептоспироз, …
  • 6. AGENDA I. Project Overview II. Crawling Wikipedia III. GoogleSets for Synonymy Discovery IV. Experiment V. Conclusions
  • 7. CRAWLING WIKIPEDIA Multilingual NER (article + category +interwiki links) Wiki Category Graph and Article Graph
  • 8. GAZETTEERS EXAMPLES IN DIFFERENT LANGUAGES
  • 9. GAZETTEERS SIZE IN DIFFERENT LANGUAGES 19 37 English 86 Japanese German 20 Russian Decision: dictionaries are too small, so wee need to find a way how to extend it!!!
  • 10. AGENDA I. Project Overview II. Crawling Wikipedia III. GoogleSets for Synonymy Discovery IV. Experiment V. Conclusions
  • 12. AGENDA I. Project Overview II. Crawling Wikipedia III. GoogleSets for Synonymy Discovery IV. Experiment V. Conclusions
  • 13. EXPERIMENT SET UP  Purpose: to perform named entity recognition task in specific domain and report accuracy of extraction using a) Wiki knowledge b) Extended lists with synonyms from Google Sets  Hypothesis: the synonyms extraction phase is essential for increasing accuracy of information extraction task
  • 14. DISEASE EXTRACTOR MODULE INPUT AND OUTPUT Output: Index of the first character Disease Index of the last character Extractor Length of the matched text Input: Text Module from file Matched Text Canonical disease name Disease ExtractionTask  The task of disease recognition can be considered as NER/information extraction (IE) task  The main purpose is to retrieve tokens that much at least one term with synonyms, abbreviations from list of the animal disease names
  • 15. CONTEXT EXAMPLES IN DIFFERENT LANGUAGES DUTCH  Leptospirose komt voor in alle landen, behalve het Noordpoolgebied. De incidentie is hoog. Meer dan de helft van de gevallen voordoet in ernstige en vereiste reanimatie. CZECH  Leptospiróza se vyskytuje ve všech zemích s výjimkou Arktidy. Incidence je vysoká. Více než polovina případů se vyskytuje v těžké a vyžaduje resuscitaci. GERMAN  Leptospirose tritt in allen Ländern, mit Ausnahme der Arktis. Die Inzidenz ist hoch. Mehr als die Hälfte der Fälle tritt in schweren und Reanimation erforderlich. ITALIAN  Leptospirosi si verifica in tutti i paesi, tranne l'Artico. L'incidenza è alta. Più della metà dei casi si verifica in rianimazione grave e richiesti. URKAINIAN  Лептоспіроз відбувається в усіх країнах, за винятком Арктики. Захворюваність висока. Більше половини випадків відбувається в суворих і необхідність реанімації. RUSSIAN  Лептоспироз происходит во всех странах, за исключением Арктики. Заболеваемость высокая. Более половины случаев происходит в суровых и необходимости реанимации.
  • 16. DISEASE EXTRACTOR MODULE DEMO http://fingolfin.user.cis.ksu.edu:8080/diseaseextractor/
  • 17.
  • 18. RESULTS FOR DISEASE EXTRACTOR MODULE INPUT A OUTPUT A Foot and mouth disease is one of the most contagious diseases of cloven-hooved mammals… INPUT B OUTPUT B Rift Valley Fever | CDC Special Pathogens Branch Mission Statement Disease …
  • 19. AGENDA I. Project Overview II. Crawling Wikipedia III. GoogleSets for Synonymy Discovery IV. Experiment V. Conclusions
  • 20. CONCLUSIONS  ApplyingWikipedia knowledge for multilingual NERTask  Phase 1: CrawlingWiki – completed  Phase 2: Google Sets Expansion – completed  Phase 3: Multilingual Disease Extraction – in progress  Novelty: Overcome Wiki limitations by applying Google Sets expansion approach  In order to estimate accuracy we need to have annotated data in different languages
  • 21. REFERENCES  Torsten Zesch and Iryna Gurevych, Analysis of the Wikipedia Category Graph for NLP Applications, In: Proceedings of the TextGraphs-2 Workshop (NAACL-HLT 2007), p. 1--8, April 2007. http://elara.tk.informatik.tu- darmstadt.de/publications/2007/hlt-textgraphs.pdf  Watanabe, Yotaro and Asahara, Masayuki and Matsumoto, Yuji, A Graph-Based Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP- CoNLL), 649-657. http://www.aclweb.org/anthology/D/D07/D07-1068  Manning, C., & Schutze, H. Foundations of statistical natural language processing. Cambridge, MA: MIT Press, 1999.
  • 22. ACKNOWLEDGEMENTS  Dr. William Hsu for meaningful guidance  John Drouhard for building extraction architecture  Landon Fowles for expanding gazetteers using Google Sets