1. Multilingual Named Entity Recognition
using Wikipedia
Laboratory for Knowledge Discovery in Databases
Department of Computing and Information Sciences
Kansas State University
http://www.kddresearch.org/tikiwiki/tiki-index.php
Presenter: Svitlana O. Volkova
Instructor: William Hsu
2. AGENDA
I. Project Overview
II. Crawling Wikipedia
III. Synonymy Discovery with Google Sets
IV. Experiment Design
V. Conclusions
3. AGENDA
I. Project Overview
II. Crawling Wikipedia
III. GoogleSets for Synonymy Discovery
IV. Experiment
V. Conclusions
4. PROJECT MILESTONES
Input: Crawler Functionality
CRAWLING WIKIPEDIA
Output: Set of Multilingual Gazetteers
Input: Initial Gazetteer in one Language
RELATIONSHIP DISCOVERY WITH GOOGLESETS
Output: Extended Gazetteer with Synonyms
Input: Extended Gazetteer with Synonyms + Content
MULTILINGUAL NER TASK
Output: Extracted Entities from the Content
5. KEY IDEA - WIKIPEDIA
Apply Wikipedia knowledge representation for
multilingual information extraction
English Wiki Concepts of Interest
…, anthrax, bovine virus, …, camelpox, surra, …
17http://wiki.digitalmethods.net/Dmi/WikipediaAnalysis
Russian Wiki Concepts of Interest
…, Зоонозы, Классическая чума свиней, Лептоспироз, …
6. AGENDA
I. Project Overview
II. Crawling Wikipedia
III. GoogleSets for Synonymy Discovery
IV. Experiment
V. Conclusions
9. GAZETTEERS SIZE IN DIFFERENT
LANGUAGES
19
37 English
86 Japanese
German
20 Russian
Decision: dictionaries are too small, so wee need to find a way how to
extend it!!!
10. AGENDA
I. Project Overview
II. Crawling Wikipedia
III. GoogleSets for Synonymy Discovery
IV. Experiment
V. Conclusions
12. AGENDA
I. Project Overview
II. Crawling Wikipedia
III. GoogleSets for Synonymy Discovery
IV. Experiment
V. Conclusions
13. EXPERIMENT SET UP
Purpose: to perform named entity recognition task in
specific domain and report accuracy of extraction using
a) Wiki knowledge
b) Extended lists with synonyms from Google Sets
Hypothesis: the synonyms extraction phase is essential
for increasing accuracy of information extraction task
14. DISEASE EXTRACTOR MODULE
INPUT AND OUTPUT
Output:
Index of the first character
Disease Index of the last character
Extractor Length of the matched text
Input: Text Module
from file Matched Text
Canonical disease name
Disease ExtractionTask
The task of disease recognition can be considered as NER/information
extraction (IE) task
The main purpose is to retrieve tokens that much at least one term with
synonyms, abbreviations from list of the animal disease names
15. CONTEXT EXAMPLES IN DIFFERENT LANGUAGES
DUTCH
Leptospirose komt voor in alle landen, behalve het Noordpoolgebied. De incidentie is hoog.
Meer dan de helft van de gevallen voordoet in ernstige en vereiste reanimatie.
CZECH
Leptospiróza se vyskytuje ve všech zemích s výjimkou Arktidy. Incidence je vysoká. Více než
polovina případů se vyskytuje v těžké a vyžaduje resuscitaci.
GERMAN
Leptospirose tritt in allen Ländern, mit Ausnahme der Arktis. Die Inzidenz ist hoch. Mehr als
die Hälfte der Fälle tritt in schweren und Reanimation erforderlich.
ITALIAN
Leptospirosi si verifica in tutti i paesi, tranne l'Artico. L'incidenza è alta. Più della metà dei
casi si verifica in rianimazione grave e richiesti.
URKAINIAN
Лептоспіроз відбувається в усіх країнах, за винятком Арктики. Захворюваність висока.
Більше половини випадків відбувається в суворих і необхідність реанімації.
RUSSIAN
Лептоспироз происходит во всех странах, за исключением Арктики. Заболеваемость
высокая. Более половины случаев происходит в суровых и необходимости реанимации.
18. RESULTS FOR DISEASE EXTRACTOR MODULE
INPUT A OUTPUT A
Foot and mouth disease is
one of the most contagious
diseases of cloven-hooved
mammals…
INPUT B OUTPUT B
Rift Valley Fever | CDC
Special Pathogens Branch
Mission Statement Disease …
19. AGENDA
I. Project Overview
II. Crawling Wikipedia
III. GoogleSets for Synonymy Discovery
IV. Experiment
V. Conclusions
20. CONCLUSIONS
ApplyingWikipedia knowledge for multilingual NERTask
Phase 1: CrawlingWiki – completed
Phase 2: Google Sets Expansion – completed
Phase 3: Multilingual Disease Extraction – in progress
Novelty: Overcome Wiki limitations by applying Google Sets
expansion approach
In order to estimate accuracy we need to have annotated data in
different languages
21. REFERENCES
Torsten Zesch and Iryna Gurevych, Analysis of the Wikipedia Category Graph for NLP
Applications, In: Proceedings of the TextGraphs-2 Workshop (NAACL-HLT 2007), p.
1--8, April 2007. http://elara.tk.informatik.tu-
darmstadt.de/publications/2007/hlt-textgraphs.pdf
Watanabe, Yotaro and Asahara, Masayuki and Matsumoto, Yuji, A Graph-Based
Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields,
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural
Language Processing and Computational Natural Language Learning (EMNLP-
CoNLL), 649-657. http://www.aclweb.org/anthology/D/D07/D07-1068
Manning, C., & Schutze, H. Foundations of statistical natural language processing.
Cambridge, MA: MIT Press, 1999.
22. ACKNOWLEDGEMENTS
Dr. William Hsu for meaningful guidance
John Drouhard for building extraction architecture
Landon Fowles for expanding gazetteers using Google Sets