Multilingual Ner Using Wiki

Multilingual Named Entity Recognition
using Wikipedia
Laboratory for Knowledge Discovery in Databases
Department of Computing and Information Sciences
Kansas State University
http://www.kddresearch.org/tikiwiki/tiki-index.php

Presenter: Svitlana O. Volkova
Instructor: William Hsu

AGENDA

I. Project Overview
II. Crawling Wikipedia
III. Synonymy Discovery with Google Sets
IV. Experiment Design
V. Conclusions

AGENDA

I. Project Overview
II. Crawling Wikipedia
III. GoogleSets for Synonymy Discovery
IV. Experiment
V. Conclusions

PROJECT MILESTONES

Input: Crawler Functionality
CRAWLING WIKIPEDIA
Output: Set of Multilingual Gazetteers

Input: Initial Gazetteer in one Language
RELATIONSHIP DISCOVERY WITH GOOGLESETS
Output: Extended Gazetteer with Synonyms

Input: Extended Gazetteer with Synonyms + Content
MULTILINGUAL NER TASK
Output: Extracted Entities from the Content

KEY IDEA - WIKIPEDIA
 Apply Wikipedia knowledge representation for
multilingual information extraction
English Wiki Concepts of Interest
…, anthrax, bovine virus, …, camelpox, surra, …

17http://wiki.digitalmethods.net/Dmi/WikipediaAnalysis

Russian Wiki Concepts of Interest
…, Зоонозы, Классическая чума свиней, Лептоспироз, …

CRAWLING WIKIPEDIA

Multilingual NER
(article + category
+interwiki links)

Wiki Category Graph and Article Graph

GAZETTEERS EXAMPLES IN DIFFERENT
LANGUAGES

GAZETTEERS SIZE IN DIFFERENT
LANGUAGES

19

37 English
86 Japanese
German
20 Russian

Decision: dictionaries are too small, so wee need to find a way how to
extend it!!!

GAZETTEERS EXAMPLES:
GERMAN GOOGLE SETS OUTPUT

EXPERIMENT SET UP
 Purpose: to perform named entity recognition task in
specific domain and report accuracy of extraction using
a) Wiki knowledge
b) Extended lists with synonyms from Google Sets

 Hypothesis: the synonyms extraction phase is essential
for increasing accuracy of information extraction task

DISEASE EXTRACTOR MODULE
INPUT AND OUTPUT
Output:
Index of the first character

Disease Index of the last character
Extractor Length of the matched text
Input: Text Module
from file Matched Text
Canonical disease name
Disease ExtractionTask
 The task of disease recognition can be considered as NER/information
extraction (IE) task
 The main purpose is to retrieve tokens that much at least one term with
synonyms, abbreviations from list of the animal disease names

CONTEXT EXAMPLES IN DIFFERENT LANGUAGES
DUTCH
 Leptospirose komt voor in alle landen, behalve het Noordpoolgebied. De incidentie is hoog.
Meer dan de helft van de gevallen voordoet in ernstige en vereiste reanimatie.
CZECH
 Leptospiróza se vyskytuje ve všech zemích s výjimkou Arktidy. Incidence je vysoká. Více než
polovina případů se vyskytuje v těžké a vyžaduje resuscitaci.
GERMAN
 Leptospirose tritt in allen Ländern, mit Ausnahme der Arktis. Die Inzidenz ist hoch. Mehr als
die Hälfte der Fälle tritt in schweren und Reanimation erforderlich.
ITALIAN
 Leptospirosi si verifica in tutti i paesi, tranne l'Artico. L'incidenza è alta. Più della metà dei
casi si verifica in rianimazione grave e richiesti.
URKAINIAN
 Лептоспіроз відбувається в усіх країнах, за винятком Арктики. Захворюваність висока.
Більше половини випадків відбувається в суворих і необхідність реанімації.
RUSSIAN
 Лептоспироз происходит во всех странах, за исключением Арктики. Заболеваемость
высокая. Более половины случаев происходит в суровых и необходимости реанимации.

DISEASE EXTRACTOR MODULE DEMO
http://fingolfin.user.cis.ksu.edu:8080/diseaseextractor/

RESULTS FOR DISEASE EXTRACTOR MODULE

INPUT A OUTPUT A
Foot and mouth disease is
one of the most contagious
diseases of cloven-hooved
mammals…

INPUT B OUTPUT B
Rift Valley Fever | CDC
Special Pathogens Branch
Mission Statement Disease …

CONCLUSIONS
 ApplyingWikipedia knowledge for multilingual NERTask

 Phase 1: CrawlingWiki – completed
 Phase 2: Google Sets Expansion – completed
 Phase 3: Multilingual Disease Extraction – in progress

 Novelty: Overcome Wiki limitations by applying Google Sets
expansion approach

 In order to estimate accuracy we need to have annotated data in
different languages

REFERENCES
 Torsten Zesch and Iryna Gurevych, Analysis of the Wikipedia Category Graph for NLP
Applications, In: Proceedings of the TextGraphs-2 Workshop (NAACL-HLT 2007), p.
1--8, April 2007. http://elara.tk.informatik.tu-
darmstadt.de/publications/2007/hlt-textgraphs.pdf

 Watanabe, Yotaro and Asahara, Masayuki and Matsumoto, Yuji, A Graph-Based
Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields,
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural
Language Processing and Computational Natural Language Learning (EMNLP-
CoNLL), 649-657. http://www.aclweb.org/anthology/D/D07/D07-1068

 Manning, C., & Schutze, H. Foundations of statistical natural language processing.
Cambridge, MA: MIT Press, 1999.

ACKNOWLEDGEMENTS

 Dr. William Hsu for meaningful guidance

 John Drouhard for building extraction architecture

 Landon Fowles for expanding gazetteers using Google Sets

Multilingual Ner Using Wiki

Recommandé

Recommandé

Contenu connexe

Similaire à Multilingual Ner Using Wiki

Similaire à Multilingual Ner Using Wiki (20)

Plus de Svitlana volkova

Plus de Svitlana volkova (18)

Dernier

Dernier (20)

Multilingual Ner Using Wiki