SlideShare une entreprise Scribd logo
1  sur  7
Télécharger pour lire hors ligne
International Journal on Computational Sciences & Applications (IJCSA) Vol.3, No.5, October 2013

HANDLING AMBIGUITIES AND UNKNOWN WORDS
IN NAMED ENTITY RECOGNITION USING
ANAPHORA RESOLUTION
Deepti Chopra1 Dr. G.N. Purohit2
Department of Computer Engineering, Banasthali Vidyapith, Rajasthan, INDIA

ABSTRACT
Anaphora Resolution is a method of determining what a particular noun phrase or pronoun at a given
instance refers to. In this paper, we have discussed how Anaphora Resolution is useful in performing
computation linguistic task in various Natural languages including the Indian languages. Also, we have
discussed how Anaphora Resolution is conducive in handling unknown words in Named Entity Recognition.

KEYWORDS
Anaphora Resolution, Named Entities, Performance Metrics

1. INTRODUCTION
Anaphora refers to presupposition or cohesion that refers back to the previous element. Anaphor
refers to the reference that is pointing back. The entity to which anaphora actually refers is known
as antecedent. So, Anaphora Resolution is a method of finding the antecedent of an anaphor.
Some of the applications of Anaphora Resolution include: Named Entity Recognition, Machine
Translation, Information Retrieval, Question Answering System and Information Extraction.
Named Entity Recognition (NER) is considered as one of the subtask of Information Extraction in
which named entities or proper nouns are searched from a huge text and classified into various
categories. Transliteration plays a crucial role in resolving the problem of unknown words in
Named Entity Recognition. If a training of a given word is performed using example based
learning in one language, then this word can be transliterated into other languages also.
e. g. नसरत/PER सेब खाती है
ु
दीपिका/PER कानिर/CITY में रहती है
ु
दीिा/PER नागिर/CITY में रहती है
ु
Above, Named Entities in Hindi are transliterated into English as: nusrat, deepika, Kanpur, deepa
and nagpur.

DOI:10.5121/ijcsa.2013.3504

29
International Journal on Computational Sciences & Applications (IJCSA) Vol.3, No.5, October 2013

2. RELATED WORK
Named Entity Recognition (NER) is the process in which Named Entities or proper nouns are
identified and then categorized into different classes of Named Entity classes e.g. Name of
Person, Sport, River, Country, State, city, Organization etc. The unknown words in Named Entity
Recognition can be handled using many ways. In one of the approaches, unknown word can be
allotted a specific tag e.g. unk tag in order to handle unknown words in NER [1][2].
Capitalization information cannot be used to identify the POS tag, since Capitalization can exists
in unknown words also. Also, Capitalization information is only restricted to identify Named
Entities in English [3][4].

3. OUR APPROACH
In our approach, we have considered ‘OTHER’ tag to be ‘Not a Named Entity’ tag. Figure 1
depicts our approach during training in NER. We process each token of training data and check
its tag. If the token is not tagged with ‘OTHER’ tag, then we transliterate the token into different
languages and save these transliterated Named Entities along with their tags in a file. If the token
is attached with ‘OTHER’ tag, then we simply process the next token in a sentence and continue
this procedure until all the training sentences are not processed.

START

TRAINING DATA
(PROCESS EACH
TOKEN)

IF NOT
OTHER
TAG?

NO

PROCESS
NEXT TOKEN

YES
TRANSLITERATE
TOKEN AND ATTACH
ITS TAG

NO

IF END OF
SENTENCE
REACHED?

YES
EXIT

Figure 1 Steps taken while performing training in Named Entity Recognition

30
International Journal on Computational Sciences & Applications (IJCSA) Vol.3, No.5, October 2013

If we have the sound mapping of one language into any other language, then we can easily
perform the transliteration task. e. g. If ‘Ram/PER plays/OTHER cricket/OTHER’ is a training
sentence, then the transliteration task will transliterate Ram into other languages, since Ram is a
Named Entity that is attached to ‘PER’ tag and not ‘OTHER’ tag. ‘plays’ and ‘cricket’ are not the
Named Entities, since these are tagged by ‘OTHER’ tag, so transliteration is not performed on it.
So, our approach is a language independent based approach, that is capable of recognizing Named
Entities from a text written in any of the Natural languages, provided that it has been trained on a
Named Entity Recognition based system in any of the languages and there lie some sound
mappings so that transliterated Named Entities in other languages can be achieved. Figure 2
displays the steps taken while performing testing in a Named Entity Recognition based system. In
this approach, all the tokens in a testing sentence are processed one at a time. If the current token
is not an unknown entity, then a tag is assigned to it using Viterbi algorithm, if Hidden Markov
Model approach is used to perform Named Entity Recognition. If current token is an unknown
entity, then the Transliteration file which was generated during training process is searched. If the
unknown entity is found in transliteration file, then the corresponding Named Entity tag is
attached to it, else ‘OTHER’ tag is allotted to it. This procedure continues until all the training
sentences are not processed.

TESTING SENTENCE
(PROCESS EACH
TOKEN)

IF AN
UNKNOWN
ENTITY?

NO

ATTACH TAG TO
TOKEN

YES
SEARCH
TRANSLITERATION
FILE

IF ENTITY
EXISTS?

NO

ATTACH ‘OTHER’
TAG TO TOKEN

YES
ATTACH NAMED
ENTITY TAG

NO

IF END OF
SENTENCE
?

YES

EXIT

31
International Journal on Computational Sciences & Applications (IJCSA) Vol.3, No.5, October 2013
Figure 2 Steps taken while performing testing in Named Entity Recognition

Consider a testing sentence: राम क्रिकट खेलता है
े
If Training sentence is: Ram/PER plays/OTHER cricket/SPORT. Then, during training
transliteration of Named Entities i.e. Ram and cricket takes place and the transliterated Named
Entities are stored in a transliterated file along with their tags. So, during testing राम and क्रिकट
े

are given PER and SPORT tags respectively. खेलता and है tokens are unknown entities, since
these tokens are neither trained during training process nor they exist in transliteration file.

3. RESULT
In Named Entity Recognition, handling of unknown words is a very crucial issue. We have
developed a code that generates a list of Named Entities used in the training file. These Named
Entities are then transliterated into different languages and stored in the respective transliterated
files of different languages. TABLE 1 displays the transliteration of Named Entities in different
languages obtained using our Transliteration code.
TABLE 1 Transliteration of Named Entities in different languages
ENGLISH
Ram

HINDI

FRENCH.
Bélier

URDU
‫رام‬

TELUGU
మేషరాశి

BENGALI
গড্ডল

Cricket

क्रिकट
े

Cricket

‫کرکٹ‬

క్రిక్ెట్

ক্রিকেট খেলা

India

भारत

Inde

‫بھارت‬

భారతదేశం

ভারত

Ganga

गंगा

Ganga

‫گنگا‬

గంగా

গঙ্গা

राम

For a given testing sentence, we can initially check if all the tokens exist in already created words
list. If for all the words, training has been performed, then by using Viterbi Algorithm, we can
generate optimal state sequence. If one of the word is unknown that does not exists in training
file, then we can check the transliterated file, if the word exists then the correct tag can be allotted
to it and hence the problem of unknown word is resolved. If initially we have performed

training in NER in English and during testing if we give a word in Hindi, then this would
be treated as an unknown word for which we have not performed training yet. So, we can
generate the transliterated list of Named Entities from English to other languages and
handle the problem of unknown words in NER. TABLE 2 displays our results obtained
while performing NER on multilingual corpus. The Training and Testing file include
sentences from the following languages: English, Hindi, Marathi, Punjabi and Urdu. We
have performed training on 142 tokens and testing on 212 tokens. Recall, Precision and
F-Measure obtained are: 95.80%, 96.3% & 96.04%.
TABLE: 2 Unknown words handling in NER having 28 sentences in Training file
Tag set

{Person, City, Sport}

Number of Training

28 (142 tokens)

sentences
Number of Testing

41(212tokens)
32
International Journal on Computational Sciences & Applications (IJCSA) Vol.3, No.5, October 2013
Sentences
Number of sentence

0

tagged wrongly
According to human

42

6

1

163

41

3

1

167

PER

City

Sport

Not-a- name

frequency of tags
(Correct Answe r)
Observed (System
provided) frequency of
tags
(Experimental Results)

TAG SET
Recall(R)

R=Correct/(Correct+Incorrect+Spurious)=209/(209+8+1)=95.80%

Precision(P)

P=Correct/(Correct+Incorrect+Missing)=209/(209+8+0)=96.3%

F-measure

F-Measure = (2*P*R)/(P+R)=(2*96.3*95.8)/(96.3+95.8)=96.04%

TABLE 3 displays the list of Named Entities on which transliteration is performed in
different languages and then these transliterated Named Entities are stored in a file along
with their tags that can be used further to resolve the problem of unknown entities in a
Named Entity Recognition.
TABLE 3 Named Entities transliterated into different languages
SNO
TAGS
1

PER (Name of Person)

2

OZ (Name of Organization)

3

CN (Name of Country)

4

MAG (Name of Magazine)

5

WEEK (Name of Week)

6

LOC (Name of Location)

7

PC (Name of Personal Computer)

8

MONTH (Name of Month)

9

CITY (Name of City)

10

ST (Name of State)

11

SPORT (Name of Sport)
33
International Journal on Computational Sciences & Applications (IJCSA) Vol.3, No.5, October 2013

4. PERFORMANCE METRICS
Performance Metrics is measure to estimate the performance of a NER based system.
Performance Metrics can be calculated in terms of 3 parameters: Precision, Accuracy and FMeasure. The output of a NER based system is termed as “response” and the interpretation of
human as the “answer key” [5]. Consider the following terms:
1. Correct-If the response is same as the answer key.
2. Incorrect-If the response is not same as the answer key.
3. Missing-If answer key is found to be tagged but response is not tagged.
4. Spurious-If response is found to be tagged but answer key is not tagged. [6]
Hence, we define Precision, Recall and F-Measure as follows: [7][8][9]
Precision (P): Correct / (Correct + Incorrect + Missing)
Recall (R): Correct / (Correct + Incorrect + Spurious)
F-Measure: (2 * P * R) / (P + R)

5. CONCLUSION
In this paper, we have discussed about Transliteration. We have given our results obtained while
performing Named Entity Recognition on multilingual corpus and handling unknown words using
transliteration approach. We have also discussed about Performance Metrics which is a very
important measure to judge the performance of a Named Entity Recognition based system.

ACKNOWLEDGEMENT
I would like to thank all those who helped me in accomplishing this task.

REFERENCES
[1]
[2]
[3]
[4]
[5]

[6]

[7]

[8]

http://www.cse.iitb.ac.in/~cs626-460-2012/seminar_ppts/ner.pdf
http://staff.um.edu.mt/cabe2/publications/nfHmms.pdf
http://acl.ldc.upenn.edu/acl2002/MAIN/pdfs/Main036.pdf
http://www.sigkdd.org/explorations/issues/7-1-2005-06/4-Fu.pdf
Shilpi Srivastava, Mukund Sanglikar & D.C Kothari. ”Named Entity Recognition System for Hindi
Language: A Hybrid Approach” International Journal of Computational Linguistics (IJCL), Volume
(2) : Issue (1) : 2011.Available at
http://cscjournals.org/csc/manuscript/Journals/IJCL/volume2/Issue1/IJCL-19.pdf
B. Sasidhar, P. M. Yohan, Dr. A. Vinaya Babu3, Dr. A. Govardhan,.“A Survey on Named Entity
Recognition in Indian Languages with particular reference to Telugu” IJCSI International Journal of
Computer Science Issues, Vol. 8, Issue 2, March 2011.
Asif Ekbal, Rejwanul Haque, Amitava Das, Venkateswarlu Poka and Sivaji Bandyopadhyay
“Language Independent Named Entity Recognition in Indian Languages” .In Proceedings of the
IJCNLP-08 Workshop on NER for South and South East Asian Languages, pages 33–40,Hyderabad,
India, January 2008.Available at: http://www.mt-archive.info/IJCNLP-2008-Ekbal.pdf
Darvinder kaur, Vishal Gupta. “A survey of Named Entity Recognition in English and other Indian
Languages” .IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 6, November
2010.

34
International Journal on Computational Sciences & Applications (IJCSA) Vol.3, No.5, October 2013
[9] G.V.S.RAJU, B.SRINIVASU, Dr.S.VISWANADHA RAJU, 4K.S.M.V.KUMAR “Named Entity
Recognition for Telugu Using Maximum Entropy Model”

About Authors
Deepti Chopra is working as Assistant Professor in the Department of
Computer Science at Banasthali University (Rajasthan), India. She has
received B.Tech degree in Computer Science and Engineering from Rajasthan
College of Engineering for Women, Jaipur, Rajasthan in 2011.She has done
M.Tech in Computer Science and Engineering from Banasthali University,
Rajasthan in 2013. Her research interests include Artificial Intelligence,
Natural Language Processing, and Information Retrieval. She has published
many papers in International journals and conferences.
Dr. G. N. Purohit is a Professor in Department of Mathematics & Statistics
at Banasthali University (Rajasthan). Before joining Banasthali University,
he was Professor and Head of the Department of Mathematics, University of
Rajasthan, Jaipur. He had been Chief-editor of a research journal and regular
reviewer of many journals. His present interest is in O.R., Discrete
Mathematics and Communication networks. He has published around 40
research papers in various journals.

.

35

Contenu connexe

Tendances

Tendances (19)

Ay34306312
Ay34306312Ay34306312
Ay34306312
 
Complex predicate meghaditya
Complex predicate meghadityaComplex predicate meghaditya
Complex predicate meghaditya
 
C7 agramakirshnan2
C7 agramakirshnan2C7 agramakirshnan2
C7 agramakirshnan2
 
Named Entity Recognition for Telugu Using Conditional Random Field
Named Entity Recognition for Telugu Using Conditional Random FieldNamed Entity Recognition for Telugu Using Conditional Random Field
Named Entity Recognition for Telugu Using Conditional Random Field
 
Classification and Identification of Telugu Aksharas using Moment Invariants ...
Classification and Identification of Telugu Aksharas using Moment Invariants ...Classification and Identification of Telugu Aksharas using Moment Invariants ...
Classification and Identification of Telugu Aksharas using Moment Invariants ...
 
A Review on a web based Punjabi t o English Machine Transliteration System
A Review on a web based Punjabi t o English Machine Transliteration SystemA Review on a web based Punjabi t o English Machine Transliteration System
A Review on a web based Punjabi t o English Machine Transliteration System
 
Ai lecture 09(unit03)
Ai lecture  09(unit03)Ai lecture  09(unit03)
Ai lecture 09(unit03)
 
Hps a hierarchical persian stemming method
Hps a hierarchical persian stemming methodHps a hierarchical persian stemming method
Hps a hierarchical persian stemming method
 
Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...
 
NLP
NLPNLP
NLP
 
Mca4040 analysis and design of algorithm
Mca4040  analysis and design of algorithmMca4040  analysis and design of algorithm
Mca4040 analysis and design of algorithm
 
Theory of Computer Science - Post Correspondence Problem
Theory of Computer Science - Post Correspondence ProblemTheory of Computer Science - Post Correspondence Problem
Theory of Computer Science - Post Correspondence Problem
 
Ai lecture 07(unit03)
Ai lecture  07(unit03)Ai lecture  07(unit03)
Ai lecture 07(unit03)
 
Toc syllabus updated
Toc syllabus updatedToc syllabus updated
Toc syllabus updated
 
Ceis 8
Ceis 8Ceis 8
Ceis 8
 
Ai lecture 10(unit03)
Ai lecture  10(unit03)Ai lecture  10(unit03)
Ai lecture 10(unit03)
 
Ijartes v1-i1-002
Ijartes v1-i1-002Ijartes v1-i1-002
Ijartes v1-i1-002
 
Tamil Morphological Analysis
Tamil Morphological AnalysisTamil Morphological Analysis
Tamil Morphological Analysis
 
Reference Scope Identification in Citing Sentences
Reference Scope Identification in Citing SentencesReference Scope Identification in Citing Sentences
Reference Scope Identification in Citing Sentences
 

En vedette

Презентация тренинг - спящий город
Презентация тренинг - спящий городПрезентация тренинг - спящий город
Презентация тренинг - спящий город
sa_denegko
 

En vedette (8)

Htv2013 matra
Htv2013 matraHtv2013 matra
Htv2013 matra
 
BBFC Research
BBFC ResearchBBFC Research
BBFC Research
 
Energy efficient sensor selection in visual sensor networks based on multi ob...
Energy efficient sensor selection in visual sensor networks based on multi ob...Energy efficient sensor selection in visual sensor networks based on multi ob...
Energy efficient sensor selection in visual sensor networks based on multi ob...
 
Contents Page Analysis
Contents Page AnalysisContents Page Analysis
Contents Page Analysis
 
Презентация тренинг - спящий город
Презентация тренинг - спящий городПрезентация тренинг - спящий город
Презентация тренинг - спящий город
 
FreeNest - Integrating Open Source Software, Open Content, Open Access, and L...
FreeNest - Integrating Open Source Software, Open Content, Open Access, and L...FreeNest - Integrating Open Source Software, Open Content, Open Access, and L...
FreeNest - Integrating Open Source Software, Open Content, Open Access, and L...
 
Quickscan brancheorganisaties en verenigingen
Quickscan brancheorganisaties en verenigingenQuickscan brancheorganisaties en verenigingen
Quickscan brancheorganisaties en verenigingen
 
Guía de buenas prácticas sobre personas con discapacidad.
Guía de buenas prácticas sobre personas con discapacidad.Guía de buenas prácticas sobre personas con discapacidad.
Guía de buenas prácticas sobre personas con discapacidad.
 

Similaire à Handling ambiguities and unknown words in named entity recognition using anaphora resolution

NERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODEL
NERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODELNERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODEL
NERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODEL
ijnlc
 
PART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHOD
PART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHODPART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHOD
PART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHOD
ijait
 

Similaire à Handling ambiguities and unknown words in named entity recognition using anaphora resolution (20)

Identification and Classification of Named Entities in Indian Languages
Identification and Classification of Named Entities in Indian LanguagesIdentification and Classification of Named Entities in Indian Languages
Identification and Classification of Named Entities in Indian Languages
 
NERHMM: A Tool for Named Entity Recognition Based on Hidden Markov Model
NERHMM: A Tool for Named Entity Recognition Based on Hidden Markov ModelNERHMM: A Tool for Named Entity Recognition Based on Hidden Markov Model
NERHMM: A Tool for Named Entity Recognition Based on Hidden Markov Model
 
NERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODEL
NERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODELNERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODEL
NERHMM: A TOOL FOR NAMED ENTITY RECOGNITION BASED ON HIDDEN MARKOV MODEL
 
Pos Tagging for Classical Tamil Texts
Pos Tagging for Classical Tamil TextsPos Tagging for Classical Tamil Texts
Pos Tagging for Classical Tamil Texts
 
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGDETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
 
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNINGDETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
 
P-6
P-6P-6
P-6
 
P-6
P-6P-6
P-6
 
ISSUES AND CHALLENGES IN MARATHI NAMED ENTITY RECOGNITION
ISSUES AND CHALLENGES IN MARATHI NAMED ENTITY RECOGNITIONISSUES AND CHALLENGES IN MARATHI NAMED ENTITY RECOGNITION
ISSUES AND CHALLENGES IN MARATHI NAMED ENTITY RECOGNITION
 
PART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHOD
PART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHODPART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHOD
PART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHOD
 
HIDDEN MARKOV MODEL BASED NAMED ENTITY RECOGNITION TOOL
HIDDEN MARKOV MODEL BASED NAMED ENTITY RECOGNITION TOOLHIDDEN MARKOV MODEL BASED NAMED ENTITY RECOGNITION TOOL
HIDDEN MARKOV MODEL BASED NAMED ENTITY RECOGNITION TOOL
 
Implementation of Enhanced Parts-of-Speech Based Rules for English to Telugu ...
Implementation of Enhanced Parts-of-Speech Based Rules for English to Telugu ...Implementation of Enhanced Parts-of-Speech Based Rules for English to Telugu ...
Implementation of Enhanced Parts-of-Speech Based Rules for English to Telugu ...
 
FIRE2014_IIT-P
FIRE2014_IIT-PFIRE2014_IIT-P
FIRE2014_IIT-P
 
A survey of named entity recognition in assamese and other indian languages
A survey of named entity recognition in assamese and other indian languagesA survey of named entity recognition in assamese and other indian languages
A survey of named entity recognition in assamese and other indian languages
 
Usage of regular expressions in nlp
Usage of regular expressions in nlpUsage of regular expressions in nlp
Usage of regular expressions in nlp
 
Usage of regular expressions in nlp
Usage of regular expressions in nlpUsage of regular expressions in nlp
Usage of regular expressions in nlp
 
D3 dhanalakshmi
D3 dhanalakshmiD3 dhanalakshmi
D3 dhanalakshmi
 
Presentation1
Presentation1Presentation1
Presentation1
 
Difficulties in processing malayalam verbs
Difficulties in processing malayalam verbsDifficulties in processing malayalam verbs
Difficulties in processing malayalam verbs
 
SETSWANA PART OF SPEECH TAGGING
SETSWANA PART OF SPEECH TAGGINGSETSWANA PART OF SPEECH TAGGING
SETSWANA PART OF SPEECH TAGGING
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Dernier (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 

Handling ambiguities and unknown words in named entity recognition using anaphora resolution

  • 1. International Journal on Computational Sciences & Applications (IJCSA) Vol.3, No.5, October 2013 HANDLING AMBIGUITIES AND UNKNOWN WORDS IN NAMED ENTITY RECOGNITION USING ANAPHORA RESOLUTION Deepti Chopra1 Dr. G.N. Purohit2 Department of Computer Engineering, Banasthali Vidyapith, Rajasthan, INDIA ABSTRACT Anaphora Resolution is a method of determining what a particular noun phrase or pronoun at a given instance refers to. In this paper, we have discussed how Anaphora Resolution is useful in performing computation linguistic task in various Natural languages including the Indian languages. Also, we have discussed how Anaphora Resolution is conducive in handling unknown words in Named Entity Recognition. KEYWORDS Anaphora Resolution, Named Entities, Performance Metrics 1. INTRODUCTION Anaphora refers to presupposition or cohesion that refers back to the previous element. Anaphor refers to the reference that is pointing back. The entity to which anaphora actually refers is known as antecedent. So, Anaphora Resolution is a method of finding the antecedent of an anaphor. Some of the applications of Anaphora Resolution include: Named Entity Recognition, Machine Translation, Information Retrieval, Question Answering System and Information Extraction. Named Entity Recognition (NER) is considered as one of the subtask of Information Extraction in which named entities or proper nouns are searched from a huge text and classified into various categories. Transliteration plays a crucial role in resolving the problem of unknown words in Named Entity Recognition. If a training of a given word is performed using example based learning in one language, then this word can be transliterated into other languages also. e. g. नसरत/PER सेब खाती है ु दीपिका/PER कानिर/CITY में रहती है ु दीिा/PER नागिर/CITY में रहती है ु Above, Named Entities in Hindi are transliterated into English as: nusrat, deepika, Kanpur, deepa and nagpur. DOI:10.5121/ijcsa.2013.3504 29
  • 2. International Journal on Computational Sciences & Applications (IJCSA) Vol.3, No.5, October 2013 2. RELATED WORK Named Entity Recognition (NER) is the process in which Named Entities or proper nouns are identified and then categorized into different classes of Named Entity classes e.g. Name of Person, Sport, River, Country, State, city, Organization etc. The unknown words in Named Entity Recognition can be handled using many ways. In one of the approaches, unknown word can be allotted a specific tag e.g. unk tag in order to handle unknown words in NER [1][2]. Capitalization information cannot be used to identify the POS tag, since Capitalization can exists in unknown words also. Also, Capitalization information is only restricted to identify Named Entities in English [3][4]. 3. OUR APPROACH In our approach, we have considered ‘OTHER’ tag to be ‘Not a Named Entity’ tag. Figure 1 depicts our approach during training in NER. We process each token of training data and check its tag. If the token is not tagged with ‘OTHER’ tag, then we transliterate the token into different languages and save these transliterated Named Entities along with their tags in a file. If the token is attached with ‘OTHER’ tag, then we simply process the next token in a sentence and continue this procedure until all the training sentences are not processed. START TRAINING DATA (PROCESS EACH TOKEN) IF NOT OTHER TAG? NO PROCESS NEXT TOKEN YES TRANSLITERATE TOKEN AND ATTACH ITS TAG NO IF END OF SENTENCE REACHED? YES EXIT Figure 1 Steps taken while performing training in Named Entity Recognition 30
  • 3. International Journal on Computational Sciences & Applications (IJCSA) Vol.3, No.5, October 2013 If we have the sound mapping of one language into any other language, then we can easily perform the transliteration task. e. g. If ‘Ram/PER plays/OTHER cricket/OTHER’ is a training sentence, then the transliteration task will transliterate Ram into other languages, since Ram is a Named Entity that is attached to ‘PER’ tag and not ‘OTHER’ tag. ‘plays’ and ‘cricket’ are not the Named Entities, since these are tagged by ‘OTHER’ tag, so transliteration is not performed on it. So, our approach is a language independent based approach, that is capable of recognizing Named Entities from a text written in any of the Natural languages, provided that it has been trained on a Named Entity Recognition based system in any of the languages and there lie some sound mappings so that transliterated Named Entities in other languages can be achieved. Figure 2 displays the steps taken while performing testing in a Named Entity Recognition based system. In this approach, all the tokens in a testing sentence are processed one at a time. If the current token is not an unknown entity, then a tag is assigned to it using Viterbi algorithm, if Hidden Markov Model approach is used to perform Named Entity Recognition. If current token is an unknown entity, then the Transliteration file which was generated during training process is searched. If the unknown entity is found in transliteration file, then the corresponding Named Entity tag is attached to it, else ‘OTHER’ tag is allotted to it. This procedure continues until all the training sentences are not processed. TESTING SENTENCE (PROCESS EACH TOKEN) IF AN UNKNOWN ENTITY? NO ATTACH TAG TO TOKEN YES SEARCH TRANSLITERATION FILE IF ENTITY EXISTS? NO ATTACH ‘OTHER’ TAG TO TOKEN YES ATTACH NAMED ENTITY TAG NO IF END OF SENTENCE ? YES EXIT 31
  • 4. International Journal on Computational Sciences & Applications (IJCSA) Vol.3, No.5, October 2013 Figure 2 Steps taken while performing testing in Named Entity Recognition Consider a testing sentence: राम क्रिकट खेलता है े If Training sentence is: Ram/PER plays/OTHER cricket/SPORT. Then, during training transliteration of Named Entities i.e. Ram and cricket takes place and the transliterated Named Entities are stored in a transliterated file along with their tags. So, during testing राम and क्रिकट े are given PER and SPORT tags respectively. खेलता and है tokens are unknown entities, since these tokens are neither trained during training process nor they exist in transliteration file. 3. RESULT In Named Entity Recognition, handling of unknown words is a very crucial issue. We have developed a code that generates a list of Named Entities used in the training file. These Named Entities are then transliterated into different languages and stored in the respective transliterated files of different languages. TABLE 1 displays the transliteration of Named Entities in different languages obtained using our Transliteration code. TABLE 1 Transliteration of Named Entities in different languages ENGLISH Ram HINDI FRENCH. Bélier URDU ‫رام‬ TELUGU మేషరాశి BENGALI গড্ডল Cricket क्रिकट े Cricket ‫کرکٹ‬ క్రిక్ెట్ ক্রিকেট খেলা India भारत Inde ‫بھارت‬ భారతదేశం ভারত Ganga गंगा Ganga ‫گنگا‬ గంగా গঙ্গা राम For a given testing sentence, we can initially check if all the tokens exist in already created words list. If for all the words, training has been performed, then by using Viterbi Algorithm, we can generate optimal state sequence. If one of the word is unknown that does not exists in training file, then we can check the transliterated file, if the word exists then the correct tag can be allotted to it and hence the problem of unknown word is resolved. If initially we have performed training in NER in English and during testing if we give a word in Hindi, then this would be treated as an unknown word for which we have not performed training yet. So, we can generate the transliterated list of Named Entities from English to other languages and handle the problem of unknown words in NER. TABLE 2 displays our results obtained while performing NER on multilingual corpus. The Training and Testing file include sentences from the following languages: English, Hindi, Marathi, Punjabi and Urdu. We have performed training on 142 tokens and testing on 212 tokens. Recall, Precision and F-Measure obtained are: 95.80%, 96.3% & 96.04%. TABLE: 2 Unknown words handling in NER having 28 sentences in Training file Tag set {Person, City, Sport} Number of Training 28 (142 tokens) sentences Number of Testing 41(212tokens) 32
  • 5. International Journal on Computational Sciences & Applications (IJCSA) Vol.3, No.5, October 2013 Sentences Number of sentence 0 tagged wrongly According to human 42 6 1 163 41 3 1 167 PER City Sport Not-a- name frequency of tags (Correct Answe r) Observed (System provided) frequency of tags (Experimental Results) TAG SET Recall(R) R=Correct/(Correct+Incorrect+Spurious)=209/(209+8+1)=95.80% Precision(P) P=Correct/(Correct+Incorrect+Missing)=209/(209+8+0)=96.3% F-measure F-Measure = (2*P*R)/(P+R)=(2*96.3*95.8)/(96.3+95.8)=96.04% TABLE 3 displays the list of Named Entities on which transliteration is performed in different languages and then these transliterated Named Entities are stored in a file along with their tags that can be used further to resolve the problem of unknown entities in a Named Entity Recognition. TABLE 3 Named Entities transliterated into different languages SNO TAGS 1 PER (Name of Person) 2 OZ (Name of Organization) 3 CN (Name of Country) 4 MAG (Name of Magazine) 5 WEEK (Name of Week) 6 LOC (Name of Location) 7 PC (Name of Personal Computer) 8 MONTH (Name of Month) 9 CITY (Name of City) 10 ST (Name of State) 11 SPORT (Name of Sport) 33
  • 6. International Journal on Computational Sciences & Applications (IJCSA) Vol.3, No.5, October 2013 4. PERFORMANCE METRICS Performance Metrics is measure to estimate the performance of a NER based system. Performance Metrics can be calculated in terms of 3 parameters: Precision, Accuracy and FMeasure. The output of a NER based system is termed as “response” and the interpretation of human as the “answer key” [5]. Consider the following terms: 1. Correct-If the response is same as the answer key. 2. Incorrect-If the response is not same as the answer key. 3. Missing-If answer key is found to be tagged but response is not tagged. 4. Spurious-If response is found to be tagged but answer key is not tagged. [6] Hence, we define Precision, Recall and F-Measure as follows: [7][8][9] Precision (P): Correct / (Correct + Incorrect + Missing) Recall (R): Correct / (Correct + Incorrect + Spurious) F-Measure: (2 * P * R) / (P + R) 5. CONCLUSION In this paper, we have discussed about Transliteration. We have given our results obtained while performing Named Entity Recognition on multilingual corpus and handling unknown words using transliteration approach. We have also discussed about Performance Metrics which is a very important measure to judge the performance of a Named Entity Recognition based system. ACKNOWLEDGEMENT I would like to thank all those who helped me in accomplishing this task. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] http://www.cse.iitb.ac.in/~cs626-460-2012/seminar_ppts/ner.pdf http://staff.um.edu.mt/cabe2/publications/nfHmms.pdf http://acl.ldc.upenn.edu/acl2002/MAIN/pdfs/Main036.pdf http://www.sigkdd.org/explorations/issues/7-1-2005-06/4-Fu.pdf Shilpi Srivastava, Mukund Sanglikar & D.C Kothari. ”Named Entity Recognition System for Hindi Language: A Hybrid Approach” International Journal of Computational Linguistics (IJCL), Volume (2) : Issue (1) : 2011.Available at http://cscjournals.org/csc/manuscript/Journals/IJCL/volume2/Issue1/IJCL-19.pdf B. Sasidhar, P. M. Yohan, Dr. A. Vinaya Babu3, Dr. A. Govardhan,.“A Survey on Named Entity Recognition in Indian Languages with particular reference to Telugu” IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 2, March 2011. Asif Ekbal, Rejwanul Haque, Amitava Das, Venkateswarlu Poka and Sivaji Bandyopadhyay “Language Independent Named Entity Recognition in Indian Languages” .In Proceedings of the IJCNLP-08 Workshop on NER for South and South East Asian Languages, pages 33–40,Hyderabad, India, January 2008.Available at: http://www.mt-archive.info/IJCNLP-2008-Ekbal.pdf Darvinder kaur, Vishal Gupta. “A survey of Named Entity Recognition in English and other Indian Languages” .IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 6, November 2010. 34
  • 7. International Journal on Computational Sciences & Applications (IJCSA) Vol.3, No.5, October 2013 [9] G.V.S.RAJU, B.SRINIVASU, Dr.S.VISWANADHA RAJU, 4K.S.M.V.KUMAR “Named Entity Recognition for Telugu Using Maximum Entropy Model” About Authors Deepti Chopra is working as Assistant Professor in the Department of Computer Science at Banasthali University (Rajasthan), India. She has received B.Tech degree in Computer Science and Engineering from Rajasthan College of Engineering for Women, Jaipur, Rajasthan in 2011.She has done M.Tech in Computer Science and Engineering from Banasthali University, Rajasthan in 2013. Her research interests include Artificial Intelligence, Natural Language Processing, and Information Retrieval. She has published many papers in International journals and conferences. Dr. G. N. Purohit is a Professor in Department of Mathematics & Statistics at Banasthali University (Rajasthan). Before joining Banasthali University, he was Professor and Head of the Department of Mathematics, University of Rajasthan, Jaipur. He had been Chief-editor of a research journal and regular reviewer of many journals. His present interest is in O.R., Discrete Mathematics and Communication networks. He has published around 40 research papers in various journals. . 35