SlideShare une entreprise Scribd logo
1  sur  5
Named Entity Recognition through CRF Based Machine Learning and
                   Language Specific Heuristics
                      First Author                                    Second Author
               Affiliation / Address line 1                     Affiliation / Address line 1
               Affiliation / Address line 2                     Affiliation / Address line 2
               Affiliation / Address line 3                     Affiliation / Address line 3
                  e-mail@domain                                    e-mail@domain




                     Abstract                           2     Some Linguistic Issues

    This paper, submitted as an entry for the           2.1    Agglutinative Nature
    NERSSEAL-2008 shared task, describes a              Some of the SSEA languages have agglutinative
    system build for Named Entity Recognition           properties. For example, a Dravidian language like
    for South and South East Asian Languages.           Telugu has a number of postpositions attached to a
    This system has been tested on five lan-            stem to form a single word. An example is:
    guages: Telugu, Hindi, Bengali, Urdu and
    Oriya. It uses CRF (Conditional Random                  guruvAraMwo = guruvAraM + wo
    Fields) based machine learning, followed                up to Wednesday = Wednesday + up to
    by post processing which involves using
    some heuristics or rules. The system is                Most of the NERs are suffixed with several dif-
    specifically tuned for Hindi, but we also re-       ferent postpositions, which increase the number of
    port the results for the other four lan-            distinct words in the corpus. This in turn affects
    guages.                                             the machine learning process.
                                                        2.2    No Capitalization
                                                        All the five languages have scripts without graphi-
1    Introduction                                       cal cues like capitalization, which could act as an
Named Entity Recognition (NER) is a task that           important indicator for NER. For a language like
seeks to locate and classify entities (‘atomic ele-     English, the NER system can exploit this feature to
ments’) in a text into predefined categories such as    its advantage.
the names of persons, organizations, locations, ex-     2.3    Ambiguity
pressions of times, quantities, etc. The
NERSSEAL contest uses 12 categories of named            One of the properties of the named entities in these
entities to define a tagset. Manually tagged training   languages is the high overlap between common
and testing data is provided to the contestants.        names and proper names. For instance ‘Kamal’ can
   NER has a wide range of applications. It is used     mean ‘a type of flower’, which is not a named enti-
in machine translation, information extraction as       ty, but it can also be a person’s name, i.e., a named
well as speech processing.                              entity.
   The task of building a named entity recognizer          Among the named entities themselves, there is
for South and South East Asian languages presents       ambiguity between a location name ‘Bangalore ek
several problems related to their linguistic charac-    badzA shaher heI’ (Bangalore is a big city) or a
teristics. Before going on to describe our method,      person’s surname ‘M. Bangalore shikshak heI’
we will point out some these issues.                    (M. Bangalore is a teacher).
2.4    Low POS Tagging Accuracy for Nouns                   Let s = (s1,s2,s3 ,s4 ,... sT ) be some sequence of
                                                         states, (the values on T output nodes). By the Ham-
For English, the available tools like POS (Part of
                                                         mersley-Clifford theorem, CRFs define the condi-
Speech) tagger can be used to provide features for
                                                         tional probability of a state sequence given an in-
machine learning. This is not very helpful SSEA
                                                         put sequence to be:
languages because the accuracy for noun and prop-
er noun tags is quite low (PVS and G., 2006)
Hence, features based on POS tags cannot be used
for NER for SSEA languages.
2.5    Spelling Variation                                   where Zo is a normalization factor over all state
                                                         sequences is an arbitrary feature function over its
One other important language related issue is the        arguments, and λk is a learned weight for each fea-
variation in the spellings of proper names. For in-      ture function. A feature function may, for example,
stance the same name ‘Shri Ram Dixit’ can be             be defined to have value 0 or 1. Higher λ weights
written as ‘Sh. Ram Dixit’, ‘Shree Ram Dixit’, ‘Sh.      make their corresponding FSM transitions more
R. Dixit’ and so on. This increases the number of        likely. CRFs define the conditional probability of a
tokens to be learnt by the machine and would per-        label sequence based on the total probability over
haps also require a higher level task like co-refer-     the state sequences,
ence resolution.
   A NER system can be rule-based, statistical or
hybrid. A rule-based NER system uses hand-writ-
ten rules to tag a corpus with named entities. A sta-
tistical NER system learns the probabilities of             where l(s) is the sequence of labels correspond-
named entities using training data, whereas hybrid       ing to the labels of the states in sequence s.
systems use both.                                           Note that the normalization factor, Zo, (also
   Developing rule-based taggers for NER can be          known in statistical physics as the partition func-
cumbersome as it is a language specific process.         tion) is the sum of the scores of all possible states.
Statistical taggers require large amount of annotat-
ed data (the more the merrier) to train. Our system
is a hybrid NER tagger which first uses Condition-
al Random Fields (CRF) as a machine learning
technique followed by some rule based post-pro-             And that the number of state sequences is expo-
cessing.                                                 nential in the input sequence length, T. In arbitrari-
                                                         ly-structured CRFs, calculating the partition func-
3     Conditional Random Fields                          tion in closed form is intractable, and approxima-
                                                         tion methods such as Gibbs sampling or loopy be-
Conditional Random Fields (CRFs) are undirected
                                                         lief propagation must be used. In linear-chain
graphical models used to calculate the conditional
                                                         structured CRFs (in use here for sequence model-
probability of values on designated output nodes
                                                         ing), the partition function can be calculated effi-
given values assigned to other designated input
                                                         ciently by dynamic programming.
nodes.
   In the special case in which the output nodes of      4     CRF Based Machine Learning
the graphical model are linked by edges in a linear
chain, CRFs make a first-order Markov indepen-           We used the CRF model to perform the initial tag-
dence assumption, and thus can be understood as          ging followed by post-processing.
conditionally-trained finite state machines (FSMs).
Let o = (o,,o2,o3 ,o4 ,... oT ) be some observed input   4.1    Statistical Tagging
data sequence, such as a sequence of words in text       In the first phase, we have used language indepen-
in a document,(the values on n input nodes of the        dent features such as statistical prefixes and suffix-
graphical model). Let S be a set of FSM states,          es to build the model using CRF. The set of fea-
each of which is associated with a label, l Є £.         tures used are described below.
Precision                        Recall                          F-Measure

             Pm           Pn        Pl        Rm        Rn          Rl        Fm        Fn        Fl

Bengali      53.34        49.28     58.27     26.77     25.88       31.19     35.65     33.94     40.63

Hindi        59.53        63.84     64.84     41.21     41.74       40.77     48.71     50.47     50.06

Oriya        39.16        40.38     63.70     23.39     19.24       28.15     29.29     26.06     39.04

Telugu       10.31        71.96     65.45     68.00     30.85       29.78     08.19     43.19     40.94

Urdu         43.63        44.76     48.96     36.69     34.56       39.07     39.86     39.01     43.46

                         Table-1: Evaluation of the NER System for Five Languages

4.2      Window of the Words                           4.6      Presence of Digits
Words preceding or following the target word may       Usually, the presence of digits indicates that the to-
be useful for determining its category. Following a    ken is a named entity. This is taken as a feature to
few trials we found that a suitable window size is     recognize NEN (Named Entity Number).
five.
                                                       4.7      Presence of 4 Digits
4.3      Suffixes
                                                       If the token is a four digit number, it is likelier to
Statistical suffixes of length 1 to 4 have been con-   be a NETI (Named Entity Time). For example,
sidered. These can capture information for words       1857, 2007 etc. are most probably years.
with the named entity (Location) tag like Hyder-
abad, Secunderabad, Ahmedabad etc., all of which       5     Heuristics Based Post Processing
end in ‘bad’.
                                                       After the first phase, we used rule-based post pro-
4.4      Prefixes                                      cessing to improve the performance.
Statistical prefixes of length 1 to 4 have been con-   5.1      Second Best Tag
sidered. These can take care of the problems asso-
                                                       It has been observed that the recall of the CRF
ciated with a large number of distinct tokens. As
                                                       model is low. In order to improve recall, we have
mentioned earlier, agglutinative languages can
                                                       used the following rule. If the best tag given by
have a number of postpositions. The use of prefix-
                                                       the CRF model is O (not a named entity) and the
es will increase the probability of ‘Hyderabad’
                                                       confidence of the second best tag is greater than
and ‘Hyderabadloo’ (Telugu for ‘in Hyderabad’)
                                                       0.15, then the second best tag is considered as the
being treated as the same token.
                                                       correct tag.
                                                          We observed an increase of 7% in recall and 3%
4.5      Start of a Sentence                           decrease in precision and 4% increase in the F-
                                                       measure, which is a significant increase in perfor-
There is a possibility of confusing the NEN            mance. The decrease in precision is expected as we
(Named Entity Number) in a sentence with the           are taking the second tag.
number that appears in a numbered list. The num-
bered list will always have numbers at the begin-      5.2      Nested Entities
ning of a sentence; hence a feature that checks for
                                                       One of the important tasks in the contest was to
this property will resolve the ambiguity with an ac-
                                                       identify nested named entities. For example if we
tual NEN.
                                                       consider "eka kilo" (Hindi: one kilo) as NEM
(Named Entity Measure), it contains a NEN               vidually for each named entity tag separately in
(Named Entity Number) within it.                        Table-2.
   The CRF model tags “eka kilo” as NEM and in
order to tag “eka” as NEN we have made use of           7   Unknown Words
other resources like a gazetteer for the list of num-
                                                        Table 3 shows the number of unknown words
bers. We used such lists for four languages.
                                                        present in the test data when compared with the
                                                        training data.
          Bengali   Hindi   Oriya   Telugu   Urdu
                                                           First column shows the number of unique
NEP       35.22     54.05   52.22   01.93    31.22      Named entity tags present in the test data for each
NED       NP        42.47   01.97   NP       21.27      language. Second column shows the number of
                                                        unique known named entities present in the test
NEO       11.59     45.63   14.50   NP       19.13
                                                        data. Third column shows the percentage of unique
NEA       NP        61.53   NP      NP       NP         unknown words present in the test data of different
NEB       NP        NP      NP      NP       NP         languages when compared to training data.
NETP      42.30     NP      NP      NP       NP
                                                        8   Error Analyses
NETO      33.33     13.77   NP      01.66    NP
                                                        We can observe from the results that the maximal
NEL       45.27     62.66   48.72   01.49    57.85
                                                        F-measure for Telugu is very low when compared
NETI      55.85     79.09   40.91   71.35    63.47      to lexical F-measure and nested F-measure. The
NEN       62.67     80.69   24.94   83.17    13.75      reason is that the test data of Telugu contains a
                                                        large number of long named entities (around 6
NEM       60.51     43.75   19.00   26.66    84.10
                                                        words), which contains around 4 to 5 nested named
NETE      19.17     31.52   NP      08.91    NP         entities. Our system was able to tag nested named
                                                        entities correctly unlike maximal named entity.
       Table-2: F-Measure (Lexical) for NE Tags            We can also observe that the maximal F-mea-
                                                        sure for Telugu is very low when compared to oth-
       Evaluation                                       er languages. This is because Telugu test data has
                                                        very few known words.
The evaluation measures used for all the five lan-         One more observation is that the F-measures of
guages are precision, recall and F-measure. These       NEN, NETI, and NEM are relatively high. The
measures are calculated in three different ways:        reason would be that these classes of NE are rela-
                                                        tively closed.
      1. Maximal Matches: The largest possible
         named entities are matched with the refer-                   Num of    Num of   % of un-
         ence data.                                                  NE tokens known NE known NE
      2. Nested Matches: The largest possible as        Bengali      1185         277          23.37
         well as nested named entities are matched.
                                                        Hindi        1120         417          37.23
      3. Lexical Item Matches: The lexical items
         inside largest possible named entities are     Oriya        1310         563          42.97
         matched.                                       Telugu       1150         145          12.60
                                                        Urdu         631          179          28.36
                                                                      Table-3: Unknown Word
6      Results                                              Conclusion
The results of evaluation as explained in the previ-    In this paper we have presented the results of using
ous section are shown in the Table-1. The F-mea-        a two stage hybrid approach for the task of named
sures for nested lexical match are also shown indi-     entity recognition for South and South East Asian
                                                        Languages. We have achieved decent Lexical F-
measures of 40.63, 50.06, 39.04, 40.94, and 43.46
for Bengali, Hindi, Oriya, Telugu and Urdu respec-
tively without using too many language specific re-
sources.

References
[1] Fei Sha and Fernando Pereira 2003, Shallow
   Parsing with Conditional Random Fields.In the
   Proceedings of HLT-NAACL.
[2] Charles Sutton, an Introduction to Conditional
   Random Fields for Relational Learning.
[3] John Lafferty, Andrew McCallum and Fernan-
   do Pereira, Conditional Random Fields: Proba-
   bilistic Models for Segment-ing and Labeling
   Sequence Data.
[4] Avinesh PVS and Karthik G: Part-Of-Speech
   Tagging and Chunking Using Conditional Ran-
   dom Fields and Transformation Based Learning:
   Proceedings of the SPSAL workshop during IJ-
   CAI’07.
[5] CRF++: Yet another Toolkit
http://crfpp.sourceforge.net/
[6] Cucerzan, S. and Yarowsky, D.:Language inde-
   pendent named entity recognition combining
   morphological and contextual evidence:Proceed-
   ings of the Joint SIGDAT Conference on
   EMNLP and VLC 1999

Contenu connexe

Tendances

Corpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of PersianCorpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of PersianIDES Editor
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERTshaurya uppal
 
A survey on parallel corpora alignment
A survey on parallel corpora alignment A survey on parallel corpora alignment
A survey on parallel corpora alignment andrefsantos
 
Verb based manipuri sentiment analysis
Verb based manipuri sentiment analysisVerb based manipuri sentiment analysis
Verb based manipuri sentiment analysisijnlc
 
Chunking in manipuri using crf
Chunking in manipuri using crfChunking in manipuri using crf
Chunking in manipuri using crfijnlc
 
Arabic named entity recognition using deep learning approach
Arabic named entity recognition using deep learning approachArabic named entity recognition using deep learning approach
Arabic named entity recognition using deep learning approachIJECEIAES
 
HMM BASED POS TAGGER FOR HINDI
HMM BASED POS TAGGER FOR HINDIHMM BASED POS TAGGER FOR HINDI
HMM BASED POS TAGGER FOR HINDIcscpconf
 
Hps a hierarchical persian stemming method
Hps a hierarchical persian stemming methodHps a hierarchical persian stemming method
Hps a hierarchical persian stemming methodijnlc
 
An Improved Approach for Word Ambiguity Removal
An Improved Approach for Word Ambiguity RemovalAn Improved Approach for Word Ambiguity Removal
An Improved Approach for Word Ambiguity RemovalWaqas Tariq
 
IRJET- Spoken Language Identification System using MFCC Features and Gaus...
IRJET-  	  Spoken Language Identification System using MFCC Features and Gaus...IRJET-  	  Spoken Language Identification System using MFCC Features and Gaus...
IRJET- Spoken Language Identification System using MFCC Features and Gaus...IRJET Journal
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarSenthil Kumar M
 
Statistical machine translation
Statistical machine translationStatistical machine translation
Statistical machine translationHrishikesh Nair
 
THE STRUCTURED COMPACT TAG-SET FOR LUGANDA
THE STRUCTURED COMPACT TAG-SET FOR LUGANDATHE STRUCTURED COMPACT TAG-SET FOR LUGANDA
THE STRUCTURED COMPACT TAG-SET FOR LUGANDAijnlc
 
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATIONAN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATIONcscpconf
 

Tendances (18)

Nlp
NlpNlp
Nlp
 
Corpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of PersianCorpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of Persian
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
A survey on parallel corpora alignment
A survey on parallel corpora alignment A survey on parallel corpora alignment
A survey on parallel corpora alignment
 
Verb based manipuri sentiment analysis
Verb based manipuri sentiment analysisVerb based manipuri sentiment analysis
Verb based manipuri sentiment analysis
 
NLP_KASHK:Text Normalization
NLP_KASHK:Text NormalizationNLP_KASHK:Text Normalization
NLP_KASHK:Text Normalization
 
Chunking in manipuri using crf
Chunking in manipuri using crfChunking in manipuri using crf
Chunking in manipuri using crf
 
combination
combinationcombination
combination
 
Arabic named entity recognition using deep learning approach
Arabic named entity recognition using deep learning approachArabic named entity recognition using deep learning approach
Arabic named entity recognition using deep learning approach
 
HMM BASED POS TAGGER FOR HINDI
HMM BASED POS TAGGER FOR HINDIHMM BASED POS TAGGER FOR HINDI
HMM BASED POS TAGGER FOR HINDI
 
Hps a hierarchical persian stemming method
Hps a hierarchical persian stemming methodHps a hierarchical persian stemming method
Hps a hierarchical persian stemming method
 
Ceis 8
Ceis 8Ceis 8
Ceis 8
 
An Improved Approach for Word Ambiguity Removal
An Improved Approach for Word Ambiguity RemovalAn Improved Approach for Word Ambiguity Removal
An Improved Approach for Word Ambiguity Removal
 
IRJET- Spoken Language Identification System using MFCC Features and Gaus...
IRJET-  	  Spoken Language Identification System using MFCC Features and Gaus...IRJET-  	  Spoken Language Identification System using MFCC Features and Gaus...
IRJET- Spoken Language Identification System using MFCC Features and Gaus...
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil Kumar
 
Statistical machine translation
Statistical machine translationStatistical machine translation
Statistical machine translation
 
THE STRUCTURED COMPACT TAG-SET FOR LUGANDA
THE STRUCTURED COMPACT TAG-SET FOR LUGANDATHE STRUCTURED COMPACT TAG-SET FOR LUGANDA
THE STRUCTURED COMPACT TAG-SET FOR LUGANDA
 
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATIONAN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
 

En vedette

IIS 7.0 for Apache Administrators
IIS 7.0 for Apache AdministratorsIIS 7.0 for Apache Administrators
IIS 7.0 for Apache Administratorsbutest
 
CSE333 project initial spec: Learning agents
CSE333 project initial spec: Learning agentsCSE333 project initial spec: Learning agents
CSE333 project initial spec: Learning agentsbutest
 
Introduction
IntroductionIntroduction
Introductionbutest
 
Project 1 (Due on Mar 6)
Project 1 (Due on Mar 6)Project 1 (Due on Mar 6)
Project 1 (Due on Mar 6)butest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Resume
ResumeResume
Resumebutest
 
ETRnew.doc.doc
ETRnew.doc.docETRnew.doc.doc
ETRnew.doc.docbutest
 

En vedette (10)

IIS 7.0 for Apache Administrators
IIS 7.0 for Apache AdministratorsIIS 7.0 for Apache Administrators
IIS 7.0 for Apache Administrators
 
CSE333 project initial spec: Learning agents
CSE333 project initial spec: Learning agentsCSE333 project initial spec: Learning agents
CSE333 project initial spec: Learning agents
 
Introduction
IntroductionIntroduction
Introduction
 
divani moderni
divani modernidivani moderni
divani moderni
 
Project 1 (Due on Mar 6)
Project 1 (Due on Mar 6)Project 1 (Due on Mar 6)
Project 1 (Due on Mar 6)
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
PPT
PPTPPT
PPT
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Resume
ResumeResume
Resume
 
ETRnew.doc.doc
ETRnew.doc.docETRnew.doc.doc
ETRnew.doc.doc
 

Similaire à P-6

Improvement of Random Forest Classifier through Localization of Persian Handw...
Improvement of Random Forest Classifier through Localization of Persian Handw...Improvement of Random Forest Classifier through Localization of Persian Handw...
Improvement of Random Forest Classifier through Localization of Persian Handw...IDES Editor
 
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmmUnit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmmDhruvKushwaha12
 
Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)kevig
 
Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)kevig
 
Multitier holistic Approach for urdu Nastaliq Recognition
Multitier holistic Approach for urdu Nastaliq RecognitionMultitier holistic Approach for urdu Nastaliq Recognition
Multitier holistic Approach for urdu Nastaliq RecognitionDr. Syed Hassan Amin
 
A survey of named entity recognition in assamese and other indian languages
A survey of named entity recognition in assamese and other indian languagesA survey of named entity recognition in assamese and other indian languages
A survey of named entity recognition in assamese and other indian languagesijnlc
 
Extractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised ApproachExtractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised ApproachFindwise
 
Statistical Named Entity Recognition for Hungarian – analysis ...
Statistical Named Entity Recognition for Hungarian – analysis ...Statistical Named Entity Recognition for Hungarian – analysis ...
Statistical Named Entity Recognition for Hungarian – analysis ...butest
 
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...IRJET Journal
 
Toward accurate Amazigh part-of-speech tagging
Toward accurate Amazigh part-of-speech taggingToward accurate Amazigh part-of-speech tagging
Toward accurate Amazigh part-of-speech taggingIAESIJAI
 
Ijarcet vol-2-issue-2-323-329
Ijarcet vol-2-issue-2-323-329Ijarcet vol-2-issue-2-323-329
Ijarcet vol-2-issue-2-323-329Editor IJARCET
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...kevig
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...ijnlc
 
Usage of regular expressions in nlp
Usage of regular expressions in nlpUsage of regular expressions in nlp
Usage of regular expressions in nlpeSAT Journals
 
SEMI-SUPERVISED BOOTSTRAPPING APPROACH FOR NAMED ENTITY RECOGNITION
SEMI-SUPERVISED BOOTSTRAPPING APPROACH FOR NAMED ENTITY RECOGNITIONSEMI-SUPERVISED BOOTSTRAPPING APPROACH FOR NAMED ENTITY RECOGNITION
SEMI-SUPERVISED BOOTSTRAPPING APPROACH FOR NAMED ENTITY RECOGNITIONkevig
 
SYLLABLE-BASED NEURAL NAMED ENTITY RECOGNITION FOR MYANMAR LANGUAGE
SYLLABLE-BASED NEURAL NAMED ENTITY RECOGNITION FOR MYANMAR LANGUAGESYLLABLE-BASED NEURAL NAMED ENTITY RECOGNITION FOR MYANMAR LANGUAGE
SYLLABLE-BASED NEURAL NAMED ENTITY RECOGNITION FOR MYANMAR LANGUAGEijnlc
 

Similaire à P-6 (20)

Improvement of Random Forest Classifier through Localization of Persian Handw...
Improvement of Random Forest Classifier through Localization of Persian Handw...Improvement of Random Forest Classifier through Localization of Persian Handw...
Improvement of Random Forest Classifier through Localization of Persian Handw...
 
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmmUnit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
Unit-1 PPL PPTvvhvmmmmmmmmmmmmmmmmmmmmmm
 
Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)
 
Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)
 
Multitier holistic Approach for urdu Nastaliq Recognition
Multitier holistic Approach for urdu Nastaliq RecognitionMultitier holistic Approach for urdu Nastaliq Recognition
Multitier holistic Approach for urdu Nastaliq Recognition
 
A survey of named entity recognition in assamese and other indian languages
A survey of named entity recognition in assamese and other indian languagesA survey of named entity recognition in assamese and other indian languages
A survey of named entity recognition in assamese and other indian languages
 
Extractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised ApproachExtractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised Approach
 
Statistical Named Entity Recognition for Hungarian – analysis ...
Statistical Named Entity Recognition for Hungarian – analysis ...Statistical Named Entity Recognition for Hungarian – analysis ...
Statistical Named Entity Recognition for Hungarian – analysis ...
 
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
 
Parsing
ParsingParsing
Parsing
 
Toward accurate Amazigh part-of-speech tagging
Toward accurate Amazigh part-of-speech taggingToward accurate Amazigh part-of-speech tagging
Toward accurate Amazigh part-of-speech tagging
 
Ijarcet vol-2-issue-2-323-329
Ijarcet vol-2-issue-2-323-329Ijarcet vol-2-issue-2-323-329
Ijarcet vol-2-issue-2-323-329
 
Usage of regular expressions in nlp
Usage of regular expressions in nlpUsage of regular expressions in nlp
Usage of regular expressions in nlp
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
 
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
 
Usage of regular expressions in nlp
Usage of regular expressions in nlpUsage of regular expressions in nlp
Usage of regular expressions in nlp
 
AINL 2016: Eyecioglu
AINL 2016: EyeciogluAINL 2016: Eyecioglu
AINL 2016: Eyecioglu
 
D3 dhanalakshmi
D3 dhanalakshmiD3 dhanalakshmi
D3 dhanalakshmi
 
SEMI-SUPERVISED BOOTSTRAPPING APPROACH FOR NAMED ENTITY RECOGNITION
SEMI-SUPERVISED BOOTSTRAPPING APPROACH FOR NAMED ENTITY RECOGNITIONSEMI-SUPERVISED BOOTSTRAPPING APPROACH FOR NAMED ENTITY RECOGNITION
SEMI-SUPERVISED BOOTSTRAPPING APPROACH FOR NAMED ENTITY RECOGNITION
 
SYLLABLE-BASED NEURAL NAMED ENTITY RECOGNITION FOR MYANMAR LANGUAGE
SYLLABLE-BASED NEURAL NAMED ENTITY RECOGNITION FOR MYANMAR LANGUAGESYLLABLE-BASED NEURAL NAMED ENTITY RECOGNITION FOR MYANMAR LANGUAGE
SYLLABLE-BASED NEURAL NAMED ENTITY RECOGNITION FOR MYANMAR LANGUAGE
 

Plus de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 
Download
DownloadDownload
Downloadbutest
 

Plus de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 
Download
DownloadDownload
Download
 

P-6

  • 1. Named Entity Recognition through CRF Based Machine Learning and Language Specific Heuristics First Author Second Author Affiliation / Address line 1 Affiliation / Address line 1 Affiliation / Address line 2 Affiliation / Address line 2 Affiliation / Address line 3 Affiliation / Address line 3 e-mail@domain e-mail@domain Abstract 2 Some Linguistic Issues This paper, submitted as an entry for the 2.1 Agglutinative Nature NERSSEAL-2008 shared task, describes a Some of the SSEA languages have agglutinative system build for Named Entity Recognition properties. For example, a Dravidian language like for South and South East Asian Languages. Telugu has a number of postpositions attached to a This system has been tested on five lan- stem to form a single word. An example is: guages: Telugu, Hindi, Bengali, Urdu and Oriya. It uses CRF (Conditional Random guruvAraMwo = guruvAraM + wo Fields) based machine learning, followed up to Wednesday = Wednesday + up to by post processing which involves using some heuristics or rules. The system is Most of the NERs are suffixed with several dif- specifically tuned for Hindi, but we also re- ferent postpositions, which increase the number of port the results for the other four lan- distinct words in the corpus. This in turn affects guages. the machine learning process. 2.2 No Capitalization All the five languages have scripts without graphi- 1 Introduction cal cues like capitalization, which could act as an Named Entity Recognition (NER) is a task that important indicator for NER. For a language like seeks to locate and classify entities (‘atomic ele- English, the NER system can exploit this feature to ments’) in a text into predefined categories such as its advantage. the names of persons, organizations, locations, ex- 2.3 Ambiguity pressions of times, quantities, etc. The NERSSEAL contest uses 12 categories of named One of the properties of the named entities in these entities to define a tagset. Manually tagged training languages is the high overlap between common and testing data is provided to the contestants. names and proper names. For instance ‘Kamal’ can NER has a wide range of applications. It is used mean ‘a type of flower’, which is not a named enti- in machine translation, information extraction as ty, but it can also be a person’s name, i.e., a named well as speech processing. entity. The task of building a named entity recognizer Among the named entities themselves, there is for South and South East Asian languages presents ambiguity between a location name ‘Bangalore ek several problems related to their linguistic charac- badzA shaher heI’ (Bangalore is a big city) or a teristics. Before going on to describe our method, person’s surname ‘M. Bangalore shikshak heI’ we will point out some these issues. (M. Bangalore is a teacher).
  • 2. 2.4 Low POS Tagging Accuracy for Nouns Let s = (s1,s2,s3 ,s4 ,... sT ) be some sequence of states, (the values on T output nodes). By the Ham- For English, the available tools like POS (Part of mersley-Clifford theorem, CRFs define the condi- Speech) tagger can be used to provide features for tional probability of a state sequence given an in- machine learning. This is not very helpful SSEA put sequence to be: languages because the accuracy for noun and prop- er noun tags is quite low (PVS and G., 2006) Hence, features based on POS tags cannot be used for NER for SSEA languages. 2.5 Spelling Variation where Zo is a normalization factor over all state sequences is an arbitrary feature function over its One other important language related issue is the arguments, and λk is a learned weight for each fea- variation in the spellings of proper names. For in- ture function. A feature function may, for example, stance the same name ‘Shri Ram Dixit’ can be be defined to have value 0 or 1. Higher λ weights written as ‘Sh. Ram Dixit’, ‘Shree Ram Dixit’, ‘Sh. make their corresponding FSM transitions more R. Dixit’ and so on. This increases the number of likely. CRFs define the conditional probability of a tokens to be learnt by the machine and would per- label sequence based on the total probability over haps also require a higher level task like co-refer- the state sequences, ence resolution. A NER system can be rule-based, statistical or hybrid. A rule-based NER system uses hand-writ- ten rules to tag a corpus with named entities. A sta- tistical NER system learns the probabilities of where l(s) is the sequence of labels correspond- named entities using training data, whereas hybrid ing to the labels of the states in sequence s. systems use both. Note that the normalization factor, Zo, (also Developing rule-based taggers for NER can be known in statistical physics as the partition func- cumbersome as it is a language specific process. tion) is the sum of the scores of all possible states. Statistical taggers require large amount of annotat- ed data (the more the merrier) to train. Our system is a hybrid NER tagger which first uses Condition- al Random Fields (CRF) as a machine learning technique followed by some rule based post-pro- And that the number of state sequences is expo- cessing. nential in the input sequence length, T. In arbitrari- ly-structured CRFs, calculating the partition func- 3 Conditional Random Fields tion in closed form is intractable, and approxima- tion methods such as Gibbs sampling or loopy be- Conditional Random Fields (CRFs) are undirected lief propagation must be used. In linear-chain graphical models used to calculate the conditional structured CRFs (in use here for sequence model- probability of values on designated output nodes ing), the partition function can be calculated effi- given values assigned to other designated input ciently by dynamic programming. nodes. In the special case in which the output nodes of 4 CRF Based Machine Learning the graphical model are linked by edges in a linear chain, CRFs make a first-order Markov indepen- We used the CRF model to perform the initial tag- dence assumption, and thus can be understood as ging followed by post-processing. conditionally-trained finite state machines (FSMs). Let o = (o,,o2,o3 ,o4 ,... oT ) be some observed input 4.1 Statistical Tagging data sequence, such as a sequence of words in text In the first phase, we have used language indepen- in a document,(the values on n input nodes of the dent features such as statistical prefixes and suffix- graphical model). Let S be a set of FSM states, es to build the model using CRF. The set of fea- each of which is associated with a label, l Є £. tures used are described below.
  • 3. Precision Recall F-Measure Pm Pn Pl Rm Rn Rl Fm Fn Fl Bengali 53.34 49.28 58.27 26.77 25.88 31.19 35.65 33.94 40.63 Hindi 59.53 63.84 64.84 41.21 41.74 40.77 48.71 50.47 50.06 Oriya 39.16 40.38 63.70 23.39 19.24 28.15 29.29 26.06 39.04 Telugu 10.31 71.96 65.45 68.00 30.85 29.78 08.19 43.19 40.94 Urdu 43.63 44.76 48.96 36.69 34.56 39.07 39.86 39.01 43.46 Table-1: Evaluation of the NER System for Five Languages 4.2 Window of the Words 4.6 Presence of Digits Words preceding or following the target word may Usually, the presence of digits indicates that the to- be useful for determining its category. Following a ken is a named entity. This is taken as a feature to few trials we found that a suitable window size is recognize NEN (Named Entity Number). five. 4.7 Presence of 4 Digits 4.3 Suffixes If the token is a four digit number, it is likelier to Statistical suffixes of length 1 to 4 have been con- be a NETI (Named Entity Time). For example, sidered. These can capture information for words 1857, 2007 etc. are most probably years. with the named entity (Location) tag like Hyder- abad, Secunderabad, Ahmedabad etc., all of which 5 Heuristics Based Post Processing end in ‘bad’. After the first phase, we used rule-based post pro- 4.4 Prefixes cessing to improve the performance. Statistical prefixes of length 1 to 4 have been con- 5.1 Second Best Tag sidered. These can take care of the problems asso- It has been observed that the recall of the CRF ciated with a large number of distinct tokens. As model is low. In order to improve recall, we have mentioned earlier, agglutinative languages can used the following rule. If the best tag given by have a number of postpositions. The use of prefix- the CRF model is O (not a named entity) and the es will increase the probability of ‘Hyderabad’ confidence of the second best tag is greater than and ‘Hyderabadloo’ (Telugu for ‘in Hyderabad’) 0.15, then the second best tag is considered as the being treated as the same token. correct tag. We observed an increase of 7% in recall and 3% 4.5 Start of a Sentence decrease in precision and 4% increase in the F- measure, which is a significant increase in perfor- There is a possibility of confusing the NEN mance. The decrease in precision is expected as we (Named Entity Number) in a sentence with the are taking the second tag. number that appears in a numbered list. The num- bered list will always have numbers at the begin- 5.2 Nested Entities ning of a sentence; hence a feature that checks for One of the important tasks in the contest was to this property will resolve the ambiguity with an ac- identify nested named entities. For example if we tual NEN. consider "eka kilo" (Hindi: one kilo) as NEM
  • 4. (Named Entity Measure), it contains a NEN vidually for each named entity tag separately in (Named Entity Number) within it. Table-2. The CRF model tags “eka kilo” as NEM and in order to tag “eka” as NEN we have made use of 7 Unknown Words other resources like a gazetteer for the list of num- Table 3 shows the number of unknown words bers. We used such lists for four languages. present in the test data when compared with the training data. Bengali Hindi Oriya Telugu Urdu First column shows the number of unique NEP 35.22 54.05 52.22 01.93 31.22 Named entity tags present in the test data for each NED NP 42.47 01.97 NP 21.27 language. Second column shows the number of unique known named entities present in the test NEO 11.59 45.63 14.50 NP 19.13 data. Third column shows the percentage of unique NEA NP 61.53 NP NP NP unknown words present in the test data of different NEB NP NP NP NP NP languages when compared to training data. NETP 42.30 NP NP NP NP 8 Error Analyses NETO 33.33 13.77 NP 01.66 NP We can observe from the results that the maximal NEL 45.27 62.66 48.72 01.49 57.85 F-measure for Telugu is very low when compared NETI 55.85 79.09 40.91 71.35 63.47 to lexical F-measure and nested F-measure. The NEN 62.67 80.69 24.94 83.17 13.75 reason is that the test data of Telugu contains a large number of long named entities (around 6 NEM 60.51 43.75 19.00 26.66 84.10 words), which contains around 4 to 5 nested named NETE 19.17 31.52 NP 08.91 NP entities. Our system was able to tag nested named entities correctly unlike maximal named entity. Table-2: F-Measure (Lexical) for NE Tags We can also observe that the maximal F-mea- sure for Telugu is very low when compared to oth- Evaluation er languages. This is because Telugu test data has very few known words. The evaluation measures used for all the five lan- One more observation is that the F-measures of guages are precision, recall and F-measure. These NEN, NETI, and NEM are relatively high. The measures are calculated in three different ways: reason would be that these classes of NE are rela- tively closed. 1. Maximal Matches: The largest possible named entities are matched with the refer- Num of Num of % of un- ence data. NE tokens known NE known NE 2. Nested Matches: The largest possible as Bengali 1185 277 23.37 well as nested named entities are matched. Hindi 1120 417 37.23 3. Lexical Item Matches: The lexical items inside largest possible named entities are Oriya 1310 563 42.97 matched. Telugu 1150 145 12.60 Urdu 631 179 28.36 Table-3: Unknown Word 6 Results Conclusion The results of evaluation as explained in the previ- In this paper we have presented the results of using ous section are shown in the Table-1. The F-mea- a two stage hybrid approach for the task of named sures for nested lexical match are also shown indi- entity recognition for South and South East Asian Languages. We have achieved decent Lexical F-
  • 5. measures of 40.63, 50.06, 39.04, 40.94, and 43.46 for Bengali, Hindi, Oriya, Telugu and Urdu respec- tively without using too many language specific re- sources. References [1] Fei Sha and Fernando Pereira 2003, Shallow Parsing with Conditional Random Fields.In the Proceedings of HLT-NAACL. [2] Charles Sutton, an Introduction to Conditional Random Fields for Relational Learning. [3] John Lafferty, Andrew McCallum and Fernan- do Pereira, Conditional Random Fields: Proba- bilistic Models for Segment-ing and Labeling Sequence Data. [4] Avinesh PVS and Karthik G: Part-Of-Speech Tagging and Chunking Using Conditional Ran- dom Fields and Transformation Based Learning: Proceedings of the SPSAL workshop during IJ- CAI’07. [5] CRF++: Yet another Toolkit http://crfpp.sourceforge.net/ [6] Cucerzan, S. and Yarowsky, D.:Language inde- pendent named entity recognition combining morphological and contextual evidence:Proceed- ings of the Joint SIGDAT Conference on EMNLP and VLC 1999