SlideShare utilise les cookies pour améliorer les fonctionnalités et les performances, et également pour vous montrer des publicités pertinentes. Si vous continuez à naviguer sur ce site, vous acceptez l’utilisation de cookies. Consultez nos Conditions d’utilisation et notre Politique de confidentialité.
SlideShare utilise les cookies pour améliorer les fonctionnalités et les performances, et également pour vous montrer des publicités pertinentes. Si vous continuez à naviguer sur ce site, vous acceptez l’utilisation de cookies. Consultez notre Politique de confidentialité et nos Conditions d’utilisation pour en savoir plus.
MINING NAME ENTITY FROM
- NIKHIL BAROTE
- KUNJ THAKKAR
- SHIVANI PODDAR
- ANKIT SHARMA
In many search domains, both contents and searches are
frequently tied to named entities such as a person, a
company or similar.
One challenge from an information retrieval point of view is
that a single entity can have more than one way of referring
In this project we describe how to use Wikipedia contents
to automatically generate a dictionary of named entities
and synonyms that are all referring to the same entity.
we can find named entities and their synonyms with a high
degree of accuracy with our approach.
There are four Wikipedia features that are in particular
attractive as a mining source when building a large
collection of NEs:
Generic Named Entity Recognition
The generic named entity recognition is only classifying a Wikipedia entry
as an entity or not. It starts out by looking at the title of the entry, since as
mentioned earlier, most of the article titles are nouns, and the only nouns
we are interested in are the proper nouns.
Category Based Named-Entity Recognition
It is a subtask of information extraction that seeks to locate and classify
elements in text into pre-defined categories such as the names of persons,
organizations, locations, expressions of times, quantities, monetary values,
After a set of NEs have been identified, we want to find their synonyms.
We intend to use the internal links, redirects and disambiguation pages
for this, and we can easily extract all of these after we have the NEs.
This will give us a list of captions, all used on links to a particular entity.
Generic Named Entity Recognition Algorithm
To classify the entries we implemented an algorithm using the
following steps when given a title, T, and the text of an entry:
1. Remove any domain suffix from T
2. Tokenize T into n units, w1;w2; :::;wn
3. Remove any wi from W where wi is included in S
4. Classify as an entity if any of these conditions holds
• ∑ C(wi) = n and n >= 2
• ∑ D(wi) >= 2
• ∑ E(T)/N(T) >= α
A domain suffix is the text enclosed in parentheses that follows
the title of entries with multiple senses.
They are used to disambiguate between the senses, but
since they are not part of the Extracting entity name, we
must first strip them from the title. Next we strip all wi
which are found in S, which is a list of stop words.
1. C=1 if any li ∊ [A::Z], 0 otherwise
2. D=1 if |Q| >= 2 where Q = ∑ C(li), 0 otherwise
3. D returns 1 if the parameter has multiple capital
letters, 0 otherwise C is a function that returns 1 if the
parameter is capitalized, and 0 otherwise, while D is a
function that that returns 1 if the parameter has
multiple capital letters, and 0 otherwise. a is a variable
used as a threshold for the third condition.
First we take unigrams , bigrams & trigrams from our query
We look for them in our synonym database & We will get a
list of doc_titles & corresponding doc_ids.
Now we look for words in window centered at current
word And we look at candidate documents & their doc_ids
(window size is set beforehand).
We use vector space model to match our query document
to these candidates.
We pick candidates with score greater than already set
threshold.Now we look for category for these entities in our
Zesch et al. evaluate the usefulness of Wikipedia as a lexical
semantic resource, and compares it to more traditional
resources, such as dictionaries, thesauri, semantic wordnets, etc.
Bunescu and Pa¸sca study how to use Wikipedia for detecting
and disambiguating NEs in open domain text.
R. C. Bunescu and M. Pasca. Using encyclopedic knowledge for
named entity disambiguation. In Proceedings of
R. Schenkel, F. M. Suchanek, and G. Kasneci. YAWN: Asemantically
annotated Wikipedia XML corpus. In Proceedings of
T. Zesch, I. Gurevych, and M. M¨uhlh¨auser. Analyzing and
accessing Wikipedia as a lexical semantic resource. In
Proceedings of Biannual Conference of the Society for
Computational Linguistics and Language Technology, 2007.
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information
Retrieval. Addison Wesley, 1999.