Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Information_retrieval_and_extraction_IIIT

221 vues

Publié le

Mining name entity from the wikipedia dump on basis of three parameter that is Person, Location and Organization.

Publié dans : Technologie, Formation
  • Identifiez-vous pour voir les commentaires

  • Soyez le premier à aimer ceci

Information_retrieval_and_extraction_IIIT

  1. 1. MINING NAME ENTITY FROM WIKIPEDIA GROUP MEMBER - NIKHIL BAROTE - KUNJ THAKKAR - SHIVANI PODDAR - ANKIT SHARMA
  2. 2.  In many search domains, both contents and searches are frequently tied to named entities such as a person, a company or similar.  One challenge from an information retrieval point of view is that a single entity can have more than one way of referring to it.  In this project we describe how to use Wikipedia contents to automatically generate a dictionary of named entities and synonyms that are all referring to the same entity.  we can find named entities and their synonyms with a high degree of accuracy with our approach.
  3. 3.  There are four Wikipedia features that are in particular attractive as a mining source when building a large collection of NEs: 1.INTERNAL LINKS 2.REDIRECT LINKS 3.EXTERNAL LINKS 4.CATEGORIES
  4. 4.  Generic Named Entity Recognition The generic named entity recognition is only classifying a Wikipedia entry as an entity or not. It starts out by looking at the title of the entry, since as mentioned earlier, most of the article titles are nouns, and the only nouns we are interested in are the proper nouns.  Category Based Named-Entity Recognition It is a subtask of information extraction that seeks to locate and classify elements in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.  Synonym extraction After a set of NEs have been identified, we want to find their synonyms. We intend to use the internal links, redirects and disambiguation pages for this, and we can easily extract all of these after we have the NEs. This will give us a list of captions, all used on links to a particular entity.
  5. 5.  Generic Named Entity Recognition Algorithm To classify the entries we implemented an algorithm using the following steps when given a title, T, and the text of an entry: 1. Remove any domain suffix from T 2. Tokenize T into n units, w1;w2; :::;wn 3. Remove any wi from W where wi is included in S 4. Classify as an entity if any of these conditions holds true: • ∑ C(wi) = n and n >= 2 • ∑ D(wi) >= 2 • ∑ E(T)/N(T) >= α  A domain suffix is the text enclosed in parentheses that follows the title of entries with multiple senses.
  6. 6.  They are used to disambiguate between the senses, but since they are not part of the Extracting entity name, we must first strip them from the title. Next we strip all wi which are found in S, which is a list of stop words. 1. C=1 if any li ∊ [A::Z], 0 otherwise 2. D=1 if |Q| >= 2 where Q = ∑ C(li), 0 otherwise 3. D returns 1 if the parameter has multiple capital letters, 0 otherwise C is a function that returns 1 if the parameter is capitalized, and 0 otherwise, while D is a function that that returns 1 if the parameter has multiple capital letters, and 0 otherwise. a is a variable used as a threshold for the third condition.
  7. 7. Search System  First we take unigrams , bigrams & trigrams from our query document  We look for them in our synonym database & We will get a list of doc_titles & corresponding doc_ids.  Now we look for words in window centered at current word And we look at candidate documents & their doc_ids (window size is set beforehand).  We use vector space model to match our query document to these candidates.  We pick candidates with score greater than already set threshold.Now we look for category for these entities in our database
  8. 8.  Zesch et al. evaluate the usefulness of Wikipedia as a lexical semantic resource, and compares it to more traditional resources, such as dictionaries, thesauri, semantic wordnets, etc.  Bunescu and Pa¸sca study how to use Wikipedia for detecting and disambiguating NEs in open domain text.
  9. 9.  R. C. Bunescu and M. Pasca. Using encyclopedic knowledge for named entity disambiguation. In Proceedings of EACL’2006, 2006.  R. Schenkel, F. M. Suchanek, and G. Kasneci. YAWN: Asemantically annotated Wikipedia XML corpus. In Proceedings of BTW’2007, 2007.  T. Zesch, I. Gurevych, and M. M¨uhlh¨auser. Analyzing and accessing Wikipedia as a lexical semantic resource. In Proceedings of Biannual Conference of the Society for Computational Linguistics and Language Technology, 2007.  R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, 1999.
  10. 10. THANK YOU!

×