Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Mining Product Synonyms - Slides

1 129 vues

Publié le

Group - 5 | Project - 2
Information Retrieval and Extraction, Spring 2014
IIIT Hyderabad

Publié dans : Logiciels, Technologie
  • Identifiez-vous pour voir les commentaires

  • Soyez le premier à aimer ceci

Mining Product Synonyms - Slides

  1. 1. Mining Product Synonym Information Retrieval and Extraction Project On Presented By : Vrishank Shete(201305642) Mohd. Salman Khan(201305513) Ankush Jain(201101010) Suprabh Shukla(201001082) Guided By : Priya Radhakrishnan Computer Science and Engineering International Institute of Information Technology
  2. 2. Introduction Problem Statement : Given an entity query, find canonical terms by which the entity can be distinguished. Forms of web queries on structured data. Gap between user queries and creators describing entities. E.g. User may query Harry Potter 6 if he wish to search for Harry Potter and The Half Blood Prince
  3. 3. Related Works  String Similarity Measures: ◦ Levenshtein String Similarity function. ◦ Dice Coefficient. ◦ Jaccard String Similarity function.  ExploitingWeb Search to Generate Synonyms for Entities by Surajit Chaudhuri,Venkatesh Ganti, Dong Xin
  4. 4. System Components  Extracting IDTokenSets using documents from web search.  Expanding IDTokenSets using p-Window context  Searching for possible canonical names from pre- crawled list.  Validating canonical names from web documents
  5. 5. Algorithm 1: Let Le = Pe; //all subsets of e; 2: while (Le is not empty) 3: Te = getnext(Le); 4: SubmitTe to W, and retrieve W(Te); 5: if (corr(Te; e;W(Te)) ¸ µ) Te is an IDTokenSet 6: Report Te and all its supersets as IDTokenSets; 7: Remove Te and all its supersets from Le; 8: else Te is not an IDTokenSet 9: Remove Te and its subsets from Le; 10: return. Here the correlation function (corr) gives the estimate of how much theTe is important to the current document.
  6. 6. Algorithm 11.After getting substrings, we show evidence by levenValue (<= 0.95) , jaccard (> 0.10) && dice (> 0.20) (by taking these values) from our data set. 12.After filtering in step 3, we again filter by correlation method which is mentioned above.(In Step 12 we get all mentions and all strings which are matching to the mentions.These strings may or may not be canonical names.) 13. Now we store all strings in a p-window context for all mentions in the results of search engine(which we already store in step 1-10) we got in step 12. 14.We count the number of times each word is occurring in all strings from step13. 15. Now we take top K words from count hash and search in all the strings from step 12(those may or may not be a part of canonical names). 16.We match words from step 15 and strings from step 12. best matched string is our canonical string and our synonym (our desired result).
  7. 7. Block Diagram 
  8. 8. An Example
  9. 9. Challenges  The web documents are highly unstructured.The query string can be present anywhere and in any form in the respective document.This case is handled using the p-Window context in which the string is supposed to be present.  The web search engines do not allow automated frequent queries in small intervals through a program.A delay of 2 seconds is introduced between two queries which makes the searching somewhat slower but serves our purpose.
  10. 10. Cons Time for web search. Less usable data from web search.
  11. 11. References  ExploitingWeb Search to Generate Synonyms for Entities By Surajit Chaudhuri,Venkatesh Ganti, Dong Xin.  Entity Synonyms for StructuredWeb Search by Tao Cheng, Hady W. Lauw, and Stelios Paparizos
  12. 12. Thank you
  13. 13. Q & A