Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Web Information Retrieval and Mining

9 478 vues

Publié le

Talk based on: Ricardo Baeza-Yates and Carlos Castillo: “Web Retrieval and Mining”.Entry in “Encyclopedia of Library and Information Sciences”, third edition (to appear in 2009).

Publié dans : Technologie
  • Identifiez-vous pour voir les commentaires

Web Information Retrieval and Mining

  1. 1. Web Retrieval and Mining Overview Source: Ricardo Baeza-Yates and Carlos Castillo: “Web Retrieval and Mining”.Entry in “Encyclopedia of Library and Information Sciences”, third edition (to appear in 2009).
  2. 2. Information Retrieval <ul><li>Methods for finding information in documents </li></ul><ul><ul><li>Started in the 1970s and 1980s </li></ul></ul><ul><li>“ Methods ” </li></ul><ul><ul><li>Algorithms and heuristics </li></ul></ul><ul><li>“ Finding ” </li></ul><ul><ul><li>Query – Document, Document – Document, etc. </li></ul></ul><ul><li>“ Documents ” </li></ul><ul><ul><li>Texts </li></ul></ul>
  3. 3. The Web is different <ul><li>Massive </li></ul><ul><ul><li>Thousands of millions of documents </li></ul></ul><ul><li>Dynamic </li></ul><ul><ul><li>Updates </li></ul></ul><ul><ul><li>Deletes </li></ul></ul><ul><li>Distributed </li></ul><ul><ul><li>Variable quality </li></ul></ul><ul><ul><li>Malicious behavior </li></ul></ul>
  4. 4. Web IR topics <ul><li>Web Search </li></ul><ul><ul><li>Crawling </li></ul></ul><ul><ul><li>Indexing </li></ul></ul><ul><ul><li>Querying </li></ul></ul><ul><li>Web Mining </li></ul><ul><li>Adversarial Web IR </li></ul><ul><li>Distributed Web IR </li></ul><ul><li>Evaluation </li></ul>
  5. 5. Web search
  6. 6. Main goals <ul><li>Precision </li></ul><ul><ul><li>Relevant documents returned / Documents returned </li></ul></ul><ul><li>Recall </li></ul><ul><ul><li>Relevant documents returned / Relevant documents </li></ul></ul><ul><li>Freshness </li></ul><ul><li>Performance/scalability </li></ul>
  7. 7. Main goals
  8. 8. Two phases of search <ul><li>Off-line </li></ul><ul><ul><li>Crawling and indexing </li></ul></ul><ul><li>On-line </li></ul><ul><ul><li>Querying and ranking </li></ul></ul>
  9. 9. Search phases
  10. 10. Web crawling <ul><li>Download pages following rules </li></ul><ul><li>Applications </li></ul><ul><ul><li>Create index for search </li></ul></ul><ul><ul><li>Find particular information items </li></ul></ul><ul><ul><li>Find/report problems </li></ul></ul><ul><li>Constraints </li></ul><ul><ul><li>Robot exclusion protocol and politeness </li></ul></ul><ul><ul><li>Deep web </li></ul></ul>
  11. 11. Web indexing <ul><li>Logical view </li></ul><ul><ul><li>Tokenization </li></ul></ul><ul><ul><li>Stopwords removal </li></ul></ul><ul><ul><li>Stemming </li></ul></ul><ul><li>Creation of an inverted index </li></ul>
  12. 12. Inverted index
  13. 13. Challenges of indexing <ul><li>Index compression </li></ul><ul><li>Efficiency in top-K searches </li></ul><ul><ul><li>Sorting </li></ul></ul><ul><li>Index distribution </li></ul><ul><ul><li>By terms </li></ul></ul><ul><ul><li>By documents </li></ul></ul>
  14. 14. Web querying and ranking <ul><li>Keyword-based search is dominant paradigm </li></ul><ul><ul><li>No large-scale open-domain QA systems (yet) </li></ul></ul><ul><li>Relevance </li></ul><ul><ul><li>Vector space model and variants </li></ul></ul><ul><li>Query expansion </li></ul><ul><li>Latent semantic indexing </li></ul>
  15. 15. Web ranking <ul><li>Quality is the main problem </li></ul><ul><li>Link ranking </li></ul><ul><ul><li>Hypothesis 1: Topical locality of links </li></ul></ul><ul><ul><li>Hypothesis 2: Link implies endorsment </li></ul></ul><ul><li>PageRank </li></ul><ul><li>HITS </li></ul>
  16. 16. HITS
  17. 17. Rank manipulation <ul><li>“ The bubble of Web visibility ” </li></ul><ul><li>Content spam </li></ul><ul><ul><li>Keyword stuffing </li></ul></ul><ul><ul><li>Content hidding </li></ul></ul><ul><li>Link spam </li></ul><ul><ul><li>Link farms </li></ul></ul><ul><li>Cloaking </li></ul>
  18. 18. Web mining
  19. 19. Content mining <ul><li>Extraction of knowledge from Web pages </li></ul><ul><ul><li>BUT ... HTML is physical formatting </li></ul></ul><ul><ul><li>There is information loss </li></ul></ul>
  20. 20. Information loss
  21. 21. Aspects of content mining <ul><li>Information extraction </li></ul><ul><ul><li>Revert information loss </li></ul></ul><ul><li>Content classification </li></ul><ul><ul><li>Topic </li></ul></ul><ul><ul><li>Genre </li></ul></ul><ul><li>Sentiment analysis </li></ul>
  22. 22. Link mining <ul><li>Scale-free networks </li></ul>
  23. 23. Macroscopic view <ul><li>Bow-tie structure </li></ul>
  24. 24. Usage mining <ul><li>Logfile analysis </li></ul><ul><li>Query logs </li></ul><ul><li>Privacy issues </li></ul>
  25. 25. Emerging topics <ul><li>Mobile Web </li></ul><ul><li>Semantic Web </li></ul><ul><li>... </li></ul>

×