Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Web Information Extraction for the DB Research Domain

572 vues

Publié le

A presentation describing my final project for an engineering degree at the Hebrew University of Jerusalem - a system for extracting information from web sites into instances of an XML schema, utilizing machine learning, structural analysis of documents and a divide & conquer strategy.

  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Web Information Extraction for the DB Research Domain

  1. 1. WEB INFORMATION EXTRACTION FOR THE DB RESEARCH DOMAIN Michael Genkin (mishagenkin@cs.huji.ac.il) Liat Kakun (liat.kakun@mail.huji.ac.il) School of Engineering and Computer Science Advisor: Dr. Sara Cohen
  2. 2. Introduction <ul><li>Wealth of information available online </li></ul><ul><ul><li>To much for it to be handled, effectively, by humans. </li></ul></ul><ul><ul><li>Mostly inaccessible to computers </li></ul></ul><ul><li>A web information extraction project </li></ul><ul><ul><li>Provide a complete, domain specific, system </li></ul></ul><ul><ul><li>Allow structured queries on top of web information. </li></ul></ul><ul><ul><li>Part of a research on developing tools to support scientific policy management @ HUJI DB Group. </li></ul></ul><ul><ul><ul><li>Advisor: Dr. Sara Cohen </li></ul></ul></ul><ul><ul><ul><li>Other groups creating components – web crawler, UI. </li></ul></ul></ul>
  3. 3. Introduction <ul><li>Extract information from DB research projects’ web sites. </li></ul><ul><ul><li>Domain specific </li></ul></ul><ul><ul><li>Divide & Conquer </li></ul></ul><ul><ul><li>Structural document analysis </li></ul></ul><ul><ul><li>Linguistic analysis </li></ul></ul><ul><ul><li>Machine learning </li></ul></ul><ul><li>The domain encoded in an XML schema document </li></ul><ul><ul><li>Contains processing instruction as well as domain semantics. </li></ul></ul><ul><li>The result is an XML based, query-able, database </li></ul>
  4. 4. Methods – Structural Analysis #1 Before: After: Transform each input document into a structurally valid, monolithic, document – using industry standard tools such as HTML Tidy and Readability.
  5. 5. Methods – Structural Analysis #2 <ul><li>Vertically segment each document into logical blocks. </li></ul><ul><li>Employ, stack based, style analysis to identify each of the blocks. </li></ul>
  6. 6. Methods - Classification Employ multiclass classification (by vector similarity) to map the logical document blocks to the appropriate schema elements.
  7. 7. Methods – Pattern Recognition Pattern: .//bibliography/ul/li/* Mine likely candidate blocks for patterns using the PAT Tree algorithm; adjusted for finding a maximum likelihood pattern.
  8. 8. Methods – Metadata Extraction Use CRF for extraction of additional metadata where appropriate (e.g. bibliographic lists).
  9. 9. Results – Setting <ul><li>50 web pages of DB research projects from American and Israeli universities. </li></ul><ul><ul><li>Chosen manually to represent a wide variety of web page styles. </li></ul></ul><ul><li>All pages pre-processed by our systems – their structure analyzed; Then manually tagged for classification, patterns, metadata. </li></ul><ul><li>20% of the dataset is sampled for training purposes, randomly. </li></ul><ul><ul><li>Repeated 5 times, and averaged. </li></ul></ul>
  10. 10. Results – Measures
  11. 11. Results Precision Recall Pattern Recognition 85% 89.7% Classification Accuracy 82.5%
  12. 12. Conclusions <ul><li>This is a feasible approach for creating a web information extraction system. </li></ul><ul><li>Good results can be achieved with a relative small sample. </li></ul><ul><li>The modular system design allows easy adaptation for additional domains. </li></ul><ul><li>Future directions: </li></ul><ul><ul><li>Schema generation </li></ul></ul><ul><ul><li>Better information integration </li></ul></ul><ul><ul><li>Additional modules (e.g. deep linguistic analysis) </li></ul></ul>
  13. 13. Questions?

×