Information extraction involves extracting meaningful data from unstructured or semi-structured text. Machine learning techniques like rule induction can be used to automatically generate wrappers, or programs that extract specific data fields from web pages. Wrappers are trained on examples to learn the formats of pages and extract targeted information. However, wrapper generation faces challenges in maintenance due to changing websites and limitations of current methods for unsupervised induction and knowledge representation.
1. Information Extraction from the WWW using Machine Learning Techniques Lee McCluskey, Dept of Informatics email: lee@hud.ac.uk
2.
3.
4.
5.
6.
7. Information Extraction from The Web WRAPPERS WEB PAGES STRUCTURED DATA BA red 555 sue MSc red 123 dave PhD grey 345 bill BSc blue 664 tom
8.
9. Example of Automated Extraction <residential> <house> < location> <city> Hebden Bridge </city> <county> West Yorkshire </county> <country> UK </country> </location> <agent-phone> 01422 843222 </agent-phone> <listed-price> £350,000 </listed-price> <comments> Bijou residence on the edge of this popular little town... </comments> </house> ... </residential> <h1> Residential Housing </h1> <ul> House For Sale <li> location: Hebden Bridge <li> agent-phone: 01422 843222 <li> listed-price: £350,000 <li> comments: Bijou residence on the edge of this popular little town... </ul> <hr> <ul> House For Sale ... </ul> ... Source: HTML ======> Destination: XML NB: XML + schema + recognised names wrapper
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
Editor's Notes
04/26/10 Points to make: 1) XML is an extensible markup language to describe structured data 2) XML is similar to HTML in that -- they are both markup languages, descending from SGML -- they both use tags 3) XML differs from HTML in that -- XML tags on data elements identify the meaning of data, rather than specifying how data should be formatted, as in HTML. XML therefore separates the three components of documents: content, structure, and presentation. -- relationships among data elements are provided via simple nesting The example should hopefully make these points clear. It shows the data from the same source, as published in HTML and XML formats. Note that even the XML document is more verbose, it also provides information in a far more convenient and usable format from a data management perspective.