SlideShare une entreprise Scribd logo
1  sur  16
A Framework for Extraction Plans and Heuristics in an
      Ontology-Based Data-Extraction System




                    A Thesis Proposal
                     Presented to the
              Department of Computer Science
                Brigham Young University


                    In Partial Fulfillment
             of the Requirements for the Degree
                      Master of Science



                    By Alan Wessman
                       March 2003
I.Introduction
    The proposed thesis deals with the subject of data extraction from unstructured or semi-
    structured documents such as Web pages. Specifically, the proposed thesis is concerned
    with abstracting the extraction process and related heuristics for an ontology-based data-
    extraction system. As such, it is closely connected with the Ontos research project
    currently underway in the Data Extraction Group at Brigham Young University.

    The fundamental tasks of data extraction are to identify and extract relevant information
    within a semi-structured or unstructured document, and to map the identified data to a
    more formal structure to permit easier manipulation of the data [3,4,6,9,18,24]. Data
    extraction is particularly important in obtaining information from the World Wide Web,
    which although entirely unpredictable in the way its documents present useful data, is
    also the largest and most accessible repository for information.

    Perhaps the most common solution to the data-extraction problem is the construction and
    use of wrappers [1,2,3,9,12,14,19,24], which are basically templates that use clues in the
    document’s structure and syntax to locate useful data. The “holes” or slots in the
    wrapper, filled with data, are mapped to fields of a common semantic model, such as a
    data model in a relational database. But wrappers suffer from an important weakness:
    being based on the structure and syntax of the document, they become invalid when the
    document structure or presentation changes, even if the information contained in the
    document is still relevant. This is of course a major shortcoming in Web-based data
    extraction, since Web documents are notorious for their tendency to change suddenly and
    drastically. Even without the problem of evolving pages, though, wrappers are still
    specific to a narrow set of documents, rather than applying to all documents belonging to
    an application domain. Thus an application may require the ongoing construction of new
    wrappers as additional data sources are discovered. Roadrunner [3] is one approach that
    seeks to mitigate this problem by automating the process of wrapper construction.
    However, the Roadrunner method only works for fairly regular data sources.

    The BYU-Ontos approach [6] is one data-extraction solution that automatically
    compensates for structural variations across data sources as well as variations over time
    by a single data source, by taking a novel approach to the problem of data extraction. In
    contrast with the Roadrunner approach, BYU-Ontos also works for highly unstructured
    data sources. It is centered on a human-designed ontology that describes the domain of
    interest and provides certain general sets of patterns (termed “data frames”, from [5]) for
    recognizing elements of the domain. Independent of any particular document format or
    syntax, the recognition patterns of the ontology detect strings and symbols using lexicons
    and regular expressions. Recognized tokens are mapped to elements of the ontology, or
    are discarded using ontology constraints and a set of built-in heuristics. BYU-Ontos
    (hereafter “Ontos”) has been shown to be quite successful in extracting data for several
    ontologically narrow applications, such as used car advertisements and obituaries. Its
    precision and recall results are competitive with those of other approaches, and it is
    resilient in the face of document change and data source variety, since its rules are
    constructed independent of any single document or format.

    But Ontos in its current incarnation has several limitations. One notable problem is that
    several discrepancies exist between the Ontos specification and its implementation. In
    particular, the Ontos implementation is sensitive to the order in which object sets and


                                                                                                 2
relationship sets are given in the domain ontology; such a dependency does not exist in
the specification. This results in errors such as the one depicted in Figure 1 below. In
this case, the extracted date matches with two object sets: Funeral Date and
Marriage Date. Despite the nearby keyword indicating that the date is a Funeral
Date, the system extracts it as a Marriage Date. This error is due to the one-by-one
processing of object sets in the order of their appearance in the ontology. Since
Marriage Date was processed first and yielded a match on the date value, it accepted
the value and excluded it from inclusion in the correct Funeral Date object set.
Simply changing the order of the object sets in the ontology will not solve the general
problem, for it is easy to conceive of a scenario in which one record has only a funeral
date and the other has only a marriage date. In such a case, the ordering will be wrong
for one of the records.




Figure 1: Example of errors due to object-set order dependencies.



Figure 2 provides additional insight into the difficulties posed by the heuristics
specification and implementation. The figure first lists the specified set of heuristics. It
then gives pseudocode for the actual implementation of the heuristics. It is difficult to
identify from the pseudocode any of the five specified heuristics. Hence verifying
program correctness or making modifications to the heuristics involves a great deal of
attention and effort, and changes to the heuristics are likely to introduce bugs into the
code.




                                                                                               3
Specified Heuristics
     •      Keyword proximity
     •      Subsumed/overlapping constants
     •      Functional relationships
     •      Nonfunctional relationships
     •      First occurrence without constraint violation

Heuristics Pseudocode
Method 1:
FOR EACH object set IN the ontology:
  IF a keyword exists for the object set THEN
    Find the closest constant to the keyword
  END IF
  WHILE keyword exists AND constant is not found DO
    Find the next keyword
    Find the closest constant to the keyword
  END WHILE
  IF keyword doesn’t exist THEN
    Find the first constant matching this object set
  END IF
  IF constant exists THEN
    Update data structures:
       -     add the constant to the output table
       -     delete subsumed data lines
       -     delete other data lines for this object set
  ELSE
    IF this object set is required THEN
       Abort
    ELSE
       Continue with next object set
    END IF
  END IF
END FOR

Method 2:
FOR EACH relational table DO
  FOR EACH keyword in dataline list DO
    IF keyword corresponds to an attribute of this table THEN
       Find context surrounding this keyword (look backward for first preceding keyword, then look forward for
          next succeeding keyword)
       Clear this table’s template tuple
       Fill in object places of the template with object(s)
       FOR EACH constant in this context DO
          IF constant corresponds to an attribute in template THEN
             IF attribute slot is empty AND adding constant would not violate any constraints for this table
                THEN
                Fill slot with constant
                Mark the constant as used (delete from dataline list, remove other overlapping constants)
                IF this slot is a nonlexical object generator AND the corresponding nonlexical object isn’t yet
                   generated THEN
                   Generate nonlexical object
                END IF
                IF the tuple is now complete THEN
                   Add it to the rows of this table
                   Re-initialize template tuple
                END IF
             ELSE IF slot is full THEN
                IF slot is functionally determined by another slot THEN
                   Replace the slot’s contents with this constant
                   Mark the constant as used
                ELSE
                   IF template tuple is sufficiently complete to write THEN
                      Add template tuple to the table
                      Replace the slot’s contents with this constant
                      Mark the constant as used
                   ELSE
                      Ignore the constant
                   END IF
                END IF
             ELSE
                Ignore the constant
             END IF
          END IF
       END FOR
       IF template tuple is sufficiently complete to write THEN
          Add template tuple to the table
       END IF
    END IF
  END FOR
END FOR

Figure 2: Specification and implementation of Ontos heuristics.




                                                                                                             4
If these were the only limitations of Ontos, it would be a straightforward matter to define
more carefully the heuristics and perform some reworking of the code. But there are a
number of data-extraction tasks that Ontos simply cannot handle at all, that represent
important advances and opportunities in data-extraction research. Aside from fixing
broken code, there is a great need to provide Ontos with the capability to support various
data-extraction techniques with such modularity and flexibility that new features and
heuristics can be “plugged in” without requiring extensive redesign of the Ontos code.
We list several examples of desired new Ontos functionality below.

Full support of object set and data frame specialization (inheritance). Ontos
currently has limited support of inheritance of object sets and data frames, but lacks such
important features as overriding and hiding of inherited extraction rules. These features
are needed to implement heuristics that correctly assign extracted values to object sets,
since a specialized extraction rule should cause matching values to be mapped to the
specializing object set.

Access to and use of previously extracted data. A major area of research that has
hitherto been unexplored in Ontos is learning-enabled data extraction. We would like the
performance of Ontos to improve over time as more data is recognized and mapped to the
ontology. This necessitates a feedback cycle wherein previously extracted values can
influence decisions to be made regarding new data. It follows that access to previously
extracted data is a necessary condition for learning-enabled data extraction. At present,
Ontos emits extracted values in the form of SQL insert statements. To access this data,
the SQL must be parsed and mapped back to the ontology. This process is inefficient. In
contrast, populating the ontology directly with extracted values would permit immediate
use of the values in the heuristics, and would also permit the construction of translation
code to traverse the ontology structure and emit the values in whatever format is desired.

Support of complex document structures. Researchers have recently investigated
applying ontology-based data extraction to differing document structures, such as tabular
data or HTML input forms [7,16]. These reflect an expansion of the research from the
original assumption that the input documents contain a relatively flat, linear multi-record
structure. Unfortunately, Ontos has had to be retooled for each different document
structure. We desire to provide Ontos with the ability to recognize different documents
and choose a structure parser that will break the documents down into appropriate
sections for more accurate processing by Ontos. The goal is that Ontos can extract from
complex documents such as a university course catalog, which is rich both in terms of
data as well as structural complexity.

Declarative specification of extraction plans. The extraction plan specifies the order in
which a data-extraction engine processes documents and target structure elements during
the extraction phase, as well as how it reconciles the heuristics results. The pseudocode
for the Ontos heuristics represents an example of an extraction plan. There is doubt,
however, that the Ontos implementation can be claimed to apply the optimal extraction
plan for all cases. We hypothesize instead that different extraction algorithms are better
for different applications. Indeed, even to measure the utility of different extraction plans
requires the ability to contrast them in controlled experiments. We wish therefore to
allow an ontology researcher or designer to specify a particular extraction plan
declaratively for a given ontology. Declarative extraction plans facilitate
experimentation and dynamic tailoring of extraction plans for various applications.



                                                                                           5
Parameterization of heuristics. The Ontos heuristics produce, at each step, a Boolean
    decision on membership of a value in a particular object set or relationship set. Once a
    value has been accepted, it is no longer available for assignment to a different set. As the
    obituaries example above demonstrated, this decision can sometimes be incorrect. We
    thus desire the heuristics to produce real-numbered confidence values for each value
    mapping. This will also permit research to be conducted in how to weight certain
    heuristics and otherwise combine confidence values to achieve a final set of Boolean
    decisions for proposed value mappings. This will also permit the use of probabilistic and
    vector-based approaches, such as a hidden Markov model (HMM) learner for
    classification [8,10,15,17,20,22,23].

    One of the factors motivating the proposed work is the considerable effort spent to
    develop a hidden Markov model learner for the current Ontos implementation. In the
    process of attempting to integrate the learner into Ontos, it was discovered that the
    heuristics could not work with such a different paradigm. In particular, the Boolean
    decisions of the heuristics could not easily be reconciled with the probabilities output by
    the HMM. The HMM work helped to demonstrate how flexible the Ontos code needs to
    be to accommodate such novel approaches.

    The examples given above are a subset of the many possible ideas and techniques
    applicable to ontology-based data extraction that have not been achievable with the
    existing Ontos code. Combined with the errors and general rigidity of the current
    codebase, it is clear that a more flexible, understandable, modular, and powerful design
    for Ontos is required. The proposed thesis addresses these problems by defining an
    abstract object-oriented framework that permits powerful extraction plans and heuristics
    to be developed, configured, and tuned for an ontology-based data-extraction system.

II.Thesis Statement
    A generalized framework of interfaces and abstract classes written in Java can decouple
    the value-mapping heuristics from the rest of the Ontos data-extraction code. The
    framework will be sufficiently flexible both to allow the current heuristics to be re-
    implemented under the framework and to enable new heuristics to be created and
    incorporated into the Ontos system without requiring significant rewrites of the non-
    heuristics code.

III.Methods
    This section describes the steps involved in accomplishing the work for the proposed
    thesis. The work may be broken down into five major steps, which are listed below:
        1. Analysis of data-extraction functionality
        2. Construction of the framework
        3. Implementation of modules under the framework
        4. Upgrades to existing tools
        5. Evaluation of the new framework

    We provide details in the following sections.

    A.Analysis of Data Extraction Functionality
    The first step in constructing a framework for data extraction is to determine common
    functionality at each level of abstraction. The abstract class DataExtractionEngine,


                                                                                               6
which encompasses any data-extraction system (not just those that are ontology-based),
represents the highest level of abstraction.

A survey of current data-extraction research [4,6,15,18] provides the basis for defining
the functionality at the highest levels of abstraction. Examination of the research
centered on Ontos [6,7,16] supplies insight into abstracting ontology-based data-
extraction systems. Much of this work has already been done for the proposed thesis. As
an example of the work done, the following paragraph describes the abstract functionality
of data-extraction systems [25]:

     In general, a data-extraction engine is initialized with a corpus of source documents
     (often mediated by an information retrieval system), a set of extraction rules to apply to
     the documents, and a target data structure to which extracted values are mapped. After
     initialization the engine executes an extraction plan, yielding a mapping between
     information in the documents and elements of the target data structure. Following the
     extraction step, the engine performs finalization tasks such as canonicalization of the
     extracted data or queries against the target data structure.

Figure 3 shows a UML diagram of a partial framework that supports the functionality
described above for the Ontos system. Subclasses of ExtractionPlan implement the
execute() method, accessing methods of the DocumentRetriever,
ValueMappingHeuristic, Ontology, and DocumentStructureRecognizer
classes to perform the extraction.
            DataExtractionEngine
        #extractionPlan : ExtractionPlan                                    ExtractionPlan
        +setInitializationParameters()
        +initialize()
                                                                            +execute()
        +doExtraction()
        +finalize()


                                                                         OntosExtractionPlan
                     Ontos                                #documentRetriever : DocumentRetriever
                                                          #documentRecognizer : DocumentStructureRecognizer
                                                          #ontology : Ontology
+setExtractionPlan(in plan : OntosExtractionPlan)

                                           «uses»


                                                                                                          DocumentRetriever


                                                                         DocumentStructureRecognizer
               ValueMappingHeuristic                   Ontology




                                                                                                   DocumentStructureParser

                                           ObjectSet              RelationshipSet




Figure 3: UML diagram of partial Ontos framework.

There is still some work to be done in determining at what level to abstract certain
features. Most data-extraction systems make a number of assumptions regarding how
they obtain documents to process, how the documents are structured, what language the
document is written in, and so forth. It is therefore a significant challenge to decide how



                                                                                                                        7
to allow these assumptions to be made explicit when implementing the abstraction. This
work will be completed for the proposed thesis.

B.Construction of a Framework
The main piece of work to be performed for the proposed thesis is development of the
framework. The framework will be implemented in the Java language [13], which is the
implementation language for the existing Ontos system and provides the object-oriented
features necessary for abstraction of data-extraction functionality.

A set of abstract classes and interfaces will comprise the framework. In Java, abstract
classes may contain instance attributes and method implementations, while interfaces
may only contain static (class) attributes and method declarations. On the other hand, a
Java class may implement multiple interfaces but inherit from only one parent class.
Thus, most of the framework will be contained in interfaces to avoid problems with
having to inherit from multiple abstract classes.

A key part of designing an abstract framework is specifying the data structures that are
published by parts of the interface. For example, a domain ontology is a data structure
that would presumably be passed into the initialization portion of a data-extraction
engine. It is reasonable therefore to declare an Ontology abstract class or interface as
part of the framework. Other data types will be defined to represent documents, extracted
values, value-mapping heuristics, data frames, and so forth.

For data extraction using OSM ontologies, the OSM grammar must be extended to
accommodate new capabilities such as declaring a reference to an existing extraction
plan. The grammar must also be implemented in a data structure with an underlying
concrete representation so that ontologies may be read from and written to a file. XML
has been chosen for the concrete representation of OSM ontologies, and an XML Schema
defining the grammar, termed OSM-X, has been designed. Work is currently underway
to wrap the XML structure within interfaces that translate XML nodes to OSM ontology
elements.

C.Implementing Modules under the Framework
The proposed thesis will not only entail construction of the framework, but will also
involve incorporation of the current Ontos code into the new framework. This will
require reconstruction of the heuristics code as an ExtractionPlan subclass and
individual Heuristic subclasses. Unintended features such as the object-set order
dependency will be corrected to harmonize with the heuristics specification.

In addition to wrapping the current functionality of Ontos into the framework, a few
additional features (namely, full generalization/specialization support and access to
previously extracted data) will be newly implemented under the framework as a
demonstration of its capabilities. However, many of the desired new features will be left
for implementation by future researchers. These may include various document structure
recognizers and parsers, specialized heuristics, learning-enabled extraction plans, and so
forth.




                                                                                           8
D.Upgrades to Existing Tools
In addition to the core Ontos data-extraction system, a graphical ontology editor tool [11]
exists for use in constructing ontologies visually rather than textually. This tool will need
to be rewritten to a sufficient degree to accommodate the new framework, particularly the
changes to the OSM grammar. We will perform the work required to upgrade the
ontology editor, coordinating with other researchers insofar as they seek to contribute to
the tool upgrade.

E.Evaluation of the Framework
We will demonstrate the improved performance and capability of our new framework by
means of a series of experiments that measure precision and recall in several data-
extraction domains

Support for existing functionality. The capabilities of Ontos under the new framework
must be no less accurate than its present capabilities. A series of experiments will
exercise the features of both the new and old Ontos, and the results will show whether the
new framework is capable of performing the same tasks with comparable (or better)
accuracy.

To achieve this, we will select three different application domains—obituaries, used-car
advertisements, and campground listings—and a corpus of documents on which to test
the old and new Ontos implementations. We will process the documents and compare
the extracted data to values extracted manually by a domain expert. This comparison will
provide precision and recall figures for each implementation. We will then compare the
precision and recall of the two implementations to gauge the performance of Ontos under
the new framework.

Despite the weaknesses in the current implementation of Ontos, its performance has been
quite good in certain domains. We therefore face the prospect of achieving lower
precision and recall figures from the new heuristics code in certain cases, since the
original code’s order dependencies may have serendipitously resulted in correct
classification of extracted values. We anticipate, however, that such differences are
likely to be minor and counterbalanced by the new code’s greater reliability and
faithfulness to theoretical underpinnings. Also, with the ability to reorder and tune the
heuristics beyond the scope of the original ontology-based extraction specification, we
expect to see overall improvements in precision and recall over the current
implementation.

Support for new functionality. We will demonstrate that our new framework will be
capable of more functionality than the current Ontos by successfully applying the new
version of Ontos to problems that the old version was incapable of handling. Because the
old version cannot even handle the problems, we will have no accuracy results to
compare against.

The first test of new functionality will focus on support of generalization/specialization
semantics. We will construct an ontology that contains a significant
generalization/specialization relationship between some of its object sets. The selected
application domain is product descriptions for digital display devices—i.e., monitors,
projectors, etc. This ontology will be applied to a set of documents in the application



                                                                                             9
domain to determine the effectiveness of the extraction. We will obtain precision and
    recall figures in a similar manner as that described earlier.

    A second test of new functionality will be centered on use of previously extracted data.
    We will construct an extraction plan that will treat previously extracted data as an “on-
    the-fly” lexicon for recognizing related values later in the document. Our application
    domain will be a university catalog, in which we will seek out course codes to comprise a
    lexicon for identifying prerequisites or cross-listings. We will again calculate precision
    and recall figures as described above.

IV.Contribution to Computer Science
    The thesis will contribute the following to the field of data extraction:
       • An abstracted approach to data extraction, especially ontology-based data
            extraction
       • The concept of extraction plans for data-extraction systems, comparable to query
            execution plans in relational database systems
       • An object-oriented framework that permits extensive experimentation with
            different data-extraction techniques
       • A revised Ontos codebase and ontology editor to aid in data-extraction research
       • More powerful Ontos data extraction with at least the same accuracy as before
       • The capability to process applications previously beyond the ability of Ontos

V.Delimitations of the Thesis
    The following will not be addressed by the proposed thesis:
       • Implementation of information retrieval modules
       • Implementation of document structure recognizers and parsers
       • Implementation of non-ontology-based data-extraction systems under the
            framework
       • Support for full declarative specification of extraction plans as part of the
            ontology definition
       • Canonicalization of extracted data

VI.Thesis Outline
        1. Introduction (5-10 pages)
        2. Related Work (2-4 pages)
        3. Scenarios in Data Extraction (15-20 pages)
               a. Value Ambiguity
               b. Data Frame Inheritance
               c. Learning Capability
               d. Complex Documents
               e. Tuning
        4. Data Extraction Engines, Plans, and Heuristics (15-20 pages)
               a. General Data Extraction
               b. Ontologies and Data Frames
               c. Extraction Plans
               d. Value-Mapping Heuristics
        5. A Framework for Extraction Plans and Heuristics (10-15 pages)
               a. Design Patterns and Frameworks


                                                                                            10
b. Plan-Level Classes and Interfaces
                    c. Heuristics-Level Classes and Interfaces
            6.   Implementation of Heuristics and Extraction Plans (2-5 pages)
            7.   Experimental Results (5-8 pages)
                    a. Comparison of Existing Functionality
                    b. Evaluation of New Functionality
            8.   Conclusions and Future Work (3-5 pages)
            9.   Appendices
                    a. Documentation on Framework
                    b. OSM-X Definition

VII.Thesis Schedule
        Literature Search and Reading             September 2001 – March 2003
        Design and Coding                         March 2003 – June 2003
        Chapter 3                                 April 2003 – May 2003
        Chapter 4                                 May 2003 – June 2003
        Chapter 5                                 June 2003 – July 2003
        Chapter 6                                 July 2003
        Chapter 7                                 August 2003
        Chapters 1,2 & 8                          August 2003 – September 2003
        Thesis Revision and Defense               September 2003


VIII.Bibliography
[1]   David Buttler, Ling Liu, Calton Pu. A fully automated object extraction system for the
      World Wide Web. In Proceedings of the 2001 International Conference on Distributed
      Computing Systems, 2001.
       This paper reports excellent precision and recall results for a fully-automated data-
       extraction system for HTML documents. Its primary limitations are its requirement for
       the documents to be well formed and its dependence on document structure for locating
       tokens of interest. It is also highly reliant upon a limited set of built-in heuristics that
       decide how to identify token separator tags.

[2]   Mary Elaine Califf and Raymond J. Mooney. Relational learning of pattern-match rules for
      information extraction. Working Notes on AAAI Spring Symposium on Applying Machine
      Learning to Discourse Processing, 1998.
        Rapier, the system described in this paper, is a relational learner of pattern matching
        rules. It is more general than most wrapper approaches since its patterns are based on
        parts of speech and lexicons of related terms, instead of specific delimiters found in a
        document. However, it requires text that is grammatical enough to be tagged for parts of
        speech.

[3]   Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo. Roadrunner: toward automatic
      data extraction from large Web sites. The VLDB Journal, 2001.
        The authors introduce a system for fully-automated wrapper generation to extract data
        from HTML pages. Their system is novel in several ways: it does not rely on user
        interaction, it has no a priori knowledge about the page contents, and it works with two
        pages at a time instead of just one.




                                                                                                  11
[4]   Line Eikvil. Information extraction from World Wide Web: A survey. Technical Report
      945, Norwegian Computing Center, 1999.
        A general survey of Web-based information extraction, this paper explains the history,
        metrics, and nature of the information extraction problem. It focuses primarily on
        wrapper generation and systems centered on that approach.

[5]   David W. Embley. Programming with data frames for everyday data items. AFIPS
      Conference Proceedings, 1980.
       The author introduces the concept of data frames, which are similar to abstract data
       types and classes. Data frames define value recognizers, context keywords, internal
       representations, operators, and so forth. This work has influenced the design and use of
       data frames in Ontos.

[6]   D.W. Embley, D.M. Campbell, Y.S. Jiang, S.W. Liddle, D.W. Lonsdale, Y.-K. Ng, R.D.
      Smith. Conceptual-model-based data extraction from multiple-record Web pages. Data &
      Knowledge Engineering, 31, 1999.
       This paper introduces the notion of ontology-based data extraction and describes the
       BYU-Ontos system. A domain ontology is a conceptual model for a particular subject,
       and may be composed of object sets, relationship sets, constraints, and various other
       elements. A data-extraction system can use ontologies together with lexicons and
       pattern-matching rules to recognize and extract portions of a document related to the
       domain of interest. This paper is the foundation for most of the work of the Data
       Extraction Group at BYU.

[7]   D.W. Embley, C. Tao, S.W. Liddle. Automatically extracting ontologically specified data
      from HTML tables with unknown structure. Proceedings of the 21st International
      Conference on Conceptual Modeling, Tampere, Finland, 2002.
        The paper describes an ontology-based approach to extracting tabular data from HTML
        by interpreting how values appear in HTML table structures. It represents a departure
        from the sequential-record assumption of earlier ontology-based data extraction work at
        BYU.

[8]   Shai Fine, Voram Singer, Naftali Tishby. The hierarchical hidden Markov model: analysis
      and applications. Machine Learning, 32, 1998.
        This paper describes hierarchical hidden Markov models and algorithms for using them.
        A hierarchical HMM is a recursive generalization of a classical HMM: it is an HMM
        with one or more hidden states replaced by “nested” HMMs, which themselves may
        contain other HMMs in place of some of any of their states. The hierarchical nature of
        the HHMMs may provide a natural fit to the relational aspect of domain ontologies.

[9]   Dayne Freitag. Information extraction from HTML: application of a general machine
      learning approach. AAAI-98, 1998.
        This paper presents SRV, which is a wrapper-related information extraction system. SRV
        is an extraction rule learner based on the FOIL algorithm. Its learning algorithm is
        domain- and syntax-independent, but the rules it learns are specific to the documents
        from which it learned them.

[10] Dayne Freitag and Andrew Kachites McCallum. Information extraction with HMMs and
     shrinkage. In Proceedings of the AAAI-99 Workshop on Machine Learning for Information
     Extraction, 1999.



                                                                                                 12
The authors provide a good explanation of using HMMs in information extraction.
        Using HMMs with certain constraints, such as manually-defined topologies, they apply a
        statistical smoothing technique called “shrinkage” to the model parameters in order to
        make the accuracy of the model more robust while requiring fewer training instances.

[11] Kimball Hewitt. An integrated ontology development environment for data extraction.
     Master’s thesis, Brigham Young University, 2000.
      The thesis describes an implementation of a graphical ontology editor still in use by the
      Data Extraction Group for construction of ontologies and data frames.

[12] Chun-Nan Hsu. Initial results on wrapping semistructured Web pages with finite-state
     transducers and contextual rules. AAAI-98 Workshop on AI and Information Integration,
     1998.
       This paper introduces SoftMealy, a wrapper induction algorithm. It constructs a finite
       state transducer for each type of item to extract, based on the delimiters encountered
       during document scanning. The finite state transducers are more flexible in dealing with
       missing attributes and various permutations of attributes.

[13] The Java tutorial. At http://java.sun.com/docs/books/tutorial/, 2003.
      This site explains features of the Java programming language, the primary
      implementation language of Ontos and the proposed framework.

[14] Nicholas Kushmerick, Daniel S. Weld, Robert Doorenbos. Wrapper induction for
     information extraction. International Joint Conference on Artificial Intelligence (IJCAI),
     1997.
       The system presented in this paper follows the wrapper generation orientation of early
       information extraction work. Using PAC-learning principles and syntactic analysis,
       wrappers are induced for specific documents.

[15] T.R. Leek. Information extraction using hidden Markov models. Master’s thesis, UC San
     Diego, 1997.
       The thesis introduces the concept of using a hidden Markov model to extract factual
       information from a corpus of English prose. It includes a sound justification for using
       HMMs in information extraction, focusing on the technique’s independence from human
       engineering and prior knowledge, as well as the usefulness of computing a probability
       for class labels.

[16] S.W. Liddle, D.W. Embley, D.T. Scott, S.H. Yau. Extracting data behind Web forms.
     Proceedings of the Workshop on Conceptual Modeling Approaches for e-Business,
     Tampere, Finland, 2002.
       The paper describes an approach to document retrieval and data extraction for the
       “hidden Web” (data available only via HTML form requests). The approach is an
       example of the need for access to previously extracted data, which in this case can be
       used to provide values for a form request.




                                                                                                13
[17] Andrew McCallum, Dayne Freitag, Fernando Pereira. Maximum entropy Markov models
     for information extraction and segmentation. In Machine Learning: Proceedings of the
     Seventeenth International Conference (ICML 2000), 2000.
       The authors present an alternative to hidden Markov models that still retains many of the
       advantages of HMMs. Maximum entropy Markov models allow observations to be
       treated as arbitrary overlapping features, and more accurately represent the probability
       of a sequence by focusing on the conditional probability of reaching a state given an
       observation and the previous state, rather than the joint probability that maximizes the
       likelihood of the observation sequence(s) used to train the model.

[18] Ion Muslea. Extraction patterns for information extraction tasks: a survey. The AAAI-99
     Workshop on Machine Learning for Information Extraction, 1999.
       Muslea surveys the existing systems for rule-based information extraction and
       categorizes them based on the type of text (free, structured, semi-structured) targeted for
       extraction. The survey does not encompass BYU-Ontos.

[19] Ion Muslea, Steve Minton, Craig Knoblock. A hierarchical approach to wrapper induction.
     Proceedings of the Third International Conference on Autonomous Agents (Agents ’99),
     1999.
       The authors present an intriguing variation on wrapper induction that accommodates
       richly structured documents. They view a document as a collection of nested lists, and
       therefore construct a wrapper that is hierarchical to accommodate the nested lists.

[20] L.R. Rabiner and B.H. Juang. An introduction to hidden Markov models. IEEE ASSP
     Magazine, 3 (1), 1986.
       This article is an oft-cited overview of hidden Markov models, the three primary
       problems of interest associated with them, and algorithms for solving these problems.
       HMMs consist of a finite set of hidden states that transition to each other according to a
       probability distribution; a finite set of symbols that may be emitted from any hidden state
       according to another probability distribution; and a probability distribution that
       determines the initial state of the model. Efficient algorithms exist for determining the
       likelihood of a particular sequence of symbols given a model, or the most likely sequence
       of hidden states that generated a particular sequence.

[21] Ellen Riloff. Automatically constructing a dictionary for information extraction tasks.
     National Conference on Artificial Intelligence, 1993.
       Riloff’s paper describes AutoSlog, which is a system that uses selective concept
       extraction in a one-shot machine learning approach to extract terms to build a dictionary
       for a given domain. This paper provides an example of earlier and alternative
       approaches to machine-learned information extraction. Of particular note is the set of
       core heuristics upon which AutoSlog relies for its capability; this dependence on
       heuristics resembles BYU-Ontos’ use of heuristics for its class labeler.

[22] Tobias Scheffer, Christian Decomain, Stefan Wrobel. Active hidden Markov models for
     information extraction. In Proceedings of the International Symposium on Intelligent Data
     Analysis, 2001.
       This paper gives an algorithm for training a hidden Markov model on partially labeled
       data using active learning methods. The research examines how to train a model when
       the observation labels are not fully specified, although this paper focuses more on
       computing the error of estimated labels in order to present the more difficult
       observations interactively for a user to label.


                                                                                                14
[23] Kristie Seymore, Andrew McCallum, Ronald Rosenfeld. Learning hidden Markov model
     structure for information extraction. In Proceedings of the AAAI Workshop on Machine
     Learning for Information Extraction, 1999.
       This paper focuses on learning the underlying topology of a hidden Markov model for
       information extraction purposes, since the optimal topology is often not known a priori.
       An important result is that a topology containing only one hidden state per class label
       may be suboptimal. The paper also proposes some techniques for learning with
       distantly-labeled data, which is labeled data from another domain whose labels partially
       overlap those of the target domain.

[24] Stephen Soderland. Learning information extraction rules for semi-structured and free text.
     Machine Learning 34, 1999.
       Soderland presents WHISK, a rule-based extraction approach. WHISK learns to detect
       information based on delimiting strings in the document, as well as tagged grammatical
       elements. The system's main contribution is its ability to learn extraction rules for free,
       semi-structured, and structured texts.

[25] Alan Wessman. Unpublished design document, 2003.
      This document contains working specifications for the framework proposed for the thesis.


IX.Artifacts
        Aside from the thesis itself, the following will be produced as artifacts:
           • An XML Schema defining the underlying data structures for Ontos
           • The proposed framework, realized in Java
           • An extraction plan and heuristics implemented under the new framework
           • Revision of a graphical ontology editor and the Ontos system using the
                framework
           • Documentation on using the framework to implement extraction plans and
                heuristics




                                                                                               15
X.Signatures
This thesis proposal by Alan Wessman is accepted in its present form by the Department
of Computer Science of Brigham Young University as satisfying the thesis proposal
requirement for the degree of Master of Science.


______________________________             _______________________________
Date                                       David W. Embley, Committee Chair



                                           _______________________________
                                           Stephen W. Liddle, Committee Member



                                           _______________________________
                                           Thomas W. Sederberg, Committee Member



                                           _______________________________
                                           David W. Embley, Graduate Coordinator




                                                                                    16

Contenu connexe

Tendances (18)

Advance Sqlite3
Advance Sqlite3Advance Sqlite3
Advance Sqlite3
 
DATA STRUCTURES - SHORT NOTES
DATA STRUCTURES - SHORT NOTESDATA STRUCTURES - SHORT NOTES
DATA STRUCTURES - SHORT NOTES
 
Link list
Link listLink list
Link list
 
Deductive databases
Deductive databasesDeductive databases
Deductive databases
 
Year 11 DATA PROCESSING 1st Term
Year 11 DATA PROCESSING 1st TermYear 11 DATA PROCESSING 1st Term
Year 11 DATA PROCESSING 1st Term
 
Report
ReportReport
Report
 
ADB introduction
ADB introductionADB introduction
ADB introduction
 
Art of indexing_in_o8i
Art of indexing_in_o8iArt of indexing_in_o8i
Art of indexing_in_o8i
 
p138-jiang
p138-jiangp138-jiang
p138-jiang
 
Deductive databases
Deductive databasesDeductive databases
Deductive databases
 
Duplicate Detection in Hierarchical Data Using XPath
Duplicate Detection in Hierarchical Data Using XPathDuplicate Detection in Hierarchical Data Using XPath
Duplicate Detection in Hierarchical Data Using XPath
 
Normalization PRESENTATION
Normalization PRESENTATIONNormalization PRESENTATION
Normalization PRESENTATION
 
AtomiDB Dr Ashis Banerjee reviews
AtomiDB Dr Ashis Banerjee reviewsAtomiDB Dr Ashis Banerjee reviews
AtomiDB Dr Ashis Banerjee reviews
 
Lesson 1 overview
Lesson 1   overviewLesson 1   overview
Lesson 1 overview
 
Propagation of data fusion
Propagation of data fusionPropagation of data fusion
Propagation of data fusion
 
OPAL Presentation
OPAL PresentationOPAL Presentation
OPAL Presentation
 
Data Structure
Data Structure Data Structure
Data Structure
 
Week 7 normalisation lecture (last)
Week 7 normalisation lecture (last)Week 7 normalisation lecture (last)
Week 7 normalisation lecture (last)
 

En vedette

children's toothpaste making activity
children's toothpaste making activitychildren's toothpaste making activity
children's toothpaste making activitynueldavidwest
 
Thesis Document "Parsimomious Estimation of Default Probabilities from Credit...
Thesis Document "Parsimomious Estimation of Default Probabilities from Credit...Thesis Document "Parsimomious Estimation of Default Probabilities from Credit...
Thesis Document "Parsimomious Estimation of Default Probabilities from Credit...Ioannis Alexakis
 
Sms based event notification system
  Sms based event notification system  Sms based event notification system
Sms based event notification systemAlexander Decker
 
Thesis:"DLAlert and Information Alert System for Digital Libraries"
Thesis:"DLAlert and Information Alert System for Digital Libraries"Thesis:"DLAlert and Information Alert System for Digital Libraries"
Thesis:"DLAlert and Information Alert System for Digital Libraries"Ioannis Alexakis
 
Real-time monitoring and delay management of a transport information system
Real-time monitoring and delay management of a transport information systemReal-time monitoring and delay management of a transport information system
Real-time monitoring and delay management of a transport information systemLorenzo Sfarra
 
Chapter 3 research design
Chapter 3 research designChapter 3 research design
Chapter 3 research designFaisal Pak
 
Chapter 3 research method p1
Chapter 3 research method p1Chapter 3 research method p1
Chapter 3 research method p1dabneyluang
 
Chapter 3 research design and methodology
Chapter 3   research design and methodologyChapter 3   research design and methodology
Chapter 3 research design and methodologyLokman Hakim Ismail
 
Manelyn L. Mananap Thesis (Chapter 3)
Manelyn L. Mananap Thesis (Chapter 3)Manelyn L. Mananap Thesis (Chapter 3)
Manelyn L. Mananap Thesis (Chapter 3)mtlobrido
 
Ed200 research chapter 3 methodology(jan282012)
Ed200 research  chapter 3 methodology(jan282012)Ed200 research  chapter 3 methodology(jan282012)
Ed200 research chapter 3 methodology(jan282012)Maria Theresa
 
Research Methodology Chapter 3
Research Methodology Chapter 3Research Methodology Chapter 3
Research Methodology Chapter 3Pulchowk Campus
 
Thesis computerized grading system
Thesis computerized grading systemThesis computerized grading system
Thesis computerized grading systemjolensreyes
 
Thesis about Computerized Payroll System for Barangay Hall, Dita
Thesis about Computerized Payroll System for Barangay Hall, DitaThesis about Computerized Payroll System for Barangay Hall, Dita
Thesis about Computerized Payroll System for Barangay Hall, DitaAcel Carl David O, Dolindo
 
Result Monitoring System using SMS thesis
Result Monitoring System using SMS thesisResult Monitoring System using SMS thesis
Result Monitoring System using SMS thesisCarul Push
 
Thesis my documentation
Thesis  my documentationThesis  my documentation
Thesis my documentationcas123
 

En vedette (20)

children's toothpaste making activity
children's toothpaste making activitychildren's toothpaste making activity
children's toothpaste making activity
 
Thesis Document "Parsimomious Estimation of Default Probabilities from Credit...
Thesis Document "Parsimomious Estimation of Default Probabilities from Credit...Thesis Document "Parsimomious Estimation of Default Probabilities from Credit...
Thesis Document "Parsimomious Estimation of Default Probabilities from Credit...
 
Sms based event notification system
  Sms based event notification system  Sms based event notification system
Sms based event notification system
 
Ao thesis
Ao thesisAo thesis
Ao thesis
 
Thesis:"DLAlert and Information Alert System for Digital Libraries"
Thesis:"DLAlert and Information Alert System for Digital Libraries"Thesis:"DLAlert and Information Alert System for Digital Libraries"
Thesis:"DLAlert and Information Alert System for Digital Libraries"
 
Real-time monitoring and delay management of a transport information system
Real-time monitoring and delay management of a transport information systemReal-time monitoring and delay management of a transport information system
Real-time monitoring and delay management of a transport information system
 
Chapter 3 research design
Chapter 3 research designChapter 3 research design
Chapter 3 research design
 
Chapter 3 research method p1
Chapter 3 research method p1Chapter 3 research method p1
Chapter 3 research method p1
 
Chapter 3
Chapter 3Chapter 3
Chapter 3
 
Chapter 3 research design and methodology
Chapter 3   research design and methodologyChapter 3   research design and methodology
Chapter 3 research design and methodology
 
Manelyn L. Mananap Thesis (Chapter 3)
Manelyn L. Mananap Thesis (Chapter 3)Manelyn L. Mananap Thesis (Chapter 3)
Manelyn L. Mananap Thesis (Chapter 3)
 
Ed200 research chapter 3 methodology(jan282012)
Ed200 research  chapter 3 methodology(jan282012)Ed200 research  chapter 3 methodology(jan282012)
Ed200 research chapter 3 methodology(jan282012)
 
Research Methodology Chapter 3
Research Methodology Chapter 3Research Methodology Chapter 3
Research Methodology Chapter 3
 
Thesis computerized grading system
Thesis computerized grading systemThesis computerized grading system
Thesis computerized grading system
 
Thesis about Computerized Payroll System for Barangay Hall, Dita
Thesis about Computerized Payroll System for Barangay Hall, DitaThesis about Computerized Payroll System for Barangay Hall, Dita
Thesis about Computerized Payroll System for Barangay Hall, Dita
 
Chapter 3
Chapter 3Chapter 3
Chapter 3
 
Data gathering
Data gatheringData gathering
Data gathering
 
Result Monitoring System using SMS thesis
Result Monitoring System using SMS thesisResult Monitoring System using SMS thesis
Result Monitoring System using SMS thesis
 
Writing chapter 3
Writing chapter 3Writing chapter 3
Writing chapter 3
 
Thesis my documentation
Thesis  my documentationThesis  my documentation
Thesis my documentation
 

Similaire à Framework for Extraction Plans and Heuristics in Ontology-Based Data Extraction

Automatically converting tabular data to
Automatically converting tabular data toAutomatically converting tabular data to
Automatically converting tabular data toIJwest
 
Mc0077 – advanced database systems
Mc0077 – advanced database systemsMc0077 – advanced database systems
Mc0077 – advanced database systemsRabby Bhatt
 
TRANSFORMATION RULES FOR BUILDING OWL ONTOLOGIES FROM RELATIONAL DATABASES
TRANSFORMATION RULES FOR BUILDING OWL ONTOLOGIES FROM RELATIONAL DATABASESTRANSFORMATION RULES FOR BUILDING OWL ONTOLOGIES FROM RELATIONAL DATABASES
TRANSFORMATION RULES FOR BUILDING OWL ONTOLOGIES FROM RELATIONAL DATABASEScscpconf
 
Part2- The Atomic Information Resource
Part2- The Atomic Information ResourcePart2- The Atomic Information Resource
Part2- The Atomic Information ResourceJEAN-MICHEL LETENNIER
 
Efficient Record De-Duplication Identifying Using Febrl Framework
Efficient Record De-Duplication Identifying Using Febrl FrameworkEfficient Record De-Duplication Identifying Using Febrl Framework
Efficient Record De-Duplication Identifying Using Febrl FrameworkIOSR Journals
 
Towards Ontology Development Based on Relational Database
Towards Ontology Development Based on Relational DatabaseTowards Ontology Development Based on Relational Database
Towards Ontology Development Based on Relational Databaseijbuiiir1
 
Hybrid approach for generating non overlapped substring using genetic algorithm
Hybrid approach for generating non overlapped substring using genetic algorithmHybrid approach for generating non overlapped substring using genetic algorithm
Hybrid approach for generating non overlapped substring using genetic algorithmeSAT Publishing House
 
Conceptual similarity measurement algorithm for domain specific ontology[
Conceptual similarity measurement algorithm for domain specific ontology[Conceptual similarity measurement algorithm for domain specific ontology[
Conceptual similarity measurement algorithm for domain specific ontology[Zac Darcy
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseEditor IJCATR
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseEditor IJCATR
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseEditor IJCATR
 
Conceptual Similarity Measurement Algorithm For Domain Specific Ontology
Conceptual Similarity Measurement Algorithm For Domain Specific OntologyConceptual Similarity Measurement Algorithm For Domain Specific Ontology
Conceptual Similarity Measurement Algorithm For Domain Specific OntologyZac Darcy
 
Query optimization to improve performance of the code execution
Query optimization to improve performance of the code executionQuery optimization to improve performance of the code execution
Query optimization to improve performance of the code executionAlexander Decker
 
11.query optimization to improve performance of the code execution
11.query optimization to improve performance of the code execution11.query optimization to improve performance of the code execution
11.query optimization to improve performance of the code executionAlexander Decker
 
Interview questions(programming)
Interview questions(programming)Interview questions(programming)
Interview questions(programming)sunilbhaisora1
 
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES cscpconf
 
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIESENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIEScsandit
 

Similaire à Framework for Extraction Plans and Heuristics in Ontology-Based Data Extraction (20)

Automatically converting tabular data to
Automatically converting tabular data toAutomatically converting tabular data to
Automatically converting tabular data to
 
Mc0077 – advanced database systems
Mc0077 – advanced database systemsMc0077 – advanced database systems
Mc0077 – advanced database systems
 
TRANSFORMATION RULES FOR BUILDING OWL ONTOLOGIES FROM RELATIONAL DATABASES
TRANSFORMATION RULES FOR BUILDING OWL ONTOLOGIES FROM RELATIONAL DATABASESTRANSFORMATION RULES FOR BUILDING OWL ONTOLOGIES FROM RELATIONAL DATABASES
TRANSFORMATION RULES FOR BUILDING OWL ONTOLOGIES FROM RELATIONAL DATABASES
 
A survey of xml tree patterns
A survey of xml tree patternsA survey of xml tree patterns
A survey of xml tree patterns
 
Part2- The Atomic Information Resource
Part2- The Atomic Information ResourcePart2- The Atomic Information Resource
Part2- The Atomic Information Resource
 
Ad
AdAd
Ad
 
Efficient Record De-Duplication Identifying Using Febrl Framework
Efficient Record De-Duplication Identifying Using Febrl FrameworkEfficient Record De-Duplication Identifying Using Febrl Framework
Efficient Record De-Duplication Identifying Using Febrl Framework
 
Towards Ontology Development Based on Relational Database
Towards Ontology Development Based on Relational DatabaseTowards Ontology Development Based on Relational Database
Towards Ontology Development Based on Relational Database
 
Hybrid approach for generating non overlapped substring using genetic algorithm
Hybrid approach for generating non overlapped substring using genetic algorithmHybrid approach for generating non overlapped substring using genetic algorithm
Hybrid approach for generating non overlapped substring using genetic algorithm
 
Conceptual similarity measurement algorithm for domain specific ontology[
Conceptual similarity measurement algorithm for domain specific ontology[Conceptual similarity measurement algorithm for domain specific ontology[
Conceptual similarity measurement algorithm for domain specific ontology[
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented database
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented database
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented database
 
Conceptual Similarity Measurement Algorithm For Domain Specific Ontology
Conceptual Similarity Measurement Algorithm For Domain Specific OntologyConceptual Similarity Measurement Algorithm For Domain Specific Ontology
Conceptual Similarity Measurement Algorithm For Domain Specific Ontology
 
Query optimization to improve performance of the code execution
Query optimization to improve performance of the code executionQuery optimization to improve performance of the code execution
Query optimization to improve performance of the code execution
 
11.query optimization to improve performance of the code execution
11.query optimization to improve performance of the code execution11.query optimization to improve performance of the code execution
11.query optimization to improve performance of the code execution
 
Interview questions(programming)
Interview questions(programming)Interview questions(programming)
Interview questions(programming)
 
Data structures
Data structuresData structures
Data structures
 
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
 
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIESENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIES
 

Plus de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Plus de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Framework for Extraction Plans and Heuristics in Ontology-Based Data Extraction

  • 1. A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System A Thesis Proposal Presented to the Department of Computer Science Brigham Young University In Partial Fulfillment of the Requirements for the Degree Master of Science By Alan Wessman March 2003
  • 2. I.Introduction The proposed thesis deals with the subject of data extraction from unstructured or semi- structured documents such as Web pages. Specifically, the proposed thesis is concerned with abstracting the extraction process and related heuristics for an ontology-based data- extraction system. As such, it is closely connected with the Ontos research project currently underway in the Data Extraction Group at Brigham Young University. The fundamental tasks of data extraction are to identify and extract relevant information within a semi-structured or unstructured document, and to map the identified data to a more formal structure to permit easier manipulation of the data [3,4,6,9,18,24]. Data extraction is particularly important in obtaining information from the World Wide Web, which although entirely unpredictable in the way its documents present useful data, is also the largest and most accessible repository for information. Perhaps the most common solution to the data-extraction problem is the construction and use of wrappers [1,2,3,9,12,14,19,24], which are basically templates that use clues in the document’s structure and syntax to locate useful data. The “holes” or slots in the wrapper, filled with data, are mapped to fields of a common semantic model, such as a data model in a relational database. But wrappers suffer from an important weakness: being based on the structure and syntax of the document, they become invalid when the document structure or presentation changes, even if the information contained in the document is still relevant. This is of course a major shortcoming in Web-based data extraction, since Web documents are notorious for their tendency to change suddenly and drastically. Even without the problem of evolving pages, though, wrappers are still specific to a narrow set of documents, rather than applying to all documents belonging to an application domain. Thus an application may require the ongoing construction of new wrappers as additional data sources are discovered. Roadrunner [3] is one approach that seeks to mitigate this problem by automating the process of wrapper construction. However, the Roadrunner method only works for fairly regular data sources. The BYU-Ontos approach [6] is one data-extraction solution that automatically compensates for structural variations across data sources as well as variations over time by a single data source, by taking a novel approach to the problem of data extraction. In contrast with the Roadrunner approach, BYU-Ontos also works for highly unstructured data sources. It is centered on a human-designed ontology that describes the domain of interest and provides certain general sets of patterns (termed “data frames”, from [5]) for recognizing elements of the domain. Independent of any particular document format or syntax, the recognition patterns of the ontology detect strings and symbols using lexicons and regular expressions. Recognized tokens are mapped to elements of the ontology, or are discarded using ontology constraints and a set of built-in heuristics. BYU-Ontos (hereafter “Ontos”) has been shown to be quite successful in extracting data for several ontologically narrow applications, such as used car advertisements and obituaries. Its precision and recall results are competitive with those of other approaches, and it is resilient in the face of document change and data source variety, since its rules are constructed independent of any single document or format. But Ontos in its current incarnation has several limitations. One notable problem is that several discrepancies exist between the Ontos specification and its implementation. In particular, the Ontos implementation is sensitive to the order in which object sets and 2
  • 3. relationship sets are given in the domain ontology; such a dependency does not exist in the specification. This results in errors such as the one depicted in Figure 1 below. In this case, the extracted date matches with two object sets: Funeral Date and Marriage Date. Despite the nearby keyword indicating that the date is a Funeral Date, the system extracts it as a Marriage Date. This error is due to the one-by-one processing of object sets in the order of their appearance in the ontology. Since Marriage Date was processed first and yielded a match on the date value, it accepted the value and excluded it from inclusion in the correct Funeral Date object set. Simply changing the order of the object sets in the ontology will not solve the general problem, for it is easy to conceive of a scenario in which one record has only a funeral date and the other has only a marriage date. In such a case, the ordering will be wrong for one of the records. Figure 1: Example of errors due to object-set order dependencies. Figure 2 provides additional insight into the difficulties posed by the heuristics specification and implementation. The figure first lists the specified set of heuristics. It then gives pseudocode for the actual implementation of the heuristics. It is difficult to identify from the pseudocode any of the five specified heuristics. Hence verifying program correctness or making modifications to the heuristics involves a great deal of attention and effort, and changes to the heuristics are likely to introduce bugs into the code. 3
  • 4. Specified Heuristics • Keyword proximity • Subsumed/overlapping constants • Functional relationships • Nonfunctional relationships • First occurrence without constraint violation Heuristics Pseudocode Method 1: FOR EACH object set IN the ontology: IF a keyword exists for the object set THEN Find the closest constant to the keyword END IF WHILE keyword exists AND constant is not found DO Find the next keyword Find the closest constant to the keyword END WHILE IF keyword doesn’t exist THEN Find the first constant matching this object set END IF IF constant exists THEN Update data structures: - add the constant to the output table - delete subsumed data lines - delete other data lines for this object set ELSE IF this object set is required THEN Abort ELSE Continue with next object set END IF END IF END FOR Method 2: FOR EACH relational table DO FOR EACH keyword in dataline list DO IF keyword corresponds to an attribute of this table THEN Find context surrounding this keyword (look backward for first preceding keyword, then look forward for next succeeding keyword) Clear this table’s template tuple Fill in object places of the template with object(s) FOR EACH constant in this context DO IF constant corresponds to an attribute in template THEN IF attribute slot is empty AND adding constant would not violate any constraints for this table THEN Fill slot with constant Mark the constant as used (delete from dataline list, remove other overlapping constants) IF this slot is a nonlexical object generator AND the corresponding nonlexical object isn’t yet generated THEN Generate nonlexical object END IF IF the tuple is now complete THEN Add it to the rows of this table Re-initialize template tuple END IF ELSE IF slot is full THEN IF slot is functionally determined by another slot THEN Replace the slot’s contents with this constant Mark the constant as used ELSE IF template tuple is sufficiently complete to write THEN Add template tuple to the table Replace the slot’s contents with this constant Mark the constant as used ELSE Ignore the constant END IF END IF ELSE Ignore the constant END IF END IF END FOR IF template tuple is sufficiently complete to write THEN Add template tuple to the table END IF END IF END FOR END FOR Figure 2: Specification and implementation of Ontos heuristics. 4
  • 5. If these were the only limitations of Ontos, it would be a straightforward matter to define more carefully the heuristics and perform some reworking of the code. But there are a number of data-extraction tasks that Ontos simply cannot handle at all, that represent important advances and opportunities in data-extraction research. Aside from fixing broken code, there is a great need to provide Ontos with the capability to support various data-extraction techniques with such modularity and flexibility that new features and heuristics can be “plugged in” without requiring extensive redesign of the Ontos code. We list several examples of desired new Ontos functionality below. Full support of object set and data frame specialization (inheritance). Ontos currently has limited support of inheritance of object sets and data frames, but lacks such important features as overriding and hiding of inherited extraction rules. These features are needed to implement heuristics that correctly assign extracted values to object sets, since a specialized extraction rule should cause matching values to be mapped to the specializing object set. Access to and use of previously extracted data. A major area of research that has hitherto been unexplored in Ontos is learning-enabled data extraction. We would like the performance of Ontos to improve over time as more data is recognized and mapped to the ontology. This necessitates a feedback cycle wherein previously extracted values can influence decisions to be made regarding new data. It follows that access to previously extracted data is a necessary condition for learning-enabled data extraction. At present, Ontos emits extracted values in the form of SQL insert statements. To access this data, the SQL must be parsed and mapped back to the ontology. This process is inefficient. In contrast, populating the ontology directly with extracted values would permit immediate use of the values in the heuristics, and would also permit the construction of translation code to traverse the ontology structure and emit the values in whatever format is desired. Support of complex document structures. Researchers have recently investigated applying ontology-based data extraction to differing document structures, such as tabular data or HTML input forms [7,16]. These reflect an expansion of the research from the original assumption that the input documents contain a relatively flat, linear multi-record structure. Unfortunately, Ontos has had to be retooled for each different document structure. We desire to provide Ontos with the ability to recognize different documents and choose a structure parser that will break the documents down into appropriate sections for more accurate processing by Ontos. The goal is that Ontos can extract from complex documents such as a university course catalog, which is rich both in terms of data as well as structural complexity. Declarative specification of extraction plans. The extraction plan specifies the order in which a data-extraction engine processes documents and target structure elements during the extraction phase, as well as how it reconciles the heuristics results. The pseudocode for the Ontos heuristics represents an example of an extraction plan. There is doubt, however, that the Ontos implementation can be claimed to apply the optimal extraction plan for all cases. We hypothesize instead that different extraction algorithms are better for different applications. Indeed, even to measure the utility of different extraction plans requires the ability to contrast them in controlled experiments. We wish therefore to allow an ontology researcher or designer to specify a particular extraction plan declaratively for a given ontology. Declarative extraction plans facilitate experimentation and dynamic tailoring of extraction plans for various applications. 5
  • 6. Parameterization of heuristics. The Ontos heuristics produce, at each step, a Boolean decision on membership of a value in a particular object set or relationship set. Once a value has been accepted, it is no longer available for assignment to a different set. As the obituaries example above demonstrated, this decision can sometimes be incorrect. We thus desire the heuristics to produce real-numbered confidence values for each value mapping. This will also permit research to be conducted in how to weight certain heuristics and otherwise combine confidence values to achieve a final set of Boolean decisions for proposed value mappings. This will also permit the use of probabilistic and vector-based approaches, such as a hidden Markov model (HMM) learner for classification [8,10,15,17,20,22,23]. One of the factors motivating the proposed work is the considerable effort spent to develop a hidden Markov model learner for the current Ontos implementation. In the process of attempting to integrate the learner into Ontos, it was discovered that the heuristics could not work with such a different paradigm. In particular, the Boolean decisions of the heuristics could not easily be reconciled with the probabilities output by the HMM. The HMM work helped to demonstrate how flexible the Ontos code needs to be to accommodate such novel approaches. The examples given above are a subset of the many possible ideas and techniques applicable to ontology-based data extraction that have not been achievable with the existing Ontos code. Combined with the errors and general rigidity of the current codebase, it is clear that a more flexible, understandable, modular, and powerful design for Ontos is required. The proposed thesis addresses these problems by defining an abstract object-oriented framework that permits powerful extraction plans and heuristics to be developed, configured, and tuned for an ontology-based data-extraction system. II.Thesis Statement A generalized framework of interfaces and abstract classes written in Java can decouple the value-mapping heuristics from the rest of the Ontos data-extraction code. The framework will be sufficiently flexible both to allow the current heuristics to be re- implemented under the framework and to enable new heuristics to be created and incorporated into the Ontos system without requiring significant rewrites of the non- heuristics code. III.Methods This section describes the steps involved in accomplishing the work for the proposed thesis. The work may be broken down into five major steps, which are listed below: 1. Analysis of data-extraction functionality 2. Construction of the framework 3. Implementation of modules under the framework 4. Upgrades to existing tools 5. Evaluation of the new framework We provide details in the following sections. A.Analysis of Data Extraction Functionality The first step in constructing a framework for data extraction is to determine common functionality at each level of abstraction. The abstract class DataExtractionEngine, 6
  • 7. which encompasses any data-extraction system (not just those that are ontology-based), represents the highest level of abstraction. A survey of current data-extraction research [4,6,15,18] provides the basis for defining the functionality at the highest levels of abstraction. Examination of the research centered on Ontos [6,7,16] supplies insight into abstracting ontology-based data- extraction systems. Much of this work has already been done for the proposed thesis. As an example of the work done, the following paragraph describes the abstract functionality of data-extraction systems [25]: In general, a data-extraction engine is initialized with a corpus of source documents (often mediated by an information retrieval system), a set of extraction rules to apply to the documents, and a target data structure to which extracted values are mapped. After initialization the engine executes an extraction plan, yielding a mapping between information in the documents and elements of the target data structure. Following the extraction step, the engine performs finalization tasks such as canonicalization of the extracted data or queries against the target data structure. Figure 3 shows a UML diagram of a partial framework that supports the functionality described above for the Ontos system. Subclasses of ExtractionPlan implement the execute() method, accessing methods of the DocumentRetriever, ValueMappingHeuristic, Ontology, and DocumentStructureRecognizer classes to perform the extraction. DataExtractionEngine #extractionPlan : ExtractionPlan ExtractionPlan +setInitializationParameters() +initialize() +execute() +doExtraction() +finalize() OntosExtractionPlan Ontos #documentRetriever : DocumentRetriever #documentRecognizer : DocumentStructureRecognizer #ontology : Ontology +setExtractionPlan(in plan : OntosExtractionPlan) «uses» DocumentRetriever DocumentStructureRecognizer ValueMappingHeuristic Ontology DocumentStructureParser ObjectSet RelationshipSet Figure 3: UML diagram of partial Ontos framework. There is still some work to be done in determining at what level to abstract certain features. Most data-extraction systems make a number of assumptions regarding how they obtain documents to process, how the documents are structured, what language the document is written in, and so forth. It is therefore a significant challenge to decide how 7
  • 8. to allow these assumptions to be made explicit when implementing the abstraction. This work will be completed for the proposed thesis. B.Construction of a Framework The main piece of work to be performed for the proposed thesis is development of the framework. The framework will be implemented in the Java language [13], which is the implementation language for the existing Ontos system and provides the object-oriented features necessary for abstraction of data-extraction functionality. A set of abstract classes and interfaces will comprise the framework. In Java, abstract classes may contain instance attributes and method implementations, while interfaces may only contain static (class) attributes and method declarations. On the other hand, a Java class may implement multiple interfaces but inherit from only one parent class. Thus, most of the framework will be contained in interfaces to avoid problems with having to inherit from multiple abstract classes. A key part of designing an abstract framework is specifying the data structures that are published by parts of the interface. For example, a domain ontology is a data structure that would presumably be passed into the initialization portion of a data-extraction engine. It is reasonable therefore to declare an Ontology abstract class or interface as part of the framework. Other data types will be defined to represent documents, extracted values, value-mapping heuristics, data frames, and so forth. For data extraction using OSM ontologies, the OSM grammar must be extended to accommodate new capabilities such as declaring a reference to an existing extraction plan. The grammar must also be implemented in a data structure with an underlying concrete representation so that ontologies may be read from and written to a file. XML has been chosen for the concrete representation of OSM ontologies, and an XML Schema defining the grammar, termed OSM-X, has been designed. Work is currently underway to wrap the XML structure within interfaces that translate XML nodes to OSM ontology elements. C.Implementing Modules under the Framework The proposed thesis will not only entail construction of the framework, but will also involve incorporation of the current Ontos code into the new framework. This will require reconstruction of the heuristics code as an ExtractionPlan subclass and individual Heuristic subclasses. Unintended features such as the object-set order dependency will be corrected to harmonize with the heuristics specification. In addition to wrapping the current functionality of Ontos into the framework, a few additional features (namely, full generalization/specialization support and access to previously extracted data) will be newly implemented under the framework as a demonstration of its capabilities. However, many of the desired new features will be left for implementation by future researchers. These may include various document structure recognizers and parsers, specialized heuristics, learning-enabled extraction plans, and so forth. 8
  • 9. D.Upgrades to Existing Tools In addition to the core Ontos data-extraction system, a graphical ontology editor tool [11] exists for use in constructing ontologies visually rather than textually. This tool will need to be rewritten to a sufficient degree to accommodate the new framework, particularly the changes to the OSM grammar. We will perform the work required to upgrade the ontology editor, coordinating with other researchers insofar as they seek to contribute to the tool upgrade. E.Evaluation of the Framework We will demonstrate the improved performance and capability of our new framework by means of a series of experiments that measure precision and recall in several data- extraction domains Support for existing functionality. The capabilities of Ontos under the new framework must be no less accurate than its present capabilities. A series of experiments will exercise the features of both the new and old Ontos, and the results will show whether the new framework is capable of performing the same tasks with comparable (or better) accuracy. To achieve this, we will select three different application domains—obituaries, used-car advertisements, and campground listings—and a corpus of documents on which to test the old and new Ontos implementations. We will process the documents and compare the extracted data to values extracted manually by a domain expert. This comparison will provide precision and recall figures for each implementation. We will then compare the precision and recall of the two implementations to gauge the performance of Ontos under the new framework. Despite the weaknesses in the current implementation of Ontos, its performance has been quite good in certain domains. We therefore face the prospect of achieving lower precision and recall figures from the new heuristics code in certain cases, since the original code’s order dependencies may have serendipitously resulted in correct classification of extracted values. We anticipate, however, that such differences are likely to be minor and counterbalanced by the new code’s greater reliability and faithfulness to theoretical underpinnings. Also, with the ability to reorder and tune the heuristics beyond the scope of the original ontology-based extraction specification, we expect to see overall improvements in precision and recall over the current implementation. Support for new functionality. We will demonstrate that our new framework will be capable of more functionality than the current Ontos by successfully applying the new version of Ontos to problems that the old version was incapable of handling. Because the old version cannot even handle the problems, we will have no accuracy results to compare against. The first test of new functionality will focus on support of generalization/specialization semantics. We will construct an ontology that contains a significant generalization/specialization relationship between some of its object sets. The selected application domain is product descriptions for digital display devices—i.e., monitors, projectors, etc. This ontology will be applied to a set of documents in the application 9
  • 10. domain to determine the effectiveness of the extraction. We will obtain precision and recall figures in a similar manner as that described earlier. A second test of new functionality will be centered on use of previously extracted data. We will construct an extraction plan that will treat previously extracted data as an “on- the-fly” lexicon for recognizing related values later in the document. Our application domain will be a university catalog, in which we will seek out course codes to comprise a lexicon for identifying prerequisites or cross-listings. We will again calculate precision and recall figures as described above. IV.Contribution to Computer Science The thesis will contribute the following to the field of data extraction: • An abstracted approach to data extraction, especially ontology-based data extraction • The concept of extraction plans for data-extraction systems, comparable to query execution plans in relational database systems • An object-oriented framework that permits extensive experimentation with different data-extraction techniques • A revised Ontos codebase and ontology editor to aid in data-extraction research • More powerful Ontos data extraction with at least the same accuracy as before • The capability to process applications previously beyond the ability of Ontos V.Delimitations of the Thesis The following will not be addressed by the proposed thesis: • Implementation of information retrieval modules • Implementation of document structure recognizers and parsers • Implementation of non-ontology-based data-extraction systems under the framework • Support for full declarative specification of extraction plans as part of the ontology definition • Canonicalization of extracted data VI.Thesis Outline 1. Introduction (5-10 pages) 2. Related Work (2-4 pages) 3. Scenarios in Data Extraction (15-20 pages) a. Value Ambiguity b. Data Frame Inheritance c. Learning Capability d. Complex Documents e. Tuning 4. Data Extraction Engines, Plans, and Heuristics (15-20 pages) a. General Data Extraction b. Ontologies and Data Frames c. Extraction Plans d. Value-Mapping Heuristics 5. A Framework for Extraction Plans and Heuristics (10-15 pages) a. Design Patterns and Frameworks 10
  • 11. b. Plan-Level Classes and Interfaces c. Heuristics-Level Classes and Interfaces 6. Implementation of Heuristics and Extraction Plans (2-5 pages) 7. Experimental Results (5-8 pages) a. Comparison of Existing Functionality b. Evaluation of New Functionality 8. Conclusions and Future Work (3-5 pages) 9. Appendices a. Documentation on Framework b. OSM-X Definition VII.Thesis Schedule Literature Search and Reading September 2001 – March 2003 Design and Coding March 2003 – June 2003 Chapter 3 April 2003 – May 2003 Chapter 4 May 2003 – June 2003 Chapter 5 June 2003 – July 2003 Chapter 6 July 2003 Chapter 7 August 2003 Chapters 1,2 & 8 August 2003 – September 2003 Thesis Revision and Defense September 2003 VIII.Bibliography [1] David Buttler, Ling Liu, Calton Pu. A fully automated object extraction system for the World Wide Web. In Proceedings of the 2001 International Conference on Distributed Computing Systems, 2001. This paper reports excellent precision and recall results for a fully-automated data- extraction system for HTML documents. Its primary limitations are its requirement for the documents to be well formed and its dependence on document structure for locating tokens of interest. It is also highly reliant upon a limited set of built-in heuristics that decide how to identify token separator tags. [2] Mary Elaine Califf and Raymond J. Mooney. Relational learning of pattern-match rules for information extraction. Working Notes on AAAI Spring Symposium on Applying Machine Learning to Discourse Processing, 1998. Rapier, the system described in this paper, is a relational learner of pattern matching rules. It is more general than most wrapper approaches since its patterns are based on parts of speech and lexicons of related terms, instead of specific delimiters found in a document. However, it requires text that is grammatical enough to be tagged for parts of speech. [3] Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo. Roadrunner: toward automatic data extraction from large Web sites. The VLDB Journal, 2001. The authors introduce a system for fully-automated wrapper generation to extract data from HTML pages. Their system is novel in several ways: it does not rely on user interaction, it has no a priori knowledge about the page contents, and it works with two pages at a time instead of just one. 11
  • 12. [4] Line Eikvil. Information extraction from World Wide Web: A survey. Technical Report 945, Norwegian Computing Center, 1999. A general survey of Web-based information extraction, this paper explains the history, metrics, and nature of the information extraction problem. It focuses primarily on wrapper generation and systems centered on that approach. [5] David W. Embley. Programming with data frames for everyday data items. AFIPS Conference Proceedings, 1980. The author introduces the concept of data frames, which are similar to abstract data types and classes. Data frames define value recognizers, context keywords, internal representations, operators, and so forth. This work has influenced the design and use of data frames in Ontos. [6] D.W. Embley, D.M. Campbell, Y.S. Jiang, S.W. Liddle, D.W. Lonsdale, Y.-K. Ng, R.D. Smith. Conceptual-model-based data extraction from multiple-record Web pages. Data & Knowledge Engineering, 31, 1999. This paper introduces the notion of ontology-based data extraction and describes the BYU-Ontos system. A domain ontology is a conceptual model for a particular subject, and may be composed of object sets, relationship sets, constraints, and various other elements. A data-extraction system can use ontologies together with lexicons and pattern-matching rules to recognize and extract portions of a document related to the domain of interest. This paper is the foundation for most of the work of the Data Extraction Group at BYU. [7] D.W. Embley, C. Tao, S.W. Liddle. Automatically extracting ontologically specified data from HTML tables with unknown structure. Proceedings of the 21st International Conference on Conceptual Modeling, Tampere, Finland, 2002. The paper describes an ontology-based approach to extracting tabular data from HTML by interpreting how values appear in HTML table structures. It represents a departure from the sequential-record assumption of earlier ontology-based data extraction work at BYU. [8] Shai Fine, Voram Singer, Naftali Tishby. The hierarchical hidden Markov model: analysis and applications. Machine Learning, 32, 1998. This paper describes hierarchical hidden Markov models and algorithms for using them. A hierarchical HMM is a recursive generalization of a classical HMM: it is an HMM with one or more hidden states replaced by “nested” HMMs, which themselves may contain other HMMs in place of some of any of their states. The hierarchical nature of the HHMMs may provide a natural fit to the relational aspect of domain ontologies. [9] Dayne Freitag. Information extraction from HTML: application of a general machine learning approach. AAAI-98, 1998. This paper presents SRV, which is a wrapper-related information extraction system. SRV is an extraction rule learner based on the FOIL algorithm. Its learning algorithm is domain- and syntax-independent, but the rules it learns are specific to the documents from which it learned them. [10] Dayne Freitag and Andrew Kachites McCallum. Information extraction with HMMs and shrinkage. In Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction, 1999. 12
  • 13. The authors provide a good explanation of using HMMs in information extraction. Using HMMs with certain constraints, such as manually-defined topologies, they apply a statistical smoothing technique called “shrinkage” to the model parameters in order to make the accuracy of the model more robust while requiring fewer training instances. [11] Kimball Hewitt. An integrated ontology development environment for data extraction. Master’s thesis, Brigham Young University, 2000. The thesis describes an implementation of a graphical ontology editor still in use by the Data Extraction Group for construction of ontologies and data frames. [12] Chun-Nan Hsu. Initial results on wrapping semistructured Web pages with finite-state transducers and contextual rules. AAAI-98 Workshop on AI and Information Integration, 1998. This paper introduces SoftMealy, a wrapper induction algorithm. It constructs a finite state transducer for each type of item to extract, based on the delimiters encountered during document scanning. The finite state transducers are more flexible in dealing with missing attributes and various permutations of attributes. [13] The Java tutorial. At http://java.sun.com/docs/books/tutorial/, 2003. This site explains features of the Java programming language, the primary implementation language of Ontos and the proposed framework. [14] Nicholas Kushmerick, Daniel S. Weld, Robert Doorenbos. Wrapper induction for information extraction. International Joint Conference on Artificial Intelligence (IJCAI), 1997. The system presented in this paper follows the wrapper generation orientation of early information extraction work. Using PAC-learning principles and syntactic analysis, wrappers are induced for specific documents. [15] T.R. Leek. Information extraction using hidden Markov models. Master’s thesis, UC San Diego, 1997. The thesis introduces the concept of using a hidden Markov model to extract factual information from a corpus of English prose. It includes a sound justification for using HMMs in information extraction, focusing on the technique’s independence from human engineering and prior knowledge, as well as the usefulness of computing a probability for class labels. [16] S.W. Liddle, D.W. Embley, D.T. Scott, S.H. Yau. Extracting data behind Web forms. Proceedings of the Workshop on Conceptual Modeling Approaches for e-Business, Tampere, Finland, 2002. The paper describes an approach to document retrieval and data extraction for the “hidden Web” (data available only via HTML form requests). The approach is an example of the need for access to previously extracted data, which in this case can be used to provide values for a form request. 13
  • 14. [17] Andrew McCallum, Dayne Freitag, Fernando Pereira. Maximum entropy Markov models for information extraction and segmentation. In Machine Learning: Proceedings of the Seventeenth International Conference (ICML 2000), 2000. The authors present an alternative to hidden Markov models that still retains many of the advantages of HMMs. Maximum entropy Markov models allow observations to be treated as arbitrary overlapping features, and more accurately represent the probability of a sequence by focusing on the conditional probability of reaching a state given an observation and the previous state, rather than the joint probability that maximizes the likelihood of the observation sequence(s) used to train the model. [18] Ion Muslea. Extraction patterns for information extraction tasks: a survey. The AAAI-99 Workshop on Machine Learning for Information Extraction, 1999. Muslea surveys the existing systems for rule-based information extraction and categorizes them based on the type of text (free, structured, semi-structured) targeted for extraction. The survey does not encompass BYU-Ontos. [19] Ion Muslea, Steve Minton, Craig Knoblock. A hierarchical approach to wrapper induction. Proceedings of the Third International Conference on Autonomous Agents (Agents ’99), 1999. The authors present an intriguing variation on wrapper induction that accommodates richly structured documents. They view a document as a collection of nested lists, and therefore construct a wrapper that is hierarchical to accommodate the nested lists. [20] L.R. Rabiner and B.H. Juang. An introduction to hidden Markov models. IEEE ASSP Magazine, 3 (1), 1986. This article is an oft-cited overview of hidden Markov models, the three primary problems of interest associated with them, and algorithms for solving these problems. HMMs consist of a finite set of hidden states that transition to each other according to a probability distribution; a finite set of symbols that may be emitted from any hidden state according to another probability distribution; and a probability distribution that determines the initial state of the model. Efficient algorithms exist for determining the likelihood of a particular sequence of symbols given a model, or the most likely sequence of hidden states that generated a particular sequence. [21] Ellen Riloff. Automatically constructing a dictionary for information extraction tasks. National Conference on Artificial Intelligence, 1993. Riloff’s paper describes AutoSlog, which is a system that uses selective concept extraction in a one-shot machine learning approach to extract terms to build a dictionary for a given domain. This paper provides an example of earlier and alternative approaches to machine-learned information extraction. Of particular note is the set of core heuristics upon which AutoSlog relies for its capability; this dependence on heuristics resembles BYU-Ontos’ use of heuristics for its class labeler. [22] Tobias Scheffer, Christian Decomain, Stefan Wrobel. Active hidden Markov models for information extraction. In Proceedings of the International Symposium on Intelligent Data Analysis, 2001. This paper gives an algorithm for training a hidden Markov model on partially labeled data using active learning methods. The research examines how to train a model when the observation labels are not fully specified, although this paper focuses more on computing the error of estimated labels in order to present the more difficult observations interactively for a user to label. 14
  • 15. [23] Kristie Seymore, Andrew McCallum, Ronald Rosenfeld. Learning hidden Markov model structure for information extraction. In Proceedings of the AAAI Workshop on Machine Learning for Information Extraction, 1999. This paper focuses on learning the underlying topology of a hidden Markov model for information extraction purposes, since the optimal topology is often not known a priori. An important result is that a topology containing only one hidden state per class label may be suboptimal. The paper also proposes some techniques for learning with distantly-labeled data, which is labeled data from another domain whose labels partially overlap those of the target domain. [24] Stephen Soderland. Learning information extraction rules for semi-structured and free text. Machine Learning 34, 1999. Soderland presents WHISK, a rule-based extraction approach. WHISK learns to detect information based on delimiting strings in the document, as well as tagged grammatical elements. The system's main contribution is its ability to learn extraction rules for free, semi-structured, and structured texts. [25] Alan Wessman. Unpublished design document, 2003. This document contains working specifications for the framework proposed for the thesis. IX.Artifacts Aside from the thesis itself, the following will be produced as artifacts: • An XML Schema defining the underlying data structures for Ontos • The proposed framework, realized in Java • An extraction plan and heuristics implemented under the new framework • Revision of a graphical ontology editor and the Ontos system using the framework • Documentation on using the framework to implement extraction plans and heuristics 15
  • 16. X.Signatures This thesis proposal by Alan Wessman is accepted in its present form by the Department of Computer Science of Brigham Young University as satisfying the thesis proposal requirement for the degree of Master of Science. ______________________________ _______________________________ Date David W. Embley, Committee Chair _______________________________ Stephen W. Liddle, Committee Member _______________________________ Thomas W. Sederberg, Committee Member _______________________________ David W. Embley, Graduate Coordinator 16