The document discusses ontology-based information extraction (OBIE), which uses an ontology to guide the extraction of information from unstructured text and presents the extracted information in an ontology. It defines OBIE, outlines common OBIE methods like linguistic rules, gazetteers, classification techniques, and discusses technologies, datasets, and evaluation used in OBIE implementations. While OBIE shows potential for creating semantic web content, the document notes there is no agreed-upon OBIE methods yet and areas remain like developing better extractors and cross-lingual OBIE.
Ontology-based information extraction in the DERI Reading Group
1. The DERI Reading Group
Ontology-based information extraction:
An Overview & Survey
(2010, Wimalasuriya and Dou)
Tobias Wunner, UNLP Group
Copyright 2010 Digital Enterprise Research Institute. All rights reserved, Paul Buitelaar
2. Definition - Motivation
a) Create content for the Semantic Web
convert existing websites into ontologies
b) Improve quality of existing ontologies
Test criterion: OBIE task
OBIE good => ontology good
4. Overview
Access to information…
Ontologie-based Information Extraction (OBIE):
“A system that processes unstructured or semi-
structured natural language text guided by an
ontology and presents the output in an ontology.
6. T
1. Text only:
Extract conceptualization and instances
County
building with café
and football table
Building
is-a
1. conceptualization
2. instances
Galway DERI building
Problem – two scenarios
7. T
County
building with café
and football table
Building
is-a
1. conceptualization
2. instances
Galway DERI building
Problem – two scenarios
conceptualization can be
too specific / generic
wrong conceptualization
1. Text only:
Extract conceptualization and instances
10. Definition – key characteristics
a) Process structured / unstructured text
b) “guided” by an ontology
c) Present output in ontology
Text
Source
Information
Extractor
Ontology
guided by
11. Definition – ontology learning or population?
Ontology population ⊂ OBIE
“OBIE is Open information extraction” (Etzioni)
alternative: semantics given by ontology!
extractors can be inside / outside ontology
Text
Source
Information
Extractor
Ontology
guided by
12. Methods
Information extractors
1. Linguistic rules
2. Gazetteer lists
3. Classification (classical / structure-aware)
4. Partial parse trees
5. Structured data analyzers
6. Web querying
13. Linguistic Rules - Methods
Regular expressions
<COMPANY> .* revenue <Number> <currency>
“Tesco’s revenue in 2009 was 3.4 billion GBP.”
Extraction ontologies
combination of ontology and lexicon
(Mädche, Embley, Buitelaar)
manual construction
High precision
14. 2. Gazetteer lists
Phrases / words instead of patterns
Named-Entity Recognition
Requirements:
1) Specify what is being extracted
2) Specify sources and avoid manual creation
Gazetteer Methods
Sematic Web
Software
Energy
Supermarket
…
industry
The software giant SAP…
Tesco a UK supermarket …
Siemens energy revenue…
… wind energy company Vestas
15. 3. Classification techniques
Break down IE task in a set of binary tasks
Classification Methods
pos
semTag
c1
c2
..
cn
Classifier
features
16. Classical
Classification Methods
Galway Germany DERI Siemens
GEIrelandMunich CITEC
missclassification does
not consider structure!
(equal cost 1/6)
DERI
TescoCladdagh
DERI
CountryCity SW Energy
IndustryLocation
17. W1,6=3
Structure aware
Classification Methods
Galway Germany Siemens
GEIrelandMunich CITEC
Classifier should
consider taxonomy structure!
TescoCladdagh
DERI
18. 4. Partial parse trees
TACITUS, SMES, LTAG
5. Analyze structured data
Wikpedia Infoboxes
6. Web querying
C-PANKOW
“Towards the self annotating web
Other methods
20. Data sets & evaluation
Data sets (corpora)
1) Message Understanding Conference (MUC-7)
2) Automatic Content Extraction (ACE)
=> more on classical IR, IE, NLP tracks
=> no data set with given semantics (ontology)
Evaluation
Precision & recall
Only used for population task
21. Recent Open IE argument
Con: Weikum, From Information to Knowledge -
Harvest Web Resources for IE
Disambiguation
NL relations are not well defined (well defined
arguments)
Pro: Weld, Using Wiki to Bootrap Open IE
Relation targeted:
learn extractor per relation -> lower recall
Structural targeted:
general extraction engine -> lower precision
22. Conclusion and Outlook
No established/ agreed methods yet
Is OBIE also ontology learning?
Data sets
Methods for best extractors
Semantic Web contribution?
eg. Gazetteers from DBPedia
Cross-lingual OBIE -> CLOBIE
23. References
[1] Wimalasuriya, Dou, Ontology-based Information
Extraction: An Introduction and Survey of current
approaches, in Journal of Computer Science, June
2010
[2] Buitelaar et Al., Towards linguistically grounded
ontologies., ESWC, Springer, 200
[3] Weikum et Al, From Information to Knowledge –
Harvesting Entities and Relationships from Web
Sources, Principle Database Systems, 2010
[4] Weld et al., Using Wikipedia to bootstrap open
information extraction, Sigmod Record, 2008