Use of ontologies in natural language processing

Use of Ontologies in Natural
Language Processing
Athman Hajhamou
Computer and Modeling Laboratory –
USMBA- FSDM – Fès

1

Summary
 Limitations of classical approaches
 Use of Ontology
 State of the Art.

2

Limitations of classical
approaches
 The huge number of available
documents on the Web makes finding
relevant ones a challenging task. Full-
text search that is still the most
popular form of search provided by the
most used services such as Google, is
very useful to retrieve documents, but
it is normally not suitable to find not
yet seen relevant documents for a
specic topic.

3

approaches
 The major reasons why purely text-based search fails to
find some of the relevant documents are the following:

 Vagueness of natural language :
synonyms,homographs and inflection of words can all
fool algorithms which see search terms only as a
sequence of characters.

 High-level, vague concepts: High-level, vaguely defined
abstract concepts like the Kosovo conict, Industrial
Revolution or the Iraq War are often not mentioned
explicitly in relevant documents, therefore present
search engines cannot find those documents.

4

approaches
 Semantic relations, like the partOf relation,
cannot be exploited. For example, if users
search for the Great Maghreb, they will not
find relevant documents mentioning only
Rabat or Morocco.
 Time dimension: for handling time
specifications, keyword matching is not
adequate. If we search documents about the
“XX century” using exactly this phrase,
relevant resources containing the character
sequences like 1945 or 1956 will not be
found by simple keyword matching.

5

approaches
 Most of the present systems can successfully
handle various inflection forms of words using
stemming algorithms, it seems that the lots of
heuristics and ranking formulas using text-
based statistics that were developed during
classical IR research in the last decades
cannot master the other mentioned issues.
One of the reasons is that term co-
occurrence that is used by most statistical
methods to measure the strength of the
semantic relation between words, is not valid
from a linguistic-semantical point of view.

6

approaches
 Most of the present systems can successfully
handle various inflection forms of words using
stemming algorithms, it seems that the lots of
heuristics and ranking formulas using text-
based statistics that were developed during
classical IR research in the last decades
cannot master the other mentioned issues.
One of the reasons is that term co-
occurrence that is used by most statistical
methods to measure the strength of the
semantic relation between words, is not valid
from a linguistic-semantical point of view.

7

approaches
 Besides term co-occurrence-based statistics
another way to improve search effectiveness
is to incorporate background knowledge into
the search process. The IR community
concentrated so far on using background
knowledge expressed in the form of thesauri.
Thesauri define a set of standard terms that
can be used to index and search a document
collection (controlled vocabulary) and a set of
linguistic relations between those terms, thus
promise a solution for the vagueness of
natural language, and partially for the
problem of high-level concepts.

8

approaches
 while intuitively one would expect to see significant
gains in retrieval effectiveness with the use of thesauri,
experience shows that this is usually not true.

 One of the major cause is the “noise” of thesaurus
relations between thesaurus terms. Linguistic relations,
such as synonyms are normally valid only between a
specific meaning of two words, but thesauri represent
those relations on a syntactic level.
 Another big problem is that the manual creation of
thesauri and the annotation of documents with
thesaurus terms is very expensive. As a result,
annotations often incomplete or erroneous, resulting in
decreased search performance.

9

Use of Ontology
 Ontologies form the basic
infrastructure of the Semantic Web.
 As ontology we consider any
formalism with a well-defined
mathematical interpretation which is
capable at least to represent a
subconcept taxonomy, concept
instances and user defined relations
between concepts.

10

Use of Ontology
 Such formalisms allow a much more
sophisticated representation of background
knowledge than classical thesauri. They
represent knowledge on the semantic level,
i.e., they contain semantic entities(concepts,
relations and instances) instead of simple
words, which eliminates the mentioned noise
from the relations.
 They allow specifying custom semantic
relations between entities, and also to store
well-known facts and axioms about a
knowledge domain (including temporal
information).

11

Use of Ontology
 Based on that, ontologies theoretically
solve all of the mentioned problems of
full text search. Unfortunately,
ontologies and semantic annotations
using them are hardly ever perfect for
the same reasons that were described
at thesauri. Indeed, presently good
quality ontologies and semantic
annotations are a very scarce
resource.

12

State Of the Art

Ontologies as Background Knowledge
to Explore Document Collections

Nathalie Aussenac-Gilles & Josiane Mothe

Institut de Recherche en Informatique de Toulouse

13

Ontologies as Background Knowledge to
Explore Document Collections

 An alternative way to go beyond bags of words could be
to organise indexing terms into a more complex
structure than "bags", such as a hierarchy or an
ontology. Texts would be indexed by concepts that
reflect their meaning rather than words considered as
chart lists with all the ambiguity that they convey.

 Nathalie A. & Josiane M. promote an approach where
information search and exploration take place in a
domain-dependant semantic context which is described
through its controlled vocabulary organized along
hierarchies which are all extracted from a single and
unifying domain ontology. Each hierarchy reveals a
given point of view on the domain, that is to say a
dimension.

14


 In this approach, the ontology and derived hierarchies
provide the query language for users. Not only can the
concept hierarchies be browsed by the user, who can
select the terms he wants to add to his query, but they
also allow them to explore the information space
according to different points of view, through the domain
vocabulary and its structure.

 Given a domain, a use defines its own information
space. It is composed of a selection of hierarchies or
dimensions among the set of possible ones. This
selection depicts his focus of interest, and lead to
identify the associated documents.

15


 Dimensions and their visualization
define a novel way to provide the
users with global views and
knowledge of the document collection.
A key component of this approach is
that the domain ontology allows to
define a visual presentation of the
entire collection or of a sub-collection
based on multi-dimensional analysis,
as it is done in OLAP systems.

16


17


18


 Strengths :
 with the help of the ontology, users should express
their needs more easily.
 documents can be seen under many dimensions (or
points of view) that could be used in order to extract
some knowledge from their content.
For the document categorization task, q concept from
an ontology can be viewed as a category.
 Weaknesses :
building an ontology is a complex and time-
consuming task: experts (domain and ontology
experts) often manually do it.
the evolution of domain knowledge is problematic, for
example new terms appear, other terms are no longer
used.

19

State Of the Art

Ontological Profiles as Semantic
Domain Representations

Geir Solskinnsbakk & Jon Atle Gulla

Norwegian University of Science and Technology

20

Ontological Profiles as Semantic Domain
Representations

 Ontologies for query disambiguation or reformulation
seem more promising, though there is a fundamental
problem with comparing ontology concepts with query
or document terms. Concepts are abstract notions that
are not necessarily linked to a particular term. Some
times there may be a number of terms that refer to the
same concepts, and some times a specific term may be
realizations of different concepts depending on the
context.
 Using conceptual structures to index or retrieve
document text requires that there is something bridging
the conceptual and real world.
 Research indicates that ontologies are of little use if
they are not aligned with the documents indexed by the
search application.

21

Representations

 Geir S. & Jon A. G. present an ontology
enrichment approach that both bridges
the conceptual and real world and
ensures that the ontology is well adapted
to the documents at hand.

 The idea is to provide contextual concept
characterizations that reveal how the
concepts are referred to semantically in
the document collection.

22

Representations

 An ontological profile is an extension of a domain
ontology. The ontology is extended with semantically
related terms. These terms are added as vectors for
each of the concepts of the ontology.

 This means that in the ontological profile each concept
is associated with a vector of semantically related terms
(concept vector). The terms are given weights to reflect
the importance of the semantic relation between the
concept and the terms.

 The concept vectors typically contain terms that are
synonyms to the concept.

23

Representations

24

Representations

 The construction of these ontological profiles is
based on three different aspects of the content of
the documents used.
The first is that we apply statistical techniques,
counting the frequency of the terms in the documents.
Terms that co-occur with a concept more frequent are
hypothesized to be more relevant for a concept than
terms that do not co-occur as frequently.
The second is that we apply linguistic techniques, i.e.
stemming, to collapse certain terms into a single
form.
The third aspect is that we use a proximity analysis of
the text. The assumption that lies behind the
proximity analysis is that the closer terms are found in
the text, the more semantically related they are.

25

Representations

26

Representations

 We give the highest weight to terms that are found in
the same sentence as the concept name phrase (the
highest semantic coherence), terms found in the same
paragraph as the concept are given lower weight than
sentence-terms, and higher than document terms.

 The basis for the weight calculation is the term
frequency for each term found in the relevant
documents.

 Applying the familiar tf*idf score to the frequencies we
get closer to the final representation of the vectors. The
idf factor gives more importance to terms that are found
in few documents across the document collection.

27

Representations

 is the term frequency for term i in
concept vector j, is the term frequency
for term i in document vector k, D, P, and
S are the possibly empty sets of relevant
documents, paragraph documents and
sentence documents as signed to j, and
a=01, b=10, and c=100 are the constant
modifiers for documents, paragraph
documents, and sentence documents,
respectively.
28

Representations

 is the tfidf score for term i in
concept vector j, is the term
frequency for term i in concept vector
j, is the frequency of the most
frequent occurring term i in concept
vector j, N is the number of concept
vectors, and n is the number of
concept vectors containing term i.

29

Representations

 Strengths :
 This approach based on ontological profile is used as a
tool for semantic reformulation of queries on top of a
standard vector space based search engine (Appach
Lucene), using the reformulated query as a query into the
index. This approach lets the system hide from the user the
fact that an ontology is used, and the user is only faced
with entering familiar keyword queries.
 Weaknesses :
 In this approach the concept name is considered as a
phrase query into the three indexes, and all documents
containing the phrase are assigned to the concept as
relevant. Of course, using the concept name as a phrase
query into the three indexes imposes a challenge; some of
the concept names are artificial in their construction or are
not used in the form given in the concept. This means that
many of the concepts are not found during the assignment
of documents to the concepts.

30

State Of the Art

An Ontology-Based Information
Retrieval Model

David Vallet, Miriam Fernández & Pablo Castells

Universidad Autónoma de Madrid

31

An Ontology-Based Information Retrieval
Model

 David V, Miriam F. & Pablo C. propose an ontology-based
retrieval model meant for the exploitation of full-fledged
domain ontologies and knowledge bases, to support semantic
search in document repositories. In contrast to boolean
semantic search systems, in this perspective full documents,
rather than specific ontology values from a KB, are returned
in response to user information needs. The search system
takes advantage of both detailed instance-level knowledge
available in the KB, and topic taxonomies for classification.

 This approach includes an ontology-based scheme for the
semi-automatic annotation of documents, and a retrieval
system. The retrieval model is based on an adaptation of the
classic vector-space model, including an annotation weighting
algorithm, and a ranking algorithm.

32

Model

 David V, Miriam F. & Pablo C. propose an ontology-based
retrieval model meant for the exploitation of full-fledged
domain ontologies and knowledge bases, to support semantic
search in document repositories. In contrast to boolean
semantic search systems, in this perspective full documents,
rather than specific ontology values from a KB, are returned
in response to user information needs. The search system
takes advantage of both detailed instance-level knowledge
available in the KB, and topic taxonomies for classification.

 This approach includes an ontology-based scheme for the
semi-automatic annotation of documents, and a retrieval
system. The retrieval model is based on an adaptation of the
classic vector-space model, including an annotation weighting
algorithm, and a ranking algorithm.

33

Model

 The system requires that the knowledge base be
constructed from three main base classes:
DomainConcept, Taxonomy, and Document.
DomainConcept should be the root of all domain
classes that can be used (directly or after
subclassing) to create instances that describe specific
entities referred to in the documents.
Document is used to create instances that act as
proxies of documents from the in-formation source to
be searched upon.
Taxonomy is the root for class hierarchies that are
merely used as classification schemes, and are never
instantiated. These taxonomies are expected to be
used as a terminology to annotate documents and
concept classes, using them as values of dedicated
properties.

34

Model

 The predefined base ontology classes described above are
complemented with an annotation ontology that provides the
basis for the semantic indexing of documents with non-
embedded annotations.

 Documents are annotated with concept instances from the
KB by creating instances of the Annotation class, provided for
this purpose. Annotation has two relational
properties, instance and document, by which concepts and
documents are related together.
Reciprocally, DomainConcept and Document have a
multivalued annotation property.

 Annotations can be created manually by a domain expert, or
semi-automatically. The subclasses ManualAnnotation and
AutomaticAnnotation are used respectively

35

Model

 DomainConcept instances use a label property to
store the most usual text form of the concept
class or instance. This property is multivalued,
since instances may have several textual lexical
variants.

 Whenever the label of an instance is found, an
annotation is created between the instance and
the document. In the system, documents can be
annotated with classes as well, by assigning
labels to concept classes.

 The annotations are used by the retrieval and
ranking module

36

Model

 In the classic vector-space model, keywords
appearing in a document are assigned weights
reflecting that some words are better at
discriminating between documents than others.

 In this approach similarly annotations are
assigned a weight that reflects how relevant the
instance is considered to be for the document
meaning.

 Weights are computed automatically by an
adaptation of the TF-IDF algorithm based on the
frequency of occurrence of the instances in each
document.

37

Model

 wij is the weight of instance Ii for
document Dj, is the number of
occurrences of Ii in Dj, is the
frequency of the most repeated
instance in Dj, ni is the number of
documents annotated with Ii, and N is
the total number of documents in the
search space.

38

Model

 The system takes as input a formal RDQL query.
This query could be generated from a keyword
query, a natural language query, a form-based
interface where the user can explicitly select
ontology classes and enter property values, or
more sophisticated search interfaces.

 The RDQL query is executed against the
knowledge base, which returns a list of instance
tuples that satisfy the query and the documents
that are annotated with these instances are
retrieved, ranked, and presented to the user.

39

Model

40

Model

 Strengths :
Better recall when querying for class
instances and using class hierarchies and
rules.
Better precision by using query weights and
structured semantic queries.
 Weaknesses :
The degree of improvement of this semantic
retrieval model depends on the completeness
and quality of the ontology, the KB, and the
concept labels.

41

State Of the Art

Improving information retrieval
effectiveness by using domain
knowledge stored in ontologies

Gabor Nagypal

University of Karlsruhe, Germany

42

Improving information retrieval effectiveness by using
domain knowledge stored in ontologies

 The quality of results that traditional full-text search engines
provide is still not optimal for many types of user queries.
Especially the vagueness of natural languages, abstract
concepts, semantic relations and temporal issues are
handled inadequately by full-text search. Ontologies and
semantic metadata can provide a solution for these problems.

 The goal of this thesis is to examine and validate whether and
how ontologies can help improving retrieval effectiveness in
information systems, considering the inherent imperfection of
ontology-based domain models and annotations.

 This work examines how ontologies can be optimally
exploited during the information retrieval process, and
proposes a general framework which is based on ontology-
supported semantic metadata generation and ontology-based
query expansion.

43


 This research evaluates the following hypotheses :

Ontologies allow to store domain knowledge in a much
more sophisticated form than thesauri. We therefore
assume that by using ontologies in IR systems a significant
gain in retrieval effectiveness can be measured.
The better (more precise) an ontology models the
application domain, the more gain is achieved in retrieval
effectiveness.
It is possible to diminish the negative effect of ontology
imperfection on search results by combining different
ontology-based heuristics during the search process.
It is a well-known fact that there is a trade-of between
algorithm complexity and performance. This insight is also
true for ontologies. Still, assumption of this approach is that
by combining ontologies with traditional IR methods, it is
possible to provide results with acceptable performance.

44


 Background knowledge stored in the form of
ontologies can be used at practically every step
of the IR process.

 In this work, solutions are there fore provided for
the issues of ontology based query extension,
ontology-supported query formulation and
ontology-supported metadata generation
(indexing).

 This leads to a conceptual system architecture
where the Ontology Manager component has a
central role, and it is extensively used by the
Indexer, Search Engine and GUI components .

45


46


 The information model defines how documents and the
user query are represented in the system. The model
used in this work represents the content of a resource
as a weighted set of instances (bag of ontology
instances) from a suitable domain ontology (the
conceptual part) together with a weighted set temporal
intervals (the temporal part).

 The representation of the conceptual part is practically
identical with the information model used by classical IR
engines built on the vector space model, with the
difference that vector terms are ontology instances
instead of words in a natural language.

47


 Time as a continuous phenomenon has different
characteristics than the discrete conceptual part
of the information model. The first question
according time is how to define similarity among
weighted sets of time intervals.

 A possible solution which is being considered, is
to use the temporal vector space model. The
main idea of the model is that if we choose a
discrete time representation, the lowest level of
granules can be viewed as terms and the vector
space model is applicable also for the time
dimension.

48


 During query formulation we use the ontology only to
disambiguate queries specified in textual form. By
running classical full-text search on ontology labels,
users only have to choose the proper term
interpretation.

 Query process applies various ontology-based
heuristics one-by-one to create separate queries which
are executed independently using a traditional full-text
engine. The ranked results are then combined together
to form the final ranked result list. The combination of
results is based on the belief network model which
allows the combination of various evidences using
Bayesian inference.

49


50


 Strengths :
 This work validate that the proposed solution significantly
improves retrieval effectiveness of information systems and
thus provides a strong motivation for developing ontologies
and semantic metadata.
 The gradual approach described allows a smooth transition
from classical text-based systems to ontology-based ones.

 Weaknesses :
 A problem with the temporal vector space approach is the
potentially huge number of time granules which are
generated for big time intervals. E.g. to represent the
existence time of concepts such as the Middle Ages,
potentially many tens of thousand terms are needed if we
use days as granules.

51

Use of ontologies in natural language processing

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Use of ontologies in natural language processing

Similaire à Use of ontologies in natural language processing (20)

Dernier

Dernier (20)

Use of ontologies in natural language processing