Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Semantic Interpretation of User Query for Question Answering on Interlinked Data
1. +
Semantic Interpretation of User Queries for
Question Answering on Interlinked Data
Saeedeh Shekarpour
Supervisor: Prof. Dr. Sören Auer
1
EIS research group - Bonn University7 January 2015
2. +
Search engines can answer queries
which match certain templates
EIS research group - Bonn University
2
7 January 2015
3. +
Search engines still lack the ability to
answer more complex queries
7 January 2015EIS research group - Bonn University
3
4. +
Evolution of Web
Web of
Documents
Semantic Web Web of Data
EIS research group - Bonn University
4
7 January 2015
5. +
RDF model
RDF is an standard for describing Web resources.
The RDF data model expresses statesments about Web resources in
the form of subject-predicate-object (triple).
The statement “Jack knows Alice” is represented as:
7 January 2015EIS research group - Bonn University
5
Jack Alice
know
6. +
The growth of Linked Open Data
EIS research group - Bonn University
6
August 2014
570 Datasets
More than 74 billion triples
May 2007
12 Datasets
7 January 2015
7. +
How to retrieve data from Linked Data?
EIS research group - Bonn University
7
Linked Data characteristics:
• Wide range of topical domains
• Variety in vocabularies
• Interlinked data
SPARQL queries:
• Knowledge about the ontology
• Proficiency in formulating formal queries
• Explicit and unambigious semantics
Text queries (either keyword or natural language
):
• Simple retrieval approach
• Implicit and ambiguous semantics
• Popular
7 January 2015
8. +
Comparison of search approaches
Data-semantic
unaware
Data-semantic
aware
Keyword-based
query
Natural language
query
Question
Answering
Systems
Information
Retrieval
Systems
Our approach:
SINA
8
EIS research group - Bonn University 7 January 2015
9. +
Objective: transformation from
textual query to formal query
Which televisions shows were created by Walt Disney?
7 January 2015EIS research group - Bonn University
9
SELECT * WHERE
{ ?v0 a dbo:TelevisionShow.
?v0 dbo:creator dbr:Walt_Disney. }
1
2
3
10. +
Test bed datasets
EIS research group - Bonn University
10
7 January 2015
One single dataset: DBpedia.
Three interlinked datasets from life-
science:
1. Drugbank: contains information
about drugs, drug target (i.e.
protein) information, interactions
and enzymes.
2. Diseasome: contains information
about diseases and genes
associated with these diseases.
3. Sider: contains information about
drugs and their side effects.
12. +
Definition: query segmentation is the process of identifying the right
segments of data items that occur in the keyword queries.
12
EIS research group - Bonn University
Query
Segmentation
Two segmentations:
Sequence of keywords:
Input Query: What are the side effects of drugs used for Tuberculosis?
(side, effect, drug , Tuberculosis)
side effect | drug | Tuberculosis side effect drug | Tuberculosis
7 January 2015
13. +
Definition: resource disambiguation is the process of recognizing the
suitable resources in the underlying knowledge base.
EIS research group - Bonn University
13
Resource
Disambiguation
Input query
•What are the side effects of drugs used
for Tuberculosis?
Ambiguous
Resources
•diseasome:Tuberculosis
•sider:Tuberculosis
Input query
•Who produced films starring Natalie
Portman?
Ambiguous
Resources
•dbpedia/ontology/film
•dbpedia/property/film
7 January 2015
16. +
Bootstrapping the model
parameters
1. Emission probability is defined based on the similarity of the label of each
state with a segment, this similarity is computed based on string-similarity
and Jaccard-similarity.
2. Semantic relatedness is a base for transition probability and initial
probability. Intuitively, it is based on two values: distance and connectivity
degree. We transform these two values to hub and authority values using
weighted HITS algorithm.
3. HITS algorithm is a link analysis algorithm that was originally developed for
ranking Web pages. It assign a hub and authority value to each web page.
4. Initial probability and transition probability are defined as a uniform
distribution over the hub and and authority values.
EIS research group - Bonn University
16
7 January 2015
Query
Segmentation
&
Resource
Disambiguation
17. +
Evaluation of bootstrapping
The accuracy of the bootstrapped transition probability using different
distribution functions, i.e., Normal, Zipfian and uniform distributions.
7 January 2015EIS research group - Bonn University
17
Query
Segmentation
&
Resource
Disambiguation
18. +
Outputof the model after
running viterbi algorithm
Sequence of
keywords
(television show creat Walt Disney)
Paths 0.0023 dbo:TelevisionShow dbo:creator dbr:Walt_Disney
0.0014 dbo:TelevisionShow dbo:creator dbr:Category:Walt_Disney
0.000589 dbr:TelevisionShow dbo:creator dbr:Walt_Disney
0.000353 dbr:TelevisionShow dbo:creator dbr:Category:Walt_Disney
0.0000376 dbp:television dbp:show dbo:creator dbr:Category:Walt_Disney
EIS research group - Bonn University
18
7 January 2015
Query
Segmentation
&
Resource
Disambiguation
19. +
Definition: query expansion is a way of reformulating the input query
in order to overcome the vocabulary mismatch problem.
EIS research group - Bonn University
19
Input query
• Wife of Barak Obama
Reformulated
query
•Spouse of Barak Obama
Query
Expansion
7 January 2015
21. +
Linguistic features
WordNet is a popular data source for expansion.
Linguistic features extracted from WordNet are:
1. Synonyms: words having a similar meanings to the input keyword.
2. Hyponyms: words representing a specialization of the input keyword.
3. Hypernyms: words representing a generalization of the input keyword.
EIS research group - Bonn University
21
7 January 2015
Query
Expansion
22. +
Semantic features from
Linked Data
1. SameAs: deriving resources using owl:sameAs.
2. SeeAlso: deriving resources using rdfs:seeAlso.
3. Equivalence class/property: deriving classes or properties using
owl:equivalentClass and owl:equivalentProperty.
4. Super class/property: deriving all super classes/properties of by following the
rdfs:subClassOf or rdfs:subPropertyOf property.
5. Sub class/property: deriving resources by following the rdfs:subClassOf or
rdfs:subPropertyOf property paths ending with the input resource.
6. Broader concepts: deriving using the SKOS vocabulary properties skos:broader and
skos:broadMatch.
7. Narrower concepts: deriving concepts using skos:narrower and skos:narrowMatch.
8. Related concepts: deriving concepts using skos:closeMatch, skos:mappingRelation
and skos:exactMatch.
EIS research group - Bonn University
22
7 January 2015
Query
Expansion
23. +
Exemplary expansion graph of the
word movie
EIS research group - Bonn University
23
movie
home movieproduction
film
motion
picture show
video
telefilm
7 January 2015
Query
Expansion
24. +
Objective of experiment
How effective do linguistic as well as semantic features perform?
How well does a linear weighted combination of features perform?
EIS research group - Bonn University
24
7 January 2015
Query
Expansion
25. +
Benchmark creation
We created a benchmark extracted from QALD1 and QALD2.
Benchmark contains all keywords having vocabulary mismatch
problem and their corresponding match.
7 January 2015EIS research group - Bonn University
25
Query
Expansion
26. +
Accuracyresults of prediction
function based on linguistic as
well as semantic features
Features Weighting
Mechanism
Precision Recall F-score
Linguistic SVM 0.730 0.650 0.620
Semantic SVM 0.680 0.630 0.600
Linguistic Decision Tree/
Information Gain
0.588 0.579 0.568
Semantic Decision Tree/
Information Gain
0.755 0.684 0.661
EIS research group - Bonn University
26
7 January 2015
Query
Expansion
27. +
Statistics over the number of the
derived words and matches
EIS research group - Bonn University
27
7 January 2015
Query
Expansion
Feature #derived words #matches
synonym 503 23
hyponym 2703 10
hypernym 657 14
sameAs 2332 12
seeAlso 49 2
equivalence 2 0
super class/property 267 4
Sub class/property 2166 4
28. +
Automatic query expansion
Input query
External Data source
Data extraction and
preparation
Heuristic method
Reformulated query
EIS research group - Bonn University
28
7 January 2015
Query
Expansion
29. +
Expansion set for each
segment
Expansion Set
Original
segment
Lemmatized
segment
Derived words
from WordNet
Synonym
Hyponym
Hypernym
EIS research group - Bonn University
29
7 January 2015
Query
Expansion
30. +
Reformulating query using
hidden Markov model
EIS research group - Bonn University
30
Barak
Barak
Obama
spouse
Obama
wife
first
lady
woman
Barak Obama wife
Barak
Obama
Start
Input query: wife of Barak Obama
Obama
wife
Barak
Obama
wife
7 January 2015
Query
Expansion
31. +
Triple-based co-occurence
In a given triple t = (s, p, o), two words w1 and w2 are co-
occurring, if they appear in the labels (rdfs:label) of at least
two resources.
EIS research group - Bonn University
31
7 January 2015
Query
Expansion
32. +
Goals of evaluation
How effective is our method with regard to a correct reformulation of queries which
have vocabulary mismatch problem?
How robust is the method for queries which do not have vocabulary mismatch
problem?
EIS research group - Bonn University
32
7 January 2015
Query
Expansion
Query Mismatch
word
Match word #derived words
Movies with Tom Cruise movie film 77
Altitude of Everest altitude elevation 16
Soccer clubs in Spain - - 19
Employees of Google - - 10
Sample of our benchmark
33. +
Rank (R) and Cumulative rank
(CR) for the test queries
EIS research group - Bonn University
33
7 January 2015
Query
Expansion
34. +
Definition: Once the resources are detected, a connected subgraph
of the knowledge base graph, called the query graph, has to be
determined which fully covers the set of mapped resources.
EIS research group - Bonn University
34
Formal Query
Construction
7 January 2015
Disambiguated
resources
sider:sideEffect
diseasome:possibleDrug
diseasome:1154
SPARQL query SELECT ?v3 WHERE {
diseasome:115 diseasome:possibleDrug ?v1 .
?v1 owl:sameAs ?v2 .
?v2 sider:sideEffect ?v3 .}
35. +
Answer of a question may be spread among different datasets
employing heterogeneous schemas.
Constructing a federated query from needs to exploit links between
the different datasets on the schema and instance levels.
EIS research group - Bonn University
35
Data Fusion on
Linked Data
7 January 2015
36. +
Two different approaches
Template-based query construction
Forward chaining based query construction
EIS research group - Bonn University
36
7 January 2015
Federated Query
Construction
37. +
Forward chaining based query
construction
1. Set of resources
EIS research group - Bonn University
37
Query What is the side effects of drugs used for Tuberculosis?
resources diseasome:1154 (type instance)
diseasome:possibleDrug (type property)
sider:sideEffect (type property)
1154 ?v0
possibleDrug
Graph 1
?v1 ?v2
sideEffect
Graph 2
115
4
?v0
possibleDrug
Template 1
?v1 ?v2
sideEffect
Template 2
115
4
?v0
possibleDrug
?v1 ?v2
sideEffect
7 January 2015
2. Incomplete query
graph
3. Query graph
Federated Query
Construction
38. +
Evaluation
Goal of experiment:
1. performance of disambiguation method using Mean Reciprocal Rank (MRR).
2. performance of forward chaining query construction method using precision
and recall.
Benchmarks:
1. 25 queries on the 3 interlinked datasets from life-science.
2. QALD1 and QALD3 benchmarks over DBpedia.
3. QALD2 was used for bootstrapping.
7 January 2015EIS research group - Bonn University
38
Federated Query
Construction
39. +
Runtime
Parallization over three components:
1. Segment validation
2. Resource retrieval
3. Query construction
7 January 2015EIS research group - Bonn University
39
42. +
Conclusion
We researched and addressed a number of challenges.
The result of the evaluation confirms the feasibility and high
accuracy, for instance:
1. query segmentation and resource disambiguation with achieved the
MRR from 86% till 96%.
2. query construction with precision 32% in DBpedia QALD3 benchmark
and 95% in life-science.
We learnt that:
1. In Linked Data, structure as well as topology of data can be leveraged for any
inference and heuristic.
2. Using structure as well as topology without any deep text analysis, Linked Data
can enhance power of question answering.
7 January 2015EIS research group - Bonn University
42
43. +
Future work
It was the first step in a long agenda.
We plan to:
1. use supervise learning to enhance the parameters of the model.
2. extend our benchmark to make further evolutions.
3. employ more number of interlinked dataset to figure out the challenges
of scability.
4. extending different aspects of each approach.
We are going to target new challenges, e.g. query cleaning.
We will continue this work with students joining us for Marie Curie
ITN network.
7 January 2015EIS research group - Bonn University
43
Google is the most widely used search engine.
Recently google has extended its functionality so as to provide direct answers to queries which match certain templates
This limitations are due to inherent unstructured nature of information in web of documents.
W cannot consolidate information from different resources.
We cannot reason on reason on data
The Semantic Web initiatives responds to these challenges by introducing standards such as RDF,
RDF-Schema and OWL for publishing information in machine-readable formats.
As a result of the Semantic Web vision and, more importantly publishing large amounts of structured data
on the Web, the concept of the Web of Data emerged.
The Web of Data refers to the set of knowledge bases published according to the Linked Data principles6
Since its creation in 2007, the Linked Data Web has been growing at an astounding rate.
In 2007 it contained 12 datasets whereas in 2014 570 datasets were published.
We encounter huge amount of published data which are interlinked, they are from wide range of topical domains and various vocabularies,
We have SPARQL which is an RDF query language.
As a formal query it has explicit and unambiguis semantics but end user needs proficiency in formulating formal queries and also knowledge about the ontology.
To enable common users to access the Data Web, we have to simplify the access by providing search interfaces that resemble the search interfaces commonly used on the
document-based Web.
The most common approach is texual query (either as natural language query or as a keyword-based query)
Textual queries are simple and popular approach for retrieving data but on the other hand its semantic is implicit and ambiguous.
While various search approaches differ in their details, they can all be positioned on this
spectrum: On one end of the spectrum are simple keyword search systems that rely on traditional
information retrieval approaches. they do not take the semantics
of the data into consideration. The main advantage of such approaches is that they scale
well as they can make use of the results of decades of research carried out in the field of information
retrieval.
On the other end of the spectrum, we find question answering systems, which assume a
natural-language query as input and convert this query into a formal query. These systems
rely on natural-language processing tools such as Part of Speech (POS) tagging and dependency parsers
to detect the relations between the elements of the query. The detected relations are then mapped to
SPARQL constructs.
The basic idea behind our work is to devise a data-semantics-aware keyword search
approach, which stands in the middle of the spectrum.
Our approach, which is
called SINA, aims to achieve maximal flexibility by being able to generate SPARQL queries from both
natural-language queries and keyword queries. Several challenges need to be addressed to devise such an
approach.
For this purpose, a number of challenges are raised, we address 6 main challenges
Query segmentation is the process of identifying the right segments of data items that occur in the
keyword queries. Regarding example 1, the input query ‘What is the side eects of drugs used for
Tuberculosis?’ is transformed to the 4-keyword tuple (side, eect, drug, Tuberculosis). This tuple can be
segmented into (‘side eect drug’, ‘Tuberculosis’) or (‘side eect’, ‘drug’, ‘Tuberculosis’). Similarly,
the query of example 2 can be segmented to (‘produce’, ‘film star’, ‘Natalie’, ‘Portman’) or (‘produce’,
‘film’, ‘star’,‘Natalie Portman’). Note that in both cases, the second segmentation is more likely to lead
to a query that contains the results intended by the user.
In addition to detecting the right segments for a given input query, we also have to map each of
these segments to a suitable resource in the underlying knowledge base. This step is dubbed entity
disambiguation and is of increasing importance since the size of knowledge bases and the heterogeneity
of schemas on the Linked Data Web grows steadily. With respect to example 1, the segment
‘Tuberculosis’ is ambiguous when querying both Sider and Diseasome because it may refer
to the resource diseasome:Tuberculosis describing the disease Tuberculosis or to the resource
sider:Tuberculosis being the side eect caused by some drugs. Regarding the example 2, the segment
‘film’ is ambiguous because it may refer to the class dbo:Film (the class of all movies in DBpedia)
or to the properties dbo:film or dbp:film (which relates festivals and the films shown during these
festivals). In fact in this step, we aim to map the input keywords to a suitable set of entity identifiers, i.e.
resources R = fr1; r2; : : : ; rmg. Note that several adjacent keywords can be mapped to a single resource,
i.e. m n. Thus, for each segment, a suitable resource has to be determined.
A state represents a knowledge base resource.
Contains all resources in the knowledge base.
In practice, we prune the state space by excluding irrelevant states.
Adding an unknown entity state comprising all resources, which are not available (anymore) in the pruned state space.
Extension of State Space with reasoning: An extension of the state space by including resources inferred from lightweight owl:sameAs reasoning.
Hub value estimates the value of links to other pagesand auhority value estimates the value on the content
In case of a
vocabulary mismatch, schema-aware search systems are unable to retrieve data. For instance, consider
the input query altitude of Everest. The keyword altitude, should be matched to the keyword elevation,
because the relevant property resource has the label elevation and not altitude. Therefore, query expansion
can be a crucial step in question answering or keyword search pipeline. A naive way for automatic
query expansion adds words derived from linguistic resources. In this regard, expansions are synonyms,
hyponyms and hypernyms of the input keywords. In practice, this naive approach fails because of high
retrieval cost and substantially decreasing precision. Regarding Linked Data, a research question arising
here is whether interlinked data and vocabularies provide features, which can be taken into account for
query expansion and how eective those new semantic features are in comparison to traditional linguistic
ones.
WordNet is a popular data source for expansion
This table shows the results of accuracy for two settings. In each setting either linguistics or semantic features are taken into account.
Interestingly, the setting with only semantic features result in an accuracy at least as high as the setting with only linguistic features.
Generally when applying expansion methods, there is a risk of yielding a large set of irrelevant words, which can have a negative impact on further processing.
This tables shows the statistics over the number of derived words and the number of matches per feature.
You can see that the number of derived words is considerably high in comparison to the number of matches.
Skos features provide zero number of derived words which in the later experiments we exclude them.
In general, when applying expansion methods, there is a high risk
of yielding a large set of irrelevant words, which will have a negative
impact on the runtime and the accuracy of the question answering
system.
An external data source like wordnet/ lemon Lexicon/ query log is employed for query expansion
Once the resources are detected, adequate formal queries (i.e. SPARQL queries) have to be generated.
In order to generate a formal query (here: a conjunctive query), a connected subgraph G0 = (V0; E0) of the
knowledge base graph G = (V; E), called the query graph, has to be determined. The intuition behind
constructing such a query graph is that it has to fully cover the set of mapped resources R = fr1; : : : ; rmg.
In linked data, mapped resources ri may belong to dierent graphs Gi. Thus, the query construction
algorithm must be able to traverse the links between datasets at both schema and instance levels. With
respect to the example 1, after applying disambiguation on the identified resources, we would obtain the
following resources from dierent datasets: