Semantic Interpretation of User Query for Question Answering on Interlinked Data

+
Semantic Interpretation of User Queries for
Question Answering on Interlinked Data
Saeedeh Shekarpour
Supervisor: Prof. Dr. Sören Auer
1
EIS research group - Bonn University7 January 2015

+
Search engines can answer queries
which match certain templates
EIS research group - Bonn University
2
7 January 2015

+
Search engines still lack the ability to
answer more complex queries
7 January 2015EIS research group - Bonn University
3

+
Evolution of Web
Web of
Documents
Semantic Web Web of Data
4
7 January 2015

+
RDF model
 RDF is an standard for describing Web resources.
 The RDF data model expresses statesments about Web resources in
the form of subject-predicate-object (triple).
 The statement “Jack knows Alice” is represented as:
5
Jack Alice
know

+
The growth of Linked Open Data
6
August 2014
570 Datasets
More than 74 billion triples
May 2007
12 Datasets
7 January 2015

+
How to retrieve data from Linked Data?
7
Linked Data characteristics:
• Wide range of topical domains
• Variety in vocabularies
• Interlinked data
SPARQL queries:
• Knowledge about the ontology
• Proficiency in formulating formal queries
• Explicit and unambigious semantics
Text queries (either keyword or natural language
):
• Simple retrieval approach
• Implicit and ambiguous semantics
• Popular
7 January 2015

+
Comparison of search approaches
Data-semantic
unaware
Data-semantic
aware
Keyword-based
query
Natural language
query
Question
Answering
Systems
Information
Retrieval
Systems
Our approach:
SINA
8
EIS research group - Bonn University 7 January 2015

+
Objective: transformation from
textual query to formal query
Which televisions shows were created by Walt Disney?
9
SELECT * WHERE
{ ?v0 a dbo:TelevisionShow.
?v0 dbo:creator dbr:Walt_Disney. }
1
2
3

+
Test bed datasets
10
7 January 2015
 One single dataset: DBpedia.
 Three interlinked datasets from life-
science:
1. Drugbank: contains information
about drugs, drug target (i.e.
protein) information, interactions
and enzymes.
2. Diseasome: contains information
about diseases and genes
associated with these diseases.
3. Sider: contains information about
drugs and their side effects.

+
The addressed challenges
Challenges
Query
Segmentation
Resource
Disambiguation
Query
Expansion
Formal Query
Construction
Data Fusion on
Linked Data
11
7 January 2015

+
 Definition: query segmentation is the process of identifying the right
segments of data items that occur in the keyword queries.
12
Query
Segmentation
Two segmentations:
Sequence of keywords:
Input Query: What are the side effects of drugs used for Tuberculosis?
(side, effect, drug , Tuberculosis)
side effect | drug | Tuberculosis side effect drug | Tuberculosis
7 January 2015

+
 Definition: resource disambiguation is the process of recognizing the
suitable resources in the underlying knowledge base.
13
Resource
Disambiguation
Input query
•What are the side effects of drugs used
for Tuberculosis?
Ambiguous
Resources
•diseasome:Tuberculosis
•sider:Tuberculosis
Input query
•Who produced films starring Natalie
Portman?
Ambiguous
Resources
•dbpedia/ontology/film
•dbpedia/property/film
7 January 2015

+
Concurrent approach
14
Query
Segmentation
Resource
Disambiguation
7 January 2015

+
1
2
3
Unknown
Entity
4
5
6
7
8
9
Start
Keyword 1 Keyword 3Keyword 2 Keyword 4
Modeling using hidden
Markov model
15
7 January 2015
Query
Segmentation
&
Resource
Disambiguation

+
Bootstrapping the model
parameters
1. Emission probability is defined based on the similarity of the label of each
state with a segment, this similarity is computed based on string-similarity
and Jaccard-similarity.
2. Semantic relatedness is a base for transition probability and initial
probability. Intuitively, it is based on two values: distance and connectivity
degree. We transform these two values to hub and authority values using
weighted HITS algorithm.
3. HITS algorithm is a link analysis algorithm that was originally developed for
ranking Web pages. It assign a hub and authority value to each web page.
4. Initial probability and transition probability are defined as a uniform
distribution over the hub and and authority values.
16
7 January 2015
Query
Segmentation
&
Resource
Disambiguation

+
Evaluation of bootstrapping
The accuracy of the bootstrapped transition probability using different
distribution functions, i.e., Normal, Zipfian and uniform distributions.
17
Query
Segmentation
&
Resource
Disambiguation

+
Outputof the model after
running viterbi algorithm
Sequence of
keywords
(television show creat Walt Disney)
Paths 0.0023 dbo:TelevisionShow dbo:creator dbr:Walt_Disney
0.0014 dbo:TelevisionShow dbo:creator dbr:Category:Walt_Disney
0.000589 dbr:TelevisionShow dbo:creator dbr:Walt_Disney
0.000353 dbr:TelevisionShow dbo:creator dbr:Category:Walt_Disney
0.0000376 dbp:television dbp:show dbo:creator dbr:Category:Walt_Disney
18
7 January 2015
Query
Segmentation
&
Resource
Disambiguation

+
Definition: query expansion is a way of reformulating the input query
in order to overcome the vocabulary mismatch problem.
19
Input query
• Wife of Barak Obama
Reformulated
query
•Spouse of Barak Obama
Query
Expansion
7 January 2015

+
Analysis of
linguistic
features vs.
semantic
features
A method
for
automatic
query
expansion
20
7 January 2015
Query
Expansion

+
Linguistic features
 WordNet is a popular data source for expansion.
 Linguistic features extracted from WordNet are:
1. Synonyms: words having a similar meanings to the input keyword.
2. Hyponyms: words representing a specialization of the input keyword.
3. Hypernyms: words representing a generalization of the input keyword.
21
7 January 2015
Query
Expansion

+
Semantic features from
Linked Data
1. SameAs: deriving resources using owl:sameAs.
2. SeeAlso: deriving resources using rdfs:seeAlso.
3. Equivalence class/property: deriving classes or properties using
owl:equivalentClass and owl:equivalentProperty.
4. Super class/property: deriving all super classes/properties of by following the
rdfs:subClassOf or rdfs:subPropertyOf property.
5. Sub class/property: deriving resources by following the rdfs:subClassOf or
rdfs:subPropertyOf property paths ending with the input resource.
6. Broader concepts: deriving using the SKOS vocabulary properties skos:broader and
skos:broadMatch.
7. Narrower concepts: deriving concepts using skos:narrower and skos:narrowMatch.
8. Related concepts: deriving concepts using skos:closeMatch, skos:mappingRelation
and skos:exactMatch.
22
7 January 2015
Query
Expansion

+
Exemplary expansion graph of the
word movie
23
movie
home movieproduction
film
motion
picture show
video
telefilm
7 January 2015
Query
Expansion

+
Objective of experiment
 How effective do linguistic as well as semantic features perform?
 How well does a linear weighted combination of features perform?
24
7 January 2015
Query
Expansion

+
Benchmark creation
 We created a benchmark extracted from QALD1 and QALD2.
 Benchmark contains all keywords having vocabulary mismatch
problem and their corresponding match.
25
Query
Expansion

+
Accuracyresults of prediction
function based on linguistic as
well as semantic features
Features Weighting
Mechanism
Precision Recall F-score
Linguistic SVM 0.730 0.650 0.620
Semantic SVM 0.680 0.630 0.600
Linguistic Decision Tree/
Information Gain
0.588 0.579 0.568
Semantic Decision Tree/
Information Gain
0.755 0.684 0.661
26
7 January 2015
Query
Expansion

+
Statistics over the number of the
derived words and matches
27
7 January 2015
Query
Expansion
Feature #derived words #matches
synonym 503 23
hyponym 2703 10
hypernym 657 14
sameAs 2332 12
seeAlso 49 2
equivalence 2 0
super class/property 267 4
Sub class/property 2166 4

+
Automatic query expansion
Input query
External Data source
Data extraction and
preparation
Heuristic method
Reformulated query
28
7 January 2015
Query
Expansion

+
Expansion set for each
segment
Expansion Set
Original
segment
Lemmatized
segment
Derived words
from WordNet
Synonym
Hyponym
Hypernym
29
7 January 2015
Query
Expansion

+
Reformulating query using
hidden Markov model
30
Barak
Barak
Obama
spouse
Obama
wife
first
lady
woman
Barak Obama wife
Barak
Obama
Start
Input query: wife of Barak Obama
Obama
wife
Barak
Obama
wife
7 January 2015
Query
Expansion

+
Triple-based co-occurence
In a given triple t = (s, p, o), two words w1 and w2 are co-
occurring, if they appear in the labels (rdfs:label) of at least
two resources.
31
7 January 2015
Query
Expansion

+
Goals of evaluation
 How effective is our method with regard to a correct reformulation of queries which
have vocabulary mismatch problem?
 How robust is the method for queries which do not have vocabulary mismatch
problem?
32
7 January 2015
Query
Expansion
Query Mismatch
word
Match word #derived words
Movies with Tom Cruise movie film 77
Altitude of Everest altitude elevation 16
Soccer clubs in Spain - - 19
Employees of Google - - 10
Sample of our benchmark

+
Rank (R) and Cumulative rank
(CR) for the test queries
33
7 January 2015
Query
Expansion

+
 Definition: Once the resources are detected, a connected subgraph
of the knowledge base graph, called the query graph, has to be
determined which fully covers the set of mapped resources.
34
Formal Query
Construction
7 January 2015
Disambiguated
resources
sider:sideEffect
diseasome:possibleDrug
diseasome:1154
SPARQL query SELECT ?v3 WHERE {
diseasome:115 diseasome:possibleDrug ?v1 .
?v1 owl:sameAs ?v2 .
?v2 sider:sideEffect ?v3 .}

+
 Answer of a question may be spread among different datasets
employing heterogeneous schemas.
 Constructing a federated query from needs to exploit links between
the different datasets on the schema and instance levels.
35
Data Fusion on
Linked Data
7 January 2015

+
Two different approaches
Template-based query construction
Forward chaining based query construction
36
7 January 2015
Federated Query
Construction

+
Forward chaining based query
construction
1. Set of resources
37
Query What is the side effects of drugs used for Tuberculosis?
resources diseasome:1154 (type instance)
diseasome:possibleDrug (type property)
sider:sideEffect (type property)
1154 ?v0
possibleDrug
Graph 1
?v1 ?v2
sideEffect
Graph 2
115
4
?v0
possibleDrug
Template 1
?v1 ?v2
sideEffect
Template 2
115
4
?v0
possibleDrug
?v1 ?v2
sideEffect
7 January 2015
2. Incomplete query
graph
3. Query graph
Federated Query
Construction

+
Evaluation
 Goal of experiment:
1. performance of disambiguation method using Mean Reciprocal Rank (MRR).
2. performance of forward chaining query construction method using precision
and recall.
 Benchmarks:
1. 25 queries on the 3 interlinked datasets from life-science.
2. QALD1 and QALD3 benchmarks over DBpedia.
3. QALD2 was used for bootstrapping.
38
Federated Query
Construction

+
Runtime
 Parallization over three components:
1. Segment validation
2. Resource retrieval
3. Query construction
39

+
Client
QueryPreprocessing
QueryExpansion
ResourceRetrieval
Disambiguation
QueryConstruction
Representation
Server
UnderlyingInterlinked
KnowledgeBases
query result
keywords
valid segments
mapped resources
tuple of
resources
SPARQL
queries
OWL API
http client
Stanford
CoreNLP
SegmentValidation
Reformulated query
SINA architecture
40
7 January 2015

+
Demo
41

+
Conclusion
 We researched and addressed a number of challenges.
 The result of the evaluation confirms the feasibility and high
accuracy, for instance:
1. query segmentation and resource disambiguation with achieved the
MRR from 86% till 96%.
2. query construction with precision 32% in DBpedia QALD3 benchmark
and 95% in life-science.
 We learnt that:
1. In Linked Data, structure as well as topology of data can be leveraged for any
inference and heuristic.
2. Using structure as well as topology without any deep text analysis, Linked Data
can enhance power of question answering.
42

+
Future work
 It was the first step in a long agenda.
 We plan to:
1. use supervise learning to enhance the parameters of the model.
2. extend our benchmark to make further evolutions.
3. employ more number of interlinked dataset to figure out the challenges
of scability.
4. extending different aspects of each approach.
 We are going to target new challenges, e.g. query cleaning.
 We will continue this work with students joining us for Marie Curie
ITN network.
43

+
44

+
Questiones?
45
7 January 2015

Semantic Interpretation of User Query for Question Answering on Interlinked Data

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Semantic Interpretation of User Query for Question Answering on Interlinked Data

Similaire à Semantic Interpretation of User Query for Question Answering on Interlinked Data (20)

Dernier

Dernier (20)

Semantic Interpretation of User Query for Question Answering on Interlinked Data

Notes de l'éditeur