The demand to access large amounts of heterogeneous structured
data is emerging as a trend for many users and applications.
However, the effort involved in querying heterogeneous
and distributed third-party databases can create major
barriers for data consumers. At the core of this problem is
the semantic gap between the way users express their information
needs and the representation of the data. This work
aims to provide a natural language interface and an associated
semantic index to support an increased level of vocabulary
independency for queries over Linked Data/Semantic
Web datasets, using a distributional-compositional semantics
approach. Distributional semantics focuses on the automatic
construction of a semantic model based on the statistical distribution
of co-occurring words in large-scale texts. The proposed
query model targets the following features: (i) a principled
semantic approximation approach with low adaptation
effort (independent from manually created resources such as
ontologies, thesauri or dictionaries), (ii) comprehensive semantic
matching supported by the inclusion of large volumes
of distributional (unstructured) commonsense knowledge into
the semantic approximation process and (iii) expressive natural language queries. The approach is evaluated using natural language queries on an open domain dataset and achieved avg. recall=0.81, mean avg. precision=0.62 and mean reciprocal rank=0.49.
Six Myths about Ontologies: The Basics of Formal Ontology
Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach
1. Natural Language Queries over
Heterogeneous Linked Data Graphs:
A Distributional-Compositional Semantics Approach
André Freitas and Edward Curry
Insight Centre for Data Analytics
International Conference on Intelligent User Interfaces
Haifa, 2014
4. Shift in the Database Landscape
Heterogeneous, complex and large-scale databases.
Very-large and dynamic “schemas”.
circa 2014
circa 2000
10s-100s attributes
1,000s-1,000,000s attributes
5. Databases for a Complex World
How do you query data on this scenario?
6. Vocabulary Problem for Databases
Query: Who is the daughter of Bill Clinton married to?
Semantic Gap
Possible representations
Semantic approximation
= Commonsense Knowledge
7. Semantics for a Complex World
Formal World
Real World
Distributional Semantics
Query Approach
16. Distributional Semantics
“Words occurring in similar (linguistic) contexts are
semantically related.”
If we can equate meaning with context, we can simply
record the contexts in which a word occurs in a
collection of texts (a corpus).
This can then be used as a surrogate of its semantic
representation.
24. Search and Composition Operations
Instance search
- Proper nouns
- String similarity + node cardinality
Class (unary predicate) search
- Nouns, adjectives and adverbs
- String similarity + Distributional semantic relatedness
Property (binary predicate) search
- Nouns, adjectives, verbs and adverbs
- Distributional semantic relatedness
Navigation
Extensional expansion
- Expands the instances associated with a class.
Operator application
- Aggregations, conditionals, ordering, position
Disjunction & Conjunction
Disambiguation dialog (instance, predicate)
25. Core Principles
Minimize the impact of Ambiguity, Vagueness, Synonymy.
Address the simplest matchings first (heuristics).
Semantic Relatedness as a primitive operation.
Distributional semantics as commonsense knowledge.
26. Question Analysis
Transform natural language queries into triple
patterns
“Who is the daughter of Bill Clinton married to?”
Bill Clinton
daughter
married to
PODS
(INSTANCE)
(PREDICATE)
(PREDICATE)
Query Features
27. Query Plan
Map query features into a query plan.
A query plan contains a sequence of core operations.
(INSTANCE)
(PREDICATE)
(PREDICATE)
(1) INSTANCE SEARCH (Bill Clinton)
(2) p1 <- SEARCH PREDICATE (Bill Clintion, daughter)
(3) e1 <- NAVIGATE (Bill Clintion, p1)
(4) p2 <- SEARCH PREDICATE (e1, married to)
(5) e2 <- NAVIGATE (e1, p2)
Query Features
Query Plan
30. Predicate Search
Query:
Bill Clinton
daughter
married to
Which properties are semantically related to „daughter‟?
Linked
Data:
:child
:Bill_Clinton
:Chelsea_Clinton
:religion
...
:Baptists
sem_rel(daughter,child)=0.054
sem_rel(daughter,child)=0.004
:almaMater
:Yale_Law_School
sem_rel(daughter,alma mater)=0.001
35. Conclusions
The compositional-distributional model supports a schemaagnostic natural language query mechanism over a large
schema (open domain) database
Comprehensive and accurate semantic matching
- Avg. recall=0.81, map=0.62, mrr=0.49
Medium-high expressivity
- 80% of queries answered
Interactive query execution time
- Avg. 1.52 s (simple queries) – 8.53 s (all queries) / query
Better recall and query coverage compared to baselines with
equivalent precision
Low adaptation effort for new datasets