An increasing number of research and industrial initiatives
have focused on publishing Linked Open Data, but little attention has been provided to help consumers to better understand existing data sets. In this paper we discuss how an ontology-driven data abstraction model supports the extraction and the representation of summaries of linked data sets. The proposed summarization model is the backbone of the ABSTAT framework, that aims at helping users understanding big and complex linked data sets. Our framework is evaluated by showing that
it is capable of unveiling information that is not explicitly represented in underspecified ontologies and that is valuable to users, e.g., helping them in the formulation of SPARQL queries.
ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization
1. ABSTAT: Ontology-driven Linked Data
Summaries with Pattern Minimalization
Blerina Spahiu, Riccardo Porrini, Matteo Palmonari, Anisa Rula, Andrea Maurino
University of Milano-Bicocca (surname@disco.unimib.it)
blerina.spahiu@disco.unimib.it
2. Outline
Motivation
Dataset Understanding
State of the Art
Summarization Framework
Abstract Knowledge Patterns (AKPs)
Pattern Minimalization
Summary extraction, storage and presentation
Evaluation
Compactness
Informativeness
User Study
Conclusion and Future Work
2University of Milan - Bicocca
3. Introduction
What types of resources are there in a data set?
How are they described?
What types of resources are linked by a certain
property and how frequently?
4. Motivation
Understanding the content of data sets is challenging
Looking at the ontology is not enough:
Ontologies may be large and underspecified
• DBpedia 2015-04: 2795 properties, domain not
specified for 259 properties, range not specified for
187 properties
• No information about the usage
Explorative queries are too expensive
Significant server overload
High response time/timeout
5. State of the Art
University of Milan - Bicocca 5
Relevance Based Summarization Pattern Based Approaches
Troullinoy et al. 2015
Zhang et al. 2007
Identifying subsets of data sets or
ontologies that are considered to
be more relevant
Aim at extracting knowledge
patterns for a complete
representation of the data set
Mihindukulasooriya et al. 2015
Persutti et al. 2011
M. Jarrar and M. Dikaiakos, 2012
Schema Induction
Induces a schema from the data
and aim at extracting stronger
assertions
Völker and Niepert, 2011
Statistics about the dataset
Konrath et. al 2012
Langegger and W. Wöb, 2009
Auer et al. 2012
Linked Open Vocabularies
(http://lov.okfn.org/)
Aim at reporting statistics about the
usage of different vocabularies,
properties and types in the data
6. State of the Art
University of Milan - Bicocca 6
Relevance Based Summarization Pattern Based Approaches
Troullinoy et al. 2015
Zhang et al. 2007
Identifying subsets of data sets or
ontologies that are considered to
be more relevant.
Aim at extracting knowledge
patterns for a complete
rapresentation of the dataset.
Mihindukulasooriya et al. 2015
Persutti et al. 2011
M. Jarrar and M. Dikaiakos, 2012
Schema Induction
Induces a schema from the data
and aim at extracting stronger
assertions.
Völker and Niepert, 2011
Statistics about the dataset
Konrath et. al 2012
Langegger and W. Wöb, 2009
Auer et al. 2012
Linked Open Vocabularies
(http://lov.okfn.org/)
Aim at reporting statistics about the
usage of different vocabularies,
properties and types in the data.
ABSTAT
7. ABSTAT
ABSTAT (http://abstat.disco.unimib.it) is an ontology-driven
linked data summarization framework
A summary provides a complete but compact schema-level
representation of a data set
A set of Abstract Knowledge Patterns (AKPs)
Statistics An AKP represents the fact that there are instance
of type Person linked with instances of type
Settlement by the property birthplace
How many times does this
pattern occur in the data set
How many times does a certain type occur as minimal type
and how many time does the property occur in the dataset
8. Abstract Knowledge Patterns (AKPs)
ABSTAT adopts a minimalization mechanism based on
minimal type patterns
Minimalization is based on a subtype graph which represents
the data ontology
Abstract Knowledge Patterns (AKPs) are abstract
representations of Knowledge Patterns
An AKP is a triple (C; P; D ) such that C and D are types and
P is a property
In ABSTAT we represent only a set of AKP occurring in the
data set, those that are minimal types
13. ABSTAT User Interfaces
ABSTAT homepage
(http://abstat.disco.unimib.it)
ABSTATBrowse
(http://abstat.disco.unimib.it/browse)
ABSTATSearch
(http://abstat.disco.unimib.it/search)
SPARQL Endpoint
(http://abstat.disco.unimib.it/sparql)
University of Milan - Bicocca 13
14. Experimental Evaluation
Summary compactness
Number of patterns in the summary vs. number of triples in the
data set
Comparison with a similar approach without minimalization
Summary informativeness
Insights about the semantics of the properties
Small-scale user study
15. Compactness
Dataset Relational Typing Assertions Types (Ext.) Properties (Ext.) Patterns
DBpedia Core 2014 40.5M 29.7M 70.1M 869 (85) 1439 (15) 171340
DBpedia 3.9 Infobox 96.3M 19.7M 116.4M 821 (58) 62572 (14) 732418
Linked Brainz 180.1M 39.6M 221.7M 21 (9) 33 (0) 161
Reduction Rate =
Dataset ABSTAT LOUPE
DBpedia Core 2014 0.002 0.01
Linked Brainz 6.72 10-7 7.1 10-7
Minimalization produces more compact summaries
Advantage of minimalization is more observable for datasets with
richer subtype graphs and typing assertions
Data sets and summaries statistics
Reduction rate
Number of patterns
Number of assertions in the data set
Similar to ABSTAT without
minimalization
16. Informativeness
ABSTAT summaries provide useful insights about the semantics
of properties, based on their usage within a data set
Dataset Missing
Domain (%)
Missing
Range (%)
Missing
Domain & Range (%)
DBpedia Core 2014 259 (18%) 187 (13%) 48 (3.3%)
DBpedia 3.9 Infobox 61368 (98%) 61309 (98%) 61161 (97%)
Linked Brainz 13 (39%) 15 (45%) 13 (39%)
18. User Study: Setup
Can ABSTAT be useful to support query formulation?
Queries to DBpedia 3.9 Infobox from the Questions and
Answering in Linked Open Data benchmark
5 queries of increasing length (1 of length 1, 2 of length 2
and 2 of length 3)
20 participants, 2 groups:
abstat group uses ABSTAT (after 20 min of training)
control group does not use ABSTAT
Measures:
Time needed to formulate the query
Accuracy of the answer
20. User Study: Results
Group Avg. Completion Time (s) Accuracy
Query 1- length 1 How many employees does Google
have?
abstat 358.9 0.9
control 380.6 0.8
Query 2- length 2 Give me all people that were born in Vienna and died in Berlin.
abstat 356.3 1
control 346.9 0.8
Query 3- length 2 Which professional surfers were born in
Australia?
abstat 476.6 0.6
control 234.24 0.7
Query 4- length 3 In which films directed by Gary Marshall was Julia Roberts
starring?
abstat 333.4 0.9
control 445.6 0.9
Query 5- length 3 Give me all books by William Goldman with more than 300
pages.
abstat 233.4 1
control 569.8 0.7
The independent t-test showed that there was a significant effect between two
groups for answering correctly Q5: t(16) = 10.32, p < .005
21. User Study: Results Analysis
abstat group users benefit from ABSTAT summary in terms of
average completion time, accuracy, or both
Increasing accuracy over increasing difficulty, performing the tasks faster
Exception is query 3, because the individual Surfing is classified with no
type other than owl:Thing
Two used strategies to answer the queries by participants
from the control group were:
To directly access the public web page describing the DBpedia named
individuals mentioned in the query
Very few submitted explorative SPARQL queries to the endpoint
22. Conclusion and Future Work
ABSTAT: ontology-driven summarization with minimalization
Sensible reduction rate and promising results about the
informativeness of the summary
Currently extending the user study
Apply relevance-oriented summarization methods based
on connectivity analysis
ABSTAT summary should consider the inheritance of properties
to produce even more compact summaries
We envision a complete analysis of the most important data set
available in the LOD cloud (20+ data sets available)
APIs available soon
23. Thank you for your attention!
23University of Milan - Bicocca