ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization

ABSTAT: Ontology-driven Linked Data
Summaries with Pattern Minimalization
Blerina Spahiu, Riccardo Porrini, Matteo Palmonari, Anisa Rula, Andrea Maurino
University of Milano-Bicocca (surname@disco.unimib.it)
blerina.spahiu@disco.unimib.it

Outline
 Motivation
 Dataset Understanding
 State of the Art
 Summarization Framework
 Abstract Knowledge Patterns (AKPs)
 Pattern Minimalization
 Summary extraction, storage and presentation
 Evaluation
 Compactness
 Informativeness
 User Study
 Conclusion and Future Work
2University of Milan - Bicocca

Introduction
What types of resources are there in a data set?
How are they described?
What types of resources are linked by a certain
property and how frequently?

Motivation
 Understanding the content of data sets is challenging
 Looking at the ontology is not enough:
Ontologies may be large and underspecified
• DBpedia 2015-04: 2795 properties, domain not
specified for 259 properties, range not specified for
187 properties
• No information about the usage
 Explorative queries are too expensive
Significant server overload
High response time/timeout

State of the Art
University of Milan - Bicocca 5
Relevance Based Summarization Pattern Based Approaches
Troullinoy et al. 2015
Zhang et al. 2007
Identifying subsets of data sets or
ontologies that are considered to
be more relevant
Aim at extracting knowledge
patterns for a complete
representation of the data set
Mihindukulasooriya et al. 2015
Persutti et al. 2011
M. Jarrar and M. Dikaiakos, 2012
Schema Induction
Induces a schema from the data
and aim at extracting stronger
assertions
Völker and Niepert, 2011
Statistics about the dataset
Konrath et. al 2012
Langegger and W. Wöb, 2009
Auer et al. 2012
Linked Open Vocabularies
(http://lov.okfn.org/)
Aim at reporting statistics about the
usage of different vocabularies,
properties and types in the data

State of the Art
Relevance Based Summarization Pattern Based Approaches
Troullinoy et al. 2015
Zhang et al. 2007
Identifying subsets of data sets or
ontologies that are considered to
be more relevant.
Aim at extracting knowledge
patterns for a complete
rapresentation of the dataset.
Mihindukulasooriya et al. 2015
Persutti et al. 2011
M. Jarrar and M. Dikaiakos, 2012
Schema Induction
Induces a schema from the data
and aim at extracting stronger
assertions.
Völker and Niepert, 2011
Statistics about the dataset
Konrath et. al 2012
Langegger and W. Wöb, 2009
Auer et al. 2012
Linked Open Vocabularies
(http://lov.okfn.org/)
Aim at reporting statistics about the
usage of different vocabularies,
properties and types in the data.
ABSTAT

ABSTAT
 ABSTAT (http://abstat.disco.unimib.it) is an ontology-driven
linked data summarization framework
 A summary provides a complete but compact schema-level
representation of a data set
 A set of Abstract Knowledge Patterns (AKPs)
 Statistics An AKP represents the fact that there are instance
of type Person linked with instances of type
Settlement by the property birthplace
How many times does this
pattern occur in the data set
How many times does a certain type occur as minimal type
and how many time does the property occur in the dataset

Abstract Knowledge Patterns (AKPs)
 ABSTAT adopts a minimalization mechanism based on
minimal type patterns
 Minimalization is based on a subtype graph which represents
the data ontology
 Abstract Knowledge Patterns (AKPs) are abstract
representations of Knowledge Patterns
 An AKP is a triple (C; P; D ) such that C and D are types and
P is a property
 In ABSTAT we represent only a set of AKP occurring in the
data set, those that are minimal types

Person
Sportist
FootballPlayer
Lawyer
Jim Brown
Amal
Clooney
“1936-02-17”
XMLSchema#Date
hasWife
Artist
George
Clooney
birthDate
= types
= instances
= literals
.
subclassOf
subclassOf
subclassOf
subclassOf
type
type
type
The (minimal-type) patterns extracted by ABSTAT are:
<Artist, hasWife, Lawyer>
<FootballPlayer, birthDate, XMLSchema#Date>
(type)
An example how AKPs are extracted
type
type
type

Person
Sportist
FootballPlayer
Lawyer
Jim Brown
Amal
Clooney
“1936-02-17”
XMLSchema#Date
hasWife
Artist
George
Clooney
birthDate
= types
= instances
= literals
.
subclassOf
subclassOf
subclassOf
subclassOf
type
type
type
(type)
type
type
type
Redundant patterns excluded by the summary:
<Person, hasWife, Person>
<Sportist, birthDate, XMLSchema#Date>
<Person, birthDate, XMLSchema#Date>

Person
Sportist
FootballPlayer
Lawyer
Jim Brown
Amal
Clooney
“1936-02-17”
XMLSchema#Date
hasWife
Artist
George
Clooney
birthDate
= types
= instances
= literals
.
subclassOf
subclassOf
subclassOf
subclassOf
type
type
type
<Artist, birthDate, XMLSchema#Date>
(type)
type
type
type
type

ABSTAT User Interfaces
ABSTAT homepage
(http://abstat.disco.unimib.it)
ABSTATBrowse
(http://abstat.disco.unimib.it/browse)
ABSTATSearch
(http://abstat.disco.unimib.it/search)
SPARQL Endpoint
(http://abstat.disco.unimib.it/sparql)

Experimental Evaluation
 Summary compactness
 Number of patterns in the summary vs. number of triples in the
data set
 Comparison with a similar approach without minimalization
 Summary informativeness
 Insights about the semantics of the properties
 Small-scale user study

Compactness
Dataset Relational Typing Assertions Types (Ext.) Properties (Ext.) Patterns
DBpedia Core 2014 40.5M 29.7M 70.1M 869 (85) 1439 (15) 171340
DBpedia 3.9 Infobox 96.3M 19.7M 116.4M 821 (58) 62572 (14) 732418
Linked Brainz 180.1M 39.6M 221.7M 21 (9) 33 (0) 161
Reduction Rate =
Dataset ABSTAT LOUPE
DBpedia Core 2014 0.002 0.01
Linked Brainz 6.72 10-7 7.1 10-7
 Minimalization produces more compact summaries
 Advantage of minimalization is more observable for datasets with
richer subtype graphs and typing assertions
Data sets and summaries statistics
Reduction rate
Number of patterns
Number of assertions in the data set
Similar to ABSTAT without
minimalization

Informativeness
ABSTAT summaries provide useful insights about the semantics
of properties, based on their usage within a data set
Dataset Missing
Domain (%)
Missing
Range (%)
Missing
Domain & Range (%)
DBpedia Core 2014 259 (18%) 187 (13%) 48 (3.3%)
DBpedia 3.9 Infobox 61368 (98%) 61309 (98%) 61161 (97%)
Linked Brainz 13 (39%) 15 (45%) 13 (39%)

Inferred domain and range for DBpedia Core 2014
0
20
40
60
80
100
120
140
160
dbo:type
dbo:religion
dbo:succes…
dbo:predec…
dbo:division
dbo:picture
dbo:isPartOf
dbo:position
dbo:series
dbo:builder
dbo:gender
dbo:category
dbo:source
dbo:jurisdic…
dbo:localAu…
dbo:biome
dbo:royalA…
dbo:orogeny
dbo:mainInt…
dbo:authority
dbo:chairLa…
dbo:era
dbo:format
dbo:founder
dbo:manag…
dbo:webcast
dbo:related
dbo:similar
dbo:hasVar…
dbo:sportG…
dbo:variantOf
dbo:connot…
dbo:named…
Numberofminimaltypes
Extracted minimal types (domain)
Extracted minimal types (range)

User Study: Setup
Can ABSTAT be useful to support query formulation?
 Queries to DBpedia 3.9 Infobox from the Questions and
Answering in Linked Open Data benchmark
5 queries of increasing length (1 of length 1, 2 of length 2
and 2 of length 3)
 20 participants, 2 groups:
abstat group uses ABSTAT (after 20 min of training)
control group does not use ABSTAT
 Measures:
Time needed to formulate the query
Accuracy of the answer

User Study: Questionnaire

User Study: Results
Group Avg. Completion Time (s) Accuracy
Query 1- length 1 How many employees does Google
have?
abstat 358.9 0.9
control 380.6 0.8
Query 2- length 2 Give me all people that were born in Vienna and died in Berlin.
abstat 356.3 1
control 346.9 0.8
Query 3- length 2 Which professional surfers were born in
Australia?
abstat 476.6 0.6
control 234.24 0.7
Query 4- length 3 In which films directed by Gary Marshall was Julia Roberts
starring?
abstat 333.4 0.9
control 445.6 0.9
Query 5- length 3 Give me all books by William Goldman with more than 300
pages.
abstat 233.4 1
control 569.8 0.7
The independent t-test showed that there was a significant effect between two
groups for answering correctly Q5: t(16) = 10.32, p < .005

User Study: Results Analysis
 abstat group users benefit from ABSTAT summary in terms of
average completion time, accuracy, or both
 Increasing accuracy over increasing difficulty, performing the tasks faster
 Exception is query 3, because the individual Surfing is classified with no
type other than owl:Thing
 Two used strategies to answer the queries by participants
from the control group were:
 To directly access the public web page describing the DBpedia named
individuals mentioned in the query
 Very few submitted explorative SPARQL queries to the endpoint

Conclusion and Future Work
 ABSTAT: ontology-driven summarization with minimalization
 Sensible reduction rate and promising results about the
informativeness of the summary
 Currently extending the user study
 Apply relevance-oriented summarization methods based
on connectivity analysis
 ABSTAT summary should consider the inheritance of properties
to produce even more compact summaries
 We envision a complete analysis of the most important data set
available in the LOD cloud (20+ data sets available)
 APIs available soon

Thank you for your attention!
23University of Milan - Bicocca

www.abstat.unimib.it

ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

En vedette

En vedette (12)

Similaire à ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization

Similaire à ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization (20)

Dernier

Dernier (20)

ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization