Presentation at the Ontology Engineering Group at UPM related to Linked Data Quality and the work done in the Enterprise Information System Group at Universität Bonn
1. Linked Data Quality Assessment
– daQ and Luzzu
Jeremy Debattista
University of Bonn
Presentation at the Ontology Engineering
Group (UPM)
2. …who am I?
• B.Sc (Hons) in Computer Science – University of
Malta
– Thesis: Collaborative Editing and Expert Finding
• M.App Sc in Computer Science – DERI, National
University of Ireland, Galway
– Thesis: Ontology-based rules for User-Controlled
Support in Ubiquitous Environments
• PhD Candidate – University of Bonn
3. … my PhD – the big picture
• Work related to Data Quality (in LD)
– representing quality metadata (daQ)
– assessing data quality (Luzzu)
– identifying new metrics from standard
vocabularies (like PROV-O)
4. … the need for Quality Metadata
• Convincing data consumers to use our
published data
• Filtering datasets
• Poor Quality Perspective – Big Data Veracity
7. … the daQ vocabulary
• Metadata as Named Graphs
• Usage of abstract class concept
• Metric assessment as Observations
• Preserving Provenance information
9. … daQ Applications
• daQ validator – Validates quality metric
schemas extending the daQ (will be online
soon)
– e.g. checking that each dimension is in exactly one
category…
• Luzzu – next slides
10. … Luzzu – QA Framework
• A comprehensive QA framework
– assesses LD quality using user-provided metrics (we
have a number of LOD metrics already) in a scalable
manner
– provides queryable metadata (daQ)
– provide quality reports which can be used for cleaning
• Java Based with maven integration
• http://eis-bonn.github.io/Luzzu
13. …what’s missing in Luzzu
• Make Luzzu work better on Big Data Platforms
– We already have a SPARK Processor
– How can metrics be scaled on different cores?
Something like map-reduce maybe?
15. … quality metrics
• Traditional naïve way
• Probabilistic Techniques (A paper was
presented at ESWC this year)
16. … probabilistic technique hypothesis
Probabilistic approximation techniques would :
(H1) drastically improve computational time
(H2) give close to accurate results
17. … probabilistic techniques used
Reservoir
Sampling
Bloom
Filters
Clustering
Coefficient
Estimation
Dereferenceability
Links to External
Data Providers
Extensional
Conciseness
Clustering
Coefficient of a
Network
21. … what am I working on
• Large Scale/Data web Scale evaluation Journal
Paper
– assessing the quality of LOD Cloud datasets
• daQ (Journal Paper)
22. … what do we do at Bonn
• Open Government Data – Publishing and
Consumption
– Data Value Chains, Value Creation, Budgeting
• Portal for publication and consumption of open
data
– Lowering of semantic data to shallower domain
specific formats (RDB, CSV etc..)
• RDF Visualisations and Recommendations
23. … what do we do at Bonn
• Dataset Change Detection
• Collaborative Authoring and Open Educational
Content
• Low-threshold agile methodology for
collaborative vocabulary development
• Mapping of AutomationML to RDF
there are various reasons why dataset should contain quality metadata
convincing data consumers: is the published data fit to the user’s needs
filtering datasets: if the publisher does not care about his data, then why should a consumer use it?
Poor quality perspective: LD is a good use case for Veracity in Big Data, but it is often overlooked due to its poor quality perspective. If the big data community is convinced otherwise, LD might be used more often on bigger platforms. Therefore we have to start by assessing data quality and stamp our datasets in a machine readable format.
Represent Quality Metadata in Named Graphs that can be attached to datasets
CDM are abstract classes… these are only conceptual.. more concrete classes should be represented as sub-classes
A dataset can be assessed multiple metrics. Each metric can be assessed over the dataset infinite times, each time the new value represented as an observation
Each observation is also a Provenance Entity, enabling the representation of concepts such as the activity agent and how a metric was executed (for example parameter setting in reservoir techniques)
The general architecture
The processing workflow
We identified the data quality lifecycle, which could be part of a bigger lifecyle like the LODStack or even to bigger more generic processes like data value chains
Metric Identification and Definition – Choosing the right metrics for a dataset and task at hand;
Assessment – Dataset assessment based on the metrics chosen
Data Repairing and Cleaning – Ensuring that, following a quality assessment, a dataset is curated in order to improve its quality;
Storage, Cataloguing and Archiving – Updating the improved dataset on the cloud whilst making the quality metadata available to the public
Exploration and Ranking – Finally, data consumers can explore cleaned datasets according to their quality metadata
our hypothesis is that probabilistic approximation techniques
would drastically improve computational time when compared with the naïve implementations which gives 100% accurate results
having said that the probabilistic techniques will still give a close to accurate results given the right parameter settings.
Therefore to sum up the metrics using the Res Sampling techniques:
The deref metric gave around 75% precision, whilst the order of magnitude can easily go over 2 with small datasets having 1M triples
The links to External DP metric gave us 100% precision, whilst the difference in the time can be easily noticeable when datasets grow larger.
From the results we saw that the precision was on average 97%, whilst the computational time takes more than 3 order of magnitude in most of the cases