Sieve - Data Quality and Fusion - LWDM2012

Sieve
Linked Data
Quality Assessment
and Fusion

Pablo N. Mendes
Hannes Mühleisen
Christian Bizer

With contributions from:
Andreas Schultz, Andrea Matteini, Christian Becker, Robert Isele

“sieve”

“A sieve, or sifter, separates wanted elements
from unwanted material using a woven screen
such as a mesh or net.”
Source: http://en.wikipedia.org/wiki/Sieve

What is Linked Data?

• Raw data (RDF)
• Accessible on the Web
• Data can link to other data sources

Thing Thing Thing Thing Thing

Thing Thing Thing Thing Thing

data link data link data link data link

A B C D E

• Benefits: Ease of access and re-use; enables discovery

Linking Open Data Cloud

http://lod-cloud.net

Linked Data Challenges
• Data providers have different intentions, experience/knowledge
• data may be inaccurate, outdated, spam etc.

• Data sources that overlap in content may use…
• ... different RDF schemata
• ... different identifiers for the same real-world entity
• …conflicting values for properties

• Integrating public datasets with internal databases poses the
same problems

An Architecture for Linked Data Applications

LDIF – Linked Data Integration Framework
1 Collect data: Managed download and update

2 Translate data into a single target vocabulary

3 Resolve identifier aliases into local target URIs

4 Assess quality, filter bad results, resolve conflicts

5 Output

• Open source (Apache License, Version 2.0)
• Collaboration between Freie Universität Berlin and mes|semantics

LDIF Pipeline

1 Collect data Supported data sources:

2 Translate data • RDF dumps (various formats)
• SPARQL Endpoints
3 Resolve identities
• Crawling Linked Data
4 Filter and fuse

5 Output

LDIF Pipeline

1 Collect data
Data sources use a wide range of different RDF
vocabularies
2 Translate data
dbpedia-owl: City

3 Resolve identities R2R local:City
schema:Place

4 Filter and fuse fb:location.citytown

5 Output • Mappings expressed in RDF (Turtle)
• Simple mappings using OWL / RDFs statements
(x rdfs:subClassOf y)
• Complex mappings with SPARQL expressivity
• Transformation functions

LDIF Pipeline

1 Collect data Data sources use different identifiers for the same entity

2 Translate data Berlin, Germany
Berlin, CT
Berlin, MD
3 Resolve identities Berlin, NJ
Berlin, MA
4 Filter and fuse Berlin
Berlin Silk =
5 Output Berlin,
Germany

• Profiles expressed in XML
• Supports various comparators and
transformations

LDIF Pipeline

1 Collect data Sources provide different values for the same property

Total Area
2 Translate data
891.85 km2
891.82 km2
3 Resolve identities 891.82 km2
891.85 km2
4 Filter and fuse
Total Area
5 Output
Quality Sieve 891.85 km2

• Profiles expressed in XML
• Supports various scoring and fusion functions

LDIF Pipeline

1 Collect data
• Output options:N-Quads
2 Translate data • N-Triples

3 Resolve identities • SPARQL Update Stream

4 Filter and fuse
• Provenance tracking using Named
5 Output Graphs

An Architecture for Linked Data Applications

Data Quality and
Fusion Module

Data Fusion

“fusing multiple records representing the same
real-world object into a single, consistent, and
clean representation”
(Bleiholder & Naumann, 2008)

Conflict resolution strategies

• Independent of quality assessment metrics
• Pick most frequent (democratic voting)
• Average, max, min, concatenation
• Within interval
• Based on task-specific quality assessment
• Keep highest scored
• Keep all that pass a threshold
• Trust some sources over others
• Weighted voting

Data Fusion

• Input:
• (Potentially) conflicting data
• Quality metadata describing input
• Execution:
• Use existing or custom FusionFunctions
• Output:
• Clean data, according to user’s definition of clean

Sieve: Quality Assessment
• Quality as “fitness for use”:
• Subjective:
• good for me might not be enough for you
• Task dependent:
• temperature: planning a weekend vs biology experiment
• Multidimensional:
• even correct data may be outdated or not available

• Requires task-specific quality assessment.

Data Quality - Conceptual Framework
Dimension
Accuracy
Consistency
Objectivity
Timeliness
Validity
Believability
Completeness
Understandability
Relevancy
Reputation
Verifiability
Amount of Data
Interpretability
Rep. Conciseness
Rep. Consistency
Availability
Response Time
Security

Configuration: Quality Assessment

• Quality Assessment Metrics composed by:
• ScoringFunction (generically applicable to given data types)
• Quality Indicator as input (adaptable to use case)

[0;1]
• Output:

Describes input within a quality dimension,
according to a user’s definition of quality

Configuration: Quality Assessment

More about Sieve

• Software: Open Source, Apache V2
• Scoring Functions and Fusion Functions can be extended
• Scala/Java interface, methods score/fuse and fromXML

• Quality scores can be stored and shared with other
applications
• Website: http://sieve.wbsg.de
• Documentation, examples, downloads, support

Use Case
Multiple data sources
(Complementary)
(Heterogeneous)

Conflicting values
Quality indicators
(Multidimensional)
(Task-dependent)
(Conflict
Resolution
Strategies)
Voilá!
User config

Evaluating Quality of Data Integration

• Completeness
• How many cities did we find?
• How many of the properties did we fill with values?
• Conciseness
• How much redundancy is there in the object identifiers?
• How much redundancy is there in the property values?
• Consistency
• How many conflicting values are there?

Results

Generated data that is more complete, concise
and consistent than in the original sources

Linked Data application Architecture

My view on this data space can also be
shared, and reused.

We can “pay as we go”

THANK YOU!
• Twitter: @pablomendes
• E-mail: pablo.mendes@fu-berlin.de

• Website: http://sieve.wbsg.de
• Google Group: http://bit.ly/ldifgroup

Supported in part by:
Vulcan Inc. as part of its Project Halo
EU FP7 projects:
-LOD2 - Creating Knowledge out of Interlinked Data
-PlanetData - A European Network of Excellence on Large-Scale Data Management

Sieve - Data Quality and Fusion - LWDM2012

Recommended

Recommended

More Related Content

Similar to Sieve - Data Quality and Fusion - LWDM2012

Similar to Sieve - Data Quality and Fusion - LWDM2012 (20)

More from Pablo Mendes

More from Pablo Mendes (10)

Recently uploaded

Recently uploaded (20)

Sieve - Data Quality and Fusion - LWDM2012