Using Semantic Web Resources for Data Quality Management
1. Using Semantic Web
Resources for Data Quality
Management
Christian Fürber and Martin Hepp
christian@fuerber.com, mhepp@computer.org
Presentation at the 17th International Conference on
Knowledge Engineering and Knowledge Management,
October 10-15, 2010, Lisbon, Portugal
2. Purpose of Data
Measurement Information &
Knowledge
101010101
010101010
DATA
101010101
001010101
Automation 001010101 Decisions
C. Fürber, M. Hepp: 2
Using SemWeb Resources for DQM
3. Data Quality in Practice
Reference: http://www.heise.de/newsticker/meldung/Comdirect-Bank-macht-Kunden-zu-Billiardaeren-996088.html
C. Fürber, M. Hepp: 3
Using SemWeb Resources for DQM
4. The Web of Messy Data?
Retrieved from http://dbpedia.org/sparql on July 20th
Which one is
the correct
population?
C. Fürber, M. Hepp: 4
Using SemWeb Resources for DQM
5. The Web of Messy Data?
Retrieved from http://dbpedia.org/sparql on July 20th
Places with
negative
population?!?
C. Fürber, M. Hepp: 5
Using SemWeb Resources for DQM
6. Risk of Failure
Measurement Information &
Knowledge
101010101
010101010
DATA
101010101
001010101
Automation 001010101 Decisions
C. Fürber, M. Hepp: 6
Using SemWeb Resources for DQM
7. Data Quality Problem Types
Inconsistent duplicates
Invalid characters Missing classification
Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
Incorrect reference Approximate duplicates
Reference: Linking Open Data cloud diagram, by
Character alignment violation
Word transpositions
Invalid substrings
Mistyping / Misspelling errors
Cardinality violation
Missing values Referential integrity violation
Misfielded values
Unique value violation False values Functional Dependency
Out of range values
Violation Imprecise values
Existence of Homonyms Meaningless values
Incorrect classification
Existence of Synonyms Contradictory relationships
Outdated conceptual elements Untyped literals Outdated values
C. Fürber, M. Hepp: 7
Using SemWeb Resources for DQM
8. Goals
• Use Semantic Web data to identify data
quality problems on instance level
• Support Data Quality Management (DQM)
process
C. Fürber, M. Hepp: 8
Using SemWeb Resources for DQM
9. Total Data Quality Management
for and based on the Semantic Web
Develop and
Define what‘s
apply SPARQL
good and / or
queries based
what‘s poor Define Measure
on DQ-
data quality
Definition
DQ
Improve Analyze
Reference: Richard Wang (1998)
C. Fürber, M. Hepp: 9
Using SemWeb Resources for DQM
10. How can the Semantic Web support
Data Quality Management?
Availability of FREE Data Quality Knowledge,
e.g. for the identification of…
• Legal value violations
• Functional dependency violations
C. Fürber, M. Hepp: 10
Using SemWeb Resources for DQM
11. Using Trusted References
Las Vegas France DQ-Constraints
local:Location tref:Location
Las Vegas
Las Vegas
France
USA
Tested Knowledgebase Trusted Reference
C. Fürber, M. Hepp: 11
Using SemWeb Resources for DQM
13. Basic Characteristics of SPIN
• Allows definition of generalized
SPARQL query templates
http://spinrdf.org/
• Constraint checking based on
SPARQL
• Definition of inferencing rules via
SPARQL
C. Fürber, M. Hepp: 13
Using SemWeb Resources for DQM
14. Generic Data Quality Constraints
Library for Easy DQ-Defintion
• Mandatory properties &
literals
• Legal values*
• Legal value ranges
• Functional dependencies*
• Legal syntaxes
• Uniqueness
* Designed to use trusted references
available @ http://semwebquality.org/ontologies/dq-constraints#
C. Fürber, M. Hepp: 14
Using SemWeb Resources for DQM
15. Definition of Data Quality
Constraints based on SPIN
C. Fürber, M. Hepp: 15
Using SemWeb Resources for DQM
17. Legal Value Constraints
Return all instances of class vcard:Address that do not have a
matching value for property vcard:country-name in property
tref:country
SELECT ?s
WHERE {
?s a vcard:Address .
?s vcard:country-name ?value .
OPTIONAL {
?s2 a tref:Location .
?s2 tref:country ?value1 .
FILTER(str(?value1)= str(?value))
} .
FILTER(!bound(?value1))
}
C. Fürber, M. Hepp: 17
Using SemWeb Resources for DQM
18. Functional Dependency Constraints
Return all instances of vcard:ADR with city-country-combinations
that do not have a matching pair in instances of gn:Location.
SELECT ?s
WHERE {
?s a gr:LocationOfSalesOrServiceProvisioning .
?s vcard:ADR ?node
?node vcard:city ?value1 .
?node vcard:country ?value2 .
NOT EXISTS {
?s2 a gn:Location .
?s2 gn:asciiname ?value1 .
?s2 gn:country ?value2 .
}}
C. Fürber, M. Hepp: 18
Using SemWeb Resources for DQM
19. Acquisition of Semantic Web
Sources for DQM
(1) Replication of relevant knowledge-bases
(2) On the fly via federated SPARQL queries:
PREFIX dbo:<http://dbpedia.org/ontology/>
SELECT *
WHERE {
?s1 :location_CITY ?city .
OPTIONAL{
SERVICE <http://dbpedia.org/sparql>{
?s2 a dbo:City .
?s2 rdfs:label ?city .
FILTER (lang(?city) = "en") .
}
}
FILTER(!bound(?s2))
}
C. Fürber, M. Hepp: 19
Using SemWeb Resources for DQM
20. Limitations
• High degree of uncertainty about quality of Semantic
Web resources
• Risk for data quality problem proliferation
• Lack of Semantic Web resources for certain domains
• Flexible design of RDF and structural heterogeneity
complicate definition of generic DQ constraints
• Scalability on large data sets
• DQ constraints close the world
C. Fürber, M. Hepp: 20
Using SemWeb Resources for DQM
21. Contributions
• Data quality control for Semantic Web data
• Identification of potential inconsistencies
between Semantic Web Resources
• Reduction of effort for the definition of functional
dependency rules and legal value rules
• Reuse of shared data quality rules on a Web
scale
C. Fürber, M. Hepp: 21
Using SemWeb Resources for DQM
22. Future Work
• Semantic Web information quality assessment
framework (SWIQA) with computation of KPI‘s
• Analysis and identification of useful „trusted
references“ based on SWIQA
• Application on multi-source master data of
information systems
• Evaluation on large data sets
C. Fürber, M. Hepp: 22
Using SemWeb Resources for DQM
23. Data Quality Constraints Library for SPIN @
http://semwebquality.org/ontologies/dq-constraints#
Christian Fürber
Researcher
E-Business & Web Science Research Group
Werner-Heisenberg-Weg 39
85577 Neubiberg
Germany
skype c.fuerber
email christian@fuerber.com
web http://www.unibw.de/ebusiness
homepage http://www.fuerber.com
twitter http://www.twitter.com/cfuerber
Paper available at http://bit.ly/c5v6TM
23