This document proposes using a database approach to monitor the quality of information stored in RDF databases. It discusses representing semantic properties of databases using integrity constraints and maintaining correctness through integrity enforcement and truth maintenance. It also discusses modeling constraints like marriage being between one man and one woman in a relational database using techniques beyond basic checks like primary and foreign keys.
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
A database approach to monitoring the quality of information in RDF stores
1. A DATABASE APPROACH TO MONITORING THE
QUALITY OF INFORMATION IN RDF STORES
Alexandre Rademaker and Edward Hermann
Wednesday, November 30, 11
2. NOTES
This is not a research report, this is a research
propose!
Let us start by looking results from database
researchers.
Wednesday, November 30, 11
3. WHAT IS (ENSURE) DATA QUALITY?
Semantic properties of databases can be represented
by integrity constraints!
Integrity enforcement means maintain correctness of
database. Truth Maintenance!
Hendrik, 2011
Wednesday, November 30, 11
4. HENDRIK DECKER
http://web.iti.upv.es/~hendrik/
Universidad Politécnica de Valencia
Wednesday, November 30, 11
5. EXAMPLE
A marriage is between one man and one women only.
How can we model such constraint in a relational
DB?
We are talking about more than: check constraint,
foreign key and primary key.
Wednesday, November 30, 11
6. DB THEORY USES DATALOG
Datalog is more expressive than SQL (transitive
closure)
SQL is FOL (dedidable for finite model)
SELECT X WHERE Y (give me the binds that satisfy
the clauses)
Wednesday, November 30, 11
7. TWO WAYS TO ENFORCE INTEGRITY
In each update, check if any integrity constraint is
violated. (not always rigorously check due its
performance penalty)
Repair extant violations of constraints. (accumulation
of inconsistency is inevitable)
Hendrik, 2011
Wednesday, November 30, 11
8. INCONSISTENCY-TOLERANT METHODS
Rigorous way is to eliminate all inconsistency. Repair
the whole database.
Relaxation... partial (flexible) repairs!
Absolute consistency is out of question
due its intractability!
Hendrik, 2011
Wednesday, November 30, 11
9. FLEXIBILITY OF PARTIAL INCONSISTENCY
Flexibility served in two ways:
Integrity enforcement is more flexible. Don’t have to
be done all at once. (constraint violations can be
tolerated to be solved in appropriate moment)
Some inconsistency may be unknown at update time.
Total approach would fail in such situation.
But...
Hendrik, 2011
Wednesday, November 30, 11
10. PARTIAL REPAIRS
Absolute consistency is out of question due its intractability.
But, naive inconsistency-tolerant repairs can be data-
destructive.
For a rational flexible repair strategy, one needs criteria
(expressed in terms of metrics)
Only admit repairs that are integrity-preserving! That is, total
amount of integrity violation not increase after the repair.
Hendrik, 2011
Wednesday, November 30, 11
11. FORMAL DEFINITIONS
For an update U (inserts, deletes) of database D, we
denoted DU the updated database.
D = database
IC = integrity theory
I = constraint
U = update
D(F) = true if F eval to true in D
D(I) = true if I is satisfied in D
D(IC) = true if all I in IC is
satisfied in D
Hendrik, 2011
Wednesday, November 30, 11
12. FORMAL DEFINITIONS
Let be an ordering antisymmetric, reflexive and transitive.
For two elements in a lattice A and B, A B is their least upper bound.
Hendrik, 2011
Wednesday, November 30, 11
13. FORMAL DEFINITIONS
We say that (µ, ) is an inconsistency metric if
µ maps tuples (D, IC) to some lattice that is partially ordered by .
Simple example of a metric is given by (D, IC) = D(IC)
with the natural order true f alse of the range of .
That is, integrity sat, D(IC) = true,
mean lower inconsistency than integrity violation,
D(IC) = false.
Non trivial examples given by comparing or
counting violated constraints.
Hendrik, 2011
Wednesday, November 30, 11
14. INCONSISTENCY METRICS
Inconsistency metrics are used to decide if an update preserves
integrity, that is, doesn’t create a integrity violation that
doesn’t exist before the update.
Intuitively, an update preserves integrity if it doesn’t increase
the measured inconsistency
For a metric (µ, ), an update U in a database D
with integrity theory IC is integrity-preserving with
regard to (µ, ) if µ(DU , IC) µ(D, IC).
Hendrik, 2011
Wednesday, November 30, 11
15. AND MORE...
Inconsistency-tolerant integrity checking
Repairs
Computing and checking partial repairs
Computing integrity-preserving repairs
Hendrik, 2011
Wednesday, November 30, 11
16. WHY WE ARE TALKING ABOUT IT?
Wednesday, November 30, 11
17. WHY WE ARE TALKING ABOUT IT?
Lattes@FGV Project (a unified KB of FGV research
publications, researchers, skills etc), http://dck092.fgv.br/
Semantic Web brings, RDF, description logics, linked data etc.
Our research topics include Logics and knowledge
representation.
RDF are the key concept of Semantic Web
Relational has fixed model (TBOX of an ontology)
Wednesday, November 30, 11
18. TOPOS: THEORETICAL PART
scra
tchi
n g th
e su
rfac
e!
A topos (plural topoi or toposes) is a category with a quite expressive internal logic
The category of graphs and graph-homomorphisms can be viewed as a topos.
This topos already has a Heyting algebra that is used as the truth-basis of its internal logic.
A Heyting algebra is a lattice with additional properties. This topos-theoretic view of RDF
stores can be investigated in order to provide a natural way to provide foundations to
partial repairs in RDF stores.
Besides that, if we view traditional DBs as finite first-order logical structures, the category
of (finite) first-order structures and homomorphism between then has its own internal
logic. This internal logic can be investigated also regarding partial repairs.
Wednesday, November 30, 11
22. LATTES@FGV: THE RDF KB
http://dck092.fgv.br:10035/repositories/fgv (800k triples)
Wednesday, November 30, 11
23. LATTES@FGV
480 CV Lattes and collected data from other sources (Qualis,
Digital Library etc) in one triple store
lots of errors (inconsistencies) for different reasons: poor user
interface for input data, misinterpretation etc.
How to identify the errors? (non ad-hoc matter)
How to fix what can be fixed automatically?
Wednesday, November 30, 11
24. INTEGRITY CONSTRAINTS IN RDF
We can consider the extension of what was discussed so far to
non-SQL
KR/DB can be viewed as a graph
The query language of RDF based stores, SPARQL, can be
used to provide semantics to the store.
Wednesday, November 30, 11
25. EXAMPLES
An article referenced by a CV
must have the author of this CV as
one of its authors!
Wednesday, November 30, 11
26. EXAMPLES
If two resources were identified by
reference to the same article, every
author of the first one should also
be related to the second one!
Wednesday, November 30, 11
27. IN THE LAST EXAMPLE
Of course, two publications cannot be considered
the same comparing only their titles!
We need entity alignment, similarity checker...
Suppose we have identified all resources that
represent the same real “entity” using ask {
owl:sameAs, than ... ?p1 owl:sameAs ?p2 ;
dc:creator ?c .
OPTIONAL {
?p2 ?rel ?c .
}
FILTER( !bound(?rel) )
}
Wednesday, November 30, 11