1. Live Linked Data
Making Linked Data writable with eventual consistency
guarantees
Luis-Daniel Ibáñez, Nantes University, GDD team
Luis-Daniel Ibáñez, Hala Skaf-Molli, Pascal Molli, and Olivier Corby. Synchronizing semantic stores with
commutative replicated data types. In Semantic Web Collaborative Spaces (SWCS) @ WWW’12
Luis-Daniel Ibáñez, Hala Skaf-Molli, Pascal Molli, and Olivier Corby. Live Linked Data: Synchronizing
semantic stores with Commutative Replicated Data Types. Special Issue on Resource Discovery, IJMSO
(to appear)
2. Linked Data
Autonomous participants
agree on a set of principles
for publishing RDF
data, allowing its querying
and browsing across
distributed servers.
In 2011, More than 30 billion
RDF triples distributed
between ~700 autonomous
participants1.
1Billion
Triple Challenge + Linked Open Data Metadata
T. Käfer et. Al., Towards a Dynamic Linked Data Observatory, LDOW@WWW12
5. Here is a list of links created by LinkedMDB that relate the movie "The Shining" to
other open datasets:
owl:SameAs to "The_Shining_(film)" on DBpedia
owl:SameAs to "The_Shining_(film)" YAGO
dbpedia:hasPhotoCollection to The movie's photos on flickr™ wrappr
movie:relatedBook to the movie's related book on RDF Book Mashup
owl:SameAs from one of the music composers of the movie, Béla
Bartók, toMusicBrainz: http://zitgist.com/music/artist/fd14da1b-3c2d-4cc8-9ca6fc8c62ce6988
rdfs:SeeAlso from one of the writers of the movie, Stanley Kubrick to RDF Book
Mashup: http://www4.wiwiss.fu-berlin.de/bookmashup/persons/Stanley+Kubrick
foaf:based_near to the locations of the movie in the US and
GB:http://sws.geonames.org/2635167/ and http://sws.geonames.org/6252001/
movie:language to the language of the movie on
lingvoj.org:http://www.lingvoj.org/lingvo/en
Source : http://wiki.linkedmdb.org/Main/Interlinking
6. Querying Linked Data
Freshness vs. Performance
Dbpedia
English
Local Copying
Not Fresh
Dbpedia
Spanish
Free
Base
Wolfram
My Super Cool
Query Answerer
about Venezuela
Distributed
querying
Performance and
reliability
7. Querying Linked Data
Data Quality
Dbpedia
English
302km2, 439,947h
All different =S
Dbpedia
Spanish
Free
Base
311km2, 407,577h
Unknown, 439,947h
What is the area
and population
of Girardot
Municipality?
314km2, 407,109h
Wolfram
302km2, 396,095h
Open Data
Venezuela
All wrong!!!
9. Live Linked Data Approach (1)
SPARQL Update 1.1 allows to update RDF data locally.
Update feeds to provide streams of updates (e.g.
DBPedia Live) ->
Allow data freshness.
Processing is local
The writing is “indirect”, only if the others decides to
consume my stream I can update their datasets.
10. Live Linked Data
DELETE {
Dbpediaen/Girardot area ?a1 .
Dbpediaen/Girardot popul ?p1 .
?ent area ?a2 .
?ent popul ?p2 .
WHERE { ?ent sameAs Dbpediaen/Girardot }}
;
INSERT DATA {
ODVenezuela/Girardot area 314
Dbpedia
ODVenezuela/Girardot popul 407109
English
ODVenezuela/Girardot sameAs Dbpediaen/Girardot
ODVenezuela/Girardot sameAs Freebase/Girardot
Dbpediaen/Girardot area 302
Dbpediaen/Girardot popul 439947 … }
Would you follow my changes?
They are official… ;)
“Follow your change”
SPARQL Update Stream
consumption Freshness
Dbpedia
Dbpediaes/Girardot area 311
Spanish
Dbpediaes/Girardot popul 407577
Great!
Freebase/Girardot area 308
Free Freebase/Girardot area “null”
Freebase/Girardot popul 439947 ODVenezuela/Girardot sameAs Dbpediaen/Girardot
Base
ODVenezuela/Girardot sameAs Freebase/Girardot
…
Dbpediaen/Girardot sameAs Dbpediaes/Girardot
Dbpediaen/Girardot sameAs Freebase/Girardot
…
Wolfram
Wolfram/Girardot area 302
Wolfram/Girardot popul 396095
ODVenezuela/Girardot area 314
Open Data ODVenezuela/Girardot popul 407109
Venezuela
Local Processing
Performance
SPARQL Update Stream
publishing
Write enabling
12. Live Linked Data Overview
A social network of linked data participants based on
a « follow your change » relation.
Makes Linked Data a consistent « read/write » space :
from linked data 1.0 to linked data 2.0
Creates assemblies of datasets and enable
« synchronize and search » paradigm: between
warehousing approach and distributed queries
approach.
13. The Forgotten Issue:
Consistency
Synchronizing semantic
sources is not new…
RDFSync
Sparql Push (PubSubHub)
ResourceSync…
But none of them addresses
the consistency when
concurrent operations or
network delivery-in-disorder
occur.
Source1
Source2
Insert DATA {a b c}
Delete DATA {a b c}
Insert DATA {a b c}
abc
≠
abc
Output depends on order of reception
14. Keeping Consistent
Consistency = Convergence + Intention preservation.
Strong Consistency
Requires blocking. Expensive
Eventual Consistency
Apply operations without blocking
If conflict detected, resolve.
Can be very expensive at large scale in network’s size.
If the type that we are dealing is Conflict-Free, no worries
about conflict detection and resolution.
15. Conflict-Free Replicated
Data Types1
An abstract data type where:
The application of all operations over any state commutes,
S o opi o opj = S o opj o opi => Convergence.
Caveat1: Intention preservation should be checked
separately.
Caveat2: Should be useful, by itself or by corresponding to
the concurrent semantic of an useful type (set, graph…).
Caveat2.1: Should be affordable.
Enable massive optimistic (eventual) replication.
1 M.
Shapiro, N. Preguiça, C. Baquero, M. Zawirski. Conflict-free Replicated Data Types. SSS 2011.
16. Enabling LLD with CRDTs
Construct a CRDT for RDF Graph Stores with SPARQL UPDATE 1.1 Operations
with minimal overhead under Live Linked Data conditions:
Unknown but steadily growing number of autonomous participants.
No central controls (coordination, primary replicas, small core of datacenters)
Follow your change relation induces the network.
Dynamicity parameters:
Degree of change / Change frequency varies between producers.
Growth rate Positive, knowledge grows.
17. RDF Datasets
Assume there are pairwise disjoint infinite sets I , B, and L
(IRIs, Blank nodes, and Literals, respectively). A triple (s,p,o) ∈
(I∪B)×I×(I∪B∪L) is called an RDF triple. In this tuple, s is the
subject, p the predicate, and o the object1.
An RDF Graph is a set of RDF triples1.
An RDF dataset is a set: { G, (<u1>, G1), (<u2>, G2), . . . (<un>, Gn) }
where G and each Gi are graphs, and each <ui> is an IRI. Each <ui>
is distinct. G, is called the “default graph” and the pairs (<ui>, Gi)
are called “named graphs”2.
1 J. Pérez, M. Arenas and C. Gutiérrez, Semantics and Complexity of SPARQL, ACM
Transactions on DB Systems, 2009
2 http://www.w3.org/TR/sparql11-query/#sparqlDataset
18. SPARQL Update Queries
Syntax
Formal Operation
INSERT DATA QuadData
Dataset-UNION(GS, Dataset(QuadData,{},GS,GS))
DELETE DATA QuadData
Dataset-DIFF(GS, Dataset(QuadData,{},GS,GS))
DELETE QuadPatternDEL INSERT QuadPatternINS
WHERE GroupGraphPattern
Dataset-UNION(Dataset-DIFF(GS,
Dataset(QuadPatternDEL,GroupGraphPattern,DS,GS)),
Dataset(QuadPatternINS, GroupGraphPattern,DS,GS)
DELETE QuadPatternDEL
WHERE GroupGraphPattern
Dataset-UNION(Dataset-DIFF(GS,
Dataset(QuadPatternDEL,GroupGraphPattern,DS,GS)),
Dataset({}, GroupGraphPattern,DS,GS)
INSERT QuadPatternINS UsingClause*
WHERE GroupGraphPattern
Dataset-UNION(Dataset-DIFF(GS,
Dataset({},GroupGraphPattern,DS,GS)),
Dataset(QuadPatternINS, GroupGraphPattern,DS,GS)
LOAD (SILENT)? IRIref
Dataset-UNION(GS, { graph(documentIRI) } )
CLEAR (SILENT)? DEFAULT
{{}} union {(irii, Gi) | 1 ≤ i ≤ n}
CREATE (SILENT)? IRIref
GS union {(iri, {})} if iri not in graphNames(GS); otherwise,
OpCreate(GS, iri) = GS
DROP (SILENT)? IRIref
GS if iri not in graphNames(GS); otherwise, OpDrop(GS, irij)
= {DG} union {(irii, Gi ) | i ≠ j and 1 ≤ i ≤ n}
http://www.w3.org/TR/sparql11-update/#formalModel
21. Enabling Intentions Preservation in
LLD
Observed effect of SPARQL update queries at generation
time should be preserved at re-execution time…
If an Sparql query inserts a triple at generation time, this
triple will be added at all sites whatever the state of the store
at re-execution time…
Intentions can be impossible to preserve (nondeterministic operations) or impossible without more
order constraints (re-execution time evaluation)
PREFIX foaf: http://xmlns.com/foaf/0.1/
WITH <http://example/addresses>
DELETE { ?person foaf:givenName 'Bill' }
INSERT { ?person foaf:givenName 'William' }
WHERE { ?person foaf:givenName 'Bill’ }
PREFIX dc: <http://purl.org/dc/elements/1.1/>
INSERT DATA {
_:book1 dc:title "A new book" ;
dc:creator "A.N.Other" .
}
22. A CRDT for SPARQL update
Rewrite SPARQL UPDATE Operations to another type
that does commute, while preserving the original
semantics.
Graph Update operations work over RDF-Graphs, and
can be expressed in terms of the basic INSERT and
DELETE of ground triples.
RDF-Graphs are Sets of RDF-Triples, and CRDTs for Sets
with Insert and Delete already exists…
23. SU-Set : A Sparql-Update CRDT
Same id for all triples inserted
together, unicity is preserved
Less expensive in
Communication
Need to send (triple,id) when
deleting
More expensive in
communication
Better for LLD as
knowledge grows…
25. Implementation
SU-Set embedded into Corese, a reference
implementation of RDF-Stores and SPARQL Update
by Wimmics team.
Rewrite Sparql Update into SUSet, execute and log.
“Follow your change” implemented with anti-entropy
over logs and version vectors.
26. How much we need to pay to have
eventual consistency ?
Time Overhead :
Adding an id to each element is linear.
Selection and lookup is not affected by many pairs with
the same triple.
Round and # of messages Overhead :
Convergence after one round, one message per
operation → Optimal
27. Validation – Space Overhead
32 bytes per 1 billion triples =
32 GB → 1 Ipod
Two UUIDs, 16 bytes each
Semantic Stores already use
an internal id → Reuse it
(UUID1,UUID2)
Extra pairs produced by
concurrent insertions could
cause problems...
Counter
Site identifier
A version vector for each
(monotonically
(Useful for authoring too)
site: Max size= number of
increasing)
participants (~300-800).
28. Dynamicity: DBPedia Live
Framework to register changes in DBPedia in near
real-time.
Generates one file with inserted triples and one with
deleted triples approximately each 10 seconds.
No pattern operations logged No Overhead here.
Many more insertions than deletions
Good for SU-Set.
Many triples inserted per operation
Even better for SU-Set
29. SU-Set Communication overhead
on DBPedia Live
7 days of streaming
No concurrent insertions
IdSize1: 0,096 Kb
Avg triple size1: 0,155 Kb
Size (MB)
# of Triples
21957
Inserts
21762190
3294,08
5334,29
3296,6
21957
Deletes
1755888
265,78
164,61
430,4
54,47%
4,68%
Overhead
1Measured
Without ids 1 id per triple
1 id per
operation
Operation
with ‘wc –c’ over DBPedia Live log files
30. Conclusion
Live Linked Data makes linked data “writable” and
allows a new query paradigm.
SU-Set is a CRDT for RDF-Graphs updated with
SPARQL-Update 1.1 that ensure eventual consistency
(convergence + intentions) on Live Linked Data at a
reasonable cost.
31. Future Work
Finish and release LLD-Corese.
Write the composed CRDT for RDF-Datasets
What other benefits LLD has for:
Querying (SyncAndSearch vs Federations, connection
with LAV/GUN, with Adaptivity…?)
Discovery…?
Community Detection…?
Provenance…?
Traces
Notes de l'éditeur
More Over ,Linked data cloud has been growing steadily, for example, here we see the 2007 picture of the Linked Open Data.
To this big picture in 2011
Linked data main rule is that your data needs to reference or “link” with existing knowledge in other sources, for example, LinkedMDB relates its entity of the movie “The Shining” to DBPedia, Yago or GeoNames.
Let’s see this applied to our previous example:Now the connections are streams that I subscribe, and from where I pull the information I’m interested in,(Like this…)Notice that with this I can solve the issue of freshness, as whenever someone updates its data, I can immediately react (example given…)And I also ease the performance issue, as I will do local processing.As in the other approaches, I can solve the data curation issue by incorporating a source and locally editing (example given…)This curation also generates an operation stream that I publish and advertise to others, people will start to follow my changes and I would also solve the write enabling issue.
So, our expectacion is that other data publishers will also start to open its updates, favoring the creation of mashups by the means of the “Follow your change” relation, creating another network on top of the actual Linked Data cloud, Of course this new mashups are not static, they can perform adding-value operations, and if this value is high enough, the original publisher can also start following these changes and, voilà, we have the writing that we want.
Before jumping into that, we need to document us a little about what is RDF and what are they operations
SPARQL Operations have a very well defined formal semantics, one set of operations to operate graphs (all here but the last two) and other to operate on datasets
Read and then…For example, pattern operations, are state dependant, so I cannot preserve its intention, same thing with blank nodes, that are like, existential variables, and SPARQL specification leaves to each store the how to name them.
To implement the pattern operations, we chose to calculate the affected triples at the local state, and send that to the other pairs.
Read…
The main interest of Linked Data is to perform queries on multiple datasets, to make that, and assuming we already discovered the relevant datasets, the first way to proceed is to do a local copy of them and perform the query on it. For example, if I ask what is the largest city of Vicente Fernandez country, by asking the country to MusicBrainz and the largest city to wikipedia, but, if the original datasets get modified, I lose freshness.
On the other hand, instead of bringing the data to me, I can go to it, making the query in a distributed fashion. The drawback is that I depend on the reliability and performance of the endpoints, , if one is down, my query will fail.
So, how does this look on real Live Linked Data: We already mentioned that DBPedia has start to publishing its updates: so, th
So an issue that we can raise here is: what happens when I found a mistake in a dataset that is not mine, where I don’t know how or don’t have the rights to modify. Put it in a more general way, how do I make this Linked Data network “Writable”.
So, we know from CRDT theory that if all operations commute, the object is conflict-free… but, bas