The Digital Cavemen of Linked Lascaux

The Digital Cavemen 
of Linked Lascaux
Ruben Verborgh

The Lascaux paintings 
are 17,300 years old.
How long will 
your records last?

SUSTAINABILITY
a threat to the Semantic Web
lack of a longterm plan for
=

SUSTAINABILITY
making promises you can keep
=

SUSTAINABILITY
a dialog becoming a contract
=

SUSTAINABILITY
remaining constant under change
=

How can we promise
to remain constant
in a changing world?

Changes
Constants
Promises
The Digital Cavemen 
of Linked Lascaux

Changes
Data models
Technology
Interfaces

The oldest data model 
is a simple table.
header
row
column
k
van Hooland, S. and Verborgh, R. 
“Linked Data for Libraries, Archives and Museums” (Facet, 2014)

Tables do not cope well 
with changes in data or schema.
Title Artist Born Died
The Thrill is Gone B. B. King 1925 2015
Riding with the King John Hiatt 1952
Riding with the King B. B. King 1925
… … … …

Relational databases provide 
a multi-dimensional table model.
7
header
row
relation
key column
attributes
table/entity

Databases cope with data changes 
but schema changes are harder.
Title Artist
The Thrill is Gone 1
Riding with the King 2
… …
ID Name Born Died
1 B. B. King 1925 2015
2 John Hiatt 1952
… … … …

There is no interoperability 
with other databases.
Title Artist
The Thrill is Gone 1
… …
Wikipedia
?

XML allows reuse of schemas 
and identiﬁers.
the same for all items; a header
line can indicate their name.
Rec
ers
root
parent
child
siblings
subje

XML schema evolution 
remains a tough nut to crack.
Tabular data Relational model
Meta-markup languages RDF
Each data item is structured as
a line of ﬁeld values. Fields are
Data are structured as tables, each of
which has its own set of attributes.
Records in one table can relate to oth-
ers by referencing their key column.
XML documents have a hierarchical
structure, which gives them a tree-
like appearance. Each element can
Each fact about a data item is expressed
as a triple, which connects a subject to
an object through a precise relationship.
root
parent
child
siblings
property
subject
object
?

The RDF datamodel is ﬂexible 
for changes in data and schema.
RDF
ent
child
s
property
subject
object

RDF involves a trade-oﬀ 
between ﬂexibility and reuse.
custom 
ontology
reuse 
ontologies
perfect 
match
perfect 
interoperability

So far for change within models… 
what about change between them?
1.1. INTRODUCTION 7
header
row
column
relation
key column
attributes
table/entity
root
header
row
column
relation
key column
table/entity
root
parent
child
siblings
property
subject
object
header
row
column
relation
key column
table/entity
root
parent
child
siblings
property
subject
object
1.1. INTRODUCTION 7
header
row
column
relation
key column
attributes
table/entity
root
parent
child
property
subject
object

There’s no ultimate model. 
They co-exist. Change is inherent.
1.1. INTRODUCTION 7
header
row
column
relation
key column
attributes
table/entity
root
header
row
column
relation
key column
table/entity
root
parent
child
siblings
property
subject
object
header
row
column
relation
key column
table/entity
root
parent
child
siblings
property
subject
object
1.1. INTRODUCTION 7
header
row
column
relation
key column
attributes
table/entity
root
parent
child
property
subject
object

Even if your data doesn’t change, 
technology does.
What happens to your data?
new software versions
new software manufacturers

Is your software 
holding your data hostage?
Is your software the owner of your data?
Intentional or unintentional vendor lock-in?
Or are you?
Can you get your data out at any moment you want?

The Cooper-Hewitt Design Museum
had trouble getting their own data.
Data in The Museum System
ﬂexible, but complex relational design
no export button
Website had more ﬂexible demands
complex manual queries to liberate data
parallel CMS to drive website

The Web has been designed 
with change in mind.
Individual links are allowed to break 
so the entire Web does not.
—Tim Berners-Lee

The Web is in rapid evolution 
but continues on working.
What year is it? Then your users need…
1995 – HTML 2.0
2000 – XML
2008 – JSON
2012 – HTML 5
2015 – RDF ?
2017 – … ?

At least HTML seems constant, 
so the human Web is safe.
http://bib.org/books/978-1-85604-964-1/
around 2005: made in HTML 4
around 2015: made in HTML 5
Markup changes, the identiﬁer does not.
Tim Berners-Lee called these “Cool URIs”.

Web APIs for machines suffer 
from changes on many levels.
http://api.bib.org/v2/viewBookDetails.php?
id=978-1-85604-964-1&format=json 
&apikey=WSDGU56VP
How does this identifier cope with change?
How long does this identifier work unchanged?
!

id=978-1-85604-964-1&format=json 
&apikey=WSDGU56VP
!
!
!
Web APIs for machines suﬀer 
from changes on many levels.
 
 
dependency on server technology
dependency on API version
dependency on representation
dependency on API key

Plenty of excuses exist 
to change machine interfaces.
But our new server does it faster!
But our new API has diﬀerent features!
But XML is obsolete now so we need JSON!

Even funnier are the excuses 
for requiring API keys.
But we need to rate limit!
But we need to track automated access!
But we need to protect our data!

Once and for all: 
API keys do not help with these.
But we need to rate limit!
But we need to track automated access!
But we need to protect our data!

Once and for all: 
API keys do not help with these.
Your HTML interface is still open!
JSON is a convenience, not a necessity.
Anybody can still do whatever they want 
by scraping HTML pages with the same data.
Protect your data, not just one interface.

Yet other possible changes 
still appear to be a concern.
Remain constant if your server changes?
Remain constant if your API changes?
Remain constant if data models change?

Constants
URIs
Ontologies
Resources

The RDF model is driven 
by unique identiﬁers.
S
O
P

Constants allow clients 
to establish a shared meaning.
S
O
P
http://bib.org/authors/7356/
http://purl.org/dc/terms/creator

Human semantics are in concepts 
and their meaning to the world.
S
O
P
a book
a person
written by

Machine semantics are in symbols 
and their structural interrelations.
S
O
P
http://digybe.wpq/dgjyj-dgu7945
http://aole.wqq/mobd1.tihz
http://yudgy.jdu/DHH8DHBtkixhj

We need to be very careful 
about our choice of symbols.
S
O
P

We need to be very careful 
about our choice of symbols.
Is this a book 
or a description of a book?
:printDate "2014-06-11"
:lastModiﬁed "2015-11-25"
Is this a person 
or a document?
:birthDate "1987-02-28"
:size "17kB"

Although designed for machines, 
the example only works for humans.
S
O
P

Because, somehow, Web APIs 
make machine access different.
S
O
P
id=978-1-85604-964-1&format=json 
&apikey=WSDGU56VP
http://api.bib.org/v2/viewAuthorProﬁle.php?
id=7356&format=json&apikey=WSDGU56VP

That’s why it’s a problem if 
machines need different identifiers.
S
O
P
id=978-1-85604-964-1&format=json 
&apikey=WSDGU56VP
http://api.bib.org/v2/viewAuthorProﬁle.php?
id=7356&format=json&apikey=WSDGU56VP

Only this triple is a global constant. 
The other is volatile and local.
S
O
P

Fortunately, we don’t have to 
pick all the constants ourselves.
Ontologies provide identiﬁers of concepts 
that are designed to be reused.
They are necessary to make RDF work.
They are necessary to create queries, 
especially over multiple datasources.

Of course, we get the beneﬁts 
only if we actually reuse.
Why have our own my:writtenBy property 
when dc:creator already exists?
Maybe we have a more speciﬁc meaning?
We can still relate both properties with RDF.
But if we all use derivatives of the constants, 
what is the value of these constants?

Authors are not always in control: 
external semantic drift happens.
foaf:knows was bidirectional…
spec: “some level of reciprocity”
An foaf:knows Pete Peter foaf:knows An
…until somebody modeled Twitter followers
Pete follows Angela Merkel Pete knows Angela
Yet Angela doesn’t know Pete…

Getting close to Derrida…
but we’re not philosophers.
There are only two hard things 
in Computer Science: 
cache invalidation and naming things.
—Phil Karlton

The constants you can touch 
are the constants you can trust.
No matter how hard technology changes, 
the books we describe remain the same.
Any mechanism of identiﬁcation 
should based on domain resources, 
not on inevitably changing technology.

The “success” story 
of the Web API community.
e existence of more than 12.000 di↵erent micro-protocols to achieve essen
en clients and servers over http. Of course, each application has its own
t does that also warrant an entirely di↵erent way of exposing this, especially
Each di↵erent api currently requires a di↵erent client, given the lack of a u
pi’s response structure and functionality. Clearly, this approach to Web apis i
2005 2007 2009 2011 2013 2015
186
1,263
2,418
5,018
7,182
10,302
12,559
number of indexed Web s
g number of Web apis is often named an indicator of their success, while the ove
ssary—and detrimental to the development of generic Web api clients. (data: progra
number of indexed Web APIs 
in ProgrammableWeb

Just imagine we had 
15,000 diﬀerent data models.
e existence of more than 12.000 di↵erent micro-protocols to achieve essen
en clients and servers over http. Of course, each application has its own
t does that also warrant an entirely di↵erent way of exposing this, especially
Each di↵erent api currently requires a di↵erent client, given the lack of a u
pi’s response structure and functionality. Clearly, this approach to Web apis i
2005 2007 2009 2011 2013 2015
186
1,263
2,418
5,018
7,182
10,302
12,559
number of indexed Web s
g number of Web apis is often named an indicator of their success, while the ove
ssary—and detrimental to the development of generic Web api clients. (data: progra
number of indexed Web APIs 
in ProgrammableWeb

Find resources in your domain 
and assign them an identiﬁer.

It’s just like building a web site. 
When a user comes, serve HTML.
U
GET
HTML

When a client comes, serve JSON.
C
GET
JSON

When a client comes, serve RDF.
C
GET
RDF

Content negotiation exists 
for a long time in HTTP.
C
GET
RDF
Resource
Representation

This allows constant URIs 
even with future changes.
C
GET
RDF 2.0

It enables different users and 
machines to talk about things.
C
U
C

The best API is no API.
Your website is already an API.
Developers like to build complicated APIs.
API keys are especially cool to build.
Every feature and change comes with a high cost.
If you ask for an API, you’ll get one.
Ask for new representations 
of your resources instead.

Promises
Web Data
Integration
Scalability

The Semantic Web promised 
data on the Web.
85,567,007,302 triples from 3,426 datasets
LODStats
38,606,408,765 from 657,896 entries
LOD Laundromat

How much of this data 
can we readily access?
data dumps
Linked Data documents
SPARQL endpoints

A data dump means downloading
everything and querying locally.

A data dump means downloading
everything and querying locally.
When was the last time 
you downloaded the full Wikipedia 
just because you had one question?

Dumps are not Web querying. 
It’s kind of like giving up.
Semantic Web Semantic Basement?
What advantage do we have 
compared to Big Data?
Still the RDF data model…
But the major diﬀerence is Web.

Linked Data documents 
allow you to traverse a dataset.

Linked Data documents 
allow you to traverse a dataset.
That’s similar to what we also do: 
consume information on Wikipedia 
by following links.

Much Linked Data is available 
using the well-known principles.
Servers publish a light-weight interface.
Clients follow their nose 
to retrieve information.

Linked Data documents allow 
query evaluation on the Web.
# Other books by the same author 
SELECT DISTINCT ?book WHERE { 
books:85604 dc:creator ?author. 
?book dc:creator ?author. 
}

Some queries are hard 
or impossible to evaluate.
# Books about Hamburg 
SELECT DISTINCT ?book ?author WHERE { 
?book dc:subject dbpedia:Hamburg. 
}

SPARQL endpoints allow you 
to ask any question you want.

SPARQL endpoints allow you 
to ask any question you want.
When was the last time 
you expected Wikipedia to answer 
speciﬁc questions automatically for you?

A public SPARQL endpoint 
happily answers this query.
}

A public SPARQL endpoint also 
happily answers this query.
}

A public SPARQL endpoint also 
happily answers this query…
SELECT DISTINCT ?drug ?drug1 ?drug2 ?drug3 ?drug4 ?d1 WHERE {
?drug1 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/drugCategory> <http://www4.wiwiss.fu-
berlin.de/drugbank/resource/drugcategory/antibiotics> .
berlin.de/drugbank/resource/drugcategory/antiviralAgents> .
berlin.de/drugbank/resource/drugcategory/antihypertensiveAgents> .
berlin.de/drugbank/resource/drugcategory/anti-bacterialAgents> .
?drug1 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/target> ?o1 .
?o1 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/genbankIdGene> ?g1 .
?o1 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/locus> ?l1 .
?o1 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/molecularWeight> ?mw1 .
?o1 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/hprdId> ?hp1 .
?o1 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/swissprotName> ?sn1 .
?o1 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/proteinSequence> ?ps1 .
?o1 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/generalReference> ?gr1 .
?drug <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/target>?o1 .
?drug2 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/target> ?o2 .
?o1 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/genbankIdGene> ?g2 .
?o2 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/locus> ?l2 .
?o2 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/molecularWeight> ?mw2 .

There’s a price to pay for being 
the most expressive HTTP interface.
The majority of public SPARQL endpoints 
has less than 95% uptime.
This means we cannot query them 
for more than 1.5 days each month.
This means we cannot rely on them 
to build Linked Data applications.
Buil-Aranda – Hogan – Umbrich – Vandenbussche 
SPARQL Web-Querying Infrastructure: Ready for Action?

The main promise of Linked Data 
is integration, preserving semantics.
RDF
ent
child
s
property
subject
object

Integration is the promise. 
But does it work on the Web?
data dumps
Linked Data documents
SPARQL endpoints

With data dumps, we just 
build a bigger basement.
How far do we go?
How do we keep data up to date?

With Linked Data documents, 
we keep on following our nose.
There are no dataset boundaries.
Some queries will remain hard.

With public SPARQL endpoints, 
problems become worse.
1 endpoint has 95% availability.
1.5 days down each month
2 endpoints have 90% availability.
3 days down each month
3 endpoints have 85% availability.
4.5 days down each month

Can we think diﬀerently 
about Linked Data on the Web?
high server costlow server cost
data 
dump
SPARQL 
endpoint
high availability low availability
high bandwidth low bandwidth
out-of-date data live data
low client costhigh client cost
Linked Data 
documents

Can we think diﬀerently 
about Linked Data on the Web?
data 
dump
SPARQL 
endpoint
Linked Data 
documents
? ?

Let us combine the lessons on 
changes, constants, and promises.
An interface that withstands change,
simple enough so it doesn’t break
complex enough to query.

Let us combine the lessons on 
changes, constants, and promises.
Data dumps contain too much.
SPARQL endpoint results are too speciﬁc.
Linked Data documents are unidirectional.

Each interface divides a dataset
into Linked Data Fragments.
Data dumps: 1 huge fragment
SPARQL endpoints: ∞ speciﬁc fragments
Linked Data: 1 fragment per subject

Can we find a new interface 
with a sustainable balance?
Triple Pattern Fragments: 
1 fragment per subject / predicate / object

Browse a dataset by triple pattern— 
no less, no more.

Machines can access 
the exact same interface as RDF.

Triple Pattern Fragments extend 
Linked Data documents with forms.
That’s even more similar to what we do: 
consume information on the Wikipedia 
by following links and using forms.

Machines solve complex queries 
by breaking them down.
}

Promises can be kept, because 
the interface is intelligently light.
Publishing Linked Data 
that can be queried on the Web 
is realistic because the workload is divided.
The server doesn’t even need a triplestore.
Since the client is in charge, 
querying multiple sources is easy.

Promises are negotiated contracts
so they always involve trade-offs.
Querying will be slower.
clients send many requests to answer a query
Query times are more consistent.
0.3 secs with a SPARQL endpoint… 95% of time
3 secs with Triple Pattern Fragments… 99.9% of time
Experiment with more complex interfaces.

Make your Linked Data 
queryable on the Web.
Several open-source implementations: 
linkeddatafragments.org/software/
Query one or multiple sources online: 
client.linkeddatafragments.org
Example: bit.ly/harvard-hamburg

Identify the constants, 
separate them from changes.
Satisfy Linked Data needs 
with promises you can keep.

Simple enough 
to be usable,
complex enough 
to be useful.

Sustainability means 
promising the simplest 
useful complexity.

@RubenVerborgh 
ruben.verborgh.org

The Digital Cavemen of Linked Lascaux

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à The Digital Cavemen of Linked Lascaux

Similaire à The Digital Cavemen of Linked Lascaux (20)

Dernier

Dernier (20)

The Digital Cavemen of Linked Lascaux