1. Introduction to Linked Data
Linked Data: what cataloguers need to know #cigld
CILIP Cataloguing and Indexing Group (CIG)
20 February 2015
Thomas Meehan
tom@aurochs.org @orangeaurochs
3. linked open DATA
245 00 $a Models for decision :
$b a conference under the auspices of the United Kingdom
Automation Council organised by the British Computer
Society and the Operational Research Society /
$c edited by C.M. Berners-Lee.
260 __ $a London :
$b English Universities Press,
$c 1965.
300 __ $a x, 149 p. :
$b ill. ;
$c 23 cm.
504 __ $a Includes bibliographical references.
700 1_ $a Berners-Lee, C. M.
5. …LINKED open data
<table border="0" cellpadding="0" cellspacing="0">
<tbody>
<tr id="bib-author-row">
<th>Author:</th>
<td id="bib-author-cell">
<a href="/search?q=au%3ABerners-Lee%2C+C.+M.&qt=hot_author" title="Search for more by this
author">C M Berners-Lee</a>;
<a href="/search?q=au%3ABritish+Computer+Society.&qt=hot_author" title="Search for more by
this author">British Computer Society.</a>;
<a href="/search?q=au%3AInstitution+of+Electrical+Engineers.&qt=hot_author" title="Search for
more by this author">Institution of Electrical Engineers.</a>;
<a href="/search?q=au%3AOperational+Research+Society.&qt=hot_author" title="Search for more
by this author">Operational Research Society.</a?
</td>
</tr>
<tr id="bib-publisher-row">
<th>Publisher:</th>
<td id="bib-publisher-cell">London : English Universities Press, [1965]</td>
</tr>
…
</table>
7. The Web of Data
• Use URIs as names for things
• Use HTTP URIs so that people can look up
those names.
• When someone looks up a URI, provide useful
information, using the standards (RDF,
SPARQL)
• Include links to other URIs so that they can
discover more things.
Tim Berners-Lee (2006)
32. Introduction to Linked Data
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix bibo: <http://purl.org/ontology/bibo/> .
cigld_intro dct:creator _:bnode001 ;
dct:created "2015" ;
dct:title "Introduction to Linked Data" ;
dct:isPartOf _:bnode003 .
_:bnode001 a foaf:person ;
foaf:name "Thomas Meehan" ;
foaf:mbox <tom@aurochs.org> ;
foaf:account _:bnode002 .
_:bnode002 a foaf:OnlineAccount ;
foaf:accountServiceHomepage "https://twitter.com/" ;
foaf:accountName "@orangeaurochs" .
_:bnode003 a bibo:Series ;
dct:title "Linked data: what cataloguers need to know" .
33. References
• Worldcat record for Models for decision / C.
Berners-Lee.
http://www.worldcat.org/title/models-for-
decision-a-conference-under-the-auspices-of-the-
united-kingdom-automation-council-organised-
by-the-british-computer-society-and-the-
operational-research-society/oclc/221944758
• What is open data? / The Open Data Institute.
http://theodi.org/guides/what-open-data
• Linked Data : design issues / Tim Berners-Lee.
http://www.w3.org/DesignIssues/LinkedData
Notes de l'éditeur
WHY LEARN ABOUT LINKED DATA
Likely to be replacement for MARC (Bibframe?)
Even if not, is being used to openly publish bibliographic data on the web
Being used by eg search engines for semantic results
Because cataloguers can take some part in the discussion!
Term often taken to means linked open data:
Data: not just text like HTML which is a marked up document.
Linked: not just text strings, eg hypertext, you can find out more by clicking on links
Open: Freely available, licensed, re-usable, re-purposable
Structured
Labelled
In a recognised format
But
No links: all the data is in text strings. If you want to find out anything more about these things, you have to get out of the system and search google or the lc authorities site. Arguably, the 700 is a link if you follow a recognised authority scheme. However, it's not an actionable link like an 856 field. You cannot follow the contents of that and find out more. It would, in fact, be hard to construct a URL from that which would go to anything meaningful. You have to get out of the system and search google or the lc authorities site.
Not (necessarily) open, or at least easy to get at. Record sharing is common in library cataloguing, but licences are rare and access is through a z39.50 gate or reconstructed web pages.
Not (arguably) even data as such but actually a record and largely textual. None of the bits make sense in isolation. I'll talk a little more about MARC in particular this afternoon.
I
This is linked in that it's hypertext: you can find out more by clicking on links, although only internally in this case.
These links are still aimed at people: difficult for a computer, e.g. a search engine, to assess value. If we look at the source for this…..
This is a snippet of the HTML from the previous page, specifically the part listing the authors and the publisher. It is all document based. The table is a means of display only. The th for Author is merely for human readability. The links go to other searches. The publisher information is wholly textual and there is no attempt to even split the elements.
Someone like Mr Google could attempt to extract meaning from a page like this but it is unreliable at best. This is the battle that Google has been fighting since it started: how to extract meaning from web pages. This is one reason why Google are keen on linked data!
Furthermore, the links that there are merely perform another search.
Looking at webpage from the previous slide, you'll also note the copyright notice at the bottom!
Here is an example of some beautiful linked data but I can't let you see it, search it, or use it. We can discuss terms later.
Open means several things:
Freely available
Licensed to minimise restriction (see the whole open access question). A lot of the Cambridge linked data work revolved around this.
Re-usable. If you can't reuse other people's data, the whole idea of linked data falls down, even if you can search it.
Re-distributable
Re-purposable. You can use it for purposes beyond its original intention.
"Open data is information that is available for anyone to use, for any purpose, at no cost.
Open data has to have a licence that says it is open data. Without a licence, the data can’t be reused. The licence might also say:
that people who use the data must credit whoever is publishing it (this is called attribution)
that people who mix the data with other data have to also release the results as open data (this is called share-alike)"—The Open Data Institute.
When people use the phrase Linked Data they are actually referring to a Web of Data compared to web of documents, using specific principles, i.e. open data in RDF.
URI: URIs can be URLs or URNs. URLs can be http, ftp, etc. URNs are not web actionable
HTTP: I.e. over the web. If you don't have http, you cannot easily go and look up more information.
Useful info: Basically description, something about it, as on a web page you'd provide information in HTML, in linked data you provide information in RDF (of which more in a second). You can search it using SPARQL (of which more from Owen later)
Links: Crucial. You can find out more from other URIs, much as links on a web page allow references and explanations, and further information to be explored.
Note: All this is independent of libraries and proceeds rather from the W3C. Linked data is not a formal W3C standard but RDF is, like HTML. The Web of Data is the basis of a semantic web, where meaning as well as text means that computers can make sense of it and act on it. Indeed linked data is sometimes described as a re-branding or re-launch of the semantic web. Then, of course there is the web of things: fridges that can catalogue cheese using MARC, central heating that understands Bibframe, or FRBR-compliant toasters.
Understanding RDF is important to understanding linked data and that's what I'm going to concentrate on for the rest of this session.
This is a simple English sentence.
It has a subject "Brideshead revisited" (the book)
It has an object "Evelyn Waugh"
It has a predicate "was written by"
This is text. How can we turn this into data?
We'll start with dividing it into Entities and relationships, similar to the modelling behind FRBR:
In this case, a Work, a Person and a relationship.
These are still English text, ambiguous, unidentified, and not linked.
Next, we can start to replace textual names of things with URIs.
First, we'll give the book a URI. The book Brideshead Revisited is a resource, not in the RDA sense but anything that can be given an identity. We identify it in RDF using a URI.
Second, the author Evelyn Waugh, the object.
Lastly we can add a URI for the predicate, the relationship.
In linked data, even the relationships are established or authorised, not just names and works and subjects. Everything.
Third, the creator relationship. This is now a what is called a RDF Triple and, with some punctuation, a valid piece of RDF!
However, it is split over three lines. To make this easier to read, especially when there are lots of triples, we can write this out in a different way: Turtle.
The first two lines are actually an attempt to make this easier to read. After the word @prefix, it gives a prefix (could be anything: it's just to make it easier to read; lastly is the base of a URI)
The triple itself now all fits more easily on one line.
This is a Triple, which is the foundation of RDF. It is an assertion (not necessarily a fact).
Subject
Predicate
Object
It is important and significant that this lone triple stands alone as a piece of data. It doesn't need to be part of a record as such. We can follow the URIs to find out more.
Triples can get quite a lot more complicated. There are ways of saying more nuanced things. In fact, one problem some have with efforts like Bibframe is that the abstraction goes too far. There are also some obvious problems with this:
Provenance. Without a record, how can we demonstrate who said this, how reliable it is. There are ways round this and initiatives to do so.
Complexity.
Redundancy. What if DC and LC shut up shop or change their minds?
I promise not to put a lot of XML in this presentation but I will mention triples a lot in this format as they really are fundamental to RDF.
A fuller description using DCMI terms.
Keep thinking of it being in three columns. The subject column is not repeated here because the subject is the same for all the triples. The predicates and objects change.
This says…. {run through triples}
Now, each of those URIs can of course be followed, and I've picked one out to follow.
If we follow the link (or screenshot), we can see what information the URI gives us.
As we are using a browser, the URI resolves to a HTML page. If we were a computer programme we could request this page as data.
We can do so through the web page too by clicking on one of the links at the bottom. We'll get a page in what is called N Triples.
LC Authorities Linked Data in HTML (screenshot)
LC Authorities Linked Data in N Triples (screenshot)
This is an excerpt of the LC Name Authority linked data converted to Turtle.
N Triples:
Simple.
Easy to see the triples and how many there are.
Hard to read each element.
Hard to fit it on a page!
Easy to read and fit on a screen although
Longer for snippets due to the prefixes. Also, syntax can get more complicated to cope with abbreviation.
Doubly useful to learn as it's essentially the same syntax as SPARQL, which is itself a really useful way of getting to know and understanding linked data.
Originally the only RDF format
Often confused with RDF itself
Easy for computers to read
Very hard for people to read!!
Often used in Bibframe documentation examples.
JSON (Javascript Object Notation) is increasingly favoured by programmers. It uses the same data structures as Javascript so can be dropped easily into a programme. It is also easy for other programming languages to use and is not even limited to RDF or even Javascript.
{Curly brackets for Objects}
[Square brackets for arrays]
JSON/LD is superficially similar, uses JSON, but is also quite different.
I.e. for embedding into HTML pages
Linked eg hypertext, you can find out more by clicking on links, although only internally.
These links are still aimed at people: difficult for a computer, e.g. a search engine, to assess value.
Copyright at the bottom!
Here are the N-triples…
The viaf id for Berners-Lee is followable: click on it and you get the VIAF record. Choose the RDF view to see the underlying RDF.
These triples or pairs of triples all make the same assertion but use different vocabularies and different uris to do so. The sixth one asserts the creator's name as a string. This can't be done directly as dc:creator needs a resource- a URI- not a literal string.