Contenu connexe


Linked data and voyager

  1. Ed Chamberlain Systems Development Librarian Cambridge University Library
  2. The Linking Open Data cloud diagram - http : //richard . cyganiak .de/2007/10/lod /
  4. What else?
  5. Beyond bibliographic Bibliographic Holdings FAST subject headings Libraries Transactions Special collections Archives Creator / entity Place of publication LCSH subject headings Course lists Language Librarians

Notes de l'éditeur

  1. Thank you /// So Hello, as an introduction, I’m very much a Systems Librarian, management of LMS is may bread and butter, but this was a great chance to do something very different …
  2. Just a Systems Librarian – this time last year my understanding at the start of linked data was limited to Talis demonstrations … Been to a lot of conferences and seen a lot of guys in back polos necks telling me that this is the future … Apologies if you see the semantic web as up there with quantum mechanics … Will contain some techy stuff Not actually that much on Voyager, although we will talk about what a next (current?) generation LMS could do in terms of RDF publishing
  3. Semantic = meaning explained (to machines) – so we would see a 245, and know to display it as a title. We would need to programme a computer specifically to work that out, it has with Marc21 no real way to discover it itself. That is the purpose of semantic or ‘self describing’ data. Hyperlinked = meaning contextualised elsewhere – use a common set of descriptions For machines as much as people – That is the grand theory – web pages for people – semantic data for machines.
  4. So after about 10 years of various approaches, An approach referred to as Linked Data looks like an emerging framework for the semantic web with some heavyweights behind it… Use URIs as names for things Use HTTP URIs so that people can look up those names When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL) Include links to other URIs, so that they can discover more things
  5. Conversely Here is a description of RDF – Resource Description Framework it’s a set of metadata encapsulation standards. Does not need to be linked, but often is. (Go back) Note that RDF is not mentioned except in brackets. Other data can be linked, as long as it follows these conventions. We can have unlinked RDF data and linked non -RDF data … So, RDF is not the be all and end all of linked data, but it’s the most commonly used mechanism right now so I’ll probably use both terms interchangably which might annoy some types But we did linked RDF (or attempted to …) and for the purposes of this presentation, the two terms are somewhat interchangeable …
  6. So lets take a look at some. The n-triple notation format is the simplest means of expressing RDF triples. Earch triple is described in one line. RDF is in its purest form is data described as triples, each a statement in three parts A subject (the person or thing being described) A predicate (affirming what feature of the subject is being described) An object (the descriptive term itself)
  7. Combine several triples and we can see what looks like a bib record But the data extends beyond the record as a composite entity through links to other sets of triples
  8. Big growth in cultural heritage
  9. No talk on linked data is complete without this image …
  10. Respond to academic / national demand for Open Data – University of Southampton - Bibliography is a key stepping stone into teaching, learning research. Reading lists, personal bibliographies, group research. Getting this data openly available in standardised formats is a real world win Tax-payer value-for-money –they pay for this data to be created, lets give it back in some other form than an OPAC CUL already provides public APIs, this seems like natural progression Gain in-house experience of RDF – see how hard it can be. A lot of RDF projects tend to outsource to people like Talis, I really wanted to chart the in-house learning process. Move library services forward – we need to be in the world of linked data however it ends up – we also need greater flexibility in record re-use as we go forward with new forats - Have been well argued …
  11. RDF works best with a permissive license – its now generally accepted that permissive licenses and bib data are a good thing Para-phrasing Paul Ayris "open bibliographic data offers chance for anyone to re-use the data to build innovative services“ CC0 or Public Domain Data License Non-commercial licenses not suitable – look at University and research funding now - what is non-commercial exactly? No NC license defines commercial activity, it actually creates more doubt, puts people off. Could building a free website based on our data and then running ads alongside it to be seen as commercial? Probably, would it deprive us of revenue or users, probably not. Consensus is not to go there with data… Permissive approach creates potential conflict with record vendors – not outright conflict, RLUK and OCLC –they are valid partners, the UL’s cataloguing team of 100+could not work if either organisation folded, its in our interests to keep them alive. For context, RLUK were behind the JISC Discovery programme, and OCLC acted as an invaluable partner with us on the project
  12. See if there were any expressive contractual clauses saying we could not redistribute
  13. Where does a record come from ? – practically quite hard to determine … Several places in Marc21 where this data could be held … Logic for examination Attempt at scripted analysis – list bib_ids by record vendor
  14. All on the project blog along with some comprehensive explanation of methodology …
  15. Most vendors happy with permissive license for ‘non-marc21’ formats - Non marc thing is not an issue in this context, no one outside of library land cares about a load of binary encoded numbers … we are re-purposing Marc originated data for a wider audience RLUK / BL BNB – PDDL OCLC – ODC-By Attribution license No good reason not to re-publish – need the right license!
  16. What did we learn Marc actually made it really difficult, hence the diagram Better container formats could have sown this up With a national / international mandate to open up data, we need a better container format other than Marc to go forward. No good reason not to re-publish – need the right license!
  17. Several attempts at conversion – settled on SQL extracts based on lists of bib_ids Use Perl scripting to ‘munge’ the data, quite dirty nasty coding around Marc files You can try this at home ! Scripts available for Voyager SQL extraction and standalone batch file conversion.
  18. Marc21 – data rich, semantically poor. Designed to print out cards for display. Never really got past that. Hard to generate granular items of data for linking (i.e. triples). A lot is lost. Data is binary encoded, hard to transfer via modern web services - needs specialised code libraries to crack, XML and JSON are the way forward here Numbers as field names – why do we need this in 2011? It’s a dark art, bears no relation to the rest of the real world - makes it very hard for external developers To come in and do this kind of work - needs specialised knowledge and is developer underfiendly Bad characters – bane of any software developer – XML encoding and validation would deal with this problem Replication – as we’ve seen above, four fields serving effectively the same purpose Over one hundred notes fields? Come on …
  19. RDF allows you to freely mix vocabularies – choices of fields to describe your data Emerging consensus on bibliographic description - thankfully no-one is attempting to recreate Marc, mainly a use of Qualified Dublin Core, FOAF Our conversion script is CSV customisable BL and others leading the way on vocab choice – they did some great data modelling, which we stayed clear of
  20. PHP script to match text against LOC subject headings – enrich with LOC GUID FAST / VIAF enrichment courtesy of OCLC FAST – next generation subject headings – very exciting VIAF – Virtual International Authority File OCLC want to develop these as linked services, keen to help.
  21. Marc / AACR2 cannot translate will to semantically rich formats Need better container / transfer standards (not necessarily RDF)
  22. Scaling issues
  23. Triplestores are cumbersome SPARQL alone does not do the trick – need faster, easier indexes covering data. One of the advantages of the Talis platform is that they can do this … High entry barrier to RDF is partly a result of these accompanying technologies – as much as the confusion and complexity around the data
  24. Building whole systems around RDF is not really a good idea – thankfully they are not doing this Need the flexibility to do this by dropping Marc21 as an internal storage format – Thankfully they are doing this - plenty of other ways to get at data RDF works best on the side, as a separate machine friendly view of data. Ensure any RDF publishing capacity is flexible (as ours is) RDF capability for Primo ?
  25. Standalone RDF is just fiddly Dublin Core, so … Create httpd uris for things so they have a permenant name on the web – really exciting Link it to something useful (LOC, FAST, VIAF) Don’t limit to the bibliographic – if records describe music or film, link to IMDB, Wikipedia or some domain specific authority … We have a chance to break out of the library bubble here…
  26. Use URIs as names for things Use HTTP URIs so that people can look up those names When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL) Include links to other URIs, so that they can discover more things Away from RDF, triplestores, MArc21 035s and licensing issues, these four points are conceptually the right approach for linked data, or any data that exists on the web