Lodlam presentation v1.0 final al20151104

Discoverability and the Web
Getting PROV ready for the semantic web

What do I hope you'll take away from this
presentation?
•The web is moving from a web of documents to a web of data
•Making web content machine readable is important for discoverability
•We can also use APIs, web mark-up, and LOD to make our web resources more
discoverable and reusable
•This isn't a fad or a fantasy, it's happening all over the globe right now and we can be a
part of it if we want to at low cost in a timely fashion

Before we get into the heavy stuff
Some Big Bang Theory c/o Google
https://www.youtube.com/watch?
v=mmQl6VGvX-c

Who needs all this data and who works with
it?
•Researchers who are after as fine grain data as possible on a given topic i.e. basically anyone who
isn’t satisfied with just a web page of interpretation (document) about something
https://en.wikipedia.org/wiki/Vida_Goldstein but would rather supplement this with the granular
details about that thing or person and browse to related data across the web
http://dbpedia.org/page/Vida_Goldstein
•Any organisation who wants to make its web resources as discoverable and usable as possible e.g.
the BBC, the Smithsonian, the Getty, UK National Archives, Digital NZ, National Archives of Korea,
SLNSW, TROVE, SRNSW, Auckland Museum or just check out http://bit.ly/1OGYZYJ
•Anyone who wants to help annotate content on the web for the social good. Think of TROVE or
our own WIKI in which 92 tags have been used 457,750 times across more than 50,000 pages!
Here’s just one example http://wiki.prov.vic.gov.au/index.php/Property:Has_keywords
•Software developers who want to build new applications out of this data to make it more
accessible and engaging. (We’ll look at some real life examples in just a moment).
•Anyone who wants to ask or allow to be asked sophisticated questions like "Show me all 20th
Century painters who were born near Timaru“, "Who were Colin McCahon's contemporaries and
let me see a chronology of their major paintings.“, “Show me all the Works in Harvard Library by
Swedish Nobel Prize winners.”, “How many people died from tuberculosis in Victoria from 1840-
1940?”, “List all Parish Plans showing allotments purchased by person X from 1900-1915 for up to
300 pounds only”.

Imagine...
Imagine... a researcher in 10 years time who
wants to use research data about the Eureka
Stockade from the Life Sciences, the Humanities
and the decorative arts to examine the
consequences of the event for Victoria’s
economy, environment and art trends from 1854
to 1870. Imagine they have access to a range of
documents but also statistical and other data
from a range of institutions that allows them to
carry this out.

Is this just a fantasy driven by a select few for a niche audience?
In a word...NO. Next slide please...

2014 Linked Open Data
Cloud
http://linkeddatacatalog.dws.informatik.uni-mannheim.de/state/#toc2
1014 organisations
About 183 gov’t

The Semantic Web
From a web of documents to a web of data
http://dbpedia.org/page/Jerilderie_Letter

PART 1: EXPLAINING SOME KEY CONCEPTS OF
THE SEMANTIC WEB
Linked Open Data refers to the way in which we have moved from the ability to link web pages and
documents over the web to the ability to link data within those web pages and documents over
the web to related data and documents.
What?Here’s a page in Wikipedia about the Jerilderie Letter
https://en.wikipedia.org/wiki/Jerilderie_Letter and here’s the Wikipedia data behind that
page/topic with links to related data http://dbpedia.org/page/Jerilderie_Letter
LODLAM is the acronym for this Linked Open Data process within Libraries, Archives and Museums.
Let’s start with an example to see how this happens...

A real life example!
http://lodlive.it/LodLive-wiki.prov.vic.gov.au/app_en.html?http://wiki.prov.vic.gov.au/index.php/Special:
This graph is showing us all the metadata contained within the PROV wiki page on the famous
Jerilderie Letter http://wiki.prov.vic.gov.au/index.php/Jerilderie_Letter
The lodview application was written by a developer based in Italy whom I met at the LODLAM 2015
Conference in Sydney recently. It is a piece of software that he wrote to help humans browse the
linked open data universe in a visual way. The beauty of it is that you can literally follow your nose
through all the connections between resources on the internet via their metadata. Or as in this
example you can just explore the metadata within the wiki page itself. So this is the machine
readable view of the wiki page, which begs the question...

What's the value in machine readable
metadata?
It simply means that as developers come up with new presentation environments for our content
we will be ready to make it accessible to them in a form they can actually use!
Here's a human readable page of the PROV wiki relating to the copy of the famous Ned Kelly
Jerilderie Letter we have in our collection http://wiki.prov.vic.gov.au/index.php/Jerilderie_Letter
Okay so how do we create machine readable
metadata?
Well we don’t need to. The beauty of the platform that the PROV wiki is built on (semantic
Mediawiki) is that it automatically creates it for us for every single page we have created metadata
for. The metadata is turned into a standard data model for machines to read called RDF. It’s the
lingua franca of the semantic web and fortunately there are a lot of smart people out there who
have developed software to transform other data types and models into RDF.
Because our wiki makes all of its contents machine readable using the standard data model of the
semantic web i.e. RDF we also offer developers a machine readable version of the same wiki page
for them to consume in whatever applications they build for browsing the semantic web.

Does that mean we’re reliant on the Wiki?
No, we can actually turn all of PROV’s Function , Agency and Series metadata into Linked Open
Data because we have something very magical called an API!
An AP What? With the help of the developer behind http://metadata.prov.vic.gov.au/provisualizer
we can use the PROV API developed by Kaz and David Fowler to gather A1 metadata consisting of
http://metadata.prov.vic.gov.au/oai/query?verb=ListSets 139 Functions, 2579 Agencies and 15212
Series and turn that into Linked Open Data. It will be inexpensive, fast and take our ACM into the
Semantic Web, similar to how we already have with the PROV wiki
http://wiki.prov.vic.gov.au/rdf/Public_Record_Office_Victoria_Semantic_Wiki.rdf
When we make Item level data accessible through our API we’ll be able to create Linked Open
Data for it as well associating it with the archives ontology we deem most appropriate.

We’re not the first archive to do or think
about this!
http://www.archivesnext.com/?p=3450
Archives Hub (UK): The Archives Hub provides a gateway to thousands of the UK’s richest archives.
Representing over 220 institutions across the country.(http://archiveshub.ac.uk/introduction/)
•Linked Jazz(Pratt Institute): a research project investigating the application of Linked Open Data (LOD)
technologies to digital cultural heritage materials. (https://linkedjazz.org/about-the-project/ )
•SNAC( Unmiversity of Virginia): an aggregate of biographical information about people, both individuals and
groups, who created or are documented in historical resources. Users can search for names of individual
people, organizations, and families; browse featured descriptions; and discover and locate connected
historical resources. Search results can be filtered by occupation and subject. (
http://socialarchive.iath.virginia.edu/snac/search )
•Conal Touhy(Brisbane-based independent software developer: “ I’ve spent a bit of time just recently poking
at the new Web API of Museum Victoria Collections, and making a Linked Open Data service based on their
API. I’m writing this up as an example of one way — a relatively easy way — to publish Linked Data off the
back of some existing API. I hope that some other libraries, archives, and museums with their own API will
adopt this approach and start publishing their data in a standard Linked Data style, so it can be linked up with
the wider web of data.” (http://conaltuohy.com/blog/lod-from-custom-web-api/ ).
And here is san example of the Linked Open Data he created for 1 item from the MV API http://bit.ly/1Zjge5P
And these are the item details from the MV website http://collections.museumvictoria.com.au/items/1411018
Just one of 93,817 man made objects they have in their collection http://collections.museumvictoria.com.au/
all accessible through their API http://collections.museumvictoria.com.au/api
See more applications pertaining to documentary heritage here http://summit2015.lodlam.net/

Time to Re-Cap and Breathe
Way back in our first example of the Jerilderie Letter you'll notice that we serve up some really
useful metadata including image URLS, georeferencing data etc that can all be consumed by
software (i.e. ‘intelligent agents’ as first communicated by Sir Tim Berners Lee).
http://lodlive.it/LodLive-wiki.prov.vic.gov.au/app_en.html?
http://wiki.prov.vic.gov.au/index.php/Special:URIResolver/Jerilderie_Letter
So, increased access to and awareness of our cultural collection by other cultural collections and
links to significant datasets such as DBpedia (the semantic database version of Wikipedia) carries
with it the benefits of increased item count / usage ( metrics that feeds directly into BP3 stats) and
ultimately continued funding for all the very important work we all continue to do in storing,
preserving and making accessible the State archives to the people of Victoria. However, that’s a
very inward looking view.
The flip side of this is something Richard Lehane from SRNSW touched on in a blog post on API’s in
October 2011 which argues that making their search tool open and accessible to developers via
their API means they can garner the work of others and inform their own choices around mobile
application development etc, though that’s probably worth a talk by itself for another time http://
data.records.nsw.gov.au/?p=248

More breathing space 
LODLAM is all about promoting the free and open use of collection metadata between cultural
institutions around the world in a way that software can parse and use in various applications that
generate increased value for the user and the organisation re access, usage, interoperability. It’s
still in its early stages but has made significant progress in just a few short years. Tim Berners Lee’s
vision for a web of data as opposed to a web of documents is not impossible to imagine as really
big organisations across the globe get onboard:
So how might this work for PROV?
Well imagine a situation where a researcher has access to related content across all the archives in
Australia or Australasia simply because that content has been annotated with metadata in a
shared language (i.e. RDF) which means software can parse it and make the necessary connections
for search engines to deliver rich results to complex questions. Not only can a researcher explore
related material within a single archival collection but can broaden this out to multiple collections.
And then what if it is then possible to bring in related content from Libraries, Galleries and
Museums as well, all the time filtering out the irrelevant material you don’t want to see?
This is the vision of the LODLAM community which brings me to part 2 of this presentation. Don’t
worry it’s going to be brief compared to Part 1

PART 2: LODLAM SYDNEY 29-30/JUNE 2015
http://summit2015.lodlam.net/about/
100 people from around the world meet up for 2 days every year to try and work out how best to
make LODLAM work. It’s a heady mix of digital humanists, developers, data wranglers, curators,
geeks and people with an interest such as me! I first attended a LODLAM Conference in San
Francisco in 2011, funded by the Internet Archive and the Sloan philanthropic foundation ( all you
had to do was apply). This year it was held in Sydney and PROV very kindly paid my registration fee
of $US100.00.It’s run according to the unconference format
“where attendees propose sessions on the first day, start those sessions then on the 2nd day the
same thing happens with a degree of socialising, tweeting etc around the discussions. There are no
keynote speakers. The meeting is based on the two primary principles of passion and
responsibility: passion to jump in and play an active role; and responsibility to lead, and follow
through with action. No papers will be submitted or read, no plenaries given, and everyone will
participate.”( https://en.wikipedia.org/wiki/Open_Space_Technology )

So what did I do?
I tried to get to as many sessions as possible but it was hard and there was so much on offer!
https://docs.google.com/spreadsheets/d/19mfLBoztvaaaik20-P2syANn2fjURzokE-
xILLMOVQ0/pubhtml
I particularly enjoyed
• A pre conference presentation by Rachael Frick, Digital Public Library of America
g:provaccess managementprojectslodlamfricksydney.pdf
• So you've got a collection API, now what? merged with How to add LOD publication
functions to existing collection management systems. Lightweight, plug-in approaches
• LODlive graph browser. Diego Valerio Camarda
• archive.schema.org. Richard Wallis
I’ll try and give you a brief overview of what I learned:

What is the DPLA?
The Digital Public Library of America (DPLA) is an all-digital library that aggregates metadata — or information
describing an item — and thumbnails for millions of photographs, manuscripts, books, sounds, moving images,
and more from libraries, archives, and museums around the United States. DPLA brings together the riches of
America’s libraries, archives, and museums, and makes them freely available to the world.”
It is very much about creating a portal for developers to use the metadata to build tools:
http://dp.la/info/developers/
The dpla use a number of hubs that reach out to content partners. These hubs facilitate content migration,
providing guidance and support around content/ rights and technical issues that might appear.
The DPLA provides a beautiful segue into the role that APIs play in exposing collection metadata to the world
and allowing others to use it to build tools useful to the collecting organisation and the researcher community.

What is an API?
Basically a way into an organisations’s metadata via a programmatic interface. If you want a really
great definition check http://data.records.nsw.gov.au/?p=248
PROV has 2 APIs that I know of, the ANDS API and the PROV wiki API. The first one feeds directly
into Research Data Australia and uses the metadata schema Rif-CS. The second one is a little more
accessible e.g.http://www.culturevictoria.com/collection-search/ delivering item level content,
using the OpenSearch protocol first developed by Amazon.
While this isn’t LOD, both are a step in the right direction of improving our discoverability.

What is mark-up? Schema.org
an initiative launched on 2 June 2011 by Bing, Google and Yahoo![ (the operators of the then
world's largest search engines) create and support a common set of schemas for structured data
mark-up on web pages. At LODLAM in Sydney , Richard Wallace proposed the creation of a
working group to develop an extension to Schema.org to encompass mark-up of web pages
relating directly to archives. An initial model of this has recently been created, and as I
understand , the NAA will be marking up their pages in the near future. Zoe D’Arcy from the NAA
will keep me informed as to their experience after doing this.
Why bother?
Search Engines can deliver richer more relevant results if they can ‘see’ the context behind web
pages e.g. a mention of Public Record Office Victoria on our website refers to an archive as
described by the Scema.org extension the working group is developing, as opposed to a string of
characters that could be the name of a rock band or all manner of things!

What might the extension look like?
This diagram shows the basic relationship between the proposed main archive, specific types plus
relevant Schema types in the model.

And how might a web page be ‘marked up’?
@prefix schema: <http://schema.org/>.
#An Archive (Organization)
<http://archive.example.com>
a schema:Archive;
schema:name "The Example Archive";
schema:address "The Old Archive, City Square, Anytown";
schema:email "info@archive.example.com";
schema:owns [
a schema:OwnershipInfo;
schema:ownedFrom "1957";
schema:typeOfGood <http://archive.example.com/boolarchive>;
schema:ownershipType schema:HasCustodyOwnership.
]
#An ArchiveCollection
<http://archive.example.com/boolarchive>
a schema:ArchiveCollection;
schema:name "The Boolean Papers Collection:;
schema:creator "Sir Binary Boolean";
schema:accessAndUse "Public view, in archive location, no image reproductions";
schema:itemLocation <http://archive.example.com>.

Conclusions: Yes it’s the end!
•We all want to make the archives as discoverable as possible
•As long as we’re on the net we might as well be on it well (clumsy I know but you get
the gist)
•There are many pieces to the puzzle...APIs, Linked Open Data, non proprietary
software, marking up web pages for Search Engines e.g. Schema.org
• We have the ability to become highly discoverable right now at low cost and in a way
that is scalable.
•What will the Access the Collection of the future look like?
• All it will take is the ability to join the dots. Many others around the world have
already done this so we’re not alone. We are lucky to have some brilliant minds with
exceptional skills in our own back yard so let’s use them.

2014 Linked Open Data
Cloud
1015 organisations

What do I hope you‘ve take away from this
presentation?
•The web is moving from a web of documents to a web of data
•Making web content machine readable is important for discoverability
•We can also use APIs, web mark-up, and LOD to make our web resources more
discoverable and reusable
•This isn't a fad or a fantasy, it's happening all over the globe right now and we can be a
part of it if we want to at low cost in a timely fashion

Lodlam presentation v1.0 final al20151104

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Lodlam presentation v1.0 final al20151104

Similaire à Lodlam presentation v1.0 final al20151104 (20)

Dernier

Dernier (20)

Lodlam presentation v1.0 final al20151104

Notes de l'éditeur