2. OUTLINE
1. Reminder what are ARKs
2. 8 years of implementing ARKs at BnF
3. Considerations about evolving the ARK standard
3. REMINDER: WHAT ARE ARKS
A maintaining institution A specification
A user registry A discussion list
http://groups.google.com/group/arks
-forum
http://www.cdlib.org/uc3/naan_registry.txt
http://tools.ietf.org/pdf/draft-kunze-ark-18.pdf
4. REMINDER, 2: ARK ANATOMY
http://www.flickr.com/photos/jenwaller/2207918246/
> >the resource
Name assigning
authority
number
(NAAN)
Name
the world
the naming
authority
ark:/12148/bpt6k103039f
Schem
e
delivery
service
http://gallica.bnf.fr/
>page> variant
Qualifiers
Name
mapping
authority
>
/f26.thumbnail
ASSIGN IDENTIFIERS
RESOLVE
IDENTIFIERS
RESOLVE
IDENTIFIERS
http://gallica.bnf.fr/ark:/12148/bpt6k103039f/f26.thumbnail
6. RISKS IN PRACTICE: WHAT OCCURRED?
Originally: ARKs for
- digitized items
- bibliographic records from the main catalogue
New applications New objects Existing apps,
- for new objects
- for existing
objects :
preservation
repository,
linked data
services
additional features
- full text OCR
- full text search
- audio rendering
Changing domain
names
- finding aids
- illuminations
- museographic
descriptions
- born digital
documents
http://www.flickr.com/photos/jenwaller/2207918246/
- virtual exhibitions
Changing technical
environment
Changing organization
7. WHAT CONCLUSIONS?
Anything can happen in 8 years, especially
unforeseen cases
The only feasible response is organizational:
Documentation
Commitment
Internal advocacy
Internal “ARK master” task force
8. LESSONS LEARNT
DON’T reassign identifiers!
Only reveal what is meaningful for the end user to
cite
Be consistent
Keep it simple
Stay in touch with ARK users
Document procedures
Address needs without overkill
10. SEMANTIC WEB: NEW ARK QUALIFIERS?
Semantic web best practices :
Distinguish the document from the described object
<URI-web-page> <URI-person>
One way to do it:
<URI-123456> <URI-123456#classifier>
This is not compatible with the ARK spec :
ARKs can only be followed by “/”, “.” or “?”
Could change if nobody used “#” in ARK names
http://data.bnf.fr/ark:/12148/cb
11908252k
http://data.bnf.fr/ark:/12148/c
b11908252k#foaf:Person
11. ARK INFLECTIONS
Inflections get A (an ARK) to different things
A by itself (no inflection) means get the named thing
A? means get the thing’s metadata
A?? means get its commitment statement
Possible landing page debate stopper?
A/ means get the thing’s landing page (if any)
A./ means get the preferred payload (if any)
mikebaird on flickr
12. WHAT KIND OF PERSISTENCE IS PROMISED?
ARKs need metadata to express things like:
Persistent and unchanging content (rare)
Persistent but dynamic content (eg, NLM home page)
Persistent but correctable (eg, most curated content)
Persistent but growing (eg, streaming data, journal)
And who are you to promise that?
Your organizational mission
Your private/public/non-profit status
Any inspectable track record (eg, link uptime stats)
13. GOING FORWARD
Discussion about evolving the ARK specification
Sharing best practices and implementation
experiences
Interested? Stay tuned on
http://groups.google.com/group/arks-forum
14. THANK YOU FOR YOUR ATTENTION!
BnF implementation
sebastien.peyrard AT bnf.fr
jean-philippe.tramoni AT bnf.fr
ARKs at CDL:
john.kunze AT ucop.edu
Notes de l'éditeur
Back in 2006 we lay the foundations: we chose ARK as our persistent identifier scheme.
Now we have around 20 million ARK identifiers assigned.
What changed other time?
In an ideal world, « nothing » because it is persistent.
Actually almost everything changed.
Which is the point of this short feedback section which is kind of stating the obvious: persistence is not something we have to see as eternity. Eternity is paralysing.
We need to find an efficient time-span where the real questions occur. Looking back at 8 years of implementation and moving forward is a way to do this.
The risk is increased complexity:
New objects: the risk is to multiply identifier assignment procedures
New applications: as responsible for ARKs, you have an increasingly growing number of apps to watch over as they evolve to make sure there is no regression in resolving ARKs.
Existing objects, new applications: e.g. our ARK for catalogue data are displayed by our main MARC catalogue AND by our linked data service, data.bnf.fr, but you need to be clear that you are talking about the same thing. I will elaborate upon this in the second part of the presentation.
Existing apps, additional features: the people that are maintaining and evolving the apps kind of own the apps. They find ARKs work pretty well, but they tended to define their own qualifiers each time there is a new service -> qualifier proliferation. Good news in a sense: ARKs became business as usual. It works! So what?
Changing technical environment: among other things, a hugely increasing flow of incoming requests on our resolver, and an increasing number of application: needed to make some modularized evolution of the resolver architecture.
Changing organization: 8 years ago, 7 expert-team, from 2 departments. Now: only one person from the original team remains, and one of the departments no longer exists! Now the audience is less « pioneer », less technical. ARKs became business as usual (curators that use ARK for citations, web application managers, linked data experts, …
Build an organization around our persistent identifiers implementation that is responsive to changes or appearing risks which meansAdapt the documentation and communication to non-experts, so that people can understand the key requirements and what is at stakes!
-> this must be solid but lightweight at the same time: 2 people, from librarian side and IT side, that are internal consultants on ARKs. Set up an internal communication task force with around 40 different people using ARKs at several level to explain the basics and so that the 2 “ARK masters” are identified.
Another thing is it is much easier to start with what you should NOT do rather than what you should do.
So we start with the don’ts when we communicate with people, to set clear limits (apart from that, everything is open to discussion and negotiation)
Qualifiers: any technical parameters, like the search keywords for digitized books, tend to be stuffed into the URI as ARK qualifiers. Say them we do not need that, because the end users do not want to cite the searched words, they want to cite the page, or the digitized book.
The previous was about practice.
What is following is considerations, either grounded in BnF use cases or initiated by CDL, that we believe could translate into useful evolutions of the standard.
There is a debate in the research data citation community about requiring the default behavior for persistent ids to take you to a landing page. On the one hand, a landing page can give you all the context you need to find out more about the dataset, such as newer versions. On the other hand, landing pages are not machine actionable, so you cannot link persistently to, say, an inline image or a CSV file. Requiring only one behavior or the other would be a hard choice.
The default behavior for ARKs is not specified, but one way out is to permit a user or a provider to construct or publish ARKs with an indicator of what to expect. With a random ARK found in the wild, the user cannot know in advance whether a landing page exists, but the user might still request a landing page experience if there is one. Similarly the user could request the canonical (provider preferred) “immersive” experience of the object. The provider would be free to ignore the “./” and “/” inflections (as if they hadn’t been supplied) or to support them.
Persistence isn’t either/or, on/off – it’s nuanced.
The ARK spec describes different kinds of persistence but has no metadata vocabulary for providers to express it to users.
Unchanging content is rare – political pressure usually trumps preservation except where legal requirements hold. Some institutions, for example, are required to hold certain content unchanged for a period of time, and to delete if after that.
Dynamic content is common – probably every national library in the world will claim that their home page is persistent and persistently thematically relevant, but all of those home pages are dynamic.
Most deliberately curated content is correctable – with responsibility comes political pressure to make sure it’s right, non-impinging, safe, respectable, etc.
Lots of debate about datasets that are appended to every 6 seconds, appearing to be highly dynamic and therefore unsuitable for citing, would suddenly stop if we could assure people that any content, once written, will be persistent, perhaps even unchanging, but that new content is likely or possible to show up at the end of the dataset.
Anyone can promise anything. What is the nature of your organization?