This document provides an introduction to persistent identifiers (PIDs) and their use in the EUDAT system. It defines PIDs as globally unique identifiers that can be used to persistently identify digital objects. The document discusses why PIDs are useful, describing problems with URLs like link rot. It then covers different PID systems like Handle and DOI, as well as EUDAT's use of Handle through the B2HANDLE service. The document also discusses PID policies, use cases, and the B2HANDLE Python library for programmatic PID management.
Introduction to Persistent Identifiers| www.eudat.eu |
1. www.eudat.euEUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065
Introduction to Persistent
Identifiers
PIDs in EUDAT
Version 2
July 2017
This work is licensed under the Creative
Commons CC-BY 4.0 licence
2. Content
What are persistent identifiers?
Why use persistent identifiers?
Different persistent identifier systems
The HANDLE system
EPIC PID system
B2HANDLE
Policies
Use cases
4. Science Data
Data generation is getting easier/cheaper
Complexity-shift from data generation to data processing &
analysis
The amount of data output is increasing, quality is getting
better
How to stimulate reuse and enable reproducibility?
Data needs to be
ReusableAccessible
Findable
Interoperable
5. Briefly, what are PIDs?
Pointers to data resources
Data files, metadata files, documents …
Globally unique
With infinite lifespan
Can be used to identify and retrieve resources
Can be resolved to the resource
Examples: ISBN, DOIs, PURLs, Handles…
PID Training
7. What is the Problem? Why not use simple
URLs?
The URL specifies the
location, on a
particular server, from
which the resource
could be retrieved.
Strictly network
locations for digital
resources.
domain may change
resource may be
relocated
link may change
B2SAFE Training
BUT
URLs a year or two later, often no longer workin the long term
In the long term URLs a year later, often no longer work
“link rot”
8. Persistent over time
today … ... ... 2030
11839/abc123 11839/abc123
11100
00100
01111
11100
00100
01111
http://www.example.com/ http://www.moved.com/
Supports access to resource as it moves from one location to another.
.. by design
9. Why can Persistent Identifiers help?
A Persistent Identifier is
distinct from a URL
not strictly bound to a specific server or filename
“A persistent identifier (PID) is a long-
lasting reference to a digital object—a
single file or set of files.“
https://en.wikipedia.org/wiki/Persistent_identifier
11100
00100
01111
11839 / abc123
resolution
prefix suffix
Identifier points to a resource with no actual
knowledge of the resource
Responsibility of the PID owner to keep it up-to-date
when the resource changes
10. Structure of a Persistent Identifier
points to a resource Is globally unique
11100
00100
01111
11839 / abc123
resolution
prefix sufffix
11839 / abc123
prefix sufffix
Once the PID is created, the
resource is globally addressable.
Data
Metadata
Document
Code
Prefix: designates administrative
domain, comes from an issuing
instance
Suffix: unique in the realm of the
prefix
11. Persistent over time
today … ... 2030
11839/abc123 11839/abc123
11100
00100
01111
11100
00100
01111
http://www.example.com/ http://www.moved.com/
.. by design
Update information
Redirection
Stable
12. PID Benefits
Persistent Identity via Indirection
Static references into fluid systems over time
Data on networks moves
Ownership/responsibility change
Formats change
Embedded IDs
For data object in hand – current state data
Updates
New related entities
Networks of Persistent Links
Data / metadata links
Provenance chains
13. PID Costs
Extra level of effort / cost on creation
Analysis – what to identify (granularity)
Folders, files
Single measurements in a time series experiment
Coordination across organisations
Maintain resolution system
Persistence requires sustained effort
Organisational discipline
Technology necessary but not sufficient
Analyse cost/benefit ratio
Don’t start unless it is worthwhile
Is your data worth it?
15. Persistent Identifier structure
Every persistent identifier consists of two parts: its prefix and
a unique local name under the prefix known as its suffix
Prefix - designates administrative domain, is generated by
an issuer, which makes sure that all prefixes are unique
Suffix - local name must be unique under its prefix.
The uniqueness of a prefix and the local name under that
prefix ensure that any identifier is globally unique within the
context of the System.
PID Training
< PREFIX > / < SUFFIX >
(e.g. 11111/123456745)
16. PID Systems
Persistent URLs (PURLs)a
Cost: no
Metadata: No additional metadata
purl: GPO/gpo46189
EPIC Systemb
Cost: $50 annual fee per prefix
Metadata: Associate any metadata
hdl:11210/123
Digital Object Identifier (DOI)d
Cost: fee per DOI + annual fee
Metadata: The INDECS schema,
stored in separate
database
DOI: 10.1000/182
Archival Resource Key (ARK)c
Cost: no
Metadata: ERC (Electronic Resource
Citation) metadata
ark: /12025/654xz321
Based on: Handle System
17. PID system Requirements
Attach multiple URLs to a PID
Allow part identifiers for complex
objects. Granularity issue
Allow attaching of extra metadata
to the PID (MD5 check, etc)
Actionable (i.e. converted to URL)
PIDs
HTTP proxy for resolving (use port
80 only)
Controlled by community
Programmable interface for
administration of PIDs from
applications
Delegation of PID administration
to other organisations
Distributed, robust, highly-
available, scalable
No single-point of failure,
distributed system with mirroring
Acceptable non-commercial
business model
PID Training
18. Identifier String Requirements
Not based on any
changeable attributes of
the entity, e.g.:
Location
Ownership
Any other attribute that
may change without
changing identity
Unique
Avoid conflicts and
referential uncertainty
A good PID system should
not allow you to use the
same suffix twice
Opaque, preferably a
“dumb number”
A well known pattern
invites assumptions that
may be misleading
Meaningful semantics
invite IP wars, language
problems
Nice to have
Human-readable
Cut-able, paste-able
Fits common systems, e.g.
URI specification
PID Training
that contribute to persistence
19. PIDs in EUDAT
EUDAT has adopted
Handle-based persistent
identifiers
A combined solution of
Handle system and EPIC
service
Employing the latest Handle
v.8
EUDAT developed a library
to interact with Handle v.8
B2HANDLE
PID Training
21. The Handle System
The Handle System is a technology specification for
assigning, managing, and resolving persistent
identifiers for digital objects and other resources.
The protocols specified enable a distributed computer
system to store identifiers (names, known as Handles)
of digital resources and resolve those Handles to the
information necessary to locate, access, and otherwise
make use of the resources.
That information can be changed as needed to reflect
the current state or location of the identified
resource without changing the Handle.
PID Training
22. The Handle System
The main goal of the Handle system is to contribute to
persistence.
The Handle system is:
reliable
scalable
flexible
trusted
built on open architecture
transparent
PID Training
23. A Handle Record
Handle Data
Type/KEY
Index Handle data Timestamp
10232/1234 URL 1 https://www.eudat.eu/ex1 2014-04-
09 12:46:53Z
DOMAIN 2 EUDAT 2014-04-
09 12:46:53Z
HS_ADMIN 100 eudat/user1 2014-04-
09 12:46:53Z
PID Training
PID – handle: 10232/1234
Actionable PID (URL/resolving): http://hdl.handle.net/10232/1234
24. Resolving Handle Record
PID Training
Global Registry
E.g. Handle
system
3. Client gets request
to resolve hdl:10232/1234
1. Client sends request to Global to resolve
0.NA/10232 (prefix handle for 10232/1234)
2. Global Responds with Service
Information for 10232
#1
#1
#2
#3
Secondary Site A
Secondary Site B
Local Service
#1 #2
Primary Site
4. Server responds with
handle data
Service Information
Local Handle Service
IP xc xc xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
..
..
..
xc
xc
xc
..
..
..
xc
xc
xc
..
..
..
...
xcccxv
xccx
xccx
xcccxv
xccx
xccx
xcccxv
xccx
xccx
25. HANDLE Record Types
Common types
URL: the location the
HANDLE should resolve to
HS_ADMIN: special record
encoding the permissions
configured for this HANDLE
10320/LOC: supports
multiple locations based on
intelligent decision.
Custom EUDAT types
EUDAT/CHECKSUM: Useful for
integrity verification
EUDAT/ROR: Repository of
Records, the ID of the community
repository.
EUDAT/FIO: PID to first ingested
object in the EUDAT domain.
EUDAT/PARENT: PID associated
with the source object in a
replication chain.
EUDAT/REPLICA: List of PIDs
pointing to replicas
27. PID System: How does it work?
PID Service
generate and manage
PIDs for digital objects
according to policies
Example: B2HANDLE
python library (next
section)
PID Training
PID Replication (as one
of the EPIC policies)
replicate the database of
Handles to partners in
EPIC to guarantee robust
and highly available PID
resolution
Resolution Service
EPIC uses the distributed
network provided by
Handle and extends it
with own local Handle
servers.
Global Handle Mirror
A mirror of the Global
Handle in Europe
Handle system and EPIC
28. Resolution Service
The web address for the Handle resolution service that
EUDAT uses is http://hdl.handle.net.
PID Training
29. EUDAT options for PIDs
In order to access a data object stored in EUDAT, an
associated persistent identifier is needed.
EUDAT requires integration of Handle in your
infrastructure. Before your community or data centre can
create PIDs you need a prefix. There are two options:
you can run your own Handle system; or
you can pass the details to EUDAT partners to
manage it on your behalf.
additional benefit of using the EUDAT systems is
access to a Python library to manage your PID
Handles
PID Training
31. What is B2HANDLE?
B2HANDLE is EUDAT’s PID service based on
Handle as technology
EPIC as federation
B2HANDLE offers:
Assignment of prefix via one of the EUDAT partners
Hosting of PIDs, i.e. operation and maintenance of Handle
servers and technical services
Replication and safe-keeping of PIDs via the EPIC
federation
Resolution mechanism based on Handle
Easy maintenance and programmatic resolving of PIDs by
the B2HANDLE Python library for general interaction with
Handle servers
PID Training
32. B2HANDLE in other EUDAT services
In the EUDAT ecosystem, EUDAT services make use
of B2HANDLE to:
guarantee data access
provide long lasting references to data and
facilitate data publishing.
PID Training
B2SAFE and B2SHARE use the service to create and
manage PIDs for their hosted data objects.
B2FIND and B2STAGE use the resolving mechanism of
B2HANDLE to retrieve and refer to objects.
33. The B2HANDLE Python library
b2handle: A Python library for interaction with EUDAT
Handle services (Handle version 8)
Setup tools-enabled Python package easy
installation
Can be employed by end-users to programmatically
resolve handles
Credentials to one of the EUDAT Handle servers are
required for creation and maintenance of PIDs
Stable state; official release of v1.0 also for use by
EUDAT user communities
36. B2HANDLE library features
Methods to read, create, modify Handles and their
records
Queries against native Handle REST interface
Support for multiple locations per object (10320/loc
entries)
Automatic management of Handle value indexes
Support for Handle reverse-lookup via additional Java
servlet
Support for resolving any Handle from any issuing
instance
38. How may I use a PID
When you have a PID use it:
To cite the data behind the PID:
In publications
On web-pages
Include actionable PIDs in linked data
Retrieve the data:
By using the corresponding resolver
Via the actionable PID
E.g. http://hdl.handle.net/11239/GRNET
PID Training
39. Policy Document
When to use Persistent Identifiers?
What should the PID resolve to?
There is no “one-size fits all” strategy for
implementing PIDs!
Create a Policy Document of What & When
Analyze the use of PIDs, create a policy for the
management
What to register
When it enters the data management life cycle
PID Training
analysis and thought
40. Policy Document
Simple Questions
Which data objects need a PID (collections, files, metadata
records)?
What kinds of data are likely to stay online long enough?
What kinds of data are likely to be linked to your PIDs?
What kinds of data are likely to be analysed/processed with
tools?
What will happen after data goes off-line?
etc..
PID Training
analysis and thought
41. PID Policies for EUDAT services
Each Service follows its own Policy for managing PIDs.
One of the main policies they all follow is the non-
deletion policy:
Once a PID is generated it is not allowed to delete
it.
E.g. B2SAFE and B2SHARE use the service to
create and manage PIDs for their hosted data
objects. They both create their own PID types (Keys)
in the PID record.
PID Training
43. Example 1: B2SHARE
The persistent identifier for
files, download single files
Cite the whole data publication
44. Example 2: B2SAFE
B2SAFE employs PIDs to keep track and link replicas of
data in the EUDAT network
45. Example 3: Enable data flows
PID Training
Link directly to the data (?locatt=id:n)
Optionally include a (mime) type in the Handle record –
can be used to select appropriate tooling
46. Summary
Persistent Identifiers provide a solution to the “link rot”
problem by providing an extra layer of indirection
Several systems are available with different conditions
PIDs do it yourself: Use a Policy Document
The HANDLE system - via EPIC policies - is the foundation
for EUDAT’s B2HANDLE service:
Low cost, only a flat annual fee
Robust, scalable and performing
Flexible, allows addition of any metadata
Provides a global resolver
48. www.eudat.eu
Authors Contributors
This work is licensed under the Creative Commons CC-BY 4.0 licence
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures.
Contract No. 654065
Themis Zamani, GRNET
Willem Elbers, CLARIN
Christine Staiger, SURFsara
Ellen Leenarts, DANS
Kostas Kavoussanakis, EPCC
Thank you
Notes de l'éditeur
Data generation is getting easier/cheaper.
At the same time there is a shift from data generation to data processing & analysis. A new way to do science.
As a result the number of data output is increasing. A new data world with science data.
One of the grand challenges of science data is to facilitate knowledge discovery by assisting humans and machines in data access, integration and analysis.
So as to make the data world a better place for science we must have some data principles in mind.
The idea of these data principles is:
Data should be Findable
Findable – Easy to find by both humans and computer systems Metadata
Data should be Accessible
Accessible – Stored for long term, accessed and/or downloaded with well-defined license and access
Data should be Interoperable
Interoperable – Ready to be combined with other datasets by humans as well as computer systems;
Data should be Re-usable.
Reusable – Ready to be used for future research and to be processed further using computational methods.
PIDs can help to identify and locate data. In some cases PID systems can also be used for keeping metadata that is vital for making data interoperable on the technical side.
What is actually the problem we are trying to solve? For this we have to look at the data creation process.
This life cycle applies to any online digital object.
analysis & enrichment. Data analysis is the process of applying data analysis techniques to a large amount of data, typically in big data repositories. It uses specialized algorithms, systems and processes to review, analyze and present information in a form that is more meaningful for organizations or end users. At the same time data enrichment is a general term that refers to processes used to enhance, refine or otherwise improve raw data. This idea and other similar concepts contribute to making data a valuable asset. Some examples are:
interpret data
derive data
produce research outputs
author publications
prepare data for preservation
registration & preservation: Data preservation, or more specifically, digital data preservation, refers to the series of managed activities necessary to ensure continued access to digital objects for as long as necessary. Long-term preservation can be defined as the ability to provide continued access to digital objects, or at least to the information contained in them, indefinitely.
All theses steps produce:
Temporary data
Referable data
Citable data
Identifiers of the kind we are discussing are themselves digital objects; so they are subject to the same life cycle.
PIDs can help to keep track of generated data and its relations.
Let’s see the use of Data URL as a means to find and access data.
The URL specifies the location, on a particular server, from which the resource could be retrieved. Strictly network locations for digital resources.
Are URLs persistent?
Suppose you want to publish online your research outputs. The transitional way to store data is to upload to a site, a repository, a directory. In order to access it you bookmark or share this URL. So you
Publish it online at some address http://www.test.com/test.html.
Other users may cite, access, re-use this url
As long as nothing changes about the way the data is accessed, this works fine. But one day you decide to move the resource to another location. So
relocate the resource at http://www.example.com/
Other users are not informed about this relocation and when they are trying to access the resource - at the first location – they always get a Page Not Found response.
Apart from this administrative change, you may have experienced:
domain may change (like the example)
resource may be relocated: The directory structure is rearranged: subdirectories are created for each collection.
link may change: The researcher decided to use a different platform, with different url queries to retrieve the resource.
You will always get “a Page Not Found” response.
In the long term however, URLs a year later, often no longer work. So this arrangement has proven to be fragile.
We could say that the current need for persistent identifiers came out.
This is what we want: a string that always resolves to the data.
Persistent over time. By design. Even if the real location of the data changes.
A Persistent Identifier is
distinct from a URL
not strictly bound to a specific server or filename.
An identifier is a unique name, identity applied to a digital object so that this object can be easily referenced. It is a reference to the digital object.
“A persistent identifier (PI) is a long-lasting reference to a digital object—a single file or set of files. “
The identifier points to a resource with no actual knowledge of the resource. It is the Responsibility of the owner to keep the PID up-to-date when the resource changes.
We are going to talk about it in our examples.
points to a resource
Identifier points to a resource, with no actual knowledge of the resource.
The resource is a black box. The type of the URL doesn't matter. It may be a file, a metadata record, a code collection.
Is globally unique
You won’t find an identifier with the same name that points to another resource. The system ensures that by design.
Once it is created, the resource is globally addressable.
PIDs solve these problems by introducing a “Redirection layer”
The user has an opaque string, which is resolved to a URL.
PID points to a URL which points to the digital object.
If the digital object is moved, its ownership is changed, or the organization of the objects is changed, the URL is often changed. With PIDs you can easily update this information, while the user can still employ the PID to refer and retrieve the data.
PIDs make use of a redirection layer bridging the stable and unstable worlds at the cost of some administrative responsibilities. The PID can be updated to point to the new URL.
PIDs introduce a stable layer of redirection on top of more unstable identifiers such as URLs.
PIDs provide a layer of redirection
PID points to a URL
The URL is unstable
The PID is stable
Update procedures need to be defined and thought through to enable stable referencing.
Static references into fluid systems over time
Data = digital object. A digital object may be moved, removed or renamed for many reasons.
It can move to other servers or even to other organisations.
It may even be changed to another format. (ex from xls to zip) .
Persistent identifiers are there to continue to provide access to this resource, so the digital object gets a Static reference into fluid systems over time.
Embedded IDs
Apart from the main information (ID, location of the object), a number of related data could be stored in the PID record.
Some ideas of these IDs are a) the version of the item or b) new related items
This means that the user always knows the current and latest state of the data
Networks of Persistent Links
Persistent identifiers may contain info about other digital objects (Data / Metadata links). This may create a chain of links between digital objects.
There are some costs when using PIDs
Extra level of effort / cost on creation: when you decide to use PIDs, data and PIDs are strictly connected. In the management life cycle you must include a new task "managing the persistent identifier for the data”.
Analysis – what to identify / granularity: Analyse the need for PID in your system. Which digital objects must have a PID? Do you need to assign a PID to all your files?
Coordination across organisations: Often, an institution already cooperates with other institutions that deal with a similar environment. So a coordination across organisations is needed
Maintain resolution system: There is a cost for maintaining a resolution system
Persistence requires sustained effort
Organisational discipline: An Organisational discipline must be followed to achieve persistence
Technology necessary but not sufficient.
Analyse cost/benefit ratio. Checklist and questions are mentioned in this presentation.
Don’t start unless it is worthwhile
Is your data worth it?
Let’s start by looking at the persistent identifier string.
Every identifier consists of two parts: its prefix and a unique local name under the prefix known as its suffix
Any suffix - local name must be unique under its local namespace. The uniqueness of a prefix and a local name under that prefix ensures that any identifier is globally unique within the context of the System.
Lets see some popular persistent identifier systems
PURLs
PURLs s are URLs which redirect to the location of the requested web resource using standard HTTP status codes. A PURL is thus a permanent web address which contains the command to redirect to another page, one which can change over time.
(Persistent Uniform Resource Locators) are Web addresses that act as permanent identifiers in the face of a dynamic and changing Web infrastructure. Instead of resolving directly to Web resources, PURLs provide a level of indirection that allows the underlying Web addresses of resources to change over time without negatively affecting systems that depend on them. This capability provides continuity of references to network resources that may migrate from machine to machine for business, social or technical reasons.
Cost: no
Metadata: Does not support additional metadata
Handle System
The Handle System is a technology specification for assigning, managing, and resolving persistent identifiers for digital objects and other resources on the Internet.
Cost: $50 annual fee per prefix
Metadata: Associate any metadata
ARK
is an identifier scheme conceived by the California Digital Library (CDL), aiming to identify objects in a persistent way. The scheme was designed on the basis that persistence "is purely a matter of service and is neither inherent in an object nor conferred on it by a particular naming syntax". An Archival Resource Key (ARK) is a Uniform Resource Locator (URL) that is a multi-purpose identifier for information objects of any type. An ARK contains the label ark: after the URL's hostname.
Cost: no
Metadata: ERC (Electronic Resource Citation) metadata
DOI
The Digital Object Identifier (DOI) was conceived as a generic framework for managing identification of content over digital networks, recognising the trend towards digital convergence and multimedia availability. A DOI name is an identifier (not a location) of an entity on digital networks. It provides a system for persistent and actionable identification and interoperable exchange of managed information on digital networks. It is based on the Handle Server for the resolution service.
Cost: fee per DOI + annual fee
Metadata: The INDECS schema
(as per slide)
(as per slide)
As we have already mentioned, a persistent identifier is a long-lasting reference to a digital object. EUDAT data domain handles registered data and each digital object should have a persistent identifier.
This persistent identifier is used for
- Replica identification
- Identification of the repository of record (in the case of replication)
- Querying of additional information
- Checksum (time stamped)
A persistent Identifier helps you
- access
- use and re-use
- verify
your data
EUDAT has adopted Handle-based persistent identifiers
A combined solution of handle system and EPIC service
So let’s discuss the Handle System
EUDAT has adopted the Handle system and EPIC. We will now dive into the technical details on Handle.
(as per slide)
The Handle system is:
reliable (using redundancy, no single points of failure, and fast enough to not appear broken);
scalable (higher loads simply managed with more computers);
flexible (can adapt to changing computing environments; useful to new applications):
trusted (both resolution and administration have technical trust methods; an operating organization is committed to the long term);
builds on open architecture (benefits from effort of a community in building applications on the infrastructure);
transparent (users need not know the infrastructure details).
PIDs such as Handles, are actually records in a database.
One can convert the PID to a URL by appending it to the address of the resolver.
When creating a Handle, the fields URL and HS_ADMIN are mandatory.
The URL field is the actual location of the data and, when it’s an HTTP URL, clicking (resolving) the Actionable PID will redirect to what is written in the URL field.
Each Handle may have a set of other values assigned, like in this example, the field “DOMAIN”, which denotes the domain where the data lives and who is responsible for the data. This is defined by the community or the domain the handle belongs το.
(It is just an example not a real PID key that is used by EUDAT!)
These Handle values use a common data structure for their data. For example, each Handle value has a unique index number that distinguishes it from other values in the value set. They also have a specific data type (or Key) that defines the syntax and semantics of the data in its data field.
Besides these, each handle value contains a set of administrative information such as TTL and permissions.
How does the resolving of a Handle work when you click on an actionable PID?
For any HTTP request, for example
http://hdl.handle.net/10232/1234
one of the proxy servers will query for the Handle, take the URL in the Handle record (or if there are multiple URLs in the Handle record it will select one, and that selection is in no particular order) and send an HTTP redirect to that URL to the user's web browser. If there is no URL value, the proxy will display the handle record.
Now let us inspect some data types or keys that are 1) predefined or 2) service or user specific.
The Handle record supports a number of record types.
Every Handle value must have a data type specified in its <type> field.
The <type> field identifies the data type that defines the syntax and semantics of data in the next <data> field. The data type may be registered with the Handle System to avoid potential conflicts.
Each field in a Handle record is timestamped.
Some common types are:
URL: one location referenced by this HANDLE
HS_ADMIN: special record encoding the permissions configured for this HANDLE. Each handle has one or more administrators. Any administrative operation (e.g., add, delete or modify handle values) can only be performed by the Handle administrator with adequate privilege. Handle administrators are defined in terms of HS_ADMIN values.
10320/LOC: supports multiple locations based on intelligent decision.
Some custom types used by EUDAT are:
Checksum: Useful for integrity verification
EUDAT/ROR: EUDAT specific for B2SAFE. ROR: (Repository of Records), the repository where data was stored first.
EUDAT/PPID: EUDAT specific for B2SAFE. the PID associated to the source object in a replication chain. If the chain has only two elements, the master copy and the first replica, then the PPID = ROR.
Anything else you like
What is it that EUDAT uses exactly and how?
As we have already mentioned, EUDAT supports a combined solution of Handle system and EPIC service, i.e. not all PIDs in EUDAT are maintained in the EPIC federation, but all PIDs in EUDAT are Handles and can be supported by the same hardware and software.
We had a closer look at the Handle system and what it provides technically. We will see now what EPIC adds to the pure technology.
ePIC provides many highly reliable, redundant and performant services to the scientific research community.
PID Service: The first service which is publicly visible is the PID service. The PID Service is the main interface to register and manage persistent identifiers in ePIC . It is implemented as a RESTful web service and it is continuously being developed by ePIC. We will see how to use the PID service in the section B2HANDLE.
Resolution: It is responsible for forwarding the user to the current location of the object (such as author or expiration date). ePIC utilizes the Handle System to achieve a redundant and load-balanced setup between the data centers.
Replication: Currently, five European data centers work together to replicate each other’s persistent identifiers. When a data center is temporarily not available, the other ePIC centers still resolve the PIDs.
Mirror: The ePIC system is based on a worldwide hierarchy with the Global Handle Systems on the top of it .These systems are registries where the most important information of the prefixes is stored. One of EPIC founders (GWDG) runs a mirror so as to assure the resolution of prefixes in Europe, even if other parts of the global network is temporarily not available.
Most of the them are hidden, except from the one responsible for PID registration, which is publicly visible.
Based on the Handle resolution mechanism we saw earlier.
Resolution: It is responsible for forwarding the user to the current location of the object.
The PID Resolution system of ePIC is responsible for forwarding the users to the current location of an identified object. In addition to the current location, other information about the object (such as author or expiration date) can also be provided. ePIC utilizes the Handle System to achieve a redundant and load-balanced setup between the data centers. ePIC replicates the PID databases to guarantee high availability of the PID resolution. The resolution services of ePIC are also included into the worldwide Handle infrastructure to guarantee a highly reliable and performant resolution of PIDs issued by ePIC .
(as per slide)
Now we will dive into the subject, what do you need to arrange for on the organisational side, when you want to use PIDs in your project or at your institute.
When you are a user you can use any PID
For publishing and referencing data in your papers and on web pages
You can include them as online resource in linked data
You can fetch the data by resolving the PID via the resolver or via an actionable PID
Now that I know how to use a PID when should I create – mint a PID?
“When to use persistent identifiers”?
It should be noted that among all the concepts which have been introduced there is no ‘one size fits all’ strategy for implementing persistent identifiers. Although the basic problems to be solved are the same, each of the systems addresses them in its own way on different administrative and technical levels. It is not possible to formulate one single recommendation for all.
Create a Policy Document of What & When
Analyze the use of PIDs, create a policy for the management
What to register
When it the data management life cycle
Determining
First of all, one must carefully analyse its current use of identifiers in general. In most cases, where data is collected, it is identified in some way. If it is data about other data or objects – metadata – it will often contain an identifier for the referred item.
Answer these simple questions
Which data objects need a PID (collections, files, metadata records)?
What kinds of data are likely to stay online long enough?
What kinds of data are likely to be linked to your PIDs?
What kinds of data are likely to be analysed/processed with tools?
What will happen after data goes off-line?
Lets see a few use cases .
B2SHARE is a user-friendly, reliable and trustworthy way for researchers, scientific communities and citizen scientists to store and share small-scale research data from diverse contexts. All B2SHARE artifacts are associated with a PID.
B2SHARE creates several PIDs:
1) A Handle for the deposit, that resolves to the specific landing page in B2SHARE
A DOI, that also resolves to the landing page but is also indexed in DataCite
2) A DOI for each uploaded file, that resolves directly to the file and thus enables programmatic and specific downloads of certain data.
B2SAFE creates a PID using its own Handle prefix that refers to the original in the community’s repository
B2SAFE uses B2HANDLE, i.e. it interacts with the Handle server via the B2HANDLE python library. Upon creation of the PID, B2SAFE also stores some extra information in the Handle entry:
The URL pointing to the real location of the data object
Checksum for integrity checks
An optional identifier that is internally used by the community
The data object is replicated to an EUDAT site
Here B2SAFE creates a new PID for the replica, with the Handle prefix for that site
The interaction is again done with the B2HANDLE python library
The Handle entry of the replica uses the PID from the community centre to refer to the original data, here PID1x is used as value for the First ingested Object (FIO), as direct parent of the replica (PARENT) and also as link to the original community repository (ROR). At the same time the PID of the data at the Community centre is updated. It holds now a link to the replica (REPLICA)
For Handles with multiple URL values, the proxy server (or web browser plug-in) simply selects the first URL value in the list of values returned by the Handle resolution. Because the order of that list is non-deterministic, there is no intelligent selection of a URL to which the client would be redirected. The 10320/loc Handle attribute was developed to improve the selection of specific resource URLs. Type “10320/loc” specifies an XML-formatted Handle value that contains a list of locations.
Locatt attribute:
If someone constructs a link as hdl:123/456?locatt=id:0 then the resolver will return the locations that have an "id" attribute of 0 (i.e., the first location).
If there is only one location element, it is returned as a redirect. If there are more than one, then you can select the one you want by adding the ?locatt=id:n attribute.
Persistent Identifiers provide a solution to the “link rot” problem by providing a layer of indirection. PIDs act as individual names for the objects with some extra information, like an ID card for people containing the address and some more information.
The indirection helps to account for changes e.g. the address changes but users can access the data while still employing the same reference.
PIDs also make it easier to keep digital objects accessible in the long term, despite changes of technology, organizations, and people, as they are independent of the data and their changes in publishing systems, transfer, and evolution of technology.
Requirements: An identifier must be globally unique, and should also be actionable - that is, a persistent identifier should provide a persistent link to the resource identified.
Several systems are available; some offer additional functionality in the form of support for storing additional metadata, providing a global resolver, etc.
Policy Document: How to use persistent identifiers in your repository requires some analysis and thought Persistence Needs Preservation. Huge amounts of useful material have been lost due to lack of resources, or explicitly assigned responsibility. There must be a policy in place to prevent this from happening. Among all the concepts which have been introduced there is no ‘one size fits all’ strategy for implementing persistent identifiers. A policy will have to be adapted to suit the needs of the individual organization
EUDAT uses Handle as technology low costs, robust setup, allows for flexible creation of PIDs
Via the EPIC federation EUDAT makes sure that PIDs are mirrored and kept safe