the UPS protoproto project

herbert van de sompel, michael nelson, thomas krichel

the UPS protoproto project

UPS 1 Meeting
Santa Fe - October 21th 1999

project description

demo the UPS protoproto

dex the data exchange framework

project why a protoproto?
•  UPS: enable cross-archive end-user services
•  protoproto:
–  facilitate discussions
–  identify issues involved in creating cross-archive services
–  experiment with digital object concepts for archive
material
–  does not claim to be a solution
•  protoproto is multi-disciplinary
–  a special instance of cross-archive
–  there is a market
–  promotional value

project who?

•  coordination: herbert van de sompel, michael
nelson, thomas krichel
•  involvement of:
– Old Dominion U & NASA Langley
– U of Surrey
– U of Ghent
– Los Alamos National Laboratory - Library
– Russian Academy of Science - Siberian branch

project sponsors

•  Los Alamos National Laboratory - Research Library
•  JISC eLib WoPEc project

project datasets
–  metadata only
–  full text remains at archives
–  static dumps obtained ca. July 99

objects full-text !organization
the arXiv 85,223 85,223 17,983
CogPrints 742 659 14
NACA 3,036 3,036 100
NCSTRL 29,184 9,084 93
NDLTD 1,590 951 1
RePEc 73,367 13,582 2,453

Total 193,142 112,535

project metadata formats

format
the arXiv internal
CogPrints internal
NACA Refer
NCSTRL RFC1807
NDLTD MARC
RePEc ReDIF

project metadata extraction

•  Getting metadata out of archives
–  not all archives support metadata extraction
•  some archives have undocumented metadata
extraction procedures
–  not all archives support rich criteria for
extraction
•  single dump concept only
•  Intellectual property and use rights not
always clear

project metadata quality

•  Metadata has problems with:
–  record duplication
–  crucial missing fields
–  internal errors
–  ambiguous references to people and places,
publications

project metadata conversion

•  all datasets converted to ReDIF:
•  essential to have a single fomat for the creation
of services
•  supply by archives in a single format was not
realistic
•  no downgrading of data

•  data enhancements:
•  creation of unique identifier
•  addition of raw subject-classification
•  normalization of publication types

project re-creation of archives

•  creation of archives for ReDIF-ed metadata
•  using intelligent digital objects : “buckets”

RePEc

arXiv NCSTRL

project buckets
•  Buckets were chosen to study the implications
of using rich, intelligent objects in UPS
•  Buckets are:
–  DL protocol / system independent
–  self-contained and mobile
–  handle their own display, enforcement of terms and
conditions, and dissemination of their contents
–  designed for bundling multiple data representations and
data instance types
•  The aggregative nature of buckets is well
suited for adding valued-added services at the
object level

project creation of end-user service

•  NCSTRL+ digital library service
•  indexing buckets in archives by requesting their
metadata
•  enhanced user-interface
•  NCSTRL+ search results point at buckets
•  buckets auto-display
•  buckets provide link to full-text in native archive

project scaling problems

•  UPS contains 193K objects
–  using buckets consumed inodes (~60 inodes per
bucket)
•  filesystem reformatted with more generous amount
of inodes
–  Solaris and Dienst conflict
•  Dienst wants each object in an publishing authority
to be in a single directory
•  Solaris has a hard limit of 32K objects in a directory
•  resolution: use many (100+) authorities for UPS

project addition of linking service

•  integrate the archives with the traditional
communication mechanism
•  context-sensitive linking to deliver extended
services via SFX technology

project SFX linking service

extended services

metadata
evaluate metadata metadata

system A system B

project addition of linking service

•  buckets for arXiv, NCSTRL and RePEc are SFX-
aware
•  Cogprints, NACA, NDLTD not SFX-aware
•  SLAC/SPIRES is SFX-aware
•  linking services for preprint metadata + for
published version

demo the UPS protoproto

•  will be available starting beginning of November
•  UPS list will be notified
•  disclaimer “not a production system”

http://ups.cs.odu.edu:8000/

http://ups.cs.odu.edu

dex some issues (I)

• data exchange framework
• data provision vs. data implementation
• central searching, distributed archives
•  need for a framework by which archives can
describe themselves:
•  content
•  terms and conditions
•  protocols, criteria supported to extract (meta)data
•  metadata scheme, subject classification scheme,
material-type scheme, ...

dex some issues (II)

•  need for an identifier scheme for archives and
archive objects
• (cf. ISSN, ISBN, DOI)
•  metadata quality obstructs the creation of services
•  desirabile to extend metadata with citation
information
•  smart objects
•  archived objects that are active, not passsive

dex providing vs. implementing data

•  Providing data:
–  publishing into an archive
–  providing methods for metadata “harvesting”
•  provide non-technical context for sharing
information also
•  Implementing Data:
–  harvest metadata from providers
–  implement user interface to data
•  Even if provided by the same DL, these are
distinct functions


Native
harvesting
interface

Input Provider Native Input
interface end-user Provider
interface
interface

Native
end-user
interface

No machine based way to Machine and user interfaces
extract metadata… for extracting metadata….


Native Input and harvesting
end-user Implementor
interfaces optional
interface

Native
Native
harvesting
harvesting
interface interface

Input
Provider Input Provider
interface
interface

Native Native end-user
end-user
interface
interface optional
(e.g., RePEc)

dex self-describing archives

•  Much of the learning about the constituent
UPS archives occurred out of band…
•  Given an unknown archive, we should be
able to algorithmically determine the
archive’s metadata...
Native
harvesting
interface
Where possible, the
harvesting interface
Input
interface
Provider
should provide the same
criteria as the end-user
Native
end-user interface
interface

dex self-describing archives

•  Recommended criteria for metadata
extraction:
–  subject classification
–  accession date
–  publication date
•  Criteria for archive description
–  metadata formats employed
–  contact information for archive
–  publication type scheme
–  identifier scheme
–  subject classification scheme

dex identifiers

•  Useful in:
–  reference linking
–  can be used in citations
–  resolving duplications
•  UPS duplications were removed by hand
–  tracking publication lifecycle
•  Need the ability for an object to have
multiple unique identifiers
–  organization, discipline, etc.

dex smart objects
•  Premise: Objects are more important than the
archives that hold them
•  SODA: Smart Objects, Dumb Archives
•  Objects should be the canonical authority for
•  metadata
•  contents
•  use
•  Objects should be able to grow and change
•  correct metadata
•  add new formats
•  add new services
•  reflect the lifecycle of the object

dex smart objects

•  It would be beneficial if the archived
objects could be heterogenous:
•  with their own “look-and-feel”
•  unique functionality / services
–  e.g., the data archiving needs of an atmospheric scientist
can be different than that of a computer scientist, engineer
or medical researcher

•  yet maintained a standard API for:
•  extracting metadata
•  content retrieval
•  resource discovery on the object
•  terms and conditions

dex lessons learned

•  A strong distinction between the provision
of data, and the implementation of data
–  also, a socio-legal context for sharing metadata
•  Open, “self-describing” archives
•  A universal, unique identifier name space
•  Archived objects with more intelligence and
flexibility

the UPS protoproto project

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Viewers also liked

Viewers also liked (20)

Similar to the UPS protoproto project

Similar to the UPS protoproto project (20)

More from Herbert Van de Sompel

More from Herbert Van de Sompel (20)

Recently uploaded

Recently uploaded (20)

the UPS protoproto project