Handwritten Text Recognition for manuscripts and early printed texts
the UPS protoproto project
1. herbert van de sompel, michael nelson, thomas krichel
the UPS protoproto project
UPS 1 Meeting
Santa Fe - October 21th 1999
2. project description
demo the UPS protoproto
dex the data exchange framework
3. project why a protoproto?
• UPS: enable cross-archive end-user services
• protoproto:
– facilitate discussions
– identify issues involved in creating cross-archive services
– experiment with digital object concepts for archive
material
– does not claim to be a solution
• protoproto is multi-disciplinary
– a special instance of cross-archive
– there is a market
– promotional value
4. project who?
• coordination: herbert van de sompel, michael
nelson, thomas krichel
• involvement of:
– Old Dominion U & NASA Langley
– U of Surrey
– U of Ghent
– Los Alamos National Laboratory - Library
– Russian Academy of Science - Siberian branch
5. project sponsors
• Los Alamos National Laboratory - Research Library
• JISC eLib WoPEc project
6. project datasets
– metadata only
– full text remains at archives
– static dumps obtained ca. July 99
objects full-text !organization
the arXiv 85,223 85,223 17,983
CogPrints 742 659 14
NACA 3,036 3,036 100
NCSTRL 29,184 9,084 93
NDLTD 1,590 951 1
RePEc 73,367 13,582 2,453
Total 193,142 112,535
7. project metadata formats
format
the arXiv internal
CogPrints internal
NACA Refer
NCSTRL RFC1807
NDLTD MARC
RePEc ReDIF
8. project metadata extraction
• Getting metadata out of archives
– not all archives support metadata extraction
• some archives have undocumented metadata
extraction procedures
– not all archives support rich criteria for
extraction
• single dump concept only
• Intellectual property and use rights not
always clear
9. project metadata quality
• Metadata has problems with:
– record duplication
– crucial missing fields
– internal errors
– ambiguous references to people and places,
publications
10. project metadata conversion
• all datasets converted to ReDIF:
• essential to have a single fomat for the creation
of services
• supply by archives in a single format was not
realistic
• no downgrading of data
• data enhancements:
• creation of unique identifier
• addition of raw subject-classification
• normalization of publication types
11. project re-creation of archives
• creation of archives for ReDIF-ed metadata
• using intelligent digital objects : “buckets”
RePEc
arXiv NCSTRL
12. project buckets
• Buckets were chosen to study the implications
of using rich, intelligent objects in UPS
• Buckets are:
– DL protocol / system independent
– self-contained and mobile
– handle their own display, enforcement of terms and
conditions, and dissemination of their contents
– designed for bundling multiple data representations and
data instance types
• The aggregative nature of buckets is well
suited for adding valued-added services at the
object level
13. project creation of end-user service
• NCSTRL+ digital library service
• indexing buckets in archives by requesting their
metadata
• enhanced user-interface
• NCSTRL+ search results point at buckets
• buckets auto-display
• buckets provide link to full-text in native archive
14. project scaling problems
• UPS contains 193K objects
– using buckets consumed inodes (~60 inodes per
bucket)
• filesystem reformatted with more generous amount
of inodes
– Solaris and Dienst conflict
• Dienst wants each object in an publishing authority
to be in a single directory
• Solaris has a hard limit of 32K objects in a directory
• resolution: use many (100+) authorities for UPS
15. project addition of linking service
• integrate the archives with the traditional
communication mechanism
• context-sensitive linking to deliver extended
services via SFX technology
16. project SFX linking service
extended services
metadata
evaluate metadata metadata
system A system B
18. project addition of linking service
• buckets for arXiv, NCSTRL and RePEc are SFX-
aware
• Cogprints, NACA, NDLTD not SFX-aware
• SLAC/SPIRES is SFX-aware
• linking services for preprint metadata + for
published version
19. demo the UPS protoproto
• will be available starting beginning of November
• UPS list will be notified
• disclaimer “not a production system”
http://ups.cs.odu.edu:8000/
http://ups.cs.odu.edu
20. dex some issues (I)
• data exchange framework
• data provision vs. data implementation
• central searching, distributed archives
• need for a framework by which archives can
describe themselves:
• content
• terms and conditions
• protocols, criteria supported to extract (meta)data
• metadata scheme, subject classification scheme,
material-type scheme, ...
21. dex some issues (II)
• need for an identifier scheme for archives and
archive objects
• (cf. ISSN, ISBN, DOI)
• metadata quality obstructs the creation of services
• desirabile to extend metadata with citation
information
• smart objects
• archived objects that are active, not passsive
22. dex providing vs. implementing data
• Providing data:
– publishing into an archive
– providing methods for metadata “harvesting”
• provide non-technical context for sharing
information also
• Implementing Data:
– harvest metadata from providers
– implement user interface to data
• Even if provided by the same DL, these are
distinct functions
23. dex providing vs. implementing data
Native
harvesting
interface
Input Provider Native Input
interface end-user Provider
interface
interface
Native
end-user
interface
No machine based way to Machine and user interfaces
extract metadata… for extracting metadata….
25. dex self-describing archives
• Much of the learning about the constituent
UPS archives occurred out of band…
• Given an unknown archive, we should be
able to algorithmically determine the
archive’s metadata...
Native
harvesting
interface
Where possible, the
harvesting interface
Input
interface
Provider
should provide the same
criteria as the end-user
Native
end-user interface
interface
26. dex self-describing archives
• Recommended criteria for metadata
extraction:
– subject classification
– accession date
– publication date
• Criteria for archive description
– metadata formats employed
– contact information for archive
– publication type scheme
– identifier scheme
– subject classification scheme
27. dex identifiers
• Useful in:
– reference linking
– can be used in citations
– resolving duplications
• UPS duplications were removed by hand
– tracking publication lifecycle
• Need the ability for an object to have
multiple unique identifiers
– organization, discipline, etc.
28. dex smart objects
• Premise: Objects are more important than the
archives that hold them
• SODA: Smart Objects, Dumb Archives
• Objects should be the canonical authority for
• metadata
• contents
• use
• Objects should be able to grow and change
• correct metadata
• add new formats
• add new services
• reflect the lifecycle of the object
29. dex smart objects
• It would be beneficial if the archived
objects could be heterogenous:
• with their own “look-and-feel”
• unique functionality / services
– e.g., the data archiving needs of an atmospheric scientist
can be different than that of a computer scientist, engineer
or medical researcher
• yet maintained a standard API for:
• extracting metadata
• content retrieval
• resource discovery on the object
• terms and conditions
30. dex lessons learned
• A strong distinction between the provision
of data, and the implementation of data
– also, a socio-legal context for sharing metadata
• Open, “self-describing” archives
• A universal, unique identifier name space
• Archived objects with more intelligence and
flexibility