In the January 1994 issue of The Cornell Veterinarian editor Maurice E. White wrote:
THIS is the last issue of "The Cornell Veterinarian". The "Cornell Vet" has a proud history, dating back to June, 1911... (p.1)
This presentation will describe Cornell University Library efforts to provide an "afterlife" to The Cornell Veterinarian by leveraging a number of disparate initiatives and metadata sources. While attempting to build article level linking to full-text in HathiTrust (functionality currently unavailable), limitations in the metadata captured during the scanning process were uncovered. The speaker will delineate these metadata findings and provide strategies (some scalable, others highly labor intensive) for gathering the necessary metadata for creating direct links to articles found in HathiTrust.
Presenter:
Steven Folsom
Cornell University
Steven Folsom is a metadata librarian overseeing the creation and management of metadata for various Cornell University Library digital platforms. He strategizes on the integration of metadata across systems with the ultimate goal of improving discovery and access of information resources.
On National Teacher Day, meet the 2024-25 Kenan Fellows
Wrangling metadata from hathi trust and pubmed to provide full text linking to the cornell veterinarian
1. Wrangling Metadata from
HathiTrust and PubMed:
Providing Full-text Linking to The Cornell Veterinarian
Photo credit: http://www.walls.com/ Steven Folsom, NASIG Annual Conference 2014
2. Cornell Library Digital Consulting
and Production Services
A single-point of service for those wishing to create digital
collections
A virtual group that spans multiple departments within
the Library (Digital Scholarship and Preservation Services,
Cornell Library IT and Metadata Librarians from Library
Technical Services)
Approaches digital collection building holistically, and
addresses the entire life cycle management of a project
Steven Folsom, NASIG Annual Conference 2014
3. The Cornell Veterinarian Project
Participants
Client:
Cornell Flower-Sprecher Veterinary Library
DCAPS Involvement:
Jaron Porciello, Digital Scholarship Initiatives Coordinator
Michelle Paolillo, Project Manager/Business Analyst (CUL’s
HathiTrust Liaison)
John Cline, Cornell Library Programmer
Steven Folsom, Metadata Librarian
Steven Folsom, NASIG Annual Conference 2014
4. HathiTrust Digital Library
Digital Library consisting of the Google Books project,
Internet Archive digitization initiatives, and content
digitized locally by libraries
Committed to preserving content with stable access and
distributed/coordinated cost of storage
Centralized technical framework with that allows for the
creation of tools and services
Steven Folsom, NASIG Annual Conference 2014
8. Google Books:
Contributions from Cornell Library
Participation in the Google Books Library Project since
2008
Google focuses on materials that they have not already
digitized
Using OCLC holdings information, they compose a
Cornell candidate list
Steven Folsom, NASIG Annual Conference 2014
12. Hathifiles
Tab-delimited full files of the Hathi Digital Library and
incremental updates (Full file is currently over 2.5 GB
uncompressed)
Light Bibliographic data
Includes some administrative metadata, e.g. rights
information, the originating institution for the scanned
copy
Steven Folsom, NASIG Annual Conference 2014
13. Select Hathifile Record Elements
Hathi Volume ID: mdp.39015076694507
Access: allow [Notes on mapping for rights attributes where
contextual user data would affect access]
Rights: pd [public domain]
HathiTrust record number: 000529434
Enumeration/Chronology: v.33 no.11 1900
Source: MIU
Title: The Chicago medical times
OCLC number: 1554176
Steven Folsom, NASIG Annual Conference 2014
14. HathiTrust Bibliographic API
Meant for use to retrieve information about small numbers of
items at a time
Returns bibliographic, rights, and volume information when
given a single or multiple standard identifiers (ISBN, LCCN,
OCLC, etc.), includes overlap with the Hathifile data
Brief example:
http://catalog.hathitrust.org/api/volumes/brief/oclc/424023.
json
Full
example:http://catalog.hathitrust.org/api/volumes/full/oclc
/424023.json
Steven Folsom, NASIG Annual Conference 2014
15. Hathi Metadata Recap
• Administrative
data about
scans and
corresponding
volumes
• Uses Hathi id’s
to link to
bibliographic
data
• Bulk
Bibliographic
data
• Some
administrative
data, e.g.
Rights
information
• Small requests
for
Bibliographic
data retrieved
using
standard
identifiers
(ISBN, LCCN,
OCLC…)
Steven Folsom, NASIG Annual Conference 2014
16. What we thought was the solution….
Use Hathi Data API to find Table of Contents for each
Volume
Gather the related OCR
Parse out article citation values from the OCR (Hopefully in a
mostly automated way)
Use the pagination data from TOC to build links by mapping
to pagination in the METS files.
What couldn’t be automated would be done manually
(with the projected outcome being an citation index with Hathi
URLs that could be used to build an interface or given to an
index like PubMed)
Steven Folsom, NASIG Annual Conference 2014
21. A Path for Automation
For each citation already in PubMed for which the HathiTrust has one
volume
1. Search PubMed <Volume> AND the Hathi Catalog id (000535347)
for The Cornell Veterinarian against the Hathi File to get the
corresponding Hathi object id from the METS
2. Use the METS object id AND the PubMed start page (the numeric
value before the ‘-“ for each PubMed article citation to find the
<ORDERLABEL> to get the <Order> number from the METS file
3. Create the URL to be added to the PubMed XML. The Hathi METS
object id and <Order> number are used to create the URL. The
sequence number in this URL equals the <Order> number. The
METS id equals the id in the URL,
http://babel.hathitrust.org/cgi/pt?id=coo.31924051143075;view=1
up;seq=11
Steven Folsom, NASIG Annual Conference 2014
22. NCBI’s LinkOut Program
A service that allows third parties to link specific NCBI
database records to relevant web-accessible resources
The relevant journal/publication must already have gone
through the Medline selection process
Document Type Definition (DTD) for contributing links in XML
Steven Folsom, NASIG Annual Conference 2014
23. PubMed Citation Data Requirements
PubMED DTD specifies how the data should be
formatted
Data Tags (R = Required, O = Optional O/R = Optional or
Required). Required tags must be included; optional tags
must be included only if the data requested appears in
the print or electronic article. Optional or Required tags
are dependent on the use of other tags
Tag names are case sensitive
Steven Folsom, NASIG Annual Conference 2014
25. In an Ideal World…
Steven Folsom, NASIG Annual Conference 2014Photo credit: http://www.priefert.com/
26. The metadata that got away…
Pre-1945 issues not indexed by PubMed
Supplemental volumes*
What we hope to do about it:
Manually capture the Hathi URL’s for the supplemental
volumes and provide them to PubMed using their linking
format
Manually capture citation data for pre-1945 articles using the
OCR files, and send to PubMed using their indexing format.
Steven Folsom, NASIG Annual Conference 2014
27. Project Outcomes
Soft:
Better understanding of what’s possible with Hathi API’s
Better understanding of PubMed’s metadata/URL contribution
requirements
Increased desire within the Cornell Library to consider greater return on our
HathiTrust investment
Concrete:
The Cornell Veterinarian should be available via PubMed for the years
already indexed soon
Manually capturing the complete backfile for The Cornell Veterinarian to
contribute to PubMed
Steven Folsom, NASIG Annual Conference 2014
28. Future Considerations
Potential for improved access to other titles currently
lacking full-text linking in PubMed [if in HathiTrust]
Investigations into other (non)full-text indexes and fulltext
repositories
New Services for interacting with HathiTrust Digital Library
Potential improvements to the Hathi workflows.
Steven Folsom, NASIG Annual Conference 2014