Script away!!: APIs, XSLT, and linked data sets for creating and enriching bibliographic data / Lucas Mak, Devin Higgins, Autumn Faulkner, Joshua Barton
The document describes three projects at Michigan State University Libraries that leverage APIs and linked data to enrich metadata:
1) Geographic Area Code enrichment of catalog records using LC linked data and XSLT processing.
2) Enhancement of brief records for the Rovi Music Collection using the MusicBrainz and Discogs APIs to add genres, names, and other metadata.
3) Creation of "worksets" from the Google Books dataset stored at MSU and linking bibliographic metadata to external linked data sources about authors to provide more context.
btNOG 6: Next Generation Internet Registry Services - RDAP
Similar to Script away!!: APIs, XSLT, and linked data sets for creating and enriching bibliographic data / Lucas Mak, Devin Higgins, Autumn Faulkner, Joshua Barton
Usage of Linked Data: Introduction and Application ScenariosEUCLID project
Similar to Script away!!: APIs, XSLT, and linked data sets for creating and enriching bibliographic data / Lucas Mak, Devin Higgins, Autumn Faulkner, Joshua Barton (20)
Script away!!: APIs, XSLT, and linked data sets for creating and enriching bibliographic data / Lucas Mak, Devin Higgins, Autumn Faulkner, Joshua Barton
1. Script away!!
APIs, XSLT, and linked data sets for creating and
enriching bibliographic data
Lucas Mak | Devin Higgins | Autumn Faulkner | Joshua Barton
Michigan State University Libraries
2. Three Projects:
• Geographic Area Code Enrichment
• Applying XSLT
• Rovi Music Collection
• Metadata enhancement via API
• Google Books Dataset
• Applying Python, leveraging APIs
3. PART 1: Geographic Area Code Enrichment
https://www.loc.gov/marc/geoareas/
4. Geographic Area Code Enrichment
• Objective
• Area studies librarians want to analyze collections in their respective areas by
Geographic Area Codes (GAC) in catalog records
• Problems
• Catalog records lacking GAC
• GACs contained in catalog record may not be comprehensive
• Until 2010, the Library of Congress (LC) assigned a maximum of three GACs to any one
bibliographic record
5. Geographic Area Codes
• Library of Congress Subject Heading Manual
• Appendix E governs the assignment of GAC
• Heading is tagged 651 or contains a geographic subdivision ($z)
• Location of individual named entities
• Events, exhibitions, movements, etc.
• And many more …
6. Solution
• Inserting GAC based on subject headings
• MARC 651$a
• Geographic subdivisions ($z)
• Government bodies subordinate to a jurisdictional heading (610 10$a)
• Geographic qualifier of conference headings (610$c & 611$c )
• Exclusions
• Ethnic groups, nationalities, civilizations, etc.
• False match possibility & matching efficiency
• Non-jurisdictional corporate bodies (610 20)
• Difficult to determine location of the corporate bodies
7. Conversion Table
• Source of data
• id.loc.gov
• LC has published the GAC data as linked data (http://id.loc.gov/vocabulary/geographicAreas)
• Bulk download available (http://id.loc.gov/download/)
• RDFXML, n-triple, and Turtle
• Data conversion
• RDFXML file into XML tables
• Name-Code conversion table
• United States → n-us--
• Deprecated-Current Code conversion table
• e.g. a-hk-- → a-cc-hk
10. Default Processing Logic
• 1st geographic subdivisions ($z) matches as is
• $z Ohio $z Cleveland → Matches “Ohio” to the conversion table
• MARC 651$a, 610 10$a, 610$c, and 611$c
• Does not have qualifier (e.g. Ohio), matches as is
• Has qualifier (e.g. Cleveland (Ohio) or Cleveland, Ohio), extract qualifier for matching
• Output
• Converts deprecated GACs into current GACs (e.g. a-hk--- → a-cc-hk)
• Keeps existing GACs in 043 if current
• Dedups newly generated GACs against the existing 043
• Only outputs unique GACs
11. Issues in Name-Code Conversion
• Special Patterns in Geographic qualifiers
• Abbreviated state/provincial names (RDA Appendix B.11)
• Vancouver (B.C.) in 651$a, “Portland, Me.” in 611$c
• Multiple country/state/provincial names
• Cumberland River (Ky. and Tenn.)
• Type of place/jurisdiction
• Chignik Lagoon (Alaska : Bay), Addison (Ohio : Township)
• Intermediate place name
• Albany (Berks County, Pa.)
12.
13. Issues in Name-Code Conversion
• Inconsistent Practices
• Australia
• Has codes down to State level, e.g. Victoria (u-at-vi)
• MARC 651 has states as geographic qualifier, e.g. Sydney (N.S.W.)
• Geographic subdivision ($z) has Australia as the 1st $z and followed by local place name with state
name as geographic qualifier in the 2nd $z
• $z Australia $z Sydney (N.S.W.)
• PRC China
• Has codes down to Province or Municipal level, e.g. Shanxi Sheng (a-cc-sh), Beijing (a-cc-pe)
• MARC 651 has
• Provinces & country as qualifier, e.g. Taiyuan (Shanxi Sheng, China)
• Country as qualifier, e.g. Beijing (China)
• Geographic subdivision ($z)
• $z China $z Taiyuan (Shanxi Sheng)
• $z China $z Beijing
14. Issues in Name-Code Conversion
• Malaysia (SHM H810)
• Only has code down to Country level (a-my---)
• MARC 651 has
• Provinces & country as qualifier, e.g. Kuching (Sarawak, Malaysia)
• Geographic subdivision ($z)
• $z Malaysia $z Kuching (Sarawak)
• Korea (South) vs Korea (North)
• Qualifier is dropped when qualifying a local place name (LC-PCC PS 16.2.2.4)
• Seoul (Korea) -- $z Korea (South) $z Seoul
• P'yŏngyang (Korea) -- $z Korea (North) $z P'yŏngyang
• Needs an exhaustive list of Korean place names for matching
• From LCNAF & LCSH files
18. Metadata Enhancement by API
• Rovi Music Collection
• Spans mid-1980s to
2014
• American and some
international markets
• 681,000 CDs
19. • Rovi Music Collection:
• Increased physical
music holdings by
over 42 times
• Very basic metadata
included UPC
• Required automation
20. Phased Cataloging Process
• Phase 1 – Local Holdings Lookup
UPCs
HTTP Query Item records
for Rovi
Holdings
If Found
MSU
OPAC
MSU OPAC
XML Server
21. Phased Cataloging Process
• Phase 2 – Locating Copy Records
Remaining
UPCs
from Phase
1
SRU Query Download
Copy
Records
If Found
Sierra
API
22. Phased Cataloging Process
• Phase 3 – Brief Record Generation (Music)
Remaining
UPCs
from Phase 2
Brief
Records
Sierra
Metadata
from Donor
23. Limitations of brief records
• Rovi data do not differentiate personal name from corporate name
• Broad genre terms mapped from Rovi proprietary terms
• For classical music, only performers are listed
25. Discogs & MusicBrainz
• Discogs.com
• Crowdsourced music database with more than 7.4 million entries
• Users contribute entries for sound recordings
• Controlled list of “Style” terms
• MusicBrainz.org
• Open content music database
• Entries are maintained by volunteer editors
• Differentiation between personal name & corporate body name
• Links to external services, e.g. VIAF, Wikidata, Discogs.com, etc.
• Uncontrolled keywords for genre
26. Application Program Interface (API)
• “A software tool…which performs a particular computational function…APIs act
as building blocks allowing software developers to create new applications
without having to code every function from scratch.”*
* Daniel Chandler and Rod Munday, “Application Programming Interface” in A Dictionary of Media and Communication (Oxford University Press, 2011)
API Database
Query
Result
27. Discogs & MusicBrainz API details
• MusicBrainz.org
• Non-commercial use of the web service is free
• Data in MusicBrainz Database is licensed under CC0
• Query result available in XML and JSON (beta) formats
• Searchable by UPC and many other typical data points
• Documentation https://musicbrainz.org/doc/Development
• Discogs.com
• Data is licensed under the “CC0 No Rights Reserved” license
• Query result available in JSON format only
• Not searchable by UPC though may be available in the returned JSON
• Documentation https://www.discogs.com/developers/
34. Outcomes
• Benefits of the process
• More granular genre terms from Discogs.com
• Possible authorized forms of name from LC
• Correct tagging (700 vs 710) of names
• Limitations
• 1 query/sec. allowed in both APIs
• “503 Service unavailable” HTTP error
• Hard to dedup lists of names from two sources
• UPC lookup failure in MusicBrainz.org
• Failure to retrieve record from Discogs.com even if record is available
36. Google Dataset at MSU
• All public domain, Google-digitized books:
• OCR text (not page-images)
• 3 million volumes
• 3 TB zipped text
• 12 GB MarcXML metadata (aka catalog records)
• Remotely synced with HathiTrust
37. Full-text not intended for reading/public display, but for “Non-
consumptive research” (aka Text mining)
38. Full-text not intended for reading/public display, but for “Non-
consumptive research” (aka Text mining)
39.
40.
41. Accessible → Usable
• Stored in a “Pairtree” directory structure:
• Unique ID = miua.0048030.1838.001
• Path to Item =
• /miua/pairtree_root/00/48/03/0,/18/38/,0/01/
• Not intuitive for human access but adds stability to file
system and quick access for machines
42. “Workset” Creation
• Subset of the larger dataset built around specific
features:
• Publication Date
• Language
• Author
• Literary Form (poetry, fiction, etc.)
• Bibliographic Level (monograph, serials)
• Content Type (text, map, musical score, etc.)
• Nature of Contents (theses, catalogs, etc.)
43. Sample Workset Query
• “19th-Century French Fiction”
• Publication Date: Between 1800 and 1899
• Language: French
• Literary Form: Fiction+Novel+Short Stories
• Bibliographic Level: Monograph [maybe]
• Search Results: 966 volumes
46. Download full results or random
sample
Download text, bibliographic and/or
technical metadata
Download zipped or unzipped
volumes
Download ID list for use with
HathiTrust API (or to make email
request)
47. Working with Metadata
• Python scripts to parse MarcXML metadata
• “Streaming” parser because metadata files were too
large to hold in memory
• Stored all MARC data in relational database (MySQL)
• Additionally, processed selected fields to index using
Solr
48. Limitations of Current Tool
• Subsetting by bibliographic data only
• Not able to answer:
• How would I gather works by 16th-century women?
By 19th-century men?
• Works by displaced/exiled authors during WWII
50. <datafield tag="100">
<subfield code="a">Emerson, Ralph Waldo,</subfield>
<subfield code="d">1803-1882.</subfield>
</datafield>
<datafield tag="245">
<subfield code="a">Representative men :</subfield>
<subfield code="b">seven lectures.</subfield>
</datafield>
<datafield tag="650”>
<subfield code="a">English language</subfield>
<subfield code="x">Rhetoric.</subfield>
</datafield>
Book Metadata → Linked Data
MarcXML Data
http://viaf.org/viaf/27079964/
Language: EN
- English
Nationality:
US - United States
Gender:
Male
http://dbpedia.org/resource/Ralph_Waldo_Emerson
Philosophical School:
Transcendentalism
Influenced by:
Hegel
Montaigne
Kant...
Notable Idea:
Over-Soul, Self-Reliance
Influenced:
Musil
Thoreau
Proust...
Founder of:
The Atlantic
Subjects:
American diarists
American
Unitarians
1803 Births
Mystics...
Birthplace:
Massachusetts
Boston
Death Place:
Concord, Massachusetts
51. Linked Data
• Compiling URIs via:
• WorldCat Identities API (author last name and OCLC
number search)
• LC Linked Data Service (author name match on
authoritative name)
• Query of multiple sources to check results.
52. Linked Data Implementation Scenario 1
• Store and index data points locally
• Store URIs and retrieved contextual data as text
• Fast search & retrieval
• Regular refresh of data required to capture
new/updated data points
53. Linked Data Implementation Scenario 2
• Store harvested URIs locally
• Retrieve data points from remote data stores using
harvested URIs
• Most up-to-date data
• Have to overcome system performance and result
normalization issues
54. Provide Author Context
• Display information about author on the fly
following author search
• Using stored URI, query dbpedia for author context
• Show thumbnail, etc. to user.
• Include link to Wikipedia and other info stores
55. Wrap Up
• These projects draw on different but
complementary skillsets
• Use similar data sources
• For example: LC Linked Data Service
• Resulting expertise informs other projects, which
involve some of the same key players
56. Wrap Up
• Example future projects:
• Linked Data Cross-Divisional Team
• Experimenting with transforming MARC bibliographic data to
BIBFRAME
• Linked Data enrichment in digital collections
• API development to harvest third-party metadata for
partial cataloging automation