1. Biodiversity Heritage Library:
A Mass Scanning Mix of Metadata
Bianca Crowley, Collections Coordinator
Biodiversity Heritage Library
Smithsonian LibrariesJun-13
2.
3. BHL Overview
• http://biodiversitylibrary.org
• New user interface launched in March
• Search by title, author, article, subjects and
scientific names
• Various download options, even high
resolution
• Taxonomic name finding algorithm
• Machine-to-machine services
6. Metadata
1. Titles vs. Items vs. Segments
2. Metadata we need:
• MARC for book and journal titles
• Volume information
• Page data
BHL Term Library Term Meaning Metadata
Title Book or Journal
Titles
Conceptual Unit MARC record
Item Volume, Piece Object Derived from holdings +
created @ digitization
Segment Article, Book
Chapter, Part
Section of
consecutive pgs
Harvested from BioStor.org
or created post digitization
8. Metadata Challenges
• BHL collection aggregates metadata from 15
member library catalogs
• Also aggregating metadata from a couple
hundred Internet Archive contributors
• Default page metadata created at time of
scanning lacks detail esp. for plates, figures, etc.
• Taxonomic name finding algorithm only as good
as optical character recognition (OCR)
13. Impact
• “BHL came to the rescue when a planned trip to work in the Mertz Library at The New
York Botanical Garden had to be cancelled due to Hurricane Sandy. Thanks to the online
resources available through BHL I was able to source most of the key works I needed,
with their supporting bibliographic information. Further use of BHL occurred when
building work at the Linnean Society of London limited access to some of the book I had
been able to use from that collection."
• “I would like thank you all very much for invaluable work and support you do. I just got a
pdf-file from more than century old (1893) journal paper (regional naturalist society
paper, published in Finland), to get copy I should take 500 mile drive to our university
library. Now I am got it fastly in high-quality pdf-copy. Cordial thanks and all success in
continuing your highly valuable mission.” [conservation biologist from Estonia]
• “You are a wonderful resource. I maintain a Website that describes the plant genus
Opuntia (prickly pear cacti). There is no way I could maintain such a site without access to
literature from 100-200 years ago. Most of the cactus species were discovered long ago; I
find it invaluable to put up PDF files to document each species in the literature as I
document them photographically. I am a botanist, but I work in the pharmaceutical field
(not so many botanical jobs out there). Your library makes it possible for me to continue
working with plants in a meaningful and scientific manner.”
14. • Repackaging content in new ways for new
audiences via:
– flickr, Facebook, Twitter, & Pinterest
– iTunes U & iBooks
• Open data & APIs
– Put content where users are already working
(Encyclopedia of Life-EOL.org, Int’l Plant Names Index-
IPNI.org, Tropicos.org)
– Gets power users to work for us (for free!) e.g.
BioStor.org, synynyms.com
A free & open access digital library for biodiversity literature and primary source materials (field books)A consortium of 15 libraries working together to run a virtual library branchA collection of content from the 15 member BHL consortium and other Internet Archive contributorsAnyone is free to access & download BHL materials
SIL employees work to scan SIL contentSIL also hosts BHL Secretariat: BHL Program Director, BHL Program Manager, BHL Collections CoordinatorNancy Gwinn = BHL Executive ChairFederal support received for the past 2 years and ongoing!
Each of the 15 BHL member libraries work together to select unique content for scanningThen we send the books from our shelves and the metadata from our library catalogs to the Internet Archive for scanningThe Internet Archive does the heavy lifting of digitization, derivative file creation and packaging all image and metadata files together for storageWe harvest the files from the IA database to our BHL database managed by the Missouri Botanical Garden in St. Louis
Overview of types of metadata we needMetadata flows from our library catalogs to the Internet Archive and then to BHLWe derive the metadata we display in the BHL website from the original MARC record of the contributing library
Example of the original MARC record in SIRIS and in the backend BHL database vs. the metadata derived from the original MARC and displayed on the BHL websiteNotice also the differences in the volume information. This is b/c AMNH contributed some volumes in addition to the SIL contributed volumes.It is often the case that multiple BHL member libraries need to work together to complete a seriesWe don’t have the time to standardize volume metadata coming from different libraries at the time of scanning but we can modify this information after it appears in our collection
Curating the BHL collection = critical piece of post-digitization workflowRequires loginWeb-based Administrative interface to access the backend BHL database so that we can make corrections to our collection as necessaryWith over 60,000 titles and 114,000 volumes – how do we manage our curation activities?!
User feedback is key; we rely on the many eyes of the crowd to help us direct our curation activities to the content people are actually usingUsers can let us know if they find a problem with something in our collection through our general feedback form and place a request for something to be scanned through our scanning request form
BHL uses an issue tracking system, known as Gemini, to manage the feedback we receive from usersNearly all consortium member libraries participate in responding to user feedback via this systemEssential to BHL day-to-day workKey to communicating at level of granularity we needExcellent documentation tool
The majority of the content in the BHL collection is public domainHowever we have agreements to provide access to over 270 in-copyright titles under a Creative Commons Attribution-Non Commercial-Share Alike licenseAs part of the volume metadata, we include data about copyright status and licensing if applicable – 3 different tiers As an open access project it is critical that we manage our copyright metadata; focus on managing in-copyright as well as “due diligence” volumesOpen data available under a Creative Commons Zero license = public domain dedication