More information than you require about the Smithsonian Libraries' mass digitization program. Presentation given to Smithsonian staff (and some others) for a day-long symposium on rapid capture methodology. Focuses on SIL's workflow for scanning books.
1. “MORE”
More information on the SIL digitization
program than you require
Keri Thompson
Smithsonian Institution Libraries
SPIN Rapid Capture Workshop February 16, 2012
2. Boutique Digitization
Boutique
One-offs
Item-based workflow
Tailored metadata
Hand-crafted data,
much user intervention
Opportunistic staffing
Project specific grants
Illustration by A.E. Marty (1882-1974)
Gazette du Bon Genre, July 1920
Smithsonian Institution Libraries
3. Mass Digitization
Prêt à lire
Standardization
Format-based workflow
and metadata model
Automate as much as
possible
Assigned staff
Funding stream
New York Millinery and Supply Co. , 1901
Smithsonian Institution Libraries
4. Ramping Up
Find your niche
Secure Funding
Hire Staff
Purchase Equipment
Standardize on metadata, processes
Automate!
i.e., find magic automation wizard
5. Our Little Corner of the Web
10 original partner
institutions
Digitizing legacy
literature of taxonomy
Over 50,000 titles, over
100,000 items, almost
38 million pages
6. Numbers!
Digitization at SI Libraries Storage estimates
1999-present
14000
At Internet
12000
Archive
>10TB
10000
too rapid
8000
rapid
6000
not rapid
4000
2000
0
Locally >7.5TB
Total Items
7. Funding
Multiple grants
Over multiple years
Lather, rinse, repeat
Kalamazoo Tank & Silo Co.
Catalog, ca. 1909
Smithsonian Institution Libraries
8. Human Resources
Started in 2008 with
2 FTE technicians (Grant)
.7 FTE manager
.5 FTE cataloger
Vendor scanning only
And a host of others!
In 2012 have
1 FTE technician (Grant)
2 FTE librarians (Grant)
International Time Recording Co.
.3 FTE manager Time Recording Card Clocks , 1914 , p.12
Smithsonian Institution Libraries
1 scanning technician (Grant)
And a host of others!
10. In-House Scanning
P65, 60.5MP camera
Strobe lights
Image capture
Filenaming
Crop, rotate
No post-processing
Convert to .tiff
11. Process(es)(es)
presentation
Data sources
Website
“gap-fills”
Vendor
Requests
storage
In-house use
(exhibitions, br Special
ochures) projects
12. Workflow Mark as
DB scanned SIRIS
Title level
Item level MARC URLs in MARC record
Initiate metadata
workflow
Item
Select & Check out Check in Check in
Scanning available
Dedupe and Ship and QC Add link
in IA/BHL
JP2000s
+ metadata
Harvest to
Local
Internet Repository
Archive
Generalized workflow
13. Standardize Process and Data
Common staging area
Metadata Model
Title level (MARC) metadata
Item level metadata
volume, issue, date, barcode
Page level metadata
sequence, page number, page type
Common storage area
Common presentation area
Ericsson LM, Can Efficiency be Measured?
Stockholm, Sweden, 1946
Smithsonian Institution Libraries
14. Automate Metadata Capture & Transformation
Extract title level metadata
MARC MARCXML
Extract item level metadata
From SIRIS SQL db xml file
Page level metadata
Interface for easy data entry
National Cash Register
File creation and conversion
Annual Report, 1953
Smithsonian Institution Libraries Upload to staging area
15. Workflow Mark as
DB scanned SIRIS
Item level Title level
metadata MARC URLs in MARC record
Initiate
workflow
Creates Item
Select & Check out Check in Check in
metadata Scanning available
Dedupe and Ship and QC Add link
“Bucket” in IA/BHL
Transforms
Images, cr .tiffs
eates
Macaw
derivatives
Page level Temp.
metadata Backup to
added NAS
Packages JP2000s
files for + metadata
transfer Internet
Archive
In-house workflow with Macaw
History: scanning since 1999. create “digital editions” whole books delivered via website, scan using betterlight, tiffs, convert to jpg. Store on gold cds! And tivoli. Metadata entered via cut and paste into spreadsheets. Beginning – html pages one per book page! Then use database driven pages. Each book scanned was a unique project. Some projects had grant funding, some didn’t. End result was not stored in a content or collections mgmt system, just on the website.
To increase volume, you must standardize - what metadata is collected, etc. try to accommodate most things you’ll scan, but inevitably one size won’t fit all. Figure out what you’re willing to compromise on and live with. Format based = books one way, photos another, audio another.Automation for efficiency and speed, staffing for consistency, quality control, and speed.You don’t necessarily need one huge funding source, but you do need a stream of funding. More than project based, but not the whole enchilada ncessarily. Leverage that as proof of concept for funding for other parts of the collection, OR for funding additional services/feature Overlapping grants, creative redeployment of existing resources, project-within-a-project funding
SIL’s rapid capture methodology based on one large project (BHL) and it’s needs. We then Extend the model from there.Justone way of approaching it We had initial grant for digitization, supplemented with two more. More will need to come.We use funding primarily for STAFF, then for vendor/outsource, then for equipment/software.Process has taken a couple years to standardized. Couldn’t have standardized and rapidized process without the automation.
Catalyst for our Ramp-up came In 2008 (or thereabouts) Smithsonian Lib and MoBot spearheaded the creation of BHL. primary audience was the international taxonomic community, we had plenty of collections that were relevant. We are primarily scanning from our NH collections, as well as Cullman rare book collections. Those make up only n% of the total SIL collections, but it’s a significant % of our public domain holdings.Ramp up was necessitated by terms of the grant!
Over 14,500 items and 5.8m images scanned since 2008. Mostly via Internet Archive (BHL only)Our other scanning project since 2010 over 1900 items and 600,000 pagesRamped up VERY QUICKLY. Sending 200 items a week for scanning. Needed to spend out funds, BUT quality suffered. Shipments started failing QC, so we scaled back. Fewer problems now.Rapidity – function of non-destructive scanning, care with fragile/rare, QC TAKES A LONG TIME, but saves rescanning later.Averaging ~ 4000 images/month locally, IA avgs 104,000 imgs/monthStorage: (est. 600MB per package, zipped compressed lossy jp2s etc) at IA = over 10TB (? 8.3TB BHL + 1.2 TB? SI )Storage locally since 2011: avg. pkg size is 23.4GB, more than 4.5TB. Saving tiffs, jp2s]
You don’t necessarily need one huge funding source, but you do need a stream of funding. Overlapping grants, creative redeployment of existing resources, project-within-a-project fundingInitial BHL digitization costs paid from MacArthur grant to EOL/BHL – only covers scanning <$500,000 (will scan approx. 17,000 books, out of over 50,000 likely to scan for that project) Rough calculation figured total cost to scan entire (BHL) collection (by IA, which is cheap) would cost over $2.5mFunding of personnel and equipment from multi-year overlappingSeidell grants (1.5m over 7 years)Expanding scanning to other parts of the collection by setting aside special purpose funds (director’s discretionary) for both people and scanning.Future…? Gradually incorporate tasks into permanent staff tasks/refill positions judiciously.Seekspecific grants for special parts of collection or special use cases
Most imp use of fundsFull time.Feed the beast.Manage coordinate workflow, also do qc, post-scanning maintenance of online collectionBHL project evolving, workflow more settled now, need libs not techsNote that the librarians do more than manage the digitization. metadata issues are now usually routed through our contract cataloging process and also use grant funds.
IA: Quick, cheap, open accessDownsides: size limit, public domain only, quality spottyIn-house: quality, controlDownsides: slower, more expensive, STORAGESpeed may be less of a factor once the NEW CAMERA comes online
Gory details:shoot target at beginning of the book only, calibrate (mostly white balance) once per book> always shoot greater than 300ppi, relative to the size of the book > Shoot in 16 bit color, Adobe 1998 RGB color spaced When imgs are converted to .tiff, ownsample to 8bit color and standardize on 300ppi (space issues) > apply auto-contrast and auto-levels but no other image editing in CaptureOne, maybe some sharpening if needed. Capture one does filenaming, crop, rotate, and convert to tiffQC done as first pass right after scanning for all items, by the scanner. Second QC is done by other staff on a selected number of items, based on a formula (NISO standard!). QC is looking only for ‘major’ errors such as missing pages, thumbs in picture, cut off text – anything that would adversely affect the OCR. We are concerned only with the “content” since this is an ACCESS copy not book as ARTIFACTAfter scanning, the operator manually moves the files onto the Macaw server, into the directory already created (name convention is barcode, same as filenames)
Digitization can happen anywhere. Multiple vendors, in house, legacy stuff you scanned way back. Small grants, Special projects, main mass-digi stream, extraction of pretty pictures for reuse.Bulk for us done by IA – cost & grant driven for BHL, but they can’t do everythingAll the various workflows=BLUE SPAGHETTI BARFHard to track, stuff everywhere, doesn’t scale (duh) need to refine processes and standardize and harmonize small-scale projects with large-scale project
Basic workflow. Key elementsItem level metadata & workflow tracking dbSIRIS as official metadata repositoryIA as staging (and temporary storage) area
Use IA as staging for convenience – already used by BHL project. Plenty of storage space, they do OCR and create derivatives for us, plus, available for everyone on IA.Accept common basic metadata model (for book format) based on BHL/IA model. Suits most things.Still to solve: storage, presentation, non-IA compatible stuff (e.g in copyright)However, creating metadata and uploading to IA would be a time intensive manual processTo be efficient must AUTOMATELocal scanning needed tool to upload to IA, create metadata = Macaw
Use & reuse data you already have. Find protocols to extract data you have. HOW? Through MACAW! For us, can get title level MARC data from SIRIS via Z39.50. Item level data not as accessible, so extracted it in bulk, stored in sep db that we use for workflow, Macaw then automatically harvests it from that db when necessary. Macaw transforms harvested data to xml.Descriptive Pg level data still entered by hand, but technical mtdt (image size) extracted automatically, transformation to xml automated.Also automate transformation tiff->jp2, bundling and uploading of locally created files to the IA staging area. (easier said than done)
When a book is selected for scanning in the Workflow database,Macaw (which checks it every couple hours) imports the item-level data (barcode, volume etc) and creates a directory on it’s server to hold the metadata and scans. It then imports the MARC record from SIRIS via z39.50 and converts it to MARCxml, saved in a file. The item-level data is stored in a database.When the scanner moves the scanned images to the directory, Macaw creates thumbnails for use in the interface.
Operator scans barcode for item and is taken to the editing page.Add page level metadata (page type, page number) and structure (page sequence). Stored in an xml file.Easy to use GUI, shortcuts to common operations, like selecting alternate pages to apply recto/verso and page type descriptions. Can re-order pages, esp useful if you’ve scanned all the rectos then all the versos.ClickContains extra fields that we can use locally for other projects – add captions, notes, flag for ‘interestingness’ e.g. blog post or etc.Once book is “finished” unless it is flagged for QC by other staff, Macaw creates the page level xml file, converts the .tffs to lossy compressed jp2s, zips the compressed jp2s, and sends the entire metadata+scans package up to Internet Archive which is a lot easier to say than it is to do. Also copied locally to NAS for temporary storage.
IA scanning for “Access” only. Not Preservation. Managing expectations. Color and calibrationCurrent equipment still slow to setup/handle oversize materialsNot embedding descriptive metadata in page images. Need to automate this. Send to dams/other.
Thruput: new camera should help, MSS and un-cataloged items need software tweaks for the metadata, also need to develop auto export to local storage, aka DAMS. Starting to repurpose images already (import directly into our galaxy of images collection) but hope to integrate into online exhibition workflow involving DAMs and ? Who knows. Output to Mets for storage. Not thinking about PREMIS just yet. Islandora for storage and/or delivery of METS based docs. Need to harvest back scans and metadata from multiple locations so we can manage corrections, storage (fault lines!) possible replication of BHL corpus.Interface as part of new digital library