SIL rapid capture

•Télécharger en tant que PPTX, PDF•

0 j'aime•664 vues

More information than you require about the Smithsonian Libraries' mass digitization program. Presentation given to Smithsonian staff (and some others) for a day-long symposium on rapid capture methodology. Focuses on SIL's workflow for scanning books.

Formation

Boutique Digitization

Boutique
 One-offs
 Item-based workflow
 Tailored metadata
 Hand-crafted data,
much user intervention
 Opportunistic staffing
 Project specific grants

Illustration by A.E. Marty (1882-1974)
Gazette du Bon Genre, July 1920
Smithsonian Institution Libraries

Mass Digitization

Prêt à lire

 Standardization
 Format-based workflow
and metadata model
 Automate as much as
possible
 Assigned staff
 Funding stream

New York Millinery and Supply Co. , 1901
Smithsonian Institution Libraries

Ramping Up
 Find your niche
 Secure Funding
 Hire Staff
 Purchase Equipment
 Standardize on metadata, processes
 Automate!
 i.e., find magic automation wizard

Our Little Corner of the Web

 10 original partner
institutions
 Digitizing legacy
literature of taxonomy
 Over 50,000 titles, over
100,000 items, almost
38 million pages

Numbers!
Digitization at SI Libraries Storage estimates
1999-present
14000
At Internet
12000
Archive
>10TB
10000
too rapid
8000

rapid
6000

not rapid
4000

2000

0
Locally >7.5TB
Total Items

Funding
 Multiple grants
 Over multiple years
 Lather, rinse, repeat

Kalamazoo Tank & Silo Co.
Catalog, ca. 1909
Smithsonian Institution Libraries

Human Resources
 Started in 2008 with
 2 FTE technicians (Grant)
 .7 FTE manager
 .5 FTE cataloger
 Vendor scanning only
 And a host of others!

 In 2012 have
 1 FTE technician (Grant)
 2 FTE librarians (Grant)
International Time Recording Co.
 .3 FTE manager Time Recording Card Clocks , 1914 , p.12
Smithsonian Institution Libraries
 1 scanning technician (Grant)
 And a host of others!

Canon 5D MkII, Biblio

PhaseOne P65, CaptureOne

BC100,CaptureOne

Equipment

In-House Scanning

 P65, 60.5MP camera
 Strobe lights
 Image capture
 Filenaming
 Crop, rotate
 No post-processing
 Convert to .tiff

Process(es)(es)
presentation
Data sources
Website

“gap-fills”

Vendor
Requests
storage
In-house use
(exhibitions, br Special
ochures) projects

Workflow Mark as
DB scanned SIRIS

Title level
Item level MARC URLs in MARC record
Initiate metadata
workflow

Item
Select & Check out Check in Check in
Scanning available
Dedupe and Ship and QC Add link
in IA/BHL

JP2000s
+ metadata

Harvest to
Local
Internet Repository
Archive

Generalized workflow

Standardize Process and Data

 Common staging area
 Metadata Model
 Title level (MARC) metadata
 Item level metadata
 volume, issue, date, barcode
 Page level metadata
 sequence, page number, page type
 Common storage area
 Common presentation area
Ericsson LM, Can Efficiency be Measured?
Stockholm, Sweden, 1946
Smithsonian Institution Libraries

Automate Metadata Capture & Transformation

 Extract title level metadata
 MARC  MARCXML
 Extract item level metadata
 From SIRIS  SQL db  xml file
 Page level metadata
 Interface for easy data entry

National Cash Register
 File creation and conversion
Annual Report, 1953
Smithsonian Institution Libraries  Upload to staging area

Workflow Mark as
DB scanned SIRIS
Item level Title level
metadata MARC URLs in MARC record
Initiate
workflow

Creates Item
Select & Check out Check in Check in
metadata Scanning available
Dedupe and Ship and QC Add link
“Bucket” in IA/BHL
Transforms
Images, cr .tiffs
eates
Macaw

derivatives
Page level Temp.
metadata Backup to
added NAS

Packages JP2000s
files for + metadata
transfer Internet
Archive

In-house workflow with Macaw

Metadata Collection and Workflow (Macaw)

Room for Improvement
 Quality Speed Embed metadata

Kenwood Bicycle Mfg. Co.
Catalogue for 1895 , 1895
Smithsonian Institution Libraries

Future

 Increase throughput
 Scan non-book items (MSS)
 Scan un-cataloged items
 Frictionless repurposing
 Output to METS
 Islandora
 Local delivery interface

Collier’s, October 18, 1952
Smithsonian Institution Libraries

Thank You!
THAT IS ALL.
Keri Thompson
thompsonk@si.edu
@DigiKeri_SIL

Contenu connexe

Similaire à SIL rapid capture

[DCTPE2010] Biodiversity & Drupal

Drupal Taiwan

The New Alchemy: Turning Data into Gold Developers are leading the charge to turn consumer behavior into profitable solutions. By accessing and analyzing the explosion of data from consumer activities, any developer can create the personalized, relevant products and services that customers demand and merchants urgently need. We will discuss how to acquire, store, and mine information, and how to design analytics-focused software and build data-driven software engines.

2011 x.commerce Innovate Data Alchemy

Brian Johnson

Sharepoint Document Library Deep Dive - a how to discussion

Regroove

Discovery platforms: Technology, tools and issues

saiful76

Streaming Hadoop for Enterprise Adoption

DATAVERSITY

Treasure Data and Heroku

Treasure Data, Inc.

Oracle: Fundamental Of DW

DataminingTools Inc

Oracle: Fundamental Of Dw

oracle content

Fred Oh's presentation for SNW Spring, Monday 4/2/12, 1:00–1:45PM Unstructured data growth is in an explosive state, and has no signs of slowing down. Costs continue to rise along with new regulations mandating longer data retention. Moreover, disparate silos, multivendor storage assets and less than optimal use of existing assets have all contributed to ‘accidental architectures.’ And while they can be key drivers for organizations to explore incremental, innovative solutions to their data challenges, they may provide only short-term gain. Join us for this session as we outline the business benefits of a truly unified, integrated platform to manage all block, file and object data that allows enterprises can make the most out of their storage resources. We explore the benefits of an integrated approach to multiprotocol file sharing, intelligent file tiering, federated search and active archiving; how to simplify and reduce the need for backup without the risk of losing availability; and the economic benefits of an integrated architecture approach that leads to lowering TCSO by 35% or more.

Big Data, Big Content, and Aligning Your Storage Strategy

Hitachi Vantara

Magento Imagine 2013: Fabrizio Branca - Learning To Fly: How Angry Birds Reac...

AOE

Generating Researcher Networks with Identified Persons on a Semantic Service ...

Korea Institute of Science and Technology Information

Data Driven Innovation with Amazon Web Services

Amazon Web Services

E- library system also known as a digital library is concerned with that body of knowledge relating to the collection, organization, storage, distribution, retrieval, and utilization of digital information. Digital libraries basically store materials in electronic format and manipulate large collections of those materials effectively. Format would be a combination of text, imaging, sound, video, audio and animation.

e-library management system

Hisplus Systems Limited

Labmatrix Slides 2011 05

bhughes26

Extracting value from Big Data is not easy. The field of technologies and vendors is fragmented and rapidly evolving. End-to-end, general purpose solutions that work out of the box don’t exist yet, and Hadoop is no exception. And most companies lack Big Data specialists. The key to unlocking real value /// extracting the gold nuggets at the end of the rainbow (???) /// lies with mapping the business requirements smartly against the emerging and imperfect ecosystem of technology and vendor choices. There is a long list of crucial questions to think about. How fast is the data flying at you? Are your Big Data analyses tightly integrated with existing systems? Or parallel and complex? Can you tolerate a minute of latency? Do you accept data loss or generous SLAs? Is imperfect security good enough? The answer to Big Data ROI lies somewhere between the herd and nerd mentality. Thinking hard and being smart about each use case as early as possible avoids costly mistakes. This talk will illustrate how Deutsche Telekom follows this segmentation approach to make sure every individual use case drives architecture design and technology selection.

Don't be Hadooped when looking for Big Data ROI

DataWorks Summit

Galaxy of bits

Michal Zylinski

Adding structure to unstructured content for enhanced findability hakan tylen

Dynamic People B.V.

Saadallah vtls

Johannes Phaladi

Big Data Real Time Applications

DataWorks Summit

Catmandu / LibreCat Project

Patrick Hochstenbach

Similaire à SIL rapid capture (20)

[DCTPE2010] Biodiversity & Drupal

2011 x.commerce Innovate Data Alchemy

Sharepoint Document Library Deep Dive - a how to discussion

Discovery platforms: Technology, tools and issues

Streaming Hadoop for Enterprise Adoption

Treasure Data and Heroku

Oracle: Fundamental Of DW

Oracle: Fundamental Of Dw

Big Data, Big Content, and Aligning Your Storage Strategy

Magento Imagine 2013: Fabrizio Branca - Learning To Fly: How Angry Birds Reac...

Generating Researcher Networks with Identified Persons on a Semantic Service ...

Data Driven Innovation with Amazon Web Services

e-library management system

Labmatrix Slides 2011 05

Don't be Hadooped when looking for Big Data ROI

Galaxy of bits

Adding structure to unstructured content for enhanced findability hakan tylen

Saadallah vtls

Big Data Real Time Applications

Catmandu / LibreCat Project

Dernier

How to Manage Global Discount in Odoo 17 POS

Celine George

Mehran University Newsletter Vol-X, Issue-I, 2024

Mehran University of Engineering & Technology, Jamshoro

This PowerPoint helps students to consider the concept of infinity.

christianmathematics

SOC 101 Demonstration of Learning Presentation

camerronhm

FSB Advising Checklist - Orientation 2024

Elizabeth Walsh

Micro-Scholarship, What it is, How can it help me.pdf

Poh-Sun Goh

Salient Features of India constitution especially power and functions

KarakKing

Wellbeing inclusion and digital dystopias.pptx

Jisc

How to Give a Domain for a Field in Odoo 17

Celine George

Python Notes for mca i year students osmania university.docx

Ramakrishna Reddy Bijjam

𝐋𝐞𝐬𝐬𝐨𝐧 𝐎𝐮𝐭𝐜𝐨𝐦𝐞𝐬: -Discern accommodations and modifications within inclusive classroom environments, distinguishing between their respective roles and applications. -Through critical analysis of hypothetical scenarios, learners will adeptly select appropriate accommodations and modifications, honing their ability to foster an inclusive learning environment for students with disabilities or unique challenges.

Understanding Accommodations and Modifications

MJDuyan

Graduate Outcomes Presentation Slides - English

neillewis46

How to setup Pycharm environment for Odoo 17.pptx

Celine George

General Principles of Intellectual Property: Concepts of Intellectual Proper...

Poonam Aher Patil

Jamworks pilot and AI at Jisc (20/03/2024)

Jisc

REMIFENTANIL: An Ultra short acting opioid.pptx

Dr. Ravikiran H M Gowda

Google Gemini An AI Revolution in Education.pptx

Dr. Sarita Anand

Spellings Wk 3 English CAPS CARES Please Practise

AnaAcapella

Sociology 101 Demonstration of Learning Exhibit

jbellavia9

Single or Multiple melodic lines structure

dhanjurrannsibayan2

Dernier (20)

How to Manage Global Discount in Odoo 17 POS

Mehran University Newsletter Vol-X, Issue-I, 2024

This PowerPoint helps students to consider the concept of infinity.

SOC 101 Demonstration of Learning Presentation

FSB Advising Checklist - Orientation 2024

Micro-Scholarship, What it is, How can it help me.pdf

Salient Features of India constitution especially power and functions

Wellbeing inclusion and digital dystopias.pptx

How to Give a Domain for a Field in Odoo 17

Python Notes for mca i year students osmania university.docx

Understanding Accommodations and Modifications

Graduate Outcomes Presentation Slides - English

How to setup Pycharm environment for Odoo 17.pptx

General Principles of Intellectual Property: Concepts of Intellectual Proper...

Jamworks pilot and AI at Jisc (20/03/2024)

REMIFENTANIL: An Ultra short acting opioid.pptx

Google Gemini An AI Revolution in Education.pptx

Spellings Wk 3 English CAPS CARES Please Practise

Sociology 101 Demonstration of Learning Exhibit

Single or Multiple melodic lines structure

SIL rapid capture

1. “MORE” More information on the SIL digitization program than you require Keri Thompson Smithsonian Institution Libraries SPIN Rapid Capture Workshop February 16, 2012

2. Boutique Digitization Boutique  One-offs  Item-based workflow  Tailored metadata  Hand-crafted data, much user intervention  Opportunistic staffing  Project specific grants Illustration by A.E. Marty (1882-1974) Gazette du Bon Genre, July 1920 Smithsonian Institution Libraries

3. Mass Digitization Prêt à lire  Standardization  Format-based workflow and metadata model  Automate as much as possible  Assigned staff  Funding stream New York Millinery and Supply Co. , 1901 Smithsonian Institution Libraries

4. Ramping Up  Find your niche  Secure Funding  Hire Staff  Purchase Equipment  Standardize on metadata, processes  Automate!  i.e., find magic automation wizard

5. Our Little Corner of the Web  10 original partner institutions  Digitizing legacy literature of taxonomy  Over 50,000 titles, over 100,000 items, almost 38 million pages

6. Numbers! Digitization at SI Libraries Storage estimates 1999-present 14000 At Internet 12000 Archive >10TB 10000 too rapid 8000 rapid 6000 not rapid 4000 2000 0 Locally >7.5TB Total Items

7. Funding  Multiple grants  Over multiple years  Lather, rinse, repeat Kalamazoo Tank & Silo Co. Catalog, ca. 1909 Smithsonian Institution Libraries

8. Human Resources  Started in 2008 with  2 FTE technicians (Grant)  .7 FTE manager  .5 FTE cataloger  Vendor scanning only  And a host of others!  In 2012 have  1 FTE technician (Grant)  2 FTE librarians (Grant) International Time Recording Co.  .3 FTE manager Time Recording Card Clocks , 1914 , p.12 Smithsonian Institution Libraries  1 scanning technician (Grant)  And a host of others!

9. Canon 5D MkII, Biblio PhaseOne P65, CaptureOne BC100,CaptureOne Equipment

10. In-House Scanning  P65, 60.5MP camera  Strobe lights  Image capture  Filenaming  Crop, rotate  No post-processing  Convert to .tiff

11. Process(es)(es) presentation Data sources Website “gap-fills” Vendor Requests storage In-house use (exhibitions, br Special ochures) projects

12. Workflow Mark as DB scanned SIRIS Title level Item level MARC URLs in MARC record Initiate metadata workflow Item Select & Check out Check in Check in Scanning available Dedupe and Ship and QC Add link in IA/BHL JP2000s + metadata Harvest to Local Internet Repository Archive Generalized workflow

13. Standardize Process and Data  Common staging area  Metadata Model  Title level (MARC) metadata  Item level metadata  volume, issue, date, barcode  Page level metadata  sequence, page number, page type  Common storage area  Common presentation area Ericsson LM, Can Efficiency be Measured? Stockholm, Sweden, 1946 Smithsonian Institution Libraries

14. Automate Metadata Capture & Transformation  Extract title level metadata  MARC  MARCXML  Extract item level metadata  From SIRIS  SQL db  xml file  Page level metadata  Interface for easy data entry National Cash Register  File creation and conversion Annual Report, 1953 Smithsonian Institution Libraries  Upload to staging area

15. Workflow Mark as DB scanned SIRIS Item level Title level metadata MARC URLs in MARC record Initiate workflow Creates Item Select & Check out Check in Check in metadata Scanning available Dedupe and Ship and QC Add link “Bucket” in IA/BHL Transforms Images, cr .tiffs eates Macaw derivatives Page level Temp. metadata Backup to added NAS Packages JP2000s files for + metadata transfer Internet Archive In-house workflow with Macaw

16. Metadata Collection and Workflow (Macaw)

17. Room for Improvement  Quality Speed Embed metadata Kenwood Bicycle Mfg. Co. Catalogue for 1895 , 1895 Smithsonian Institution Libraries

18. Future  Increase throughput  Scan non-book items (MSS)  Scan un-cataloged items  Frictionless repurposing  Output to METS  Islandora  Local delivery interface Collier’s, October 18, 1952 Smithsonian Institution Libraries

19. Thank You! THAT IS ALL. Keri Thompson thompsonk@si.edu @DigiKeri_SIL

Notes de l'éditeur

History: scanning since 1999. create “digital editions” whole books delivered via website, scan using betterlight, tiffs, convert to jpg. Store on gold cds! And tivoli. Metadata entered via cut and paste into spreadsheets. Beginning – html pages one per book page! Then use database driven pages. Each book scanned was a unique project. Some projects had grant funding, some didn’t. End result was not stored in a content or collections mgmt system, just on the website.
To increase volume, you must standardize - what metadata is collected, etc. try to accommodate most things you’ll scan, but inevitably one size won’t fit all. Figure out what you’re willing to compromise on and live with. Format based = books one way, photos another, audio another.Automation for efficiency and speed, staffing for consistency, quality control, and speed.You don’t necessarily need one huge funding source, but you do need a stream of funding. More than project based, but not the whole enchilada ncessarily. Leverage that as proof of concept for funding for other parts of the collection, OR for funding additional services/feature Overlapping grants, creative redeployment of existing resources, project-within-a-project funding
SIL’s rapid capture methodology based on one large project (BHL) and it’s needs. We then Extend the model from there.Justone way of approaching it We had initial grant for digitization, supplemented with two more. More will need to come.We use funding primarily for STAFF, then for vendor/outsource, then for equipment/software.Process has taken a couple years to standardized. Couldn’t have standardized and rapidized process without the automation.
Catalyst for our Ramp-up came In 2008 (or thereabouts) Smithsonian Lib and MoBot spearheaded the creation of BHL. primary audience was the international taxonomic community, we had plenty of collections that were relevant. We are primarily scanning from our NH collections, as well as Cullman rare book collections. Those make up only n% of the total SIL collections, but it’s a significant % of our public domain holdings.Ramp up was necessitated by terms of the grant!
Over 14,500 items and 5.8m images scanned since 2008. Mostly via Internet Archive (BHL only)Our other scanning project since 2010 over 1900 items and 600,000 pagesRamped up VERY QUICKLY. Sending 200 items a week for scanning. Needed to spend out funds, BUT quality suffered. Shipments started failing QC, so we scaled back. Fewer problems now.Rapidity – function of non-destructive scanning, care with fragile/rare, QC TAKES A LONG TIME, but saves rescanning later.Averaging ~ 4000 images/month locally, IA avgs 104,000 imgs/monthStorage: (est. 600MB per package, zipped compressed lossy jp2s etc) at IA = over 10TB (? 8.3TB BHL + 1.2 TB? SI )Storage locally since 2011: avg. pkg size is 23.4GB, more than 4.5TB. Saving tiffs, jp2s]
You don’t necessarily need one huge funding source, but you do need a stream of funding. Overlapping grants, creative redeployment of existing resources, project-within-a-project fundingInitial BHL digitization costs paid from MacArthur grant to EOL/BHL – only covers scanning <$500,000 (will scan approx. 17,000 books, out of over 50,000 likely to scan for that project) Rough calculation figured total cost to scan entire (BHL) collection (by IA, which is cheap) would cost over $2.5mFunding of personnel and equipment from multi-year overlappingSeidell grants (1.5m over 7 years)Expanding scanning to other parts of the collection by setting aside special purpose funds (director’s discretionary) for both people and scanning.Future…? Gradually incorporate tasks into permanent staff tasks/refill positions judiciously.Seekspecific grants for special parts of collection or special use cases
Most imp use of fundsFull time.Feed the beast.Manage coordinate workflow, also do qc, post-scanning maintenance of online collectionBHL project evolving, workflow more settled now, need libs not techsNote that the librarians do more than manage the digitization. metadata issues are now usually routed through our contract cataloging process and also use grant funds.
IA: Quick, cheap, open accessDownsides: size limit, public domain only, quality spottyIn-house: quality, controlDownsides: slower, more expensive, STORAGESpeed may be less of a factor once the NEW CAMERA comes online
Gory details:shoot target at beginning of the book only, calibrate (mostly white balance) once per book> always shoot greater than 300ppi, relative to the size of the book > Shoot in 16 bit color, Adobe 1998 RGB color spaced When imgs are converted to .tiff, ownsample to 8bit color and standardize on 300ppi (space issues) > apply auto-contrast and auto-levels but no other image editing in CaptureOne, maybe some sharpening if needed. Capture one does filenaming, crop, rotate, and convert to tiffQC done as first pass right after scanning for all items, by the scanner. Second QC is done by other staff on a selected number of items, based on a formula (NISO standard!). QC is looking only for ‘major’ errors such as missing pages, thumbs in picture, cut off text – anything that would adversely affect the OCR. We are concerned only with the “content” since this is an ACCESS copy not book as ARTIFACTAfter scanning, the operator manually moves the files onto the Macaw server, into the directory already created (name convention is barcode, same as filenames)
Digitization can happen anywhere. Multiple vendors, in house, legacy stuff you scanned way back. Small grants, Special projects, main mass-digi stream, extraction of pretty pictures for reuse.Bulk for us done by IA – cost & grant driven for BHL, but they can’t do everythingAll the various workflows=BLUE SPAGHETTI BARFHard to track, stuff everywhere, doesn’t scale (duh) need to refine processes and standardize and harmonize small-scale projects with large-scale project
Basic workflow. Key elementsItem level metadata & workflow tracking dbSIRIS as official metadata repositoryIA as staging (and temporary storage) area
Use IA as staging for convenience – already used by BHL project. Plenty of storage space, they do OCR and create derivatives for us, plus, available for everyone on IA.Accept common basic metadata model (for book format) based on BHL/IA model. Suits most things.Still to solve: storage, presentation, non-IA compatible stuff (e.g in copyright)However, creating metadata and uploading to IA would be a time intensive manual processTo be efficient must AUTOMATELocal scanning needed tool to upload to IA, create metadata = Macaw
Use & reuse data you already have. Find protocols to extract data you have. HOW? Through MACAW! For us, can get title level MARC data from SIRIS via Z39.50. Item level data not as accessible, so extracted it in bulk, stored in sep db that we use for workflow, Macaw then automatically harvests it from that db when necessary. Macaw transforms harvested data to xml.Descriptive Pg level data still entered by hand, but technical mtdt (image size) extracted automatically, transformation to xml automated.Also automate transformation tiff->jp2, bundling and uploading of locally created files to the IA staging area. (easier said than done)
When a book is selected for scanning in the Workflow database,Macaw (which checks it every couple hours) imports the item-level data (barcode, volume etc) and creates a directory on it’s server to hold the metadata and scans. It then imports the MARC record from SIRIS via z39.50 and converts it to MARCxml, saved in a file. The item-level data is stored in a database.When the scanner moves the scanned images to the directory, Macaw creates thumbnails for use in the interface.
Operator scans barcode for item and is taken to the editing page.Add page level metadata (page type, page number) and structure (page sequence). Stored in an xml file.Easy to use GUI, shortcuts to common operations, like selecting alternate pages to apply recto/verso and page type descriptions. Can re-order pages, esp useful if you’ve scanned all the rectos then all the versos.ClickContains extra fields that we can use locally for other projects – add captions, notes, flag for ‘interestingness’ e.g. blog post or etc.Once book is “finished” unless it is flagged for QC by other staff, Macaw creates the page level xml file, converts the .tffs to lossy compressed jp2s, zips the compressed jp2s, and sends the entire metadata+scans package up to Internet Archive which is a lot easier to say than it is to do. Also copied locally to NAS for temporary storage.
IA scanning for “Access” only. Not Preservation. Managing expectations. Color and calibrationCurrent equipment still slow to setup/handle oversize materialsNot embedding descriptive metadata in page images. Need to automate this. Send to dams/other.
Thruput: new camera should help, MSS and un-cataloged items need software tweaks for the metadata, also need to develop auto export to local storage, aka DAMS. Starting to repurpose images already (import directly into our galaxy of images collection) but hope to integrate into online exhibition workflow involving DAMs and ? Who knows. Output to Mets for storage. Not thinking about PREMIS just yet. Islandora for storage and/or delivery of METS based docs. Need to harvest back scans and metadata from multiple locations so we can manage corrections, storage (fault lines!) possible replication of BHL corpus.Interface as part of new digital library

SIL rapid capture

Recommandé

Recommandé

Contenu connexe

Similaire à SIL rapid capture

Similaire à SIL rapid capture (20)

Dernier

Dernier (20)

SIL rapid capture

Notes de l'éditeur