SlideShare une entreprise Scribd logo
1  sur  19
“MORE”
More information on the SIL digitization
             program than you require
                                                          Keri Thompson
                                         Smithsonian Institution Libraries
                          SPIN Rapid Capture Workshop February 16, 2012
Boutique Digitization

                                          Boutique
                                             One-offs
                                             Item-based workflow
                                             Tailored metadata
                                             Hand-crafted data,
                                              much user intervention
                                             Opportunistic staffing
                                             Project specific grants

 Illustration by A.E. Marty (1882-1974)
 Gazette du Bon Genre, July 1920
 Smithsonian Institution Libraries
Mass Digitization

Prêt à lire

   Standardization
   Format-based workflow
    and metadata model
   Automate as much as
    possible
   Assigned staff
   Funding stream

                            New York Millinery and Supply Co. , 1901
                            Smithsonian Institution Libraries
Ramping Up
   Find your niche
   Secure Funding
   Hire Staff
   Purchase Equipment
   Standardize on metadata, processes
   Automate!
       i.e., find magic automation wizard
Our Little Corner of the Web



                         10 original partner
                          institutions
                         Digitizing legacy
                          literature of taxonomy
                         Over 50,000 titles, over
                          100,000 items, almost
                          38 million pages
Numbers!
        Digitization at SI Libraries     Storage estimates
               1999-present
14000
                                                             At Internet
12000
                                                             Archive
                                                             >10TB
10000
                           too rapid
 8000


                        rapid
 6000

          not rapid
 4000



 2000



    0
                                       Locally >7.5TB
                      Total Items
Funding
   Multiple grants
   Over multiple years
   Lather, rinse, repeat




                            Kalamazoo Tank & Silo Co.
                            Catalog, ca. 1909
                            Smithsonian Institution Libraries
Human Resources
   Started in 2008 with
       2 FTE technicians (Grant)
       .7 FTE manager
       .5 FTE cataloger
       Vendor scanning only
       And a host of others!


   In 2012 have
       1 FTE technician (Grant)
       2 FTE librarians (Grant)
                                        International Time Recording Co.
       .3 FTE manager                  Time Recording Card Clocks , 1914 , p.12
                                        Smithsonian Institution Libraries
       1 scanning technician (Grant)
       And a host of others!
Canon 5D MkII, Biblio




PhaseOne P65, CaptureOne



                           BC100,CaptureOne




  Equipment
In-House Scanning

                       P65, 60.5MP camera
                       Strobe lights
                       Image capture
                       Filenaming
                       Crop, rotate
                       No post-processing
                       Convert to .tiff
Process(es)(es)
                                              presentation
         Data sources
                                    Website


“gap-fills”


      Vendor
                        Requests
                                              storage
In-house use
(exhibitions, br         Special
ochures)                 projects
Workflow            Mark as
               DB                scanned              SIRIS

                                                       Title level
                        Item level                     MARC                URLs in MARC record
  Initiate              metadata
  workflow

                                                                        Item
  Select &        Check out                           Check in                      Check in
                                   Scanning                          available
  Dedupe           and Ship                            and QC                       Add link
                                                                     in IA/BHL



                              JP2000s
                              + metadata

                                                                            Harvest to
                                                                              Local
                                           Internet                         Repository
                                            Archive



Generalized workflow
Standardize Process and Data

   Common staging area
   Metadata Model
       Title level (MARC) metadata
       Item level metadata
           volume, issue, date, barcode
       Page level metadata
           sequence, page number, page type
   Common storage area
   Common presentation area
                                               Ericsson LM, Can Efficiency be Measured?
                                               Stockholm, Sweden, 1946
                                               Smithsonian Institution Libraries
Automate Metadata Capture & Transformation

                                       Extract title level metadata
                                           MARC  MARCXML
                                       Extract item level metadata
                                           From SIRIS  SQL db  xml file
                                       Page level metadata
                                           Interface for easy data entry

National Cash Register
                                       File creation and conversion
Annual Report, 1953
Smithsonian Institution Libraries      Upload to staging area
Workflow                    Mark as
              DB                        scanned        SIRIS
                       Item level                          Title level
                       metadata                            MARC                                URLs in MARC record
Initiate
workflow

                                 Creates                                                  Item
 Select &        Check out                                               Check in                       Check in
                                metadata          Scanning                             available
 Dedupe           and Ship                                                and QC                        Add link
                                “Bucket”                                               in IA/BHL
                                Transforms
                                Images, cr        .tiffs
                                   eates
                       Macaw



                                derivatives
                                Page level                          Temp.
                                metadata                           Backup to
                                  added                               NAS

                                    Packages           JP2000s
                                     files for         + metadata
                                    transfer                                        Internet
                                                                                     Archive

      In-house workflow with Macaw
Metadata Collection and Workflow (Macaw)
Room for Improvement
   Quality Speed Embed metadata




        Kenwood Bicycle Mfg. Co.
        Catalogue for 1895 , 1895
        Smithsonian Institution Libraries
Future

   Increase throughput
   Scan non-book items (MSS)
   Scan un-cataloged items
   Frictionless repurposing
   Output to METS
   Islandora
   Local delivery interface



                                Collier’s, October 18, 1952
                                Smithsonian Institution Libraries
Thank You!
THAT IS ALL.
   Keri Thompson
thompsonk@si.edu
    @DigiKeri_SIL

Contenu connexe

Similaire à SIL rapid capture

[DCTPE2010] Biodiversity & Drupal
[DCTPE2010] Biodiversity & Drupal[DCTPE2010] Biodiversity & Drupal
[DCTPE2010] Biodiversity & DrupalDrupal Taiwan
 
2011 x.commerce Innovate Data Alchemy
2011 x.commerce Innovate Data Alchemy2011 x.commerce Innovate Data Alchemy
2011 x.commerce Innovate Data AlchemyBrian Johnson
 
Sharepoint Document Library Deep Dive - a how to discussion
Sharepoint Document Library Deep Dive - a how to discussionSharepoint Document Library Deep Dive - a how to discussion
Sharepoint Document Library Deep Dive - a how to discussionRegroove
 
Discovery platforms: Technology, tools and issues
Discovery platforms: Technology, tools and issuesDiscovery platforms: Technology, tools and issues
Discovery platforms: Technology, tools and issuessaiful76
 
Streaming Hadoop for Enterprise Adoption
Streaming Hadoop for Enterprise AdoptionStreaming Hadoop for Enterprise Adoption
Streaming Hadoop for Enterprise AdoptionDATAVERSITY
 
Oracle: Fundamental Of Dw
Oracle: Fundamental Of DwOracle: Fundamental Of Dw
Oracle: Fundamental Of Dworacle content
 
Big Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage StrategyBig Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage StrategyHitachi Vantara
 
Magento Imagine 2013: Fabrizio Branca - Learning To Fly: How Angry Birds Reac...
Magento Imagine 2013: Fabrizio Branca - Learning To Fly: How Angry Birds Reac...Magento Imagine 2013: Fabrizio Branca - Learning To Fly: How Angry Birds Reac...
Magento Imagine 2013: Fabrizio Branca - Learning To Fly: How Angry Birds Reac...AOE
 
Data Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web ServicesData Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web ServicesAmazon Web Services
 
Labmatrix Slides 2011 05
Labmatrix Slides 2011 05Labmatrix Slides 2011 05
Labmatrix Slides 2011 05bhughes26
 
Don't be Hadooped when looking for Big Data ROI
Don't be Hadooped when looking for Big Data ROIDon't be Hadooped when looking for Big Data ROI
Don't be Hadooped when looking for Big Data ROIDataWorks Summit
 
Adding structure to unstructured content for enhanced findability hakan tylen
Adding structure to unstructured content for enhanced findability hakan tylenAdding structure to unstructured content for enhanced findability hakan tylen
Adding structure to unstructured content for enhanced findability hakan tylenDynamic People B.V.
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time ApplicationsDataWorks Summit
 

Similaire à SIL rapid capture (20)

[DCTPE2010] Biodiversity & Drupal
[DCTPE2010] Biodiversity & Drupal[DCTPE2010] Biodiversity & Drupal
[DCTPE2010] Biodiversity & Drupal
 
2011 x.commerce Innovate Data Alchemy
2011 x.commerce Innovate Data Alchemy2011 x.commerce Innovate Data Alchemy
2011 x.commerce Innovate Data Alchemy
 
Sharepoint Document Library Deep Dive - a how to discussion
Sharepoint Document Library Deep Dive - a how to discussionSharepoint Document Library Deep Dive - a how to discussion
Sharepoint Document Library Deep Dive - a how to discussion
 
Discovery platforms: Technology, tools and issues
Discovery platforms: Technology, tools and issuesDiscovery platforms: Technology, tools and issues
Discovery platforms: Technology, tools and issues
 
Streaming Hadoop for Enterprise Adoption
Streaming Hadoop for Enterprise AdoptionStreaming Hadoop for Enterprise Adoption
Streaming Hadoop for Enterprise Adoption
 
Treasure Data and Heroku
Treasure Data and HerokuTreasure Data and Heroku
Treasure Data and Heroku
 
Oracle: Fundamental Of DW
Oracle: Fundamental Of DWOracle: Fundamental Of DW
Oracle: Fundamental Of DW
 
Oracle: Fundamental Of Dw
Oracle: Fundamental Of DwOracle: Fundamental Of Dw
Oracle: Fundamental Of Dw
 
Big Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage StrategyBig Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage Strategy
 
Magento Imagine 2013: Fabrizio Branca - Learning To Fly: How Angry Birds Reac...
Magento Imagine 2013: Fabrizio Branca - Learning To Fly: How Angry Birds Reac...Magento Imagine 2013: Fabrizio Branca - Learning To Fly: How Angry Birds Reac...
Magento Imagine 2013: Fabrizio Branca - Learning To Fly: How Angry Birds Reac...
 
Generating Researcher Networks with Identified Persons on a Semantic Service ...
Generating Researcher Networks with Identified Persons on a Semantic Service ...Generating Researcher Networks with Identified Persons on a Semantic Service ...
Generating Researcher Networks with Identified Persons on a Semantic Service ...
 
Data Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web ServicesData Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web Services
 
e-library management system
e-library management systeme-library management system
e-library management system
 
Labmatrix Slides 2011 05
Labmatrix Slides 2011 05Labmatrix Slides 2011 05
Labmatrix Slides 2011 05
 
Don't be Hadooped when looking for Big Data ROI
Don't be Hadooped when looking for Big Data ROIDon't be Hadooped when looking for Big Data ROI
Don't be Hadooped when looking for Big Data ROI
 
Galaxy of bits
Galaxy of bitsGalaxy of bits
Galaxy of bits
 
Adding structure to unstructured content for enhanced findability hakan tylen
Adding structure to unstructured content for enhanced findability hakan tylenAdding structure to unstructured content for enhanced findability hakan tylen
Adding structure to unstructured content for enhanced findability hakan tylen
 
Saadallah vtls
Saadallah vtlsSaadallah vtls
Saadallah vtls
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
 
Catmandu / LibreCat Project
Catmandu / LibreCat ProjectCatmandu / LibreCat Project
Catmandu / LibreCat Project
 

Dernier

Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...RKavithamani
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 

Dernier (20)

Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 

SIL rapid capture

  • 1. “MORE” More information on the SIL digitization program than you require Keri Thompson Smithsonian Institution Libraries SPIN Rapid Capture Workshop February 16, 2012
  • 2. Boutique Digitization Boutique  One-offs  Item-based workflow  Tailored metadata  Hand-crafted data, much user intervention  Opportunistic staffing  Project specific grants Illustration by A.E. Marty (1882-1974) Gazette du Bon Genre, July 1920 Smithsonian Institution Libraries
  • 3. Mass Digitization Prêt à lire  Standardization  Format-based workflow and metadata model  Automate as much as possible  Assigned staff  Funding stream New York Millinery and Supply Co. , 1901 Smithsonian Institution Libraries
  • 4. Ramping Up  Find your niche  Secure Funding  Hire Staff  Purchase Equipment  Standardize on metadata, processes  Automate!  i.e., find magic automation wizard
  • 5. Our Little Corner of the Web  10 original partner institutions  Digitizing legacy literature of taxonomy  Over 50,000 titles, over 100,000 items, almost 38 million pages
  • 6. Numbers! Digitization at SI Libraries Storage estimates 1999-present 14000 At Internet 12000 Archive >10TB 10000 too rapid 8000 rapid 6000 not rapid 4000 2000 0 Locally >7.5TB Total Items
  • 7. Funding  Multiple grants  Over multiple years  Lather, rinse, repeat Kalamazoo Tank & Silo Co. Catalog, ca. 1909 Smithsonian Institution Libraries
  • 8. Human Resources  Started in 2008 with  2 FTE technicians (Grant)  .7 FTE manager  .5 FTE cataloger  Vendor scanning only  And a host of others!  In 2012 have  1 FTE technician (Grant)  2 FTE librarians (Grant) International Time Recording Co.  .3 FTE manager Time Recording Card Clocks , 1914 , p.12 Smithsonian Institution Libraries  1 scanning technician (Grant)  And a host of others!
  • 9. Canon 5D MkII, Biblio PhaseOne P65, CaptureOne BC100,CaptureOne Equipment
  • 10. In-House Scanning  P65, 60.5MP camera  Strobe lights  Image capture  Filenaming  Crop, rotate  No post-processing  Convert to .tiff
  • 11. Process(es)(es) presentation Data sources Website “gap-fills” Vendor Requests storage In-house use (exhibitions, br Special ochures) projects
  • 12. Workflow Mark as DB scanned SIRIS Title level Item level MARC URLs in MARC record Initiate metadata workflow Item Select & Check out Check in Check in Scanning available Dedupe and Ship and QC Add link in IA/BHL JP2000s + metadata Harvest to Local Internet Repository Archive Generalized workflow
  • 13. Standardize Process and Data  Common staging area  Metadata Model  Title level (MARC) metadata  Item level metadata  volume, issue, date, barcode  Page level metadata  sequence, page number, page type  Common storage area  Common presentation area Ericsson LM, Can Efficiency be Measured? Stockholm, Sweden, 1946 Smithsonian Institution Libraries
  • 14. Automate Metadata Capture & Transformation  Extract title level metadata  MARC  MARCXML  Extract item level metadata  From SIRIS  SQL db  xml file  Page level metadata  Interface for easy data entry National Cash Register  File creation and conversion Annual Report, 1953 Smithsonian Institution Libraries  Upload to staging area
  • 15. Workflow Mark as DB scanned SIRIS Item level Title level metadata MARC URLs in MARC record Initiate workflow Creates Item Select & Check out Check in Check in metadata Scanning available Dedupe and Ship and QC Add link “Bucket” in IA/BHL Transforms Images, cr .tiffs eates Macaw derivatives Page level Temp. metadata Backup to added NAS Packages JP2000s files for + metadata transfer Internet Archive In-house workflow with Macaw
  • 16. Metadata Collection and Workflow (Macaw)
  • 17. Room for Improvement  Quality Speed Embed metadata Kenwood Bicycle Mfg. Co. Catalogue for 1895 , 1895 Smithsonian Institution Libraries
  • 18. Future  Increase throughput  Scan non-book items (MSS)  Scan un-cataloged items  Frictionless repurposing  Output to METS  Islandora  Local delivery interface Collier’s, October 18, 1952 Smithsonian Institution Libraries
  • 19. Thank You! THAT IS ALL. Keri Thompson thompsonk@si.edu @DigiKeri_SIL

Notes de l'éditeur

  1. History: scanning since 1999. create “digital editions” whole books delivered via website, scan using betterlight, tiffs, convert to jpg. Store on gold cds! And tivoli. Metadata entered via cut and paste into spreadsheets. Beginning – html pages one per book page! Then use database driven pages. Each book scanned was a unique project. Some projects had grant funding, some didn’t. End result was not stored in a content or collections mgmt system, just on the website.
  2. To increase volume, you must standardize - what metadata is collected, etc. try to accommodate most things you’ll scan, but inevitably one size won’t fit all. Figure out what you’re willing to compromise on and live with. Format based = books one way, photos another, audio another.Automation for efficiency and speed, staffing for consistency, quality control, and speed.You don’t necessarily need one huge funding source, but you do need a stream of funding. More than project based, but not the whole enchilada ncessarily. Leverage that as proof of concept for funding for other parts of the collection, OR for funding additional services/feature Overlapping grants, creative redeployment of existing resources, project-within-a-project funding
  3. SIL’s rapid capture methodology based on one large project (BHL) and it’s needs. We then Extend the model from there.Justone way of approaching it We had initial grant for digitization, supplemented with two more. More will need to come.We use funding primarily for STAFF, then for vendor/outsource, then for equipment/software.Process has taken a couple years to standardized. Couldn’t have standardized and rapidized process without the automation.
  4. Catalyst for our Ramp-up came In 2008 (or thereabouts) Smithsonian Lib and MoBot spearheaded the creation of BHL. primary audience was the international taxonomic community, we had plenty of collections that were relevant. We are primarily scanning from our NH collections, as well as Cullman rare book collections. Those make up only n% of the total SIL collections, but it’s a significant % of our public domain holdings.Ramp up was necessitated by terms of the grant!
  5. Over 14,500 items and 5.8m images scanned since 2008. Mostly via Internet Archive (BHL only)Our other scanning project since 2010 over 1900 items and 600,000 pagesRamped up VERY QUICKLY. Sending 200 items a week for scanning. Needed to spend out funds, BUT quality suffered. Shipments started failing QC, so we scaled back. Fewer problems now.Rapidity – function of non-destructive scanning, care with fragile/rare, QC TAKES A LONG TIME, but saves rescanning later.Averaging ~ 4000 images/month locally, IA avgs 104,000 imgs/monthStorage: (est. 600MB per package, zipped compressed lossy jp2s etc) at IA = over 10TB (? 8.3TB BHL + 1.2 TB? SI )Storage locally since 2011: avg. pkg size is 23.4GB, more than 4.5TB. Saving tiffs, jp2s]
  6. You don’t necessarily need one huge funding source, but you do need a stream of funding. Overlapping grants, creative redeployment of existing resources, project-within-a-project fundingInitial BHL digitization costs paid from MacArthur grant to EOL/BHL – only covers scanning <$500,000 (will scan approx. 17,000 books, out of over 50,000 likely to scan for that project) Rough calculation figured total cost to scan entire (BHL) collection (by IA, which is cheap) would cost over $2.5mFunding of personnel and equipment from multi-year overlappingSeidell grants (1.5m over 7 years)Expanding scanning to other parts of the collection by setting aside special purpose funds (director’s discretionary) for both people and scanning.Future…? Gradually incorporate tasks into permanent staff tasks/refill positions judiciously.Seekspecific grants for special parts of collection or special use cases
  7. Most imp use of fundsFull time.Feed the beast.Manage coordinate workflow, also do qc, post-scanning maintenance of online collectionBHL project evolving, workflow more settled now, need libs not techsNote that the librarians do more than manage the digitization. metadata issues are now usually routed through our contract cataloging process and also use grant funds.
  8. IA: Quick, cheap, open accessDownsides: size limit, public domain only, quality spottyIn-house: quality, controlDownsides: slower, more expensive, STORAGESpeed may be less of a factor once the NEW CAMERA comes online
  9. Gory details:shoot target at beginning of the book only, calibrate (mostly white balance) once per book> always shoot greater than 300ppi, relative to the size of the book > Shoot in 16 bit color, Adobe 1998 RGB color spaced When imgs are converted to .tiff, ownsample to 8bit color and standardize on 300ppi (space issues) > apply auto-contrast and auto-levels but no other image editing in CaptureOne, maybe some sharpening if needed. Capture one does filenaming, crop, rotate, and convert to tiffQC done as first pass right after scanning for all items, by the scanner. Second QC is done by other staff on a selected number of items, based on a formula (NISO standard!). QC is looking only for ‘major’ errors such as missing pages, thumbs in picture, cut off text – anything that would adversely affect the OCR. We are concerned only with the “content” since this is an ACCESS copy not book as ARTIFACTAfter scanning, the operator manually moves the files onto the Macaw server, into the directory already created (name convention is barcode, same as filenames)
  10. Digitization can happen anywhere. Multiple vendors, in house, legacy stuff you scanned way back. Small grants, Special projects, main mass-digi stream, extraction of pretty pictures for reuse.Bulk for us done by IA – cost & grant driven for BHL, but they can’t do everythingAll the various workflows=BLUE SPAGHETTI BARFHard to track, stuff everywhere, doesn’t scale (duh) need to refine processes and standardize and harmonize small-scale projects with large-scale project
  11. Basic workflow. Key elementsItem level metadata & workflow tracking dbSIRIS as official metadata repositoryIA as staging (and temporary storage) area
  12. Use IA as staging for convenience – already used by BHL project. Plenty of storage space, they do OCR and create derivatives for us, plus, available for everyone on IA.Accept common basic metadata model (for book format) based on BHL/IA model. Suits most things.Still to solve: storage, presentation, non-IA compatible stuff (e.g in copyright)However, creating metadata and uploading to IA would be a time intensive manual processTo be efficient must AUTOMATELocal scanning needed tool to upload to IA, create metadata = Macaw
  13. Use & reuse data you already have. Find protocols to extract data you have. HOW? Through MACAW! For us, can get title level MARC data from SIRIS via Z39.50. Item level data not as accessible, so extracted it in bulk, stored in sep db that we use for workflow, Macaw then automatically harvests it from that db when necessary. Macaw transforms harvested data to xml.Descriptive Pg level data still entered by hand, but technical mtdt (image size) extracted automatically, transformation to xml automated.Also automate transformation tiff->jp2, bundling and uploading of locally created files to the IA staging area. (easier said than done)
  14. When a book is selected for scanning in the Workflow database,Macaw (which checks it every couple hours) imports the item-level data (barcode, volume etc) and creates a directory on it’s server to hold the metadata and scans. It then imports the MARC record from SIRIS via z39.50 and converts it to MARCxml, saved in a file. The item-level data is stored in a database.When the scanner moves the scanned images to the directory, Macaw creates thumbnails for use in the interface.
  15. Operator scans barcode for item and is taken to the editing page.Add page level metadata (page type, page number) and structure (page sequence). Stored in an xml file.Easy to use GUI, shortcuts to common operations, like selecting alternate pages to apply recto/verso and page type descriptions. Can re-order pages, esp useful if you’ve scanned all the rectos then all the versos.ClickContains extra fields that we can use locally for other projects – add captions, notes, flag for ‘interestingness’ e.g. blog post or etc.Once book is “finished” unless it is flagged for QC by other staff, Macaw creates the page level xml file, converts the .tffs to lossy compressed jp2s, zips the compressed jp2s, and sends the entire metadata+scans package up to Internet Archive which is a lot easier to say than it is to do. Also copied locally to NAS for temporary storage.
  16. IA scanning for “Access” only. Not Preservation. Managing expectations. Color and calibrationCurrent equipment still slow to setup/handle oversize materialsNot embedding descriptive metadata in page images. Need to automate this. Send to dams/other.
  17. Thruput: new camera should help, MSS and un-cataloged items need software tweaks for the metadata, also need to develop auto export to local storage, aka DAMS. Starting to repurpose images already (import directly into our galaxy of images collection) but hope to integrate into online exhibition workflow involving DAMs and ? Who knows. Output to Mets for storage. Not thinking about PREMIS just yet. Islandora for storage and/or delivery of METS based docs. Need to harvest back scans and metadata from multiple locations so we can manage corrections, storage (fault lines!) possible replication of BHL corpus.Interface as part of new digital library