SlideShare une entreprise Scribd logo
1  sur  22
Télécharger pour lire hors ligne
Lance Stuchell and Kat Hagedorn,
University of Michigan Library
The Great Migration: Moving First
Generation Digital Texts to HathiTrust
July 22, 2014 - Digital Preservation 2014
#digpres14
Overview
We’ve been migrating our older collections to
HathiTrust for 5 years now (since 2009).
Recently, we’ve been taking a closer
look at correctly preserving this content.
We’ve been doing that for the past 2 years
(since early 2012).
#digpres14
Overview
DLXS: (Digital Library eXtension Service)
●  Includes content from 1995 to present
●  Images, texts, bibliographies, finding aids
HathiTrust:
●  Consortium of research institutions and libraries
●  Content includes material created through Google
digitization project
#digpres14
Two repositories
Some things are similar:
●  File formats (TIFF, JPEG2000), structure of
repository, repository management staff
●  Content to be migrated (printed volumes)
However:
●  HathiTrust has stricter technical requirements
●  HathiTrust has a more formal ingest process
#digpres14
Our team
John Weise - Head of the Digital
Library Production Service (DLPS)
Cory Snavely - Head of the Core
Services (infrastructure) unit
Aaron Elkiss - Systems Programmer
in Core Services
Chris Powell - Coordinator for Text
Collections in DLPS
Matt LaChance - Data Processing
Automation Programmer in DLPS
Lance Stuchell - Digital
Preservation Librarian
Kat Hagedorn - Project Manager for
Digital Projects in DLPS
Our team at U-M includes:
#digpres14
The Weasley House effect
When we started, we were balancing
pragmatism with best practices and a
thoughtful approach.
We didn’t have good validation tools,
because they didn’t exist.
We got smarter as we went along, and we
got better tools to help us.
But it means we don’t have consistency
across all our collections.
#digpres14
The pain point
95% of our materials can be ingested easily -
they pass current validation with little or no
problems.
However, that remaining 5%...
#digpres14
It’s not really broken
Problem Details Solution
Bitonal image resolution Resolution value is incorrect or zero Manually fix the images
Invalid but well-formed
bitonal images
Images are viewable, but the image
metadata is incorrect
Automation for those can batch fix,
manual intervention for the rest
Unexpected contone
image dimensions
Some images in a volume are suspiciously
not a reasonable size, in relation to the other
pages
Automate discovery mechanism, but in
the end manually inspecting and
annotating
“Funny” filenames Filenames that met a previous spec don’t
meet current spec
Automate discovery, and automatically
fixed (pattern matching)
Bad/ambiguous dates e.g., 4-5-98, 040506 Automate discovery and fix of patterns,
but some will need manual fixing
Legitimate sequence
skips
Skips in the numbering or nomenclature of a
sequence (not the page)
Automate discovery and most fixes, some
manually rename
It’s wrong to call these
broken or unbroken - they
don’t meet our standards of
preservation as-is.
It’s really broken
Problem Details Solution
Blank contone images Contone is a blank image Need to reload/rescan the volume
Contone images don’t
match bitonals
Misaligned (sequences) of contones and
bitonals within the volume
Manual work to discover the correct
matches, then make those changes
Skipped pages Illegitimate skips Need to reload/rescan the volume
OCR but no images OCR for certain sequences, but no matching
images
After discovery, may need to rescan
some images, but some images may not
be needed
Character errors e.g., em-dashes, control characters not
recognizable by XML
After discovery, remove/replace those
characters
Non-viewable,
malformed images
Cannot open, view or discover the problems
with certain images
“Cut bait” (rescan the entire volume)
The “note from mom” tool
We needed a way to indicate which volumes we were
ingesting because a thoughtful analysis didn’t require a fix,
but a method for noting our analysis prior to ingest. e.g.,
●  legitimate printed volume errors (missing pages)
●  those unexpected contone image dimensions (if they look good on
manual inspection)
We built an ingest utility - aka the “note from mom” - to
ingest these volumes.
#digpres14
“Final” result?
Of the 167,102 volumes in our text collections,
we have validated and migrated 25,339 of them
to HathiTrust (15%).
●  18 collections (mostly) ingested (20%), 31 next on the docket
●  41 cannot be currently migrated
It took us most of the past 2 years to validate
and ingest these volumes.
#digpres14
Full disclosure
Categories that need special help:
●  Some volumes we deemed okay to ingest, but they need to be
revisited (reduced validation).
●  Those collections with permission or ownership questions.
●  Those collections that we built to contain contones we couldn’t
include in our text collections.
●  Those collections that contain more highly-encoded volumes (level
4 TEI).
Some collections can’t be migrated because they are licensed, don’t
have page images, etc.
#digpres14
Hard, but intriguing
We have been:
●  intrigued - by the extent of certain problems
●  annoyed - by image reloads that seem to make no sense
●  surprised - at the differences between latter years and today in
terms of preservation tools and expertise
But in the end, this is fun! It just takes a lot of
time, so we encourage you to start now.
#digpres14
Opportunity for reflection
DLXS collections were digitized with
preservation in mind
Migration project offers an
opportunity to evaluate how we did,
where we have gone, and where we
are now
Pensive_John by DiscourseMarker
#digpres14
How did we do?
●  Small percentage of files are “broken,” non-
assessable, or require rescanning
●  Most errors happened in digitization and/or
package creation - we’re not sure
o  Not caused by degrading in the repository
o  Not caused by anything inherent in the file formats
●  Most errors discovered during migration
#digpres14
Way back in 1995...
●  Content was created with long term
preservation in mind, but…
o  Ingest validation was minimal
o  Relied on vendors creating correct
content [pause for laughter]
o  Metadata that was to become
engine of management at scale
was inconsistent or incorrect
#digpres14
Road from then to now
●  Shift from “hand-crafted” to mass digitization led to
unprecedented scale
●  Further reliance on tools facilitating automated
processes like ingest (JHOVE, etc.)
●  Increased use of standards (METS, PREMIS, adoption
of OAIS and TRAC concepts)
●  Effort to move away from anecdotal knowledge of
issues with content
#digpres14
Are we better now?
●  Processes allows for consistent metadata
creation and content validation at scale
●  Use of standards accommodates large scale
repository changes (METS & PREMIS uplift)
●  Using PREMIS and local metadata for more
documentation within the repository
●  Content is much more consistent
#digpres14
Future issues: The outliers
●  Necessity of documenting
heterogeneous material
increases along with scale
o  “Note from mom”
●  How to document complicated
special cases?
o  Shift from anecdotal
knowledge is not complete
outlier by Robert S. Donovan #digpres14
Future issues: tool dependance
●  Some significant properties do not
have perfect validation methods
●  Scale and access demands make
manual review impossible
●  There are a variety of methods to
detect and fix these after ingest,
including re-scanning as a “last-
resort”
#digpres14
Future issues: tool dependance
●  What happens in cases without
rescanning as a “last resort”?
o  Born digital
o  File format migration
o  Degraded A/V material
●  How does community prioritize tool
development?
#digpres14
The punch line (for me)
Preservation is iterative, not static
●  Formats have (so far) been very stable
●  Metadata and supporting elements evolve
o  Tools to facilitate automated processes
o  METS and PREMIS uplifts
o  “note from mom,” etc.
●  Constant decision making to balance real-world
constraints with preservation ideals
#digpres14

Contenu connexe

En vedette

EECU Online Banking Updates - September 2013
EECU Online Banking Updates - September 2013EECU Online Banking Updates - September 2013
EECU Online Banking Updates - September 2013eecudfw
 
이스라엘
이스라엘이스라엘
이스라엘siwon0108
 
Presentazione benchmarking laura_cafaro
Presentazione benchmarking laura_cafaroPresentazione benchmarking laura_cafaro
Presentazione benchmarking laura_cafaroLaura Cafaro
 
Social Travel Britain 2015 conference: Instagram
Social Travel Britain 2015 conference: InstagramSocial Travel Britain 2015 conference: Instagram
Social Travel Britain 2015 conference: InstagramMark Frary
 
Opening sequence film pitch
Opening sequence film pitchOpening sequence film pitch
Opening sequence film pitch06tomasymm
 
Đón học viên tại sân bay - Trường UV ESL
Đón học viên tại sân bay - Trường UV ESLĐón học viên tại sân bay - Trường UV ESL
Đón học viên tại sân bay - Trường UV ESLUV ESL Center
 
Rbs nov-dec-2016
Rbs nov-dec-2016Rbs nov-dec-2016
Rbs nov-dec-2016Rbs Jabbeke
 
Diapositivas compu
Diapositivas compuDiapositivas compu
Diapositivas compujossela
 
Social Travel Britain 2015 conference: Digital Visitor
Social Travel Britain 2015 conference: Digital VisitorSocial Travel Britain 2015 conference: Digital Visitor
Social Travel Britain 2015 conference: Digital VisitorMark Frary
 
Developing thinking handouts_2010
Developing thinking handouts_2010Developing thinking handouts_2010
Developing thinking handouts_2010mund123
 
Низкоуровневые оптимизации. Андрей Аксенов. Unigine Open Air 2013
Низкоуровневые оптимизации. Андрей Аксенов. Unigine Open Air 2013Низкоуровневые оптимизации. Андрей Аксенов. Unigine Open Air 2013
Низкоуровневые оптимизации. Андрей Аксенов. Unigine Open Air 2013Unigine Corp.
 
Taller de passejades
Taller de passejadesTaller de passejades
Taller de passejadesMercè Gimeno
 
Philips_ Business Presentation 2012
Philips_ Business Presentation 2012Philips_ Business Presentation 2012
Philips_ Business Presentation 2012anneke_blaauw
 

En vedette (20)

EECU Online Banking Updates - September 2013
EECU Online Banking Updates - September 2013EECU Online Banking Updates - September 2013
EECU Online Banking Updates - September 2013
 
이스라엘
이스라엘이스라엘
이스라엘
 
Cavalls
CavallsCavalls
Cavalls
 
Presentazione benchmarking laura_cafaro
Presentazione benchmarking laura_cafaroPresentazione benchmarking laura_cafaro
Presentazione benchmarking laura_cafaro
 
αρχαϊκη εποχη
αρχαϊκη εποχηαρχαϊκη εποχη
αρχαϊκη εποχη
 
Social Travel Britain 2015 conference: Instagram
Social Travel Britain 2015 conference: InstagramSocial Travel Britain 2015 conference: Instagram
Social Travel Britain 2015 conference: Instagram
 
Opening sequence film pitch
Opening sequence film pitchOpening sequence film pitch
Opening sequence film pitch
 
Đón học viên tại sân bay - Trường UV ESL
Đón học viên tại sân bay - Trường UV ESLĐón học viên tại sân bay - Trường UV ESL
Đón học viên tại sân bay - Trường UV ESL
 
Rbs nov-dec-2016
Rbs nov-dec-2016Rbs nov-dec-2016
Rbs nov-dec-2016
 
Diapositivas compu
Diapositivas compuDiapositivas compu
Diapositivas compu
 
Epic Awards 101
Epic Awards 101Epic Awards 101
Epic Awards 101
 
FaceCV Recruit
FaceCV RecruitFaceCV Recruit
FaceCV Recruit
 
pasta de sal i més
pasta de sal i méspasta de sal i més
pasta de sal i més
 
Social Travel Britain 2015 conference: Digital Visitor
Social Travel Britain 2015 conference: Digital VisitorSocial Travel Britain 2015 conference: Digital Visitor
Social Travel Britain 2015 conference: Digital Visitor
 
Developing thinking handouts_2010
Developing thinking handouts_2010Developing thinking handouts_2010
Developing thinking handouts_2010
 
Catálogo de Productos
Catálogo de ProductosCatálogo de Productos
Catálogo de Productos
 
Низкоуровневые оптимизации. Андрей Аксенов. Unigine Open Air 2013
Низкоуровневые оптимизации. Андрей Аксенов. Unigine Open Air 2013Низкоуровневые оптимизации. Андрей Аксенов. Unigine Open Air 2013
Низкоуровневые оптимизации. Андрей Аксенов. Unigine Open Air 2013
 
Wiki informe 5
Wiki informe 5Wiki informe 5
Wiki informe 5
 
Taller de passejades
Taller de passejadesTaller de passejades
Taller de passejades
 
Philips_ Business Presentation 2012
Philips_ Business Presentation 2012Philips_ Business Presentation 2012
Philips_ Business Presentation 2012
 

Similaire à The Great Migration: Moving First Generation Digital Texts to HathiTrust: Digital Preservation 2014

Migrating 1st-Generation Digital Texts from Local Collections to HathiTrust
Migrating 1st-Generation Digital Texts from Local Collections to HathiTrustMigrating 1st-Generation Digital Texts from Local Collections to HathiTrust
Migrating 1st-Generation Digital Texts from Local Collections to HathiTrustkhage1
 
Patterns and Anti-Patterns for Memorializing Data Science Project Artifacts
Patterns and Anti-Patterns for Memorializing Data Science Project ArtifactsPatterns and Anti-Patterns for Memorializing Data Science Project Artifacts
Patterns and Anti-Patterns for Memorializing Data Science Project ArtifactsDatabricks
 
T44u 2015, content migration
T44u 2015, content migrationT44u 2015, content migration
T44u 2015, content migrationTerminalfour
 
Keith Schengili-Roberts - DITA Worst Practices
Keith Schengili-Roberts - DITA Worst PracticesKeith Schengili-Roberts - DITA Worst Practices
Keith Schengili-Roberts - DITA Worst PracticesJack Molisani
 
To Migrate or Not to Migrate? A Case Study on Database Migration
To Migrate or Not to Migrate?  A Case Study on Database MigrationTo Migrate or Not to Migrate?  A Case Study on Database Migration
To Migrate or Not to Migrate? A Case Study on Database MigrationVisual Resources Association
 
Lessons Learned... Migration to DITA During Corporate Acquisitions
Lessons Learned... Migration to DITA During Corporate AcquisitionsLessons Learned... Migration to DITA During Corporate Acquisitions
Lessons Learned... Migration to DITA During Corporate AcquisitionsPublishing Smarter
 
Semantics Helps Connect the Dots
Semantics Helps Connect the DotsSemantics Helps Connect the Dots
Semantics Helps Connect the DotsAlicia Harapko
 
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantPOWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantLynne Thomas
 
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...Association for Computational Linguistics
 
What They Won't Tell You About DITA
What They Won't Tell You About DITAWhat They Won't Tell You About DITA
What They Won't Tell You About DITAAlan Houser
 
Better workflows, stronger governance
Better workflows, stronger governanceBetter workflows, stronger governance
Better workflows, stronger governanceAhava Leibtag
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data WarehouseCaserta
 
Putting Buyers and Sellers in the Best Light, How Etsy Leverages Big Data for...
Putting Buyers and Sellers in the Best Light, How Etsy Leverages Big Data for...Putting Buyers and Sellers in the Best Light, How Etsy Leverages Big Data for...
Putting Buyers and Sellers in the Best Light, How Etsy Leverages Big Data for...Dana Gardner
 
Linked Data: The Real Web 2.0 (from 2008)
Linked Data: The Real Web 2.0 (from 2008)Linked Data: The Real Web 2.0 (from 2008)
Linked Data: The Real Web 2.0 (from 2008)Uche Ogbuji
 
50 Shades of Fail KScope16
50 Shades of Fail KScope1650 Shades of Fail KScope16
50 Shades of Fail KScope16Christian Berg
 
Data modeling trends for Analytics
Data modeling trends for AnalyticsData modeling trends for Analytics
Data modeling trends for AnalyticsIke Ellis
 
Superheros and a Leprechaun, with Flare: A case study in breaking down silos
Superheros and a Leprechaun, with Flare: A case study in breaking down silosSuperheros and a Leprechaun, with Flare: A case study in breaking down silos
Superheros and a Leprechaun, with Flare: A case study in breaking down silosTheresa Putkey
 

Similaire à The Great Migration: Moving First Generation Digital Texts to HathiTrust: Digital Preservation 2014 (20)

Migrating 1st-Generation Digital Texts from Local Collections to HathiTrust
Migrating 1st-Generation Digital Texts from Local Collections to HathiTrustMigrating 1st-Generation Digital Texts from Local Collections to HathiTrust
Migrating 1st-Generation Digital Texts from Local Collections to HathiTrust
 
The Road to DITA
The Road to DITAThe Road to DITA
The Road to DITA
 
Patterns and Anti-Patterns for Memorializing Data Science Project Artifacts
Patterns and Anti-Patterns for Memorializing Data Science Project ArtifactsPatterns and Anti-Patterns for Memorializing Data Science Project Artifacts
Patterns and Anti-Patterns for Memorializing Data Science Project Artifacts
 
T44u 2015, content migration
T44u 2015, content migrationT44u 2015, content migration
T44u 2015, content migration
 
Keith Schengili-Roberts - DITA Worst Practices
Keith Schengili-Roberts - DITA Worst PracticesKeith Schengili-Roberts - DITA Worst Practices
Keith Schengili-Roberts - DITA Worst Practices
 
To Migrate or Not to Migrate? A Case Study on Database Migration
To Migrate or Not to Migrate?  A Case Study on Database MigrationTo Migrate or Not to Migrate?  A Case Study on Database Migration
To Migrate or Not to Migrate? A Case Study on Database Migration
 
Lessons Learned... Migration to DITA During Corporate Acquisitions
Lessons Learned... Migration to DITA During Corporate AcquisitionsLessons Learned... Migration to DITA During Corporate Acquisitions
Lessons Learned... Migration to DITA During Corporate Acquisitions
 
Semantics Helps Connect the Dots
Semantics Helps Connect the DotsSemantics Helps Connect the Dots
Semantics Helps Connect the Dots
 
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantPOWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
 
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
 
What They Won't Tell You About DITA
What They Won't Tell You About DITAWhat They Won't Tell You About DITA
What They Won't Tell You About DITA
 
Better workflows, stronger governance
Better workflows, stronger governanceBetter workflows, stronger governance
Better workflows, stronger governance
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
 
Big data technology
Big data technology Big data technology
Big data technology
 
Putting Buyers and Sellers in the Best Light, How Etsy Leverages Big Data for...
Putting Buyers and Sellers in the Best Light, How Etsy Leverages Big Data for...Putting Buyers and Sellers in the Best Light, How Etsy Leverages Big Data for...
Putting Buyers and Sellers in the Best Light, How Etsy Leverages Big Data for...
 
Linked Data: The Real Web 2.0 (from 2008)
Linked Data: The Real Web 2.0 (from 2008)Linked Data: The Real Web 2.0 (from 2008)
Linked Data: The Real Web 2.0 (from 2008)
 
50 Shades of Fail KScope16
50 Shades of Fail KScope1650 Shades of Fail KScope16
50 Shades of Fail KScope16
 
Data modeling trends for Analytics
Data modeling trends for AnalyticsData modeling trends for Analytics
Data modeling trends for Analytics
 
Superheros and a Leprechaun, with Flare: A case study in breaking down silos
Superheros and a Leprechaun, with Flare: A case study in breaking down silosSuperheros and a Leprechaun, with Flare: A case study in breaking down silos
Superheros and a Leprechaun, with Flare: A case study in breaking down silos
 
Library Linked Data
Library Linked DataLibrary Linked Data
Library Linked Data
 

Dernier

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Dernier (20)

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

The Great Migration: Moving First Generation Digital Texts to HathiTrust: Digital Preservation 2014

  • 1. Lance Stuchell and Kat Hagedorn, University of Michigan Library The Great Migration: Moving First Generation Digital Texts to HathiTrust July 22, 2014 - Digital Preservation 2014 #digpres14
  • 2. Overview We’ve been migrating our older collections to HathiTrust for 5 years now (since 2009). Recently, we’ve been taking a closer look at correctly preserving this content. We’ve been doing that for the past 2 years (since early 2012). #digpres14
  • 3. Overview DLXS: (Digital Library eXtension Service) ●  Includes content from 1995 to present ●  Images, texts, bibliographies, finding aids HathiTrust: ●  Consortium of research institutions and libraries ●  Content includes material created through Google digitization project #digpres14
  • 4. Two repositories Some things are similar: ●  File formats (TIFF, JPEG2000), structure of repository, repository management staff ●  Content to be migrated (printed volumes) However: ●  HathiTrust has stricter technical requirements ●  HathiTrust has a more formal ingest process #digpres14
  • 5. Our team John Weise - Head of the Digital Library Production Service (DLPS) Cory Snavely - Head of the Core Services (infrastructure) unit Aaron Elkiss - Systems Programmer in Core Services Chris Powell - Coordinator for Text Collections in DLPS Matt LaChance - Data Processing Automation Programmer in DLPS Lance Stuchell - Digital Preservation Librarian Kat Hagedorn - Project Manager for Digital Projects in DLPS Our team at U-M includes: #digpres14
  • 6. The Weasley House effect When we started, we were balancing pragmatism with best practices and a thoughtful approach. We didn’t have good validation tools, because they didn’t exist. We got smarter as we went along, and we got better tools to help us. But it means we don’t have consistency across all our collections. #digpres14
  • 7. The pain point 95% of our materials can be ingested easily - they pass current validation with little or no problems. However, that remaining 5%... #digpres14
  • 8. It’s not really broken Problem Details Solution Bitonal image resolution Resolution value is incorrect or zero Manually fix the images Invalid but well-formed bitonal images Images are viewable, but the image metadata is incorrect Automation for those can batch fix, manual intervention for the rest Unexpected contone image dimensions Some images in a volume are suspiciously not a reasonable size, in relation to the other pages Automate discovery mechanism, but in the end manually inspecting and annotating “Funny” filenames Filenames that met a previous spec don’t meet current spec Automate discovery, and automatically fixed (pattern matching) Bad/ambiguous dates e.g., 4-5-98, 040506 Automate discovery and fix of patterns, but some will need manual fixing Legitimate sequence skips Skips in the numbering or nomenclature of a sequence (not the page) Automate discovery and most fixes, some manually rename It’s wrong to call these broken or unbroken - they don’t meet our standards of preservation as-is.
  • 9. It’s really broken Problem Details Solution Blank contone images Contone is a blank image Need to reload/rescan the volume Contone images don’t match bitonals Misaligned (sequences) of contones and bitonals within the volume Manual work to discover the correct matches, then make those changes Skipped pages Illegitimate skips Need to reload/rescan the volume OCR but no images OCR for certain sequences, but no matching images After discovery, may need to rescan some images, but some images may not be needed Character errors e.g., em-dashes, control characters not recognizable by XML After discovery, remove/replace those characters Non-viewable, malformed images Cannot open, view or discover the problems with certain images “Cut bait” (rescan the entire volume)
  • 10. The “note from mom” tool We needed a way to indicate which volumes we were ingesting because a thoughtful analysis didn’t require a fix, but a method for noting our analysis prior to ingest. e.g., ●  legitimate printed volume errors (missing pages) ●  those unexpected contone image dimensions (if they look good on manual inspection) We built an ingest utility - aka the “note from mom” - to ingest these volumes. #digpres14
  • 11. “Final” result? Of the 167,102 volumes in our text collections, we have validated and migrated 25,339 of them to HathiTrust (15%). ●  18 collections (mostly) ingested (20%), 31 next on the docket ●  41 cannot be currently migrated It took us most of the past 2 years to validate and ingest these volumes. #digpres14
  • 12. Full disclosure Categories that need special help: ●  Some volumes we deemed okay to ingest, but they need to be revisited (reduced validation). ●  Those collections with permission or ownership questions. ●  Those collections that we built to contain contones we couldn’t include in our text collections. ●  Those collections that contain more highly-encoded volumes (level 4 TEI). Some collections can’t be migrated because they are licensed, don’t have page images, etc. #digpres14
  • 13. Hard, but intriguing We have been: ●  intrigued - by the extent of certain problems ●  annoyed - by image reloads that seem to make no sense ●  surprised - at the differences between latter years and today in terms of preservation tools and expertise But in the end, this is fun! It just takes a lot of time, so we encourage you to start now. #digpres14
  • 14. Opportunity for reflection DLXS collections were digitized with preservation in mind Migration project offers an opportunity to evaluate how we did, where we have gone, and where we are now Pensive_John by DiscourseMarker #digpres14
  • 15. How did we do? ●  Small percentage of files are “broken,” non- assessable, or require rescanning ●  Most errors happened in digitization and/or package creation - we’re not sure o  Not caused by degrading in the repository o  Not caused by anything inherent in the file formats ●  Most errors discovered during migration #digpres14
  • 16. Way back in 1995... ●  Content was created with long term preservation in mind, but… o  Ingest validation was minimal o  Relied on vendors creating correct content [pause for laughter] o  Metadata that was to become engine of management at scale was inconsistent or incorrect #digpres14
  • 17. Road from then to now ●  Shift from “hand-crafted” to mass digitization led to unprecedented scale ●  Further reliance on tools facilitating automated processes like ingest (JHOVE, etc.) ●  Increased use of standards (METS, PREMIS, adoption of OAIS and TRAC concepts) ●  Effort to move away from anecdotal knowledge of issues with content #digpres14
  • 18. Are we better now? ●  Processes allows for consistent metadata creation and content validation at scale ●  Use of standards accommodates large scale repository changes (METS & PREMIS uplift) ●  Using PREMIS and local metadata for more documentation within the repository ●  Content is much more consistent #digpres14
  • 19. Future issues: The outliers ●  Necessity of documenting heterogeneous material increases along with scale o  “Note from mom” ●  How to document complicated special cases? o  Shift from anecdotal knowledge is not complete outlier by Robert S. Donovan #digpres14
  • 20. Future issues: tool dependance ●  Some significant properties do not have perfect validation methods ●  Scale and access demands make manual review impossible ●  There are a variety of methods to detect and fix these after ingest, including re-scanning as a “last- resort” #digpres14
  • 21. Future issues: tool dependance ●  What happens in cases without rescanning as a “last resort”? o  Born digital o  File format migration o  Degraded A/V material ●  How does community prioritize tool development? #digpres14
  • 22. The punch line (for me) Preservation is iterative, not static ●  Formats have (so far) been very stable ●  Metadata and supporting elements evolve o  Tools to facilitate automated processes o  METS and PREMIS uplifts o  “note from mom,” etc. ●  Constant decision making to balance real-world constraints with preservation ideals #digpres14

Notes de l'éditeur

  1. Skip over this fast, say Lance will say much more in his section.
  2. Briefly go over each of these. Contones and bitonals were created at different times - because we didn’t have the technology at the time (wholly separate workflow).
  3. Don’t forget the last one we call “hella crazy”
  4. But we never worked on this full-time, not even half-time!
  5. Photo: https://www.flickr.com/photos/kris3198/2399334942/ CC-BY-NC-ND
  6. Image: CC-BY https://flic.kr/p/dVzKGh