SlideShare une entreprise Scribd logo
1  sur  22
Introduction to preservation
   workflow, formats and risks
            This section by
          Steve Hitchcock,
             KeepItproject




For JISC KeepItcourse on Digital Preservation Tools for Repository Managers
 Module 3, Primer on preservation workflow, formats and characterisation
           Westminster-Kingsway College, London, 2 March 2010
Overview of session
• Some terminology

• Preservation workflow

• Repository format profiles

• Format risks: a group task

• Some thoughts about formats
Representation information:
connecting data with what we see
Open Archival Information System
   (OAIS) reference model
The 3-stage repository model

Get Content
                         Manage Content                  Serve Content
  (Ingest)



                         Appraise & Select

                               Store




                                                         Retrieve
                               Index




                                                Locate
              Ingest




                       Preservation - Check

                       Preservation - Analyse

                       Preservation - Action

                              Dispose
Preservation workflow
             Check                     Analyse                  Action




                •Format    Preservation planning            •Migration
identification, version    Characterisation:                • Emulation
                     ing   Significant properties and       • Storage selection
       • File validation   technical
          • Virus check    characteristics, provenance, for
    • Bit checking and     mat, risk factors
 checksum calculation
                           Risk analysis
                  Tools
                           Tools
            e.g. DROID
                           Plato (Planets)
                 JHOVE
                           PRONOM (TNA)
                   FITS
                           P2 risk registry (KeepIt)
                           INFORM (U Illinois)
Rosenthal on:
doing less for preservation
   “I believe that "it became necessary
     to change the content in order to
     preserve it" is a very bad idea; we
     should preserve what's out there
     without adding cost and losing
     information by preemptively migrating
     to a format we believe (normally
     without evidence) is less doomed.”

     Are format specifications important for
     preservation? (January 4, 2009)
     http://blog.dshr.org/2009/01/are-format-
     specifications-important-for.html
Rosenthal on:
 aggressivevsrelaxed
         preservation
“In the long run, all digital formats
become obsolete. Broadly, reactions to this
dismal prospect have taken two forms:
- The aggressive form has been to do as
much work as possible as soon as possible
-The relaxed form has been to postpone
doing anything until it is absolutely
essential
Format Obsolescence: the Prostate Cancer of
Preservation (May 7, 2007)
http://blog.dshr.org/2007/05/format-obsolescence-
prostate-cancer-of.html
Rosenthal on:
the Prostate Cancer of Preservation
 “format obsolescence is the prostate cancer of digital
 preservation. It is a serious and ultimately fatal problem. But it is
 highly likely that something else will kill you first

 “A risk-based approach would surely prefer the "relaxed"
 approach, minimizing up-front and storage costs, and thereby
 freeing up resources to preserve more, and higher-risk content.

 “The best example of a "relaxed" ingest pipeline is the Internet
 Archive, which has so far ingested over 85 billion web pages with
 minimal human intervention.”
 Format Obsolescence: the Prostate Cancer of Preservation (May 7, 2007)
 http://blog.dshr.org/2007/05/format-obsolescence-prostate-cancer-of.html
Repository format profile:
       an example




      Originally from Registry of Open Access Repositories (ROAR)
ROAR format profiles today




This profile for Australian Research Online repository
                         To access a format profile:
                         Find chosen repository in ROAR, open [Record Details]
                         Format profiles not available for all repositories in ROAR
                         ROAR disclaimer: Full-text formats is based on automatic
                         file-format identification and is prone to errors
Accepted repository formats:
      a recent survey
What file formats do you accept? Do you convert any to a
different format?
• All accept any format.
• Two convert everything to PDF, but store the source files in
the background for preservation reasons.
• Four mention specifically converting Word to PDF: one
seeks permission from the author to do this, and uploads as
Word if permission is not granted.
• One mentions converting ZIP files to PDF.
Sue Ashby
University of Portsmouth Library
                              Summary of responses to IR questionnaire
                                 JISC-REPOSITORIES, 18 February 2010
Format risks
1000 Ubiquity: degree of adoption of the format
1001 Support: number of tools available which can access the format
1002 Disclosure: extent to which the format documentation is publicly
disclosed
1003 Document Quality: completeness of the available documentation
1004 Stability: speed and backwards-compatibility of version change
1005 Ease of Identification: ease with which the format can be identified
1006 Ease of validation: ease with which the format can be validated
1007Lossiness: does the format use lossy compression
1008 Intellectual Property Rights: whether or not the format in
encumbered by IPR
1009 Complexity: degree of content or behavioural complexity supported

              From PRONOM documentation (The National Archives), July 2008
Format risks
1000 Ubiquity: degree of adoption of the format
1001 Support: number of tools available which can access the format
1002 Disclosure: extent to which the format documentation is publicly
disclosed
1003 Document Quality: completeness of the available documentation
1004 Stability: speed and backwards-compatibility of version change
1005 Ease of identification: ease with which the format can be identified
1006 Ease of validation: ease with which the format can be validated
1007Lossiness: does the format use lossy compression
1008 Intellectual property rights: whether or not the format is
encumbered by IPR
1009 Complexity: degree of content or behavioural complexity supported

              From PRONOM documentation (The National Archives), July 2008
A group task on format risks
1. Choose two formats to compare (e.g. Word vs
   PDF, Word vs ODF, PDF vs XML, TIFF vs JPEG)
2. By working through the (surviving) list of format
   risks select a winner (or a draw) between your
   chosen formats for each risk category (1 point for
   win)
3. Total the scores to find an overall winning format
4. Suggest one reason why the winning format using
   this method may not be the one you would
   choose for your repository
Some thoughts about formats
Free vs open source vs open standard:

•MS Office – XML – open standard
•Open Office – free – XML - open standard
•PDF page representation
•XML generic Web format, computational
Rosenthal on:
why we can relax about preservation
      “Historically, the open source community
         has developed rendering software for
         almost all proprietary formats that achieve
         wide use

         “Even the formats which pose the
         greatest problems for preservation, those
         protected by DRM technology, typically
         have open source renderers”
         Format Obsolescence: Scenarios (April 29, 2007)
         http://blog.dshr.org/2007/04/format-obsolescence-
         scenarios.html
Work with, not against, your
     authors and contributors
• “Preservation begins with the author”
• U. Rochester (USA) has written its own repository software IR+ to
give its authors a Web-based authoring workspace
• But which applications are widely used and popular among your
authors? Digital content authoring tools are typically chosen on
the basis of purpose, utility, familiarity (what is
provided, supported by Information Systems?) Rarely are they
chosen for format or preservation.
• Authors will craft their output in the chosen application, but will
often throw away that craft if asked to convert to another format
• One approach that builds on popular formats is ICE: Integrated
Content Environment, which converts formats from popular
content authoring tools
An image format comparison:
        TIFF vs JPEG 2000?
Studies and user reports claim JPEG 2000 to be – or at least will
become – the next archiving format for digital images
The format offers new possibilities, such as streaming, and reduces
storage consumption through lossless and lossy compression.
Another often claimed advantage of JPEG 2000 is that the master
image can possibly serve as the access copy as well, and thus
replace derived compressed, low resolution access copies.
Preservation Planning at the Bavarian State Library Using a Collection of Digitized
16th Century Printings, D-Lib Magazine, Vol15 No. 11/12, Nov/Dec 2009
http://www.dlib.org/dlib/november09/kulovits/11kulovits.html
TIFF vs JPEG 2000?
   Who’s for JPEG? The major players line up
1. The National Library of the Netherlands evaluated JPEG 2000
   against uncompressed TIFF (currently used) for storage
   capacity, image quality, long-term sustainability, functionality.
   JPEG 2000 is recommended as future archive format.
2. The British Library recently moved forward to migrate their
   80-terabyte newspaper collection from TIFF to JPEG 2000
3. The Wellcome Library announced they will use JPEG 2000 for
   their upcoming digitization projects
Preservation Planning at the Bavarian State Library Using a Collection of Digitized
16th Century Printings, D-Lib Magazine, Vol15 No. 11/12, Nov/Dec 2009
http://www.dlib.org/dlib/november09/kulovits/11kulovits.html
TIFF vs JPEG 2000?
            What does Plato say?
“At this point in time not migrating the TIFF v6
images is the best alternative.

“However, in one year we'll look at this plan again
to see if there are more tools available and whether
or not the ones we considered in this year's
evaluation have been improved.”
Preservation Planning at the Bavarian State Library Using a
Collection of Digitized 16th Century Printings, D-Lib
Magazine, Vol15 No. 11/12, Nov/Dec 2009
http://www.dlib.org/dlib/november09/kulovits/11kulovits.html
Further reading on formats and risk
   • Malcolm Todd, File Formats for Preservation, DPC
   Technology Watch Report Series, Report 09-02, 2
   December 2009
   •http://www.dpconline.org/newsroom/file-formats-for-preservation-
   technology-watch-report.html
   • Judith Rog and Caroline van Wijk, Evaluating File Formats
   for Long-term
   Preservation, KoninklijkeBibliotheek, February
   2008
http://www.kb.nl/hrd/dd/dd_links_en_publicaties/publicaties/K
   B_file_format_evaluation_method_27022008.pdf
   •See also Preserv project bibliography for many more
   papers on file formats
   http://preserv.eprints.org/Preserv-bibliography.html

Contenu connexe

En vedette

Puglia marac-file formats-20111020
Puglia marac-file formats-20111020Puglia marac-file formats-20111020
Puglia marac-file formats-20111020
MARAC Bethlehem PC
 
Dlf 2011UDFR-a-semantic-registry-for-format-representation-information-v1
Dlf 2011UDFR-a-semantic-registry-for-format-representation-information-v1Dlf 2011UDFR-a-semantic-registry-for-format-representation-information-v1
Dlf 2011UDFR-a-semantic-registry-for-format-representation-information-v1
DLFCLIR
 
Public Knowledge Project
Public Knowledge ProjectPublic Knowledge Project
Public Knowledge Project
DLFCLIR
 
Managing the Digitization of Large Press Archives
Managing the Digitization of Large Press ArchivesManaging the Digitization of Large Press Archives
Managing the Digitization of Large Press Archives
DLFCLIR
 
Participatory Digital Library
Participatory Digital LibraryParticipatory Digital Library
Participatory Digital Library
DLFCLIR
 

En vedette (17)

Keepit Course 5: Revision
Keepit Course 5: RevisionKeepit Course 5: Revision
Keepit Course 5: Revision
 
Puglia marac-file formats-20111020
Puglia marac-file formats-20111020Puglia marac-file formats-20111020
Puglia marac-file formats-20111020
 
Dlf 2011UDFR-a-semantic-registry-for-format-representation-information-v1
Dlf 2011UDFR-a-semantic-registry-for-format-representation-information-v1Dlf 2011UDFR-a-semantic-registry-for-format-representation-information-v1
Dlf 2011UDFR-a-semantic-registry-for-format-representation-information-v1
 
Marac11
Marac11Marac11
Marac11
 
Subject guides for archives - Eva Guggemos
Subject guides for archives - Eva GuggemosSubject guides for archives - Eva Guggemos
Subject guides for archives - Eva Guggemos
 
DAMs Strategy Presentation
DAMs Strategy PresentationDAMs Strategy Presentation
DAMs Strategy Presentation
 
Business Process Analysis for Your Records Management Program
Business Process Analysis for Your Records Management ProgramBusiness Process Analysis for Your Records Management Program
Business Process Analysis for Your Records Management Program
 
Preservation Planning using Plato, by Hannes Kulovits and Andreas Rauber
Preservation Planning using Plato, by Hannes Kulovits and Andreas RauberPreservation Planning using Plato, by Hannes Kulovits and Andreas Rauber
Preservation Planning using Plato, by Hannes Kulovits and Andreas Rauber
 
KeepIt Course 4: Putting storage, format management and preservation planning...
KeepIt Course 4: Putting storage, format management and preservation planning...KeepIt Course 4: Putting storage, format management and preservation planning...
KeepIt Course 4: Putting storage, format management and preservation planning...
 
LIFE3: Predicting Long Term Preservation Costs, by Brian Hole
LIFE3: Predicting Long Term Preservation Costs, by Brian HoleLIFE3: Predicting Long Term Preservation Costs, by Brian Hole
LIFE3: Predicting Long Term Preservation Costs, by Brian Hole
 
Significant Properties, Practical 1: Object Analysis (SPs part 3), by Stephen...
Significant Properties, Practical 1: Object Analysis (SPs part 3), by Stephen...Significant Properties, Practical 1: Object Analysis (SPs part 3), by Stephen...
Significant Properties, Practical 1: Object Analysis (SPs part 3), by Stephen...
 
Using DAF as a Data Scoping Tool, by Sarah Jones
Using DAF as a Data Scoping Tool, by Sarah JonesUsing DAF as a Data Scoping Tool, by Sarah Jones
Using DAF as a Data Scoping Tool, by Sarah Jones
 
Dlf bonnie tijerina keynote
Dlf  bonnie tijerina keynoteDlf  bonnie tijerina keynote
Dlf bonnie tijerina keynote
 
Public Knowledge Project
Public Knowledge ProjectPublic Knowledge Project
Public Knowledge Project
 
Managing the Digitization of Large Press Archives
Managing the Digitization of Large Press ArchivesManaging the Digitization of Large Press Archives
Managing the Digitization of Large Press Archives
 
Participatory Digital Library
Participatory Digital LibraryParticipatory Digital Library
Participatory Digital Library
 
Significant Properties - Where Next? (SPs part 6), by Stephen Grace and Garet...
Significant Properties - Where Next? (SPs part 6), by Stephen Grace and Garet...Significant Properties - Where Next? (SPs part 6), by Stephen Grace and Garet...
Significant Properties - Where Next? (SPs part 6), by Stephen Grace and Garet...
 

Similaire à KeepIt Course 3: preservation workflow

Similaire à KeepIt Course 3: preservation workflow (20)

EPrints Preservation: Why we need Preservation Planning
EPrints Preservation: Why we need Preservation PlanningEPrints Preservation: Why we need Preservation Planning
EPrints Preservation: Why we need Preservation Planning
 
OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...
OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...
OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...
 
File Formats for Preservation
File Formats for PreservationFile Formats for Preservation
File Formats for Preservation
 
Completepresentation
CompletepresentationCompletepresentation
Completepresentation
 
Digital Preservation in the Wild
Digital Preservation in the WildDigital Preservation in the Wild
Digital Preservation in the Wild
 
Prototype Design of Open Access Institutional Repository
Prototype Design of Open Access Institutional RepositoryPrototype Design of Open Access Institutional Repository
Prototype Design of Open Access Institutional Repository
 
Digital Preservation Best Practices: Lessons Learned From Across the Pond
Digital Preservation Best Practices: Lessons Learned From Across the PondDigital Preservation Best Practices: Lessons Learned From Across the Pond
Digital Preservation Best Practices: Lessons Learned From Across the Pond
 
Digital Presentation Best Practices: Lessons Learned From Across the Pond
Digital Presentation Best Practices: Lessons Learned From Across the PondDigital Presentation Best Practices: Lessons Learned From Across the Pond
Digital Presentation Best Practices: Lessons Learned From Across the Pond
 
Metadata For Preservation Delos
Metadata For Preservation DelosMetadata For Preservation Delos
Metadata For Preservation Delos
 
An Introduction to AtoM, Archivematica, and Artefactual Systems
An Introduction to AtoM, Archivematica, and Artefactual SystemsAn Introduction to AtoM, Archivematica, and Artefactual Systems
An Introduction to AtoM, Archivematica, and Artefactual Systems
 
Brief Introduction to Digital Preservation
Brief Introduction to Digital PreservationBrief Introduction to Digital Preservation
Brief Introduction to Digital Preservation
 
Trm Introduction
Trm IntroductionTrm Introduction
Trm Introduction
 
Digital Preservation Process: Preparation and Requirements
Digital Preservation Process: Preparation and RequirementsDigital Preservation Process: Preparation and Requirements
Digital Preservation Process: Preparation and Requirements
 
2009 PLANETS Vienna - MIXED migration to XML
2009 PLANETS Vienna - MIXED migration to XML2009 PLANETS Vienna - MIXED migration to XML
2009 PLANETS Vienna - MIXED migration to XML
 
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
 
Hans Hofman - European Perspectives on Digital Preservation
Hans Hofman - European Perspectives on Digital PreservationHans Hofman - European Perspectives on Digital Preservation
Hans Hofman - European Perspectives on Digital Preservation
 
Analytics with unified file and object
Analytics with unified file and object Analytics with unified file and object
Analytics with unified file and object
 
Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...
 
The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...
 
Andrew Waugh presentation
Andrew Waugh   presentationAndrew Waugh   presentation
Andrew Waugh presentation
 

Plus de JISC KeepIt project

Plus de JISC KeepIt project (20)

Preserving repository content: practical steps for repository managers by Mig...
Preserving repository content: practical steps for repository managers by Mig...Preserving repository content: practical steps for repository managers by Mig...
Preserving repository content: practical steps for repository managers by Mig...
 
Update on the JISC KeepIt Repository Preservation Exemplars Project, June 2010
Update on the JISC KeepIt Repository Preservation Exemplars Project, June 2010Update on the JISC KeepIt Repository Preservation Exemplars Project, June 2010
Update on the JISC KeepIt Repository Preservation Exemplars Project, June 2010
 
Transforming repositories: from repository managers to institutional data man...
Transforming repositories: from repository managers to institutional data man...Transforming repositories: from repository managers to institutional data man...
Transforming repositories: from repository managers to institutional data man...
 
Keepit Course 5: Concluding the course
Keepit Course 5: Concluding the courseKeepit Course 5: Concluding the course
Keepit Course 5: Concluding the course
 
KeepIt Course 5: DRAMBORA: Risk and Trust and Data Management, by Martin Donn...
KeepIt Course 5: DRAMBORA: Risk and Trust and Data Management, by Martin Donn...KeepIt Course 5: DRAMBORA: Risk and Trust and Data Management, by Martin Donn...
KeepIt Course 5: DRAMBORA: Risk and Trust and Data Management, by Martin Donn...
 
Keepit Course 5: Tools for Assessing Trustworthy Repositories
Keepit Course 5: Tools for Assessing Trustworthy RepositoriesKeepit Course 5: Tools for Assessing Trustworthy Repositories
Keepit Course 5: Tools for Assessing Trustworthy Repositories
 
Keepit Course 5: Trust
Keepit Course 5: TrustKeepit Course 5: Trust
Keepit Course 5: Trust
 
Physical preservation with EPrints: 1 Storage, by Adam Field, David Tarrant, ...
Physical preservation with EPrints: 1 Storage, by Adam Field, David Tarrant, ...Physical preservation with EPrints: 1 Storage, by Adam Field, David Tarrant, ...
Physical preservation with EPrints: 1 Storage, by Adam Field, David Tarrant, ...
 
KeepIt Course 4: digital preservation recap, by Andreas Rauber, Hannes Kulovi...
KeepIt Course 4: digital preservation recap, by Andreas Rauber, Hannes Kulovi...KeepIt Course 4: digital preservation recap, by Andreas Rauber, Hannes Kulovi...
KeepIt Course 4: digital preservation recap, by Andreas Rauber, Hannes Kulovi...
 
Keepit Course 3: Provenance (and OPM), based on slides by Luc Moreau
Keepit Course 3: Provenance (and OPM), based on slides by Luc MoreauKeepit Course 3: Provenance (and OPM), based on slides by Luc Moreau
Keepit Course 3: Provenance (and OPM), based on slides by Luc Moreau
 
KeepIt Course 3: Applying Preservation Metadata to Repositories
KeepIt Course 3: Applying Preservation Metadata to RepositoriesKeepIt Course 3: Applying Preservation Metadata to Repositories
KeepIt Course 3: Applying Preservation Metadata to Repositories
 
Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...
Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...
Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...
 
Significant Properties, Practical 2: Stakeholder Analysis (SPs part 4), by St...
Significant Properties, Practical 2: Stakeholder Analysis (SPs part 4), by St...Significant Properties, Practical 2: Stakeholder Analysis (SPs part 4), by St...
Significant Properties, Practical 2: Stakeholder Analysis (SPs part 4), by St...
 
InSPECT Significant Properties Framework (SPs part 2), by Stephen Grace and G...
InSPECT Significant Properties Framework (SPs part 2), by Stephen Grace and G...InSPECT Significant Properties Framework (SPs part 2), by Stephen Grace and G...
InSPECT Significant Properties Framework (SPs part 2), by Stephen Grace and G...
 
Introducing Significant Properties (SPs part 1), by Stephen Grace and Gareth ...
Introducing Significant Properties (SPs part 1), by Stephen Grace and Gareth ...Introducing Significant Properties (SPs part 1), by Stephen Grace and Gareth ...
Introducing Significant Properties (SPs part 1), by Stephen Grace and Gareth ...
 
KeepIt Course 3: primer on preservation workflow, formats and characterisation
KeepIt Course 3: primer on preservation workflow, formats and characterisationKeepIt Course 3: primer on preservation workflow, formats and characterisation
KeepIt Course 3: primer on preservation workflow, formats and characterisation
 
Costs, Policy, and Benefits in Long-term Digital Preservation, by Neil Beagrie
Costs, Policy, and Benefits in Long-term Digital Preservation, by Neil BeagrieCosts, Policy, and Benefits in Long-term Digital Preservation, by Neil Beagrie
Costs, Policy, and Benefits in Long-term Digital Preservation, by Neil Beagrie
 
KeepIt Course 2: preservation costs
KeepIt Course 2: preservation costsKeepIt Course 2: preservation costs
KeepIt Course 2: preservation costs
 
The AIDA toolkit: Assessing Institutional Digital Assets, by Ed Pinsent
The AIDA toolkit: Assessing Institutional Digital Assets, by Ed PinsentThe AIDA toolkit: Assessing Institutional Digital Assets, by Ed Pinsent
The AIDA toolkit: Assessing Institutional Digital Assets, by Ed Pinsent
 
DAF group exercise: scoping data and curation requirements, by Sarah Jones
DAF group exercise: scoping data and curation requirements, by Sarah JonesDAF group exercise: scoping data and curation requirements, by Sarah Jones
DAF group exercise: scoping data and curation requirements, by Sarah Jones
 

Dernier

Dernier (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

KeepIt Course 3: preservation workflow

  • 1. Introduction to preservation workflow, formats and risks This section by Steve Hitchcock, KeepItproject For JISC KeepItcourse on Digital Preservation Tools for Repository Managers Module 3, Primer on preservation workflow, formats and characterisation Westminster-Kingsway College, London, 2 March 2010
  • 2. Overview of session • Some terminology • Preservation workflow • Repository format profiles • Format risks: a group task • Some thoughts about formats
  • 4. Open Archival Information System (OAIS) reference model
  • 5. The 3-stage repository model Get Content Manage Content Serve Content (Ingest) Appraise & Select Store Retrieve Index Locate Ingest Preservation - Check Preservation - Analyse Preservation - Action Dispose
  • 6. Preservation workflow Check Analyse Action •Format Preservation planning •Migration identification, version Characterisation: • Emulation ing Significant properties and • Storage selection • File validation technical • Virus check characteristics, provenance, for • Bit checking and mat, risk factors checksum calculation Risk analysis Tools Tools e.g. DROID Plato (Planets) JHOVE PRONOM (TNA) FITS P2 risk registry (KeepIt) INFORM (U Illinois)
  • 7. Rosenthal on: doing less for preservation “I believe that "it became necessary to change the content in order to preserve it" is a very bad idea; we should preserve what's out there without adding cost and losing information by preemptively migrating to a format we believe (normally without evidence) is less doomed.” Are format specifications important for preservation? (January 4, 2009) http://blog.dshr.org/2009/01/are-format- specifications-important-for.html
  • 8. Rosenthal on: aggressivevsrelaxed preservation “In the long run, all digital formats become obsolete. Broadly, reactions to this dismal prospect have taken two forms: - The aggressive form has been to do as much work as possible as soon as possible -The relaxed form has been to postpone doing anything until it is absolutely essential Format Obsolescence: the Prostate Cancer of Preservation (May 7, 2007) http://blog.dshr.org/2007/05/format-obsolescence- prostate-cancer-of.html
  • 9. Rosenthal on: the Prostate Cancer of Preservation “format obsolescence is the prostate cancer of digital preservation. It is a serious and ultimately fatal problem. But it is highly likely that something else will kill you first “A risk-based approach would surely prefer the "relaxed" approach, minimizing up-front and storage costs, and thereby freeing up resources to preserve more, and higher-risk content. “The best example of a "relaxed" ingest pipeline is the Internet Archive, which has so far ingested over 85 billion web pages with minimal human intervention.” Format Obsolescence: the Prostate Cancer of Preservation (May 7, 2007) http://blog.dshr.org/2007/05/format-obsolescence-prostate-cancer-of.html
  • 10. Repository format profile: an example Originally from Registry of Open Access Repositories (ROAR)
  • 11. ROAR format profiles today This profile for Australian Research Online repository To access a format profile: Find chosen repository in ROAR, open [Record Details] Format profiles not available for all repositories in ROAR ROAR disclaimer: Full-text formats is based on automatic file-format identification and is prone to errors
  • 12. Accepted repository formats: a recent survey What file formats do you accept? Do you convert any to a different format? • All accept any format. • Two convert everything to PDF, but store the source files in the background for preservation reasons. • Four mention specifically converting Word to PDF: one seeks permission from the author to do this, and uploads as Word if permission is not granted. • One mentions converting ZIP files to PDF. Sue Ashby University of Portsmouth Library Summary of responses to IR questionnaire JISC-REPOSITORIES, 18 February 2010
  • 13. Format risks 1000 Ubiquity: degree of adoption of the format 1001 Support: number of tools available which can access the format 1002 Disclosure: extent to which the format documentation is publicly disclosed 1003 Document Quality: completeness of the available documentation 1004 Stability: speed and backwards-compatibility of version change 1005 Ease of Identification: ease with which the format can be identified 1006 Ease of validation: ease with which the format can be validated 1007Lossiness: does the format use lossy compression 1008 Intellectual Property Rights: whether or not the format in encumbered by IPR 1009 Complexity: degree of content or behavioural complexity supported From PRONOM documentation (The National Archives), July 2008
  • 14. Format risks 1000 Ubiquity: degree of adoption of the format 1001 Support: number of tools available which can access the format 1002 Disclosure: extent to which the format documentation is publicly disclosed 1003 Document Quality: completeness of the available documentation 1004 Stability: speed and backwards-compatibility of version change 1005 Ease of identification: ease with which the format can be identified 1006 Ease of validation: ease with which the format can be validated 1007Lossiness: does the format use lossy compression 1008 Intellectual property rights: whether or not the format is encumbered by IPR 1009 Complexity: degree of content or behavioural complexity supported From PRONOM documentation (The National Archives), July 2008
  • 15. A group task on format risks 1. Choose two formats to compare (e.g. Word vs PDF, Word vs ODF, PDF vs XML, TIFF vs JPEG) 2. By working through the (surviving) list of format risks select a winner (or a draw) between your chosen formats for each risk category (1 point for win) 3. Total the scores to find an overall winning format 4. Suggest one reason why the winning format using this method may not be the one you would choose for your repository
  • 16. Some thoughts about formats Free vs open source vs open standard: •MS Office – XML – open standard •Open Office – free – XML - open standard •PDF page representation •XML generic Web format, computational
  • 17. Rosenthal on: why we can relax about preservation “Historically, the open source community has developed rendering software for almost all proprietary formats that achieve wide use “Even the formats which pose the greatest problems for preservation, those protected by DRM technology, typically have open source renderers” Format Obsolescence: Scenarios (April 29, 2007) http://blog.dshr.org/2007/04/format-obsolescence- scenarios.html
  • 18. Work with, not against, your authors and contributors • “Preservation begins with the author” • U. Rochester (USA) has written its own repository software IR+ to give its authors a Web-based authoring workspace • But which applications are widely used and popular among your authors? Digital content authoring tools are typically chosen on the basis of purpose, utility, familiarity (what is provided, supported by Information Systems?) Rarely are they chosen for format or preservation. • Authors will craft their output in the chosen application, but will often throw away that craft if asked to convert to another format • One approach that builds on popular formats is ICE: Integrated Content Environment, which converts formats from popular content authoring tools
  • 19. An image format comparison: TIFF vs JPEG 2000? Studies and user reports claim JPEG 2000 to be – or at least will become – the next archiving format for digital images The format offers new possibilities, such as streaming, and reduces storage consumption through lossless and lossy compression. Another often claimed advantage of JPEG 2000 is that the master image can possibly serve as the access copy as well, and thus replace derived compressed, low resolution access copies. Preservation Planning at the Bavarian State Library Using a Collection of Digitized 16th Century Printings, D-Lib Magazine, Vol15 No. 11/12, Nov/Dec 2009 http://www.dlib.org/dlib/november09/kulovits/11kulovits.html
  • 20. TIFF vs JPEG 2000? Who’s for JPEG? The major players line up 1. The National Library of the Netherlands evaluated JPEG 2000 against uncompressed TIFF (currently used) for storage capacity, image quality, long-term sustainability, functionality. JPEG 2000 is recommended as future archive format. 2. The British Library recently moved forward to migrate their 80-terabyte newspaper collection from TIFF to JPEG 2000 3. The Wellcome Library announced they will use JPEG 2000 for their upcoming digitization projects Preservation Planning at the Bavarian State Library Using a Collection of Digitized 16th Century Printings, D-Lib Magazine, Vol15 No. 11/12, Nov/Dec 2009 http://www.dlib.org/dlib/november09/kulovits/11kulovits.html
  • 21. TIFF vs JPEG 2000? What does Plato say? “At this point in time not migrating the TIFF v6 images is the best alternative. “However, in one year we'll look at this plan again to see if there are more tools available and whether or not the ones we considered in this year's evaluation have been improved.” Preservation Planning at the Bavarian State Library Using a Collection of Digitized 16th Century Printings, D-Lib Magazine, Vol15 No. 11/12, Nov/Dec 2009 http://www.dlib.org/dlib/november09/kulovits/11kulovits.html
  • 22. Further reading on formats and risk • Malcolm Todd, File Formats for Preservation, DPC Technology Watch Report Series, Report 09-02, 2 December 2009 •http://www.dpconline.org/newsroom/file-formats-for-preservation- technology-watch-report.html • Judith Rog and Caroline van Wijk, Evaluating File Formats for Long-term Preservation, KoninklijkeBibliotheek, February 2008
http://www.kb.nl/hrd/dd/dd_links_en_publicaties/publicaties/K B_file_format_evaluation_method_27022008.pdf •See also Preserv project bibliography for many more papers on file formats http://preserv.eprints.org/Preserv-bibliography.html