This presentation introduces preservation workflow, a process to manage the risk associated with file formats of different digital objects. It was given as part of module 3 of a 5-module course on digital preservation tools for repository managers, presented by the JISC KeepIt project. For more on this and other presentations in this course look for the tag 'KeepIt course' in the project blog http://blogs.ecs.soton.ac.uk/keepit/
1. Introduction to preservation
workflow, formats and risks
This section by
Steve Hitchcock,
KeepItproject
For JISC KeepItcourse on Digital Preservation Tools for Repository Managers
Module 3, Primer on preservation workflow, formats and characterisation
Westminster-Kingsway College, London, 2 March 2010
2. Overview of session
• Some terminology
• Preservation workflow
• Repository format profiles
• Format risks: a group task
• Some thoughts about formats
5. The 3-stage repository model
Get Content
Manage Content Serve Content
(Ingest)
Appraise & Select
Store
Retrieve
Index
Locate
Ingest
Preservation - Check
Preservation - Analyse
Preservation - Action
Dispose
6. Preservation workflow
Check Analyse Action
•Format Preservation planning •Migration
identification, version Characterisation: • Emulation
ing Significant properties and • Storage selection
• File validation technical
• Virus check characteristics, provenance, for
• Bit checking and mat, risk factors
checksum calculation
Risk analysis
Tools
Tools
e.g. DROID
Plato (Planets)
JHOVE
PRONOM (TNA)
FITS
P2 risk registry (KeepIt)
INFORM (U Illinois)
7. Rosenthal on:
doing less for preservation
“I believe that "it became necessary
to change the content in order to
preserve it" is a very bad idea; we
should preserve what's out there
without adding cost and losing
information by preemptively migrating
to a format we believe (normally
without evidence) is less doomed.”
Are format specifications important for
preservation? (January 4, 2009)
http://blog.dshr.org/2009/01/are-format-
specifications-important-for.html
8. Rosenthal on:
aggressivevsrelaxed
preservation
“In the long run, all digital formats
become obsolete. Broadly, reactions to this
dismal prospect have taken two forms:
- The aggressive form has been to do as
much work as possible as soon as possible
-The relaxed form has been to postpone
doing anything until it is absolutely
essential
Format Obsolescence: the Prostate Cancer of
Preservation (May 7, 2007)
http://blog.dshr.org/2007/05/format-obsolescence-
prostate-cancer-of.html
9. Rosenthal on:
the Prostate Cancer of Preservation
“format obsolescence is the prostate cancer of digital
preservation. It is a serious and ultimately fatal problem. But it is
highly likely that something else will kill you first
“A risk-based approach would surely prefer the "relaxed"
approach, minimizing up-front and storage costs, and thereby
freeing up resources to preserve more, and higher-risk content.
“The best example of a "relaxed" ingest pipeline is the Internet
Archive, which has so far ingested over 85 billion web pages with
minimal human intervention.”
Format Obsolescence: the Prostate Cancer of Preservation (May 7, 2007)
http://blog.dshr.org/2007/05/format-obsolescence-prostate-cancer-of.html
11. ROAR format profiles today
This profile for Australian Research Online repository
To access a format profile:
Find chosen repository in ROAR, open [Record Details]
Format profiles not available for all repositories in ROAR
ROAR disclaimer: Full-text formats is based on automatic
file-format identification and is prone to errors
12. Accepted repository formats:
a recent survey
What file formats do you accept? Do you convert any to a
different format?
• All accept any format.
• Two convert everything to PDF, but store the source files in
the background for preservation reasons.
• Four mention specifically converting Word to PDF: one
seeks permission from the author to do this, and uploads as
Word if permission is not granted.
• One mentions converting ZIP files to PDF.
Sue Ashby
University of Portsmouth Library
Summary of responses to IR questionnaire
JISC-REPOSITORIES, 18 February 2010
13. Format risks
1000 Ubiquity: degree of adoption of the format
1001 Support: number of tools available which can access the format
1002 Disclosure: extent to which the format documentation is publicly
disclosed
1003 Document Quality: completeness of the available documentation
1004 Stability: speed and backwards-compatibility of version change
1005 Ease of Identification: ease with which the format can be identified
1006 Ease of validation: ease with which the format can be validated
1007Lossiness: does the format use lossy compression
1008 Intellectual Property Rights: whether or not the format in
encumbered by IPR
1009 Complexity: degree of content or behavioural complexity supported
From PRONOM documentation (The National Archives), July 2008
14. Format risks
1000 Ubiquity: degree of adoption of the format
1001 Support: number of tools available which can access the format
1002 Disclosure: extent to which the format documentation is publicly
disclosed
1003 Document Quality: completeness of the available documentation
1004 Stability: speed and backwards-compatibility of version change
1005 Ease of identification: ease with which the format can be identified
1006 Ease of validation: ease with which the format can be validated
1007Lossiness: does the format use lossy compression
1008 Intellectual property rights: whether or not the format is
encumbered by IPR
1009 Complexity: degree of content or behavioural complexity supported
From PRONOM documentation (The National Archives), July 2008
15. A group task on format risks
1. Choose two formats to compare (e.g. Word vs
PDF, Word vs ODF, PDF vs XML, TIFF vs JPEG)
2. By working through the (surviving) list of format
risks select a winner (or a draw) between your
chosen formats for each risk category (1 point for
win)
3. Total the scores to find an overall winning format
4. Suggest one reason why the winning format using
this method may not be the one you would
choose for your repository
16. Some thoughts about formats
Free vs open source vs open standard:
•MS Office – XML – open standard
•Open Office – free – XML - open standard
•PDF page representation
•XML generic Web format, computational
17. Rosenthal on:
why we can relax about preservation
“Historically, the open source community
has developed rendering software for
almost all proprietary formats that achieve
wide use
“Even the formats which pose the
greatest problems for preservation, those
protected by DRM technology, typically
have open source renderers”
Format Obsolescence: Scenarios (April 29, 2007)
http://blog.dshr.org/2007/04/format-obsolescence-
scenarios.html
18. Work with, not against, your
authors and contributors
• “Preservation begins with the author”
• U. Rochester (USA) has written its own repository software IR+ to
give its authors a Web-based authoring workspace
• But which applications are widely used and popular among your
authors? Digital content authoring tools are typically chosen on
the basis of purpose, utility, familiarity (what is
provided, supported by Information Systems?) Rarely are they
chosen for format or preservation.
• Authors will craft their output in the chosen application, but will
often throw away that craft if asked to convert to another format
• One approach that builds on popular formats is ICE: Integrated
Content Environment, which converts formats from popular
content authoring tools
19. An image format comparison:
TIFF vs JPEG 2000?
Studies and user reports claim JPEG 2000 to be – or at least will
become – the next archiving format for digital images
The format offers new possibilities, such as streaming, and reduces
storage consumption through lossless and lossy compression.
Another often claimed advantage of JPEG 2000 is that the master
image can possibly serve as the access copy as well, and thus
replace derived compressed, low resolution access copies.
Preservation Planning at the Bavarian State Library Using a Collection of Digitized
16th Century Printings, D-Lib Magazine, Vol15 No. 11/12, Nov/Dec 2009
http://www.dlib.org/dlib/november09/kulovits/11kulovits.html
20. TIFF vs JPEG 2000?
Who’s for JPEG? The major players line up
1. The National Library of the Netherlands evaluated JPEG 2000
against uncompressed TIFF (currently used) for storage
capacity, image quality, long-term sustainability, functionality.
JPEG 2000 is recommended as future archive format.
2. The British Library recently moved forward to migrate their
80-terabyte newspaper collection from TIFF to JPEG 2000
3. The Wellcome Library announced they will use JPEG 2000 for
their upcoming digitization projects
Preservation Planning at the Bavarian State Library Using a Collection of Digitized
16th Century Printings, D-Lib Magazine, Vol15 No. 11/12, Nov/Dec 2009
http://www.dlib.org/dlib/november09/kulovits/11kulovits.html
21. TIFF vs JPEG 2000?
What does Plato say?
“At this point in time not migrating the TIFF v6
images is the best alternative.
“However, in one year we'll look at this plan again
to see if there are more tools available and whether
or not the ones we considered in this year's
evaluation have been improved.”
Preservation Planning at the Bavarian State Library Using a
Collection of Digitized 16th Century Printings, D-Lib
Magazine, Vol15 No. 11/12, Nov/Dec 2009
http://www.dlib.org/dlib/november09/kulovits/11kulovits.html
22. Further reading on formats and risk
• Malcolm Todd, File Formats for Preservation, DPC
Technology Watch Report Series, Report 09-02, 2
December 2009
•http://www.dpconline.org/newsroom/file-formats-for-preservation-
technology-watch-report.html
• Judith Rog and Caroline van Wijk, Evaluating File Formats
for Long-term
Preservation, KoninklijkeBibliotheek, February
2008 http://www.kb.nl/hrd/dd/dd_links_en_publicaties/publicaties/K
B_file_format_evaluation_method_27022008.pdf
•See also Preserv project bibliography for many more
papers on file formats
http://preserv.eprints.org/Preserv-bibliography.html