The Codex of Business Writing Software for Real-World Solutions 2.pptx
Preservation Planning: Choosing a suitable digital preservation strategy
1. P res erva tio n P la nning :
Choosing a suitable preservation
approach
Long-term Archiving P erspectives of
E uropean Union P ublications meeting
Office for Official Publications of the European Communities
Luxembourg, November 10-11, 2011
Gareth Knight
Centre for e-Research
2. Preservation Objectives
Authentic - it is what it Understandability – what does
purports to be this information mean?
Content
preservation
Bitstream
preservation
Priscilla Caplan's revised Preservation Pyramid
3. Identity
• The exact sameness of things.
• Leibniz's law indicates that 2 items that share
common attributes are not only similar, but are the
same thing
• Can two things be the same? “ultimately nothing is
the s ame as something else” (Paskin, 2003) A painting of Leibniz
Questions:
• Both images are a pictorial representation of Leibniz
• Image A is constructed using paint on a canvas
• Image B is constructed as 0s and 1s
• Do they share the same identity?
• Is it necessary for all object attribute to be same, or is
it acceptable to have some degree of granularity?
• How much is identity based upon ability to measure
attributes?
Scanned copy of painting
5. Is integrity maintained = Yes/No
• Linked to notions of consistency, wholeness and truth
• There has not been deliberate or accidental damage/change
that has caused meaning to be altered or lost, in part or
entirety.
• Checksum algorithm applied to a file generates a distinct
(possibly unique) alphanumeric value
• Commonly used to check for accidental/deliberate data
change/corruption
• Generate checksum on October 1st
• Generate checksum on October 14th & compare to Oct 1st value –
are they the same? Y E S /N O
6. Is Integrity maintained
= 0- 100%
If one chunk became corrupted, the hashes for other chunks,
which hadn't changed, could be used to prove its integrity.
P iec ew is e ha s hing :
•divides an input file into sections and checksums each chunk
separately.
•Intended to measure integrity of disk images (dcfldd).
• However, Insert or delete changes all subsequent hashes
•R o lling ha s h:
Looks at each point of file in semi-random order
Depends only on last few bytes
7. Example of Piecewise hashing (1)
19e33h213a7865b2b664348b
ea3fe191227a4eg933bc41ge
2d839db2996b412e84h77a33
872e73ab867c883e7391ae65
8. Example of Piecewise hashing (2)
19e33h213a7865b2b664348b
SAME!
ea3fe191227a4eg933bc41ge
SAME!
a73921e173c94e8232fa91bb
DIFFERENT TEXT
7894af8211c12bb123ah9912
INCOMPLETE
10. Data Interpretation in practice
OAIS Reference Model
NAA Performance Model
=
+ + +
data computer OS application information
content
11. Information Object
Information Properties
Some definitions:
• Information P roperty/ D escription:
IP
• A description of part of the information
content (OAIS RM v2, 2009)
• P roperty:
• An abstract attribute, trait or peculiarity
suitable for describing preservation
objects, actions or environments
(Dappert, 2009)
Observations:
• No interpretation of significance –
merely exists
• May be held in different locations and
different levels of detail
12. Information Property categories (1)
Rothenberg & Bikson (1999) identify five types of
Information Property:
• C ontent: the author’s intellectual work, e.g. text, still image,
audio waveform, etc.
• C ontext: Information that affects the content’s intended
meaning and establishes its provenance
• Appearance: Information that contributes to the recreation of
the performance, e.g. font type/colour/size, bit depth
• S tructure: Relationship between 2+ types of content, e.g. e-
mail attachments, internal hyperlinks
• Behaviour: information that establishes how content interacts
with the user, or other objects or components, e.g. hyperlink
handling
http://www.panix.com/~jeffr/Prof/digilong.html
13. Context
Content Image & Text
link
Content and
Context? Structure
Appearance
Behaviour
14. Information Property categories (2)
PLANETS Digital Object Properties WP use different
classification based upon ability to identify:
•E x tra c ta ble properties :
• Properties that can be extracted from or calculated
on the fly, e.g. file size, image dimensions, MD
•O bs erva tiona l properties :
• Can only be determined by human observation, e.g.
licence restriction(?)
•P erform a nc e P ro perties :
• Properties that emerge through combination of HW,
SW & Data Object
Source: PLANETS Digital Object Properties WG
15. Performance
Observational Property Property
Extractable
information
17. PREMIS
• "things that most working repositories are
likely to need to know in order to support
digital preservation“
• Core metadata that defines “viability,
renderability, understandability,
authenticity, and identity in a preservation
context"
What metadata assists with rendering?
• Format
• Size
• Fixity
• Creating Application: Name, version, date
PREMIS DD 1.0 (May 2005) data was created
PREMIS DD 2.0 (March 2008)
• Inhibitors: Features intended to inhibit
access, use, or migration.
18. Technical Metadata for still images
http://www.flickr.com/photos/k4chii/200303113/
Standards: Z39.87, MIX
and others
Information on
•Image characteristics
•Encoding scheme
•Metadata
19. Document MD
Applicable to formats that are primarily text, allow choice of font,
support embedded multimedia & page layouts
Example elements
Page Count
Word Count
Character Count
Paragraph Count
Line count
Table Count
Graphics Count
Language
Fonts (list of each font in document)
Features (additional document features, e.g. hasTransparency,
hasOutline, hasAnnotation)
20. Third party services: Representation
Information Registries
•Require trusted third party
services capable of identifying
formats
• PRONOM, UDFR
•Providing information on
rendering data
• OpenWith, various RI services
22. Change in process over time
SOURCE PROCESS PERFORMANCE
Intel PC, 2000
+ + =
Mac laptop, 2006
+ + =
X64 Ubuntu laptop, 2010
+ + =
operating software information
hardware
system application content
Potential for changing to ‘Performance’ over time
23. Change is a necessity… and a risk
“traditionally, preserving things meant keeping them unchanged; however
… if we hold on to digital information without modifications, accessing the
information will become increasingly more difficult, if not impossible.”
(Su-Shing Chen, 2001)
“The fundamental challenge of digital preservation is to preserve the
accessibility and authenticity of digital objects over time and domains, and
across changing technical environments” (Wilson, 2008)
26. What do we need to keep for information
Object to be authentic?
“Understanding, defining and assessing the individual
properties… important.. for informing decisions about which
characteristics of that object should be preserved over time,
in circumstances where it is not possible, for reasons such as
cost, practicality or technical constraints, to preserve all the
elements of that object”
(Montague et al. The Concept of Significant Properties. 2010)
“Unless such properties can be defined in a rigorous and
measurable manner, cultural memory institutions have no
objective framework for identifying, implementing, and
validating appropriate preservation strategies, nor for
asserting the continued authenticity of their digital collections”
(Dappert, 2009)
27. Acceptable Vs Unacceptable change
•Easy to identify when preservation gone wrong, but how do you
decide when it goes right?
• Interpretation is a value judgement – often influenced by different
criteria
• Uncertainty on level that evaluation should be performed – technical
encoding, object type (e.g. still image), object sub-type (e.g. business
document, research paper)
• How do you measure attributes that are considered significant?
• Technical properties may vary between formats
• Observational properties require manual identification
28. Planning your strategy; strategising your plan
• P res erva tio n P la n:
defines a series of preservation actions to be taken
by a responsible institution due to an identified risk
for a given set of digital objects or records”
http://www.dlib.org/dlib/november09/kulovits/11kulovits.html
• P res erva tio n s tra teg y
indicates commitment to preservation and high-level
approach adopted – organisational mission, applied
principles (e.g. use lifecycle approach), sequence of
actions (immediate, medium term, long-term), risk
management
29. Why develop a preservation plan?
Assists decision-making process
• Evaluate different strategies
• Evaluate different tools
Determine which is the most effective approach for your needs
• Transparency of operation – enable others to view and
understand approach adopted – inspire confidence and trust
• Provide evidence of decision-making – decisions may be
questioned. How do you prove that approach taken was
appropriate for circumstances?
30. Evaluation frameworks
Various approaches may be adopted to develop preservation plan:
•Produce internal decision tree
• Fit intrinsic needs of organisation, but requires staff time to develop &
may be limiting when considering new approaches
•Perform informal “bottom-up” object analysis & develop bespoke
plan
• Fit requirements of object type, but may be time intensive to produce
& may be incompatible with broader policies
•Adopt 3rd party standardised plan (aka copy and paste)
• Adopting existing plan saves time, but may be inappropriate for
context
•Use analysis frameworks and toolkits
• Structured process by which organisation can identify objectives &
develop plan to address them
• DRAMBORA/DIRKS – analyse environment & practices, identify risks and
brainstorm methods of mitigating or avoiding them
• Data Asset Framework – identify data held, assess management practices & make
recommendations for improvement
• PLANETS Preservation Planning –define requirements, evaluate alternative
approaches, analyse and compare results, recommend preferred approach, and
develop plan
31. Preservation Planning workflow
•Developed as part of DELOS
project & adopted by PLANETS
Consortium
•Conforms to the ‘General COTS
(Commercial-Off-The-Shelf)
selection process (GCS)
•Abstract steps: Define criteria,
Search for products, Create
shortlist, Evaluate candidates,
Analyze data & Select product
•Uses utility analysis approach
33. Define Requirements:
Factors to consider
•Identify & analyse environment in which
decisions are made (e.g. assumptions &
constraints) to determine context:
• Organisational/dept objectives (e.g. mission
statement, mandate)
• National/local policy framework (e.g. acquisition,
legal framework)
• Codes of practice
• Financial limitations – what can you afford?
• Object types to be maintained
• Expertise & needs of key stakeholders, e.g.
Designated Community
34. Whose views do you need to take into
account?
D ig ita l a rc hive pers pec tive
• General trend to simplify object to make it (speculatively) easier to
manage in future:
• Reduce cost of preservation process
• Limit risk that accessibility/preservation issues will emerge
• Increase number of preservation options available
C rea to r pers pec tive
• Author intent difficult to establish
• Differs for each object – do you seek to treat each object individually
or identify broad classes?
• When do you ask them? On creation, after 5 years? May have
different views on value.
U s er pers pec tive
• How do you analyse interpretation of current user community?
• How do you predict needs of future users?
35. InSPECT Requirements Analysis
Framework (2008)
• Adopted a design method used to assist engineers &
designers to create & re-design artefacts
• Based upon theory that artefact construction is a product
of designated function(s)
• Assessment upon two philosophical approaches:
1. Teleology: study of design and purpose of object – why was
it created?
2. Epistemology: Understand meaning and process by which
knowledge is acquired
• In combination, these encourage evaluation of context of
creation and information needed to communicate intrinsic
knowledge to a new audience (designated community)
36. Requirements Analysis activities
S tep 1: O bjec t A na lys is
Interpret context of creation:
1. Analyse object to find out what it contains
2. Identify original audience and functions that object was created to
perform
3. Determine info. properties necessary to achieve each function
S tep 2: S ta k eholder A na lys is
Determine future requirements of digital object
1. Identify Stakeholders that will use object
2. Determine function set they may perform when using object
3. Identify quality thresholds for each information property that must be
met to allow each function to be achieved – what is acceptable loss?
37. Define Requirements:
PLANETS Requirement Categories
• Produce list of criteria that will be used to evaluate diff. preservation
strategies in specific domain
• May take top-down (organisational) or top-down (object) approach
• PLANETS identify four groups of characteristic to be evaluated:
1. O bject: Attributes of information content itself, e.g. behaviour, context
2. R ecord: Attributes of record including context, relationships & MD -
potential overlap with Obj in some cases
3. P rocess : Attributes of preservation process, e.g. processing speed,
usability of tool, ability to batch process, etc.
4. C os t: Set-up of process, cost per object, H/W & S/W, personnel
• Non-prescriptive - evaluator may identify further top-level & sub-
categories or ignore existing criteria (e.g. technical characteristics for
format evaluation)
• May be expressed as spreadsheet, list, mind-map, post-it notes & other
forms
38. Record requirements as Evaluation Tree
•Set of requirements may be
expressed as mind map,
spreadsheet, or other form
•Define structure of
evaluation process, grouping
similar items together
•Assign a measurement
value to each ‘leaf’
• Objective measure: E.g.
colour depth, duration
• Subjective measure:
Acceptable variance,
39. Define Requirements:
Measure each criterion
•Assign a measurement value to each ‘leaf’
•Objective measures:
• Unambiguous, automated (possibly), E.g. seconds to process
object, colour depth, cost value
•Subjective measures:
• Acceptable, but often require manual evaluation, e.g. degree
of format support
•Type of scale
• Numeric measure (e.g. 15 bit)
• Boolean (Yes/No)
• Controlled vocab
(e.g. Yes/Acceptable/No)
• Ordinal numbers (controlled list)
• Subjective criteria (0-5)
41. Define Alternatives
• On basis of object type and expressed
requirements, what strategies are feasible?
• Many different approaches available, e.g. TIFF
images could undergo following actions:
• Format conversion to JPG2k
• Format conversion to PNG (to save space)
• Format conversion to PDF (though would not recommend)
• Emulation/virtual machine
• Do nothing!
• For each alternative strategy, may wish to define:
• Tool to be tested (e.g. name, version, OS)
• Configuration parameters
• Function to be tested
42. Trial the preservation approaches
Develop a set of experiments to trial the
preservation approach
Define workflow
Select representative test files
Perform evaluation
Evaluate the outcome according to
your objective tree
Were there undesired/unexpected
results?
43. PLATO conversion tool/format comparison
Definition of alternative approaches to preserve GIF image (conversion to alt.
formats) and identification of tool services available to perform action
44. Compare results
Require common basis for comparing different strategies
N o rm a lis e dis pa ra te res ults
Each evaluation factor is measured differently (Y/N, cost, speed
of conversion)
Can make them comparable by converting them to a uniform
scale
S et I m porta nt Fa c to rs
Not all assessment criteria is equal – do you wish to prioritise
specific reqs. (e.g. scalability, cost)
C om pa re outc o m es & s elec t m os t a ppropria te
pres erva tion s tra teg y
45. Conclusions
Preservation is an iterative process – must climb many
steps to reach the top of the pyramid
Preservation Planning enables organisation to
understand and document their requirements
Demonstrate decision making – inspires confidence &
trust
Not a perform once, forget process. Must be repeated
46. Discussion points
• Are traditional checksum techniques acceptable
for measuring integrity, or do we need a more
granular approach?
• How should we utilise & build upon third party
services, such as RI Registries & preservation
plan tools, to achieve our preservation
objectives?
• What would a preservation plan for our scanned
images, documents, metadata look like?
47. Thank You for your attention
QUESTIONS?
Gareth Knight
gareth.knight@kcl.ac.uk