2. The Practitioners' Have Spoken…
Quality Assurance (of broken or potentially broken data):
Quality assurance, Bit rot, and Integrity
Appraisal and Assessment:
Appraisal and assessment, Conformance, Unknown
characteristics, and Unknown file formats.
Identify/Locate Preservation Worthy Data
Identify Preservation Risks:
Obsolescence, preservation risk and business constraint
Long tail of many other issues:
Contextual and Data capture issues through to Embedded
objects, and broader issues around Value and cost.
Plus: Sustainable Tools
2
3. Appraisal and Assessment
Conformance, Unknown characteristics, and Unknown file
formats. Identify/Locate Preservation Worthy Data
Identification
Always used to „route‟ data to software that can understand it.
Use minimum information to identify:
e.g. header only if possible. “Truncated PDF”, not
“UNKNOWN”. GIS shapefiles: .shp, .shx, but with a missing
.dbf should be reported as such.
Validation
Two modes needed: “Fast fail”, “Log and continue” /Quirks
Stop baseless distinction between “Well formed” and “Valid”
Validation is irrelevant to digital preservation assessment:
e.g. Effective “PDF/A”, without the 1.4 and XMP chunk.
We‟re on the wrong side of Postel‟s Law.
Unknown completeness and failure to future-proof:
e.g. JHOVE tries to validate versions of PDF it cannot know.
e.g. Tools sometimes interpret/migrate data opaquely. 3
6. Identify Preservation Risks
Obsolescence, preservation risk and business constraint
Significant Properties are irrelevant here.
It‟s not really about the content, but about the context.
Dependency Analysis:
What software does this need?
Does this file use format features that are not well supported
across implementations?
What other resources are transcluded?
Fonts? c.f. OfficeDDT.
Remote embeds?
Embedded scripts that might mask dependencies?
Do some operations require a password?
e.g. JHOVE cannot spot „harmless‟ PDF encryption.
6
7. Sustainable Tools
Our Tools
Pure-Java Characterisation:
JHOVE („clean room‟ implementation)
New Zealand Metadata Extractor (NZME)
Apache Tika
Java-based aggregation of various CLI tools:
JHOVE2
FITS
Other Characterisation:
XCL – C++/XML „clean room‟ extended with ImageMagick
Many more, inc. forensics, BitCurator, OfficeDDT, jpylyzer...
Identification:
DROID, FIDO, Apache Tika, File
Visualisation:
C3PO, and many non-specialised tools.
7
8. Sustainable Tools
Up to date? Working together?
Software Dependency Management:
FITS/JHOVE2 embed old DROID versions, hard to upgrade.
Dead dependencies: FITS and FFIdent, NZME and Jflac.
Is FITS embedding JHOVE2, or is JHOVE2 embedding FITS?
Embed shared modules instead?
Software Project Management and Communication:
JHOVE, JHOVE2? FITS?
JHOVE2 only compiles on Sheila‟s branch?
Roadmaps, issue management, testing, C.I., etc.
Cross-project coordination and bug-fixing?
Complexity: JHOVE2, XCL, extremely complex
JHOVE2 Berkley DB causes checksum failures in tests
Tika solves same problem using SAX 8
9. Sustainable Tools
Shared tests?
Separate projects arise from separate workflows
Start by understand commonality and find gaps?
Share test cases and compare results?
The OPF Format Corpus contains various valid and invalid files.
Built by practitioners' to test real use cases.
e.g. JP2 features, PDF Cabinet of Horrors.
Do the tools give consistent and complementary results?
Let‟s find out!
c.f. Dave Tarrant‟s REF for Identification:
http://data.openplanetsfoundation.org/ref/
http://data.openplanetsfoundation.org/ref/pdf/pdf_1.7/
9
10. Bit-mashing as Tool QA
Bitwise exploration of data sensitivity.
One way to compare tools.
Helps understand formats.
c.f. Jay Gattuso‟s recent OPF blog.
10
11. Quality Assurance (of broken or potentially broken data)
Quality assurance, Bit rot, and Integrity
JHOVE let failed TIFF-JP2 through…
Jpylyzer does better.
Both fall far short of actual rendering.
11
12. Where's the unification?
Where should we work together?
Shared test corpora and test framework:
Start with the OPF Format Corpus?
Pull other corpora in by reference:
http://www.pdfa.org/2011/08/isartor-test-suite/ for PDF/A
Sustainable version of Dave Tarrant‟s REF?
Extend with bit-mashing to compare tools?
Aim to coordinate more:
Make it clear where to go? (More about OfficeDDT).
Consider merging projects?
Consider sharing underlying libraries?
Consider building Tika modules?
Please consider Apache Preflight as base for PDF validation.
12