SlideShare une entreprise Scribd logo
1  sur  12
•Andrew Jackson
•Web Archiving Technical Lead
•British Library
Unified Characterisation, Please
The Practitioners' Have Spoken…
 Quality Assurance (of broken or potentially broken data):
 Quality assurance, Bit rot, and Integrity
 Appraisal and Assessment:
 Appraisal and assessment, Conformance, Unknown
characteristics, and Unknown file formats.
 Identify/Locate Preservation Worthy Data
 Identify Preservation Risks:
 Obsolescence, preservation risk and business constraint
 Long tail of many other issues:
 Contextual and Data capture issues through to Embedded
objects, and broader issues around Value and cost.
 Plus: Sustainable Tools
2
Appraisal and Assessment
Conformance, Unknown characteristics, and Unknown file
formats. Identify/Locate Preservation Worthy Data
 Identification
 Always used to „route‟ data to software that can understand it.
 Use minimum information to identify:
 e.g. header only if possible. “Truncated PDF”, not
“UNKNOWN”. GIS shapefiles: .shp, .shx, but with a missing
.dbf should be reported as such.
 Validation
 Two modes needed: “Fast fail”, “Log and continue” /Quirks
 Stop baseless distinction between “Well formed” and “Valid”
 Validation is irrelevant to digital preservation assessment:
 e.g. Effective “PDF/A”, without the 1.4 and XMP chunk.
 We‟re on the wrong side of Postel‟s Law.
 Unknown completeness and failure to future-proof:
 e.g. JHOVE tries to validate versions of PDF it cannot know.
 e.g. Tools sometimes interpret/migrate data opaquely. 3
• t
4
• this
5
Identify Preservation Risks
Obsolescence, preservation risk and business constraint
 Significant Properties are irrelevant here.
 It‟s not really about the content, but about the context.
 Dependency Analysis:
 What software does this need?
 Does this file use format features that are not well supported
across implementations?
 What other resources are transcluded?
 Fonts? c.f. OfficeDDT.
 Remote embeds?
 Embedded scripts that might mask dependencies?
 Do some operations require a password?
 e.g. JHOVE cannot spot „harmless‟ PDF encryption.
6
Sustainable Tools
Our Tools
 Pure-Java Characterisation:
 JHOVE („clean room‟ implementation)
 New Zealand Metadata Extractor (NZME)
 Apache Tika
 Java-based aggregation of various CLI tools:
 JHOVE2
 FITS
 Other Characterisation:
 XCL – C++/XML „clean room‟ extended with ImageMagick
 Many more, inc. forensics, BitCurator, OfficeDDT, jpylyzer...
 Identification:
 DROID, FIDO, Apache Tika, File
 Visualisation:
 C3PO, and many non-specialised tools.
7
Sustainable Tools
Up to date? Working together?
 Software Dependency Management:
 FITS/JHOVE2 embed old DROID versions, hard to upgrade.
 Dead dependencies: FITS and FFIdent, NZME and Jflac.
 Is FITS embedding JHOVE2, or is JHOVE2 embedding FITS?
 Embed shared modules instead?
 Software Project Management and Communication:
 JHOVE, JHOVE2? FITS?
 JHOVE2 only compiles on Sheila‟s branch?
 Roadmaps, issue management, testing, C.I., etc.
 Cross-project coordination and bug-fixing?
 Complexity: JHOVE2, XCL, extremely complex
 JHOVE2 Berkley DB causes checksum failures in tests
 Tika solves same problem using SAX 8
Sustainable Tools
Shared tests?
 Separate projects arise from separate workflows
 Start by understand commonality and find gaps?
 Share test cases and compare results?
 The OPF Format Corpus contains various valid and invalid files.
 Built by practitioners' to test real use cases.
 e.g. JP2 features, PDF Cabinet of Horrors.
 Do the tools give consistent and complementary results?
 Let‟s find out!
 c.f. Dave Tarrant‟s REF for Identification:
 http://data.openplanetsfoundation.org/ref/
 http://data.openplanetsfoundation.org/ref/pdf/pdf_1.7/
9
Bit-mashing as Tool QA
 Bitwise exploration of data sensitivity.
 One way to compare tools.
 Helps understand formats.
 c.f. Jay Gattuso‟s recent OPF blog.
10
Quality Assurance (of broken or potentially broken data)
Quality assurance, Bit rot, and Integrity
 JHOVE let failed TIFF-JP2 through…
 Jpylyzer does better.
 Both fall far short of actual rendering.
11
Where's the unification?
Where should we work together?
 Shared test corpora and test framework:
 Start with the OPF Format Corpus?
 Pull other corpora in by reference:
 http://www.pdfa.org/2011/08/isartor-test-suite/ for PDF/A
 Sustainable version of Dave Tarrant‟s REF?
 Extend with bit-mashing to compare tools?
 Aim to coordinate more:
 Make it clear where to go? (More about OfficeDDT).
 Consider merging projects?
 Consider sharing underlying libraries?
 Consider building Tika modules?
 Please consider Apache Preflight as base for PDF validation.
12

Contenu connexe

Tendances

Software Citation and Other Incentives at BD2K Software Discovery Workshop
Software Citation and Other Incentives at BD2K Software Discovery WorkshopSoftware Citation and Other Incentives at BD2K Software Discovery Workshop
Software Citation and Other Incentives at BD2K Software Discovery Workshop
Martin Hammitzsch
 
Project Credit: Laure Haak - Contributor Role Pilot
Project Credit: Laure Haak - Contributor Role PilotProject Credit: Laure Haak - Contributor Role Pilot
Project Credit: Laure Haak - Contributor Role Pilot
CASRAI
 

Tendances (20)

Advanced .net api (ewout)
Advanced .net api (ewout)Advanced .net api (ewout)
Advanced .net api (ewout)
 
Furore devdays 2017- rdf2(solbrig)
Furore devdays 2017- rdf2(solbrig)Furore devdays 2017- rdf2(solbrig)
Furore devdays 2017- rdf2(solbrig)
 
Furore devdays 2017- profiling academy - profiling guidelines v1
Furore devdays 2017- profiling academy - profiling guidelines v1Furore devdays 2017- profiling academy - profiling guidelines v1
Furore devdays 2017- profiling academy - profiling guidelines v1
 
Devdays 2017 implementation guide authoring - ardon toonstra
Devdays 2017  implementation guide authoring - ardon toonstraDevdays 2017  implementation guide authoring - ardon toonstra
Devdays 2017 implementation guide authoring - ardon toonstra
 
Fhir dev days_basic_fhir_terminology_services
Fhir dev days_basic_fhir_terminology_servicesFhir dev days_basic_fhir_terminology_services
Fhir dev days_basic_fhir_terminology_services
 
Furore devdays2017 tdd-2-advanced
Furore devdays2017 tdd-2-advancedFurore devdays2017 tdd-2-advanced
Furore devdays2017 tdd-2-advanced
 
Furore devdays 2017- rdf1(solbrig)
Furore devdays 2017- rdf1(solbrig)Furore devdays 2017- rdf1(solbrig)
Furore devdays 2017- rdf1(solbrig)
 
Software Citation and Other Incentives at BD2K Software Discovery Workshop
Software Citation and Other Incentives at BD2K Software Discovery WorkshopSoftware Citation and Other Incentives at BD2K Software Discovery Workshop
Software Citation and Other Incentives at BD2K Software Discovery Workshop
 
Fhir dev days 2017 fhir profiling - overview and introduction v07
Fhir dev days 2017   fhir profiling - overview and introduction v07Fhir dev days 2017   fhir profiling - overview and introduction v07
Fhir dev days 2017 fhir profiling - overview and introduction v07
 
Project Credit: Laure Haak - Contributor Role Pilot
Project Credit: Laure Haak - Contributor Role PilotProject Credit: Laure Haak - Contributor Role Pilot
Project Credit: Laure Haak - Contributor Role Pilot
 
SciForge Workshop@Potsdam Institute for Climate Impact Reserach; Nov 2014
SciForge Workshop@Potsdam Institute for Climate Impact Reserach; Nov 2014SciForge Workshop@Potsdam Institute for Climate Impact Reserach; Nov 2014
SciForge Workshop@Potsdam Institute for Climate Impact Reserach; Nov 2014
 
Security overview (grahame)
Security overview (grahame)Security overview (grahame)
Security overview (grahame)
 
The Future Publication of Software
The Future Publication of SoftwareThe Future Publication of Software
The Future Publication of Software
 
Fhir foundation (grahame)
Fhir foundation (grahame)Fhir foundation (grahame)
Fhir foundation (grahame)
 
Hackdays and workshops 2019
Hackdays and workshops 2019Hackdays and workshops 2019
Hackdays and workshops 2019
 
Whats new (grahame)
Whats new (grahame)Whats new (grahame)
Whats new (grahame)
 
Use of ISOcat within CMDI
Use of ISOcat within CMDIUse of ISOcat within CMDI
Use of ISOcat within CMDI
 
Furore devdays 2017- continua implementing fhir
Furore devdays 2017- continua implementing fhirFurore devdays 2017- continua implementing fhir
Furore devdays 2017- continua implementing fhir
 
fhir-documents
fhir-documentsfhir-documents
fhir-documents
 
Furore devdays 2017- fhir and devices - cooper thc2
Furore devdays 2017- fhir and devices - cooper thc2Furore devdays 2017- fhir and devices - cooper thc2
Furore devdays 2017- fhir and devices - cooper thc2
 

En vedette

Pain points for preservation services / workflows in repositories
Pain points for preservation services /  workflows in repositories Pain points for preservation services /  workflows in repositories
Pain points for preservation services / workflows in repositories
prwheatley
 

En vedette (7)

Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...
Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...
Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...
 
Cochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and FormatsCochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and Formats
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3PO
 
Digging into File Formats: Poking around at data using file, DROID, JHOVE, an...
Digging into File Formats: Poking around at data using file, DROID, JHOVE, an...Digging into File Formats: Poking around at data using file, DROID, JHOVE, an...
Digging into File Formats: Poking around at data using file, DROID, JHOVE, an...
 
Pain points for preservation services / workflows in repositories
Pain points for preservation services /  workflows in repositories Pain points for preservation services /  workflows in repositories
Pain points for preservation services / workflows in repositories
 
[Dpf manager] berlin workshop
[Dpf manager] berlin workshop[Dpf manager] berlin workshop
[Dpf manager] berlin workshop
 
Preservation content in_files
Preservation content in_filesPreservation content in_files
Preservation content in_files
 

Similaire à Unified characterisation, please

Showing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
Showing How Security Has (And Hasn't) Improved, After Ten Years Of TryingShowing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
Showing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
Dan Kaminsky
 
Exercises portfolio-Digital Curation Tools (IS40620)
Exercises portfolio-Digital Curation Tools (IS40620)Exercises portfolio-Digital Curation Tools (IS40620)
Exercises portfolio-Digital Curation Tools (IS40620)
softwaresatish
 
scale_perf_best_practices
scale_perf_best_practicesscale_perf_best_practices
scale_perf_best_practices
webuploader
 
CucumberSeleniumWD
CucumberSeleniumWDCucumberSeleniumWD
CucumberSeleniumWD
Vikas Sarin
 
Writting Better Software
Writting Better SoftwareWritting Better Software
Writting Better Software
svilen.ivanov
 

Similaire à Unified characterisation, please (20)

Showing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
Showing How Security Has (And Hasn't) Improved, After Ten Years Of TryingShowing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
Showing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
 
Exercises portfolio-Digital Curation Tools (IS40620)
Exercises portfolio-Digital Curation Tools (IS40620)Exercises portfolio-Digital Curation Tools (IS40620)
Exercises portfolio-Digital Curation Tools (IS40620)
 
Meet a 100% R-based CRO. The summary of a 5-year journey
Meet a 100% R-based CRO. The summary of a 5-year journeyMeet a 100% R-based CRO. The summary of a 5-year journey
Meet a 100% R-based CRO. The summary of a 5-year journey
 
Meet a 100% R-based CRO - The summary of a 5-year journey
Meet a 100% R-based CRO - The summary of a 5-year journeyMeet a 100% R-based CRO - The summary of a 5-year journey
Meet a 100% R-based CRO - The summary of a 5-year journey
 
The  Selection Between An Open Source And Vended Software in Libraries: Oppor...
The  Selection Between An Open Source And Vended Software in Libraries:Oppor...The  Selection Between An Open Source And Vended Software in Libraries:Oppor...
The  Selection Between An Open Source And Vended Software in Libraries: Oppor...
 
Evaluation of format identification tools
Evaluation of format identification toolsEvaluation of format identification tools
Evaluation of format identification tools
 
Malicious pdf document detection based on feature extraction and entropy
Malicious pdf document detection based on feature extraction and entropyMalicious pdf document detection based on feature extraction and entropy
Malicious pdf document detection based on feature extraction and entropy
 
ESG - HDS HCP Anywhere Easy, Secure, On-Premises File Sharing
ESG - HDS HCP Anywhere Easy, Secure, On-Premises File SharingESG - HDS HCP Anywhere Easy, Secure, On-Premises File Sharing
ESG - HDS HCP Anywhere Easy, Secure, On-Premises File Sharing
 
Information Management 2marks with answer
Information Management 2marks with answerInformation Management 2marks with answer
Information Management 2marks with answer
 
Considerations for using personal information management (pim) software for d...
Considerations for using personal information management (pim) software for d...Considerations for using personal information management (pim) software for d...
Considerations for using personal information management (pim) software for d...
 
scale_perf_best_practices
scale_perf_best_practicesscale_perf_best_practices
scale_perf_best_practices
 
Everyone's A Mechanic
Everyone's A MechanicEveryone's A Mechanic
Everyone's A Mechanic
 
QQML presentation
QQML presentationQQML presentation
QQML presentation
 
CucumberSeleniumWD
CucumberSeleniumWDCucumberSeleniumWD
CucumberSeleniumWD
 
What Have We Lost - A look at some historical techniques
What Have We Lost - A look at some historical techniquesWhat Have We Lost - A look at some historical techniques
What Have We Lost - A look at some historical techniques
 
Supporting PDF accessibility evaluation: Early results from the FixRep project
 Supporting PDF accessibility evaluation: Early results from the FixRep project Supporting PDF accessibility evaluation: Early results from the FixRep project
Supporting PDF accessibility evaluation: Early results from the FixRep project
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
 
Drupal 7 Feeds Intro Drupal Camp Indianapolis 2011
Drupal 7 Feeds Intro Drupal Camp Indianapolis 2011Drupal 7 Feeds Intro Drupal Camp Indianapolis 2011
Drupal 7 Feeds Intro Drupal Camp Indianapolis 2011
 
Writting Better Software
Writting Better SoftwareWritting Better Software
Writting Better Software
 
Tooling on distributed services
Tooling on distributed servicesTooling on distributed services
Tooling on distributed services
 

Plus de Andy Jackson (7)

The 'Digital Object Types' Issue
The 'Digital Object Types' IssueThe 'Digital Object Types' Issue
The 'Digital Object Types' Issue
 
Ten years of the UK web archive: what have we saved?
Ten years of the UK web archive: what have we saved?Ten years of the UK web archive: what have we saved?
Ten years of the UK web archive: what have we saved?
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Seeing In The Dark: Discovery and data-mining of restricted web archives
Seeing In The Dark: Discovery and data-mining of restricted web archivesSeeing In The Dark: Discovery and data-mining of restricted web archives
Seeing In The Dark: Discovery and data-mining of restricted web archives
 
Digging into the Web Archive at the British Library 2014-11-27
Digging into the Web Archive at the British Library 2014-11-27Digging into the Web Archive at the British Library 2014-11-27
Digging into the Web Archive at the British Library 2014-11-27
 
IIPC GA 2014 Solr
IIPC GA 2014 SolrIIPC GA 2014 Solr
IIPC GA 2014 Solr
 
Formats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web HistoryFormats Over Time: Exploring UK Web History
Formats Over Time: Exploring UK Web History
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Unified characterisation, please

  • 1. •Andrew Jackson •Web Archiving Technical Lead •British Library Unified Characterisation, Please
  • 2. The Practitioners' Have Spoken…  Quality Assurance (of broken or potentially broken data):  Quality assurance, Bit rot, and Integrity  Appraisal and Assessment:  Appraisal and assessment, Conformance, Unknown characteristics, and Unknown file formats.  Identify/Locate Preservation Worthy Data  Identify Preservation Risks:  Obsolescence, preservation risk and business constraint  Long tail of many other issues:  Contextual and Data capture issues through to Embedded objects, and broader issues around Value and cost.  Plus: Sustainable Tools 2
  • 3. Appraisal and Assessment Conformance, Unknown characteristics, and Unknown file formats. Identify/Locate Preservation Worthy Data  Identification  Always used to „route‟ data to software that can understand it.  Use minimum information to identify:  e.g. header only if possible. “Truncated PDF”, not “UNKNOWN”. GIS shapefiles: .shp, .shx, but with a missing .dbf should be reported as such.  Validation  Two modes needed: “Fast fail”, “Log and continue” /Quirks  Stop baseless distinction between “Well formed” and “Valid”  Validation is irrelevant to digital preservation assessment:  e.g. Effective “PDF/A”, without the 1.4 and XMP chunk.  We‟re on the wrong side of Postel‟s Law.  Unknown completeness and failure to future-proof:  e.g. JHOVE tries to validate versions of PDF it cannot know.  e.g. Tools sometimes interpret/migrate data opaquely. 3
  • 6. Identify Preservation Risks Obsolescence, preservation risk and business constraint  Significant Properties are irrelevant here.  It‟s not really about the content, but about the context.  Dependency Analysis:  What software does this need?  Does this file use format features that are not well supported across implementations?  What other resources are transcluded?  Fonts? c.f. OfficeDDT.  Remote embeds?  Embedded scripts that might mask dependencies?  Do some operations require a password?  e.g. JHOVE cannot spot „harmless‟ PDF encryption. 6
  • 7. Sustainable Tools Our Tools  Pure-Java Characterisation:  JHOVE („clean room‟ implementation)  New Zealand Metadata Extractor (NZME)  Apache Tika  Java-based aggregation of various CLI tools:  JHOVE2  FITS  Other Characterisation:  XCL – C++/XML „clean room‟ extended with ImageMagick  Many more, inc. forensics, BitCurator, OfficeDDT, jpylyzer...  Identification:  DROID, FIDO, Apache Tika, File  Visualisation:  C3PO, and many non-specialised tools. 7
  • 8. Sustainable Tools Up to date? Working together?  Software Dependency Management:  FITS/JHOVE2 embed old DROID versions, hard to upgrade.  Dead dependencies: FITS and FFIdent, NZME and Jflac.  Is FITS embedding JHOVE2, or is JHOVE2 embedding FITS?  Embed shared modules instead?  Software Project Management and Communication:  JHOVE, JHOVE2? FITS?  JHOVE2 only compiles on Sheila‟s branch?  Roadmaps, issue management, testing, C.I., etc.  Cross-project coordination and bug-fixing?  Complexity: JHOVE2, XCL, extremely complex  JHOVE2 Berkley DB causes checksum failures in tests  Tika solves same problem using SAX 8
  • 9. Sustainable Tools Shared tests?  Separate projects arise from separate workflows  Start by understand commonality and find gaps?  Share test cases and compare results?  The OPF Format Corpus contains various valid and invalid files.  Built by practitioners' to test real use cases.  e.g. JP2 features, PDF Cabinet of Horrors.  Do the tools give consistent and complementary results?  Let‟s find out!  c.f. Dave Tarrant‟s REF for Identification:  http://data.openplanetsfoundation.org/ref/  http://data.openplanetsfoundation.org/ref/pdf/pdf_1.7/ 9
  • 10. Bit-mashing as Tool QA  Bitwise exploration of data sensitivity.  One way to compare tools.  Helps understand formats.  c.f. Jay Gattuso‟s recent OPF blog. 10
  • 11. Quality Assurance (of broken or potentially broken data) Quality assurance, Bit rot, and Integrity  JHOVE let failed TIFF-JP2 through…  Jpylyzer does better.  Both fall far short of actual rendering. 11
  • 12. Where's the unification? Where should we work together?  Shared test corpora and test framework:  Start with the OPF Format Corpus?  Pull other corpora in by reference:  http://www.pdfa.org/2011/08/isartor-test-suite/ for PDF/A  Sustainable version of Dave Tarrant‟s REF?  Extend with bit-mashing to compare tools?  Aim to coordinate more:  Make it clear where to go? (More about OfficeDDT).  Consider merging projects?  Consider sharing underlying libraries?  Consider building Tika modules?  Please consider Apache Preflight as base for PDF validation. 12