3. Rothenberg & Rosenthal On Format Obsolescence
Jeff Rothenberg:
“Digital Information Lasts Forever –
Or Five Years, Whichever Comes First.” (1997)
“…still apt…” (2012)
David Rosenthal:
“when challenged, proponents of [format migration strategies]
have failed to identify even one format in wide use when
Rothenberg [made that assertion] that has gone obsolete in the
intervening decade and a half.” (2010)
That network effects inhibit obsolescence
Where is the evidence?
5. UK Web Domain Dataset (1994-2010)
UK Web Domain Dataset (1994-2010)
From the Internet Archive
Millions of websites
> 2.5 billion resources
> 400,000 ARC/WARC files
> 35TB
Execution at Scale
Stored on HDFS
Map-Reduce
6. Identification Tools
DROID
Well-known in digital preservation community
Format version level identification
Minor problem concerning file handles
Only binary signature part (DROID-B) could be embedded
Apache Tika
Widely used identification and data extraction tool
Identifies many formats at the MIME type level
Easy to embed and extend
Added ability to extract e.g. software identifiers
Minor bug concerning identification buffer size
7. A Common Language For Format Identifiers
Comparison and combination requires a common model
Map PRONOM IDs to extended MIME Types
fmt/18
becomes
application/pdf; version=1.4
Allows easy comparison at sub-type level
Can easily extend to cover other properties:
text/plain; charset=UTF-8
application/pdf;
software=“Adobe Acrobat 6.0”
Also extended Tika to output details from PDFs
8. Format Profile Dataset
Server, Tika & DROID-B format profiles, over time:
image/png image/png image/png; version=1.0 2004 102
!
application/pdf !
application/pdf; version=1.2; software="Acrobat
Distiller 4.0 for Windows";
source="Adobe PageMaker 6.0" !
application/pdf; version=1.2 !2004 !1
CC0 – free to download and reuse
http://data.webarchive.org.uk/opendata/ukwa.ds.2/fmt/
Please cite us and/or let us know if you use it
Source code of all tools and modifications also available
https://github.com/openplanets/nanite
11. Inconsistencies
Gaps
37 formats spotted by DROID-B but not Tika
Notably includes earlier Office formats
129 formats spotted by Tika but not DROID-B
But at least 20 are due to not using the full DROID
Conflicts
Failed MIME type mapping, e.g. PDF 1.7 (since fixed)
‘Soft’ signatures – e.g. PICT matching 3M JPG (gone)
DROID strictness – 9M GIF, 4M JPG, 1.3M PDF…
Both tools bad at non-HTML/XML text formats
CSS, scripting languages like JS, CSV, TSV, etc.
21. Summary
Format obsolescence is complex
Network effects do appear to stabilize formats
But once popular formats are fading nevertheless
More sophisticated approach required
Please re-use our data, or ask for more
Firmer conclusions need:
Richer, more detailed results
From a wider range of corpora
This approach only gives creator information
A different approach will be needed to understand
resource consumption (e.g. PPT 4, RealAudio 1)