SlideShare a Scribd company logo
1 of 19
Download to read offline
Future-Proofing the Web:
What We Can Do Today
        16 September 2005

John Kunze, California Digital Library
Conclusion
•  Desiccation now; migration/emulation maybe
•  Modest preservation budgets, competing
   organizational priorities, and diminishing
   expert format knowledge may make it worth
   while to save original formats along with
   simple, low-tech, desiccated formats having
   an excellent preservation outlook just in case
   –  the original format should fail and
   –  we never get funding to touch the objects again
California Digital Library (CDL)
•  A university library with no books, students, or
   faculty
•  Central services for the 10 campus libraries of
   the University of California
•  Content hosting: electronic texts, datasets,
   finding aids, etc.
•  New preservation challenge for CDL: capture
   and long-term retention of material found on
   the web
What’s digital preservation?
•  Stored objects that remain usable and
   faithful to the creators’ original intention
•  How? By safeguarding information’s …
  –  Viability (intact bit streams)
  –  Renderability (by machines)
  –  Understandability (by humans)
•  Viability not in scope here
Migration and Emulation
•  Migration problems
  –  Unknown costs, human review, format
     errors
•  Emulation problems
  –  Unknown costs, human review, software IP
•  Both try to keep up with or preserve an
   object’s technical context
•  An approach to reduce that context…
The Lesson from Paper
•  As a recording and display device
  –  Can last for 1000 years
•  Why this astonishing performance?
  –  No technical intermediation required
•  What trick can we borrow
  –  The simplest technologies to maintain and
     understand today are the simplest to carry
     forward and to recreate in the future
Low-Tech Dependencies
•  Semantic technology
  –  Loss inevitable due to linguistic shifts
•  Substitute light source
  –  Fire, the lowest tech invention
•  Microfilm
  –  Light source plus lens -- 500-year old
     technology
Desiccated Data
•  Remarkable lesson from the longest-
   lived online digital format
  –  Plain text archives of IETF internet RFCs
  –  High in value, low in features
•  Preservation through “desiccation”
  –  No fonts, graphics, colors, diacritics, etc.
  –  But essential cultural value retained
Hedging our Bets
•  Always save the original format
•  In addition, derive desiccated formats in case
   the original format ever fails
•  Extra storage cost may be incurred anyway if
   your access system requires a plain text
   derivative for search indexing
•  Question: what about Latin-1 support
•  Question: surfacing hidden features
Next Lowest-Tech Technology
•  Raster image as alternate desiccated format
   –  Rectangular grid of picture elements
   –  Technical impact of pressure to compress
      •  Open run-length encoding or wavelet?
•  Rendering tools will never be better than at
   peak of format’s popularity
   –  Very common malformed format instances
•  Additional fall back format in case the original
   and plain text versions fail
•  Question: surfacing hidden data
Beyond Text and Images
•  No attempt yet to formulate desiccated
   data versions of audio, video, or multi-
   media
•  General lesson: technology will clearly
   be part of digital preservation, but the
   greater the technological dependence,
   the greater the risk
Summary
•  Save the original and desiccated versions as
   fall backs in case of failure
  –  Few features, much value
  –  Low cost, done at peak of tool sophistication
•  Web archiving has no preservation metadata
•  We may never have the money to touch most
   of our objects again
•  Desiccation is something we can do today
Impersistence - lesser factors
Deliberately or accidentally, objects are
•  Removed
•  Replaced
•  Moved without setting up a redirect
  –  Everyone has an indirection mechanism,
     though most don’t use it


•  Scheme impact: zero
Impersistence - small factors
Your org likes persistent ids in principle, but
•  It lacks knowledge that vanilla web servers
   trivially support 500,000 redirect directives
•  It lacks the expertise or staff to maintain a
   web server, a two-column database table,
   and a nightly server config file report writer

•  Scheme impact: zero
Scheme costs and risks
•  Every modern service needs to support indefinitely
   and find or be given replacements for at least
   –  Web server, web browser, and DNS
•  In addition, URN, Handle, and DOI resolution need a
   global proxy or a plugin for every access
•  ARK could use a plugin, but doesn’t need it
•  Handle and DOI also require
   –  You to maintain an extra local server
   –  The community to maintain a set of global servers
•  For the CDL
   –  Handle and DOI come with highest risk
   –  ARK comes with lowest risk
Persistence - indirect factors
CDL’s persistence requirements call for an id
   scheme (not service) connecting users to
•  metadata
•  whether and what kind of persistence
•  sub-object and variant inferences
•  core ids on proxy failure (gracefully)
•  Scheme impact: ARK provides these
  –  A scheme is not a service (DOI is not CrossRef)
  –  When choosing a scheme, we wanted to remain
     independent of extra external service providers
Our Stuff vs Their Stuff
•  Persistence can be split into
   –  the Our Stuff Problem
   –  the Their Stuff Problem
•  It makes no sense for CDL to assign
   persistent ids to Their Stuff
   –  Their Stuff can be hugely important to our users,
      but we don’t control it and cannot vouch for it
   –  Where we can afford it, we track them with PURLs
•  CDL does assign persistent ids to Our Stuff
Distribution of Id Assignment
•  Objects ingested in flows from other libraries
   per submission agreements
•  Each object has an ARK after ingest
   –  Either it has it already
   –  Or we give it one upon entry
•  Campuses can mint their own ARKs or rely
   on our minting service
•  Their own campus ARK namespace is theirs
   to divide up as they wish
Opaque ids with semantic
          extensions
•  CDL dilemma:
   –  opaque ids are needed for names that age and
      travel well
   –  Semantically laden ids are helpful in providing
      many id services
•  Hybrid:
   –  opaque ids are used to name abstract
      preservation objects
   –  Semantic and sometimes transient extensions
      address components inside of objects (the set of
      components evolves over time anyway)

More Related Content

Similar to Future-Proofing the Web: Low-Tech Dependencies for Long-Term Preservation

Introduction to Digital Preservation
Introduction to Digital PreservationIntroduction to Digital Preservation
Introduction to Digital PreservationBill LeFurgy
 
2010 AIRI Petabyte Challenge - View From The Trenches
2010 AIRI Petabyte Challenge - View From The Trenches2010 AIRI Petabyte Challenge - View From The Trenches
2010 AIRI Petabyte Challenge - View From The TrenchesGeorge Ang
 
"Filling the Digital Preservation Gap" with Archivematica
"Filling the Digital Preservation Gap" with Archivematica"Filling the Digital Preservation Gap" with Archivematica
"Filling the Digital Preservation Gap" with ArchivematicaJenny Mitcham
 
“Filling the digital preservation gap” an update from the Jisc Research Data ...
“Filling the digital preservation gap”an update from the Jisc Research Data ...“Filling the digital preservation gap”an update from the Jisc Research Data ...
“Filling the digital preservation gap” an update from the Jisc Research Data ...Jenny Mitcham
 
4D Pubs - Distributed Dynamic Document Dsplay
4D Pubs - Distributed Dynamic Document Dsplay4D Pubs - Distributed Dynamic Document Dsplay
4D Pubs - Distributed Dynamic Document DsplayChris Despopoulos
 
How to Build Deep Learning Models
How to Build Deep Learning ModelsHow to Build Deep Learning Models
How to Build Deep Learning ModelsJosh Patterson
 
Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015Josh Patterson
 
Supporting Research through "Desktop as a Service" models of e-infrastructure...
Supporting Research through "Desktop as a Service" models of e-infrastructure...Supporting Research through "Desktop as a Service" models of e-infrastructure...
Supporting Research through "Desktop as a Service" models of e-infrastructure...David Wallom
 
IWMW 2002: Portals and CMS:" Why You Need Them Both
IWMW 2002: Portals and CMS:" Why You Need Them BothIWMW 2002: Portals and CMS:" Why You Need Them Both
IWMW 2002: Portals and CMS:" Why You Need Them BothIWMW
 
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...HPCC Systems
 
SAICSIT 2011 Postgraduate Symposium Presentation
SAICSIT 2011 Postgraduate Symposium PresentationSAICSIT 2011 Postgraduate Symposium Presentation
SAICSIT 2011 Postgraduate Symposium PresentationLighton Phiri
 
Just digitise it - Daniel Wilksch of the Public Records Office Victoria
Just digitise it - Daniel Wilksch of the Public Records Office VictoriaJust digitise it - Daniel Wilksch of the Public Records Office Victoria
Just digitise it - Daniel Wilksch of the Public Records Office VictoriaNational Library of Australia
 
Digital Kniznica 09 Ben Osteen Oxford
Digital Kniznica 09 Ben Osteen OxfordDigital Kniznica 09 Ben Osteen Oxford
Digital Kniznica 09 Ben Osteen Oxfordguest045ab43
 
Applied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jApplied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jDataWorks Summit
 
071310 sun d_0930_feldman_stephen
071310 sun d_0930_feldman_stephen071310 sun d_0930_feldman_stephen
071310 sun d_0930_feldman_stephenSteve Feldman
 
TECHunplugged Austin 2016
TECHunplugged Austin 2016TECHunplugged Austin 2016
TECHunplugged Austin 2016Chris Evans
 
A future with no history meets a history with no future: how much do we need ...
A future with no history meets a history with no future: how much do we need ...A future with no history meets a history with no future: how much do we need ...
A future with no history meets a history with no future: how much do we need ...DigCurV
 
Solving the Game Content Problem
Solving the Game Content ProblemSolving the Game Content Problem
Solving the Game Content ProblemKoray Hagen
 

Similar to Future-Proofing the Web: Low-Tech Dependencies for Long-Term Preservation (20)

Introduction to Digital Preservation
Introduction to Digital PreservationIntroduction to Digital Preservation
Introduction to Digital Preservation
 
2010 AIRI Petabyte Challenge - View From The Trenches
2010 AIRI Petabyte Challenge - View From The Trenches2010 AIRI Petabyte Challenge - View From The Trenches
2010 AIRI Petabyte Challenge - View From The Trenches
 
Quick and dirty islandora
Quick and dirty islandoraQuick and dirty islandora
Quick and dirty islandora
 
"Filling the Digital Preservation Gap" with Archivematica
"Filling the Digital Preservation Gap" with Archivematica"Filling the Digital Preservation Gap" with Archivematica
"Filling the Digital Preservation Gap" with Archivematica
 
“Filling the digital preservation gap” an update from the Jisc Research Data ...
“Filling the digital preservation gap”an update from the Jisc Research Data ...“Filling the digital preservation gap”an update from the Jisc Research Data ...
“Filling the digital preservation gap” an update from the Jisc Research Data ...
 
4D Pubs - Distributed Dynamic Document Dsplay
4D Pubs - Distributed Dynamic Document Dsplay4D Pubs - Distributed Dynamic Document Dsplay
4D Pubs - Distributed Dynamic Document Dsplay
 
How to Build Deep Learning Models
How to Build Deep Learning ModelsHow to Build Deep Learning Models
How to Build Deep Learning Models
 
Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015
 
Supporting Research through "Desktop as a Service" models of e-infrastructure...
Supporting Research through "Desktop as a Service" models of e-infrastructure...Supporting Research through "Desktop as a Service" models of e-infrastructure...
Supporting Research through "Desktop as a Service" models of e-infrastructure...
 
IWMW 2002: Portals and CMS:" Why You Need Them Both
IWMW 2002: Portals and CMS:" Why You Need Them BothIWMW 2002: Portals and CMS:" Why You Need Them Both
IWMW 2002: Portals and CMS:" Why You Need Them Both
 
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
 
SAICSIT 2011 Postgraduate Symposium Presentation
SAICSIT 2011 Postgraduate Symposium PresentationSAICSIT 2011 Postgraduate Symposium Presentation
SAICSIT 2011 Postgraduate Symposium Presentation
 
Just digitise it - Daniel Wilksch of the Public Records Office Victoria
Just digitise it - Daniel Wilksch of the Public Records Office VictoriaJust digitise it - Daniel Wilksch of the Public Records Office Victoria
Just digitise it - Daniel Wilksch of the Public Records Office Victoria
 
Bertenthal
BertenthalBertenthal
Bertenthal
 
Digital Kniznica 09 Ben Osteen Oxford
Digital Kniznica 09 Ben Osteen OxfordDigital Kniznica 09 Ben Osteen Oxford
Digital Kniznica 09 Ben Osteen Oxford
 
Applied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jApplied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4j
 
071310 sun d_0930_feldman_stephen
071310 sun d_0930_feldman_stephen071310 sun d_0930_feldman_stephen
071310 sun d_0930_feldman_stephen
 
TECHunplugged Austin 2016
TECHunplugged Austin 2016TECHunplugged Austin 2016
TECHunplugged Austin 2016
 
A future with no history meets a history with no future: how much do we need ...
A future with no history meets a history with no future: how much do we need ...A future with no history meets a history with no future: how much do we need ...
A future with no history meets a history with no future: how much do we need ...
 
Solving the Game Content Problem
Solving the Game Content ProblemSolving the Game Content Problem
Solving the Game Content Problem
 

More from John Kunze

The YAMZ Metadictionary
The YAMZ MetadictionaryThe YAMZ Metadictionary
The YAMZ MetadictionaryJohn Kunze
 
YAMZ Metadata Vocabulary Builder
YAMZ Metadata Vocabulary BuilderYAMZ Metadata Vocabulary Builder
YAMZ Metadata Vocabulary BuilderJohn Kunze
 
The ARK Alliance: 20 years, 850 institutions, 8.2 billion persistent identifi...
The ARK Alliance: 20 years, 850 institutions, 8.2 billion persistent identifi...The ARK Alliance: 20 years, 850 institutions, 8.2 billion persistent identifi...
The ARK Alliance: 20 years, 850 institutions, 8.2 billion persistent identifi...John Kunze
 
EZID and N2T at CDL
EZID and N2T at CDLEZID and N2T at CDL
EZID and N2T at CDLJohn Kunze
 
YAMZ.net: better, faster, cheaper taxonomy building
YAMZ.net:  better, faster, cheaper taxonomy buildingYAMZ.net:  better, faster, cheaper taxonomy building
YAMZ.net: better, faster, cheaper taxonomy buildingJohn Kunze
 
A Vocabulary for Persistence
A Vocabulary for PersistenceA Vocabulary for Persistence
A Vocabulary for PersistenceJohn Kunze
 
Identifiers obey Resolvers not Schemes
Identifiers obey Resolvers not SchemesIdentifiers obey Resolvers not Schemes
Identifiers obey Resolvers not SchemesJohn Kunze
 
Names, Things, and Open Identifier Infrastructure: N2T and ARKs
Names, Things, and Open Identifier Infrastructure: N2T and ARKsNames, Things, and Open Identifier Infrastructure: N2T and ARKs
Names, Things, and Open Identifier Infrastructure: N2T and ARKsJohn Kunze
 
ARK identifiers: lessons learnt at BnF: paths forward
ARK identifiers: lessons learnt at BnF: paths forwardARK identifiers: lessons learnt at BnF: paths forward
ARK identifiers: lessons learnt at BnF: paths forwardJohn Kunze
 
YAMZ: a cross-domain crowd-sourced metadata vocabulary
YAMZ: a cross-domain crowd-sourced metadata vocabularyYAMZ: a cross-domain crowd-sourced metadata vocabulary
YAMZ: a cross-domain crowd-sourced metadata vocabularyJohn Kunze
 
DataONE Preservation and Metadata Working Group Report 2014
DataONE Preservation and Metadata Working Group Report 2014DataONE Preservation and Metadata Working Group Report 2014
DataONE Preservation and Metadata Working Group Report 2014John Kunze
 
Selected Bash shell tricks from Camp CDL breakout group
Selected Bash shell tricks from Camp CDL breakout groupSelected Bash shell tricks from Camp CDL breakout group
Selected Bash shell tricks from Camp CDL breakout groupJohn Kunze
 
Annotating Research Datasets
Annotating Research DatasetsAnnotating Research Datasets
Annotating Research DatasetsJohn Kunze
 
The Data Management Ecosystem
The Data Management EcosystemThe Data Management Ecosystem
The Data Management EcosystemJohn Kunze
 
Library Tools Supporting Data-Rich Research
Library Tools Supporting Data-Rich ResearchLibrary Tools Supporting Data-Rich Research
Library Tools Supporting Data-Rich ResearchJohn Kunze
 
Big Data's Long Tail
Big Data's Long TailBig Data's Long Tail
Big Data's Long TailJohn Kunze
 
Scalable Identifiers for Natural History Collections
Scalable Identifiers for Natural History CollectionsScalable Identifiers for Natural History Collections
Scalable Identifiers for Natural History CollectionsJohn Kunze
 
Supporting Data-Rich Research on Many Fronts
Supporting Data-Rich Research on Many FrontsSupporting Data-Rich Research on Many Fronts
Supporting Data-Rich Research on Many FrontsJohn Kunze
 
The ARK Identifier Scheme at Ten Years Old
The ARK Identifier Scheme at Ten Years OldThe ARK Identifier Scheme at Ten Years Old
The ARK Identifier Scheme at Ten Years OldJohn Kunze
 

More from John Kunze (20)

The YAMZ Metadictionary
The YAMZ MetadictionaryThe YAMZ Metadictionary
The YAMZ Metadictionary
 
YAMZ Metadata Vocabulary Builder
YAMZ Metadata Vocabulary BuilderYAMZ Metadata Vocabulary Builder
YAMZ Metadata Vocabulary Builder
 
The ARK Alliance: 20 years, 850 institutions, 8.2 billion persistent identifi...
The ARK Alliance: 20 years, 850 institutions, 8.2 billion persistent identifi...The ARK Alliance: 20 years, 850 institutions, 8.2 billion persistent identifi...
The ARK Alliance: 20 years, 850 institutions, 8.2 billion persistent identifi...
 
EZID and N2T at CDL
EZID and N2T at CDLEZID and N2T at CDL
EZID and N2T at CDL
 
YAMZ.net: better, faster, cheaper taxonomy building
YAMZ.net:  better, faster, cheaper taxonomy buildingYAMZ.net:  better, faster, cheaper taxonomy building
YAMZ.net: better, faster, cheaper taxonomy building
 
A Vocabulary for Persistence
A Vocabulary for PersistenceA Vocabulary for Persistence
A Vocabulary for Persistence
 
Identifiers obey Resolvers not Schemes
Identifiers obey Resolvers not SchemesIdentifiers obey Resolvers not Schemes
Identifiers obey Resolvers not Schemes
 
Names, Things, and Open Identifier Infrastructure: N2T and ARKs
Names, Things, and Open Identifier Infrastructure: N2T and ARKsNames, Things, and Open Identifier Infrastructure: N2T and ARKs
Names, Things, and Open Identifier Infrastructure: N2T and ARKs
 
ARK identifiers: lessons learnt at BnF: paths forward
ARK identifiers: lessons learnt at BnF: paths forwardARK identifiers: lessons learnt at BnF: paths forward
ARK identifiers: lessons learnt at BnF: paths forward
 
YAMZ: a cross-domain crowd-sourced metadata vocabulary
YAMZ: a cross-domain crowd-sourced metadata vocabularyYAMZ: a cross-domain crowd-sourced metadata vocabulary
YAMZ: a cross-domain crowd-sourced metadata vocabulary
 
DataONE Preservation and Metadata Working Group Report 2014
DataONE Preservation and Metadata Working Group Report 2014DataONE Preservation and Metadata Working Group Report 2014
DataONE Preservation and Metadata Working Group Report 2014
 
Selected Bash shell tricks from Camp CDL breakout group
Selected Bash shell tricks from Camp CDL breakout groupSelected Bash shell tricks from Camp CDL breakout group
Selected Bash shell tricks from Camp CDL breakout group
 
Annotating Research Datasets
Annotating Research DatasetsAnnotating Research Datasets
Annotating Research Datasets
 
The Data Management Ecosystem
The Data Management EcosystemThe Data Management Ecosystem
The Data Management Ecosystem
 
Library Tools Supporting Data-Rich Research
Library Tools Supporting Data-Rich ResearchLibrary Tools Supporting Data-Rich Research
Library Tools Supporting Data-Rich Research
 
Big Data's Long Tail
Big Data's Long TailBig Data's Long Tail
Big Data's Long Tail
 
Pamwg 2012ahm
Pamwg 2012ahmPamwg 2012ahm
Pamwg 2012ahm
 
Scalable Identifiers for Natural History Collections
Scalable Identifiers for Natural History CollectionsScalable Identifiers for Natural History Collections
Scalable Identifiers for Natural History Collections
 
Supporting Data-Rich Research on Many Fronts
Supporting Data-Rich Research on Many FrontsSupporting Data-Rich Research on Many Fronts
Supporting Data-Rich Research on Many Fronts
 
The ARK Identifier Scheme at Ten Years Old
The ARK Identifier Scheme at Ten Years OldThe ARK Identifier Scheme at Ten Years Old
The ARK Identifier Scheme at Ten Years Old
 

Recently uploaded

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 

Recently uploaded (20)

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 

Future-Proofing the Web: Low-Tech Dependencies for Long-Term Preservation

  • 1. Future-Proofing the Web: What We Can Do Today 16 September 2005 John Kunze, California Digital Library
  • 2. Conclusion •  Desiccation now; migration/emulation maybe •  Modest preservation budgets, competing organizational priorities, and diminishing expert format knowledge may make it worth while to save original formats along with simple, low-tech, desiccated formats having an excellent preservation outlook just in case –  the original format should fail and –  we never get funding to touch the objects again
  • 3. California Digital Library (CDL) •  A university library with no books, students, or faculty •  Central services for the 10 campus libraries of the University of California •  Content hosting: electronic texts, datasets, finding aids, etc. •  New preservation challenge for CDL: capture and long-term retention of material found on the web
  • 4. What’s digital preservation? •  Stored objects that remain usable and faithful to the creators’ original intention •  How? By safeguarding information’s … –  Viability (intact bit streams) –  Renderability (by machines) –  Understandability (by humans) •  Viability not in scope here
  • 5. Migration and Emulation •  Migration problems –  Unknown costs, human review, format errors •  Emulation problems –  Unknown costs, human review, software IP •  Both try to keep up with or preserve an object’s technical context •  An approach to reduce that context…
  • 6. The Lesson from Paper •  As a recording and display device –  Can last for 1000 years •  Why this astonishing performance? –  No technical intermediation required •  What trick can we borrow –  The simplest technologies to maintain and understand today are the simplest to carry forward and to recreate in the future
  • 7. Low-Tech Dependencies •  Semantic technology –  Loss inevitable due to linguistic shifts •  Substitute light source –  Fire, the lowest tech invention •  Microfilm –  Light source plus lens -- 500-year old technology
  • 8. Desiccated Data •  Remarkable lesson from the longest- lived online digital format –  Plain text archives of IETF internet RFCs –  High in value, low in features •  Preservation through “desiccation” –  No fonts, graphics, colors, diacritics, etc. –  But essential cultural value retained
  • 9. Hedging our Bets •  Always save the original format •  In addition, derive desiccated formats in case the original format ever fails •  Extra storage cost may be incurred anyway if your access system requires a plain text derivative for search indexing •  Question: what about Latin-1 support •  Question: surfacing hidden features
  • 10. Next Lowest-Tech Technology •  Raster image as alternate desiccated format –  Rectangular grid of picture elements –  Technical impact of pressure to compress •  Open run-length encoding or wavelet? •  Rendering tools will never be better than at peak of format’s popularity –  Very common malformed format instances •  Additional fall back format in case the original and plain text versions fail •  Question: surfacing hidden data
  • 11. Beyond Text and Images •  No attempt yet to formulate desiccated data versions of audio, video, or multi- media •  General lesson: technology will clearly be part of digital preservation, but the greater the technological dependence, the greater the risk
  • 12. Summary •  Save the original and desiccated versions as fall backs in case of failure –  Few features, much value –  Low cost, done at peak of tool sophistication •  Web archiving has no preservation metadata •  We may never have the money to touch most of our objects again •  Desiccation is something we can do today
  • 13. Impersistence - lesser factors Deliberately or accidentally, objects are •  Removed •  Replaced •  Moved without setting up a redirect –  Everyone has an indirection mechanism, though most don’t use it •  Scheme impact: zero
  • 14. Impersistence - small factors Your org likes persistent ids in principle, but •  It lacks knowledge that vanilla web servers trivially support 500,000 redirect directives •  It lacks the expertise or staff to maintain a web server, a two-column database table, and a nightly server config file report writer •  Scheme impact: zero
  • 15. Scheme costs and risks •  Every modern service needs to support indefinitely and find or be given replacements for at least –  Web server, web browser, and DNS •  In addition, URN, Handle, and DOI resolution need a global proxy or a plugin for every access •  ARK could use a plugin, but doesn’t need it •  Handle and DOI also require –  You to maintain an extra local server –  The community to maintain a set of global servers •  For the CDL –  Handle and DOI come with highest risk –  ARK comes with lowest risk
  • 16. Persistence - indirect factors CDL’s persistence requirements call for an id scheme (not service) connecting users to •  metadata •  whether and what kind of persistence •  sub-object and variant inferences •  core ids on proxy failure (gracefully) •  Scheme impact: ARK provides these –  A scheme is not a service (DOI is not CrossRef) –  When choosing a scheme, we wanted to remain independent of extra external service providers
  • 17. Our Stuff vs Their Stuff •  Persistence can be split into –  the Our Stuff Problem –  the Their Stuff Problem •  It makes no sense for CDL to assign persistent ids to Their Stuff –  Their Stuff can be hugely important to our users, but we don’t control it and cannot vouch for it –  Where we can afford it, we track them with PURLs •  CDL does assign persistent ids to Our Stuff
  • 18. Distribution of Id Assignment •  Objects ingested in flows from other libraries per submission agreements •  Each object has an ARK after ingest –  Either it has it already –  Or we give it one upon entry •  Campuses can mint their own ARKs or rely on our minting service •  Their own campus ARK namespace is theirs to divide up as they wish
  • 19. Opaque ids with semantic extensions •  CDL dilemma: –  opaque ids are needed for names that age and travel well –  Semantically laden ids are helpful in providing many id services •  Hybrid: –  opaque ids are used to name abstract preservation objects –  Semantic and sometimes transient extensions address components inside of objects (the set of components evolves over time anyway)