Extended version of slides presented at the "404/File Not Found" symposium held at Georgetown University on October 24 2014, see http://www.law.georgetown.edu/library/404/ . The presentation provides a brief overview of the link/reference rot problem and then discusses three complimentary strategies to combat it: Pro-actively capturing web resources that are linked from a seed collection; Referencing the captures by means of annotated links; Accessing the captures using Memento infrastructure.
1. Creating Pockets of Persistence
Herbert Van de Sompel
@hvdsomp
http://public.lanl.gov/herbertv/
Los Alamos National Laboratory
Acknowledgements:
Michael L. Nelson
@phonedude_mln
Old Dominion University
Herbert Van de Sompel
404/File Not Found, Washington, DC, October 24 2014
2. Addressing the Link/Reference Rot Challenge
• Pockets of Persistence
• Capture – Archive Pro-Actively, Selectively
• Reference – Annotate Links
• Access – Travel in Time
Herbert Van de Sompel
404/File Not Found, Washington, DC, October 24 2014
3. Pockets of Persistence
Herbert Van de Sompel
How to achieve the ability to:
404/File Not Found, Washington, DC, October 24 2014
• Persistently
• Precisely
• Seamlessly
revisit the Web of the Past
and the Web of the Now at
some point in the Future
4. Pockets of Persistence
Herbert Van de Sompel
How to achieve the ability to:
404/File Not Found, Washington, DC, October 24 2014
• Persistently
• Precisely
• Seamlessly
revisit the Web of the Past
and the Web of the Now at
some point in the Future
Two components to the link/reference rot
challenge:
• Link rot: Links stop working aka 404
Not Found
• Content drift: Referenced content
changes over time
5. Illustration
Herbert Van de Sompel
Current version of http://en.wikipedia.org/wiki/Coil_(band) on October 22 2014
404/File Not Found, Washington, DC, October 24 2014
6. Illustration – Link Rot
Herbert Van de Sompel
Current version of http://en.wikipedia.org/wiki/Coil_(band) on October 22 2014
404/File Not Found, Washington, DC, October 24 2014
7. Illustration – Link Rot
Herbert Van de Sompel
Current version of http://liarsociety.tripod.com/blog/index.blog?from=20041130 on October 22 2014
404/File Not Found, Washington, DC, October 24 2014
8. Illustration – Content Drift
Version of http://en.wikipedia.Herbert org/Van wiki/de Coil_(Sompel
band) dated October 2 2014
http://en.wikipedia.org/w/index.php?title=Coil_(band)&oldid=388321480
404/File Not Found, Washington, DC, October 24 2014
9. Illustration – Content Drift
Herbert Van de Sompel
Current version of http://en.wikipedia.org/wiki/Peter_Christopherson on October 22 2014
404/File Not Found, Washington, DC, October 24 2014
10. Illustration – Content Drift
Version of http://en.wikipedia.org/wiki/Peter_Christopherson that was current on October 2 2010
Herbert Van de Sompel
http://en.wikipedia.org/w/index.php?title=Peter_Christopherson&oldid=387987414
404/File Not Found, Washington, DC, October 24 2014
11. Pockets of Persistence
Herbert Van de Sompel
How to achieve the ability to:
404/File Not Found, Washington, DC, October 24 2014
• Persistently
• Precisely
• Seamlessly
revisit the Web of the Past
and the Web of the Now at
some point in the Future
This challenge exists for the entire web,
but some communities actually care
about addressing it:
• scholarly communication,
• legal publications,
• journalism,
• Wikipedia,
• …
Mobilize the communities that care about
this problem to work towards joint,
interoperable solutions, approaches
12. Addressing the Link/Reference Rot Challenge
• Pockets of Persistence
• Capture – Archive Pro-Actively, Selectively
• Reference – Annotate Links
• Access – Travel in Time
Herbert Van de Sompel
404/File Not Found, Washington, DC, October 24 2014
13. Pro-Active Capture for a Seed Collection
• Seed Collection - Starting point for capture is a seed collection of
interest to communities that care, e.g.
o On-Line journalism
• Lifecycle Events – Intervene at critical moments in the lifecycle of
items in these collections to pro-actively capture
o Collection items – some solutions in place
o Web resources referenced in collection items
Herbert Van de Sompel
o Scholarly literature
o Legal documents
o Wikipedia articles
404/File Not Found, Washington, DC, October 24 2014
14. Pro-Active Capture for Seed Collection
• What those crucial lifecycle events are may depend on the
• Creation of new article
• Creation of new version of
article
• Creation of substantially
new version of article
• Addition of external
reference to article
• References to article
exceed a certain threshold
Scholarly Literature
Herbert Van de Sompel
collection type
Wikipedia
404/File Not Found, Washington, DC, October 24 2014
15. Authoring Legal Documents – perma.cc
Herbert Van de Sompel
http://perma.cc
404/File Not Found, Washington, DC, October 24 2014
16. Authoring Scholarly Literature: Experimental Zotero Extension
Richard Wincewicz (2014) Prototype Hiberlink plugin for Zotero for pro-active archiving and temporal references
Herbert Van de Sompel
https://www.youtube.com/v/ZYmi_Ydr65M%26vq
404/File Not Found, Washington, DC, October 24 2014
17. Submitting Scholarly Literature: Experimental HiberActive Service
Martin Klein et al. (2014) HiberActive: Pro-Active Archiving of web references from scholarly articles
Herbert Van de Sompel
Open Repositories 2014 http://www.slideshare.net/martinklein0815/hiberactive
404/File Not Found, Washington, DC, October 24 2014
18. Pro-Active Capture for Seed Collection
• Interoperability for on-demand capture:
o Need basic interoperability for machine-driven on-demand
capture:
- Discovery of capture interface
- Interface IN - [ Original URI ]
- Interface OUT - [ URI of Capture ; Capture Datetime ]
Herbert Van de Sompel
404/File Not Found, Washington, DC, October 24 2014
19. Addressing the Link/Reference Rot Challenge
• Pockets of Persistence
• Capture – Archive Pro-Actively, Selectively
• Reference – Annotate Links
• Access – Travel in Time
Herbert Van de Sompel
404/File Not Found, Washington, DC, October 24 2014
20. Reference Captures and Annotate Links
• Existing practice for linking to captures:
o Link to URI of Capture
o Lose Capture Datetime
• Problems with existing practice:
o Impossible to visit the original URI, if desired
o Requires the permanent existence/uptime of the archive that
holds the capture
- One link rot problem replaced by another
Van de Sompel, H. et al. (2013) Thoughts on referencing, linking, reference rot
Herbert Van de Sompel
o Lose Original URI
http://mementoweb.org/missing-link/
404/File Not Found, Washington, DC, October 24 2014
21. Permanent Existence/Uptime of Archives?
Capture of http://webcitation.org dated July 17 2013
Herbert Van de Sompel
https://archive.today/eAETp
404/File Not Found, Washington, DC, October 24 2014
22. Permanent Existence/Uptime of Archives?
Herbert Van de Sompel
http://webcitation.org/ on August 6 2014
404/File Not Found, Washington, DC, October 24 2014
23. Permanent Existence/Uptime of Archives?
Remnant of discontinued web archive http://mummify.it captured on February 14 2014
Herbert Van de Sompel
https://web.archive.org/web/20140214233752/https://www.mummify.it/
404/File Not Found, Washington, DC, October 24 2014
24. Permanent Existence/Uptime of Archives?
http://www.themoscowtimes.com/news/article/russia-bans-wayback-machine-internet-archive-over-islamic-state-video/
Herbert Van de Sompel
510074.html
404/File Not Found, Washington, DC, October 24 2014
25. Hacking Original URI, Capture Datetime from Capture URI?
URI of Capture Original URI Datetime T
https://web.archive.org/web/20140214233752/https://
www.mummify.it
https://archive.today/eAETp no no
http://perma.cc/4RH7-999Q?type=source no no
http://en.wikipedia.org/w/index.php?title=Coil_(band)
&oldid=388321480
Herbert Van de Sompel
yes yes
no no
404/File Not Found, Washington, DC, October 24 2014
26. Using Capture URI to find Captures in Other Web Archives?
Herbert Van de Sompel
404/File Not Found, Washington, DC, October 24 2014
27. Using Capture URI to find Captures in Other Web Archives?
Herbert Van de Sompel
404/File Not Found, Washington, DC, October 24 2014
28. Reference Captures and Annotate Links
• Desired practice for linking to captures is to annotate the link so it
Herbert Van de Sompel
404/File Not Found, Washington, DC, October 24 2014
conveys:
- URI of Capture
- Original URI
- Capture Datetime
• Link annotation supports fallback to other archives:
o Original URI allows finding captures in all web archives
o Capture Datetime allows finding an appropriate capture in all
web archives
o Original URI and Capture Datetime allows automatic access
to an appropriate capture in all web archives (see Access)
Van de Sompel, H. et al. (2013) Thoughts on referencing, linking, reference rot
http://mementoweb.org/missing-link/
29. Reference Captures and Annotate Links
• Desired practice for linking to captures is to annotate the link so it
Herbert Van de Sompel
404/File Not Found, Washington, DC, October 24 2014
conveys:
URI of Capture
Original URI Capture Datetime
30. Reference Captures and Annotate Links
• Interoperability for link annotation:
o Need an approach to convey, in a uniform, machine-actionable
- URI of Capture
- Original URI
- Capture Datetime
o Missing Link Proposal
- http://mementoweb.org/missing-link/
o W3C Robustness and Archiving Community Group
- http://www.w3.org/community/irobar/
Herbert Van de Sompel
way:
• Ongoing efforts:
404/File Not Found, Washington, DC, October 24 2014
31. Missing Link Proposal
<a href=“http://liarsociety.tripod.com/blog/index.blog?from=20041130”
data-versionurl=“https://archive.today/ElCHn”
data-versiondate=“2008-02-06T00:00:00Z”>
Herbert Van de Sompel
URI of Capture
Capture Datetime
404/File Not Found, Washington, DC, October 24 2014
Original URI
Van de Sompel, H. et al. (2013) Thoughts on referencing, linking, reference rot
http://mementoweb.org/missing-link/
32. Addressing the Link/Reference Rot Challenge
• Pockets of Persistence
• Capture – Archive Pro-Actively, Selectively
• Reference – Annotate Links
• Access – Travel in Time
Herbert Van de Sompel
404/File Not Found, Washington, DC, October 24 2014
33. Memento Web Time Travel
Use the Original URI
Herbert Van de Sompel
Current version of http://law.georgetown.edu/library/404/ on October 22 2014
404/File Not Found, Washington, DC, October 24 2014
34. Memento Web Time Travel
And a Datetime
Herbert Van de Sompel
404/File Not Found, Washington, DC, October 24 2014
35. Memento Web Time Travel
To automatically retrieve the temporally nearest available capture
Capture of http://law.georgetown.edu/library/404/ dated May 3 2014
Herbert Van de Sompel
http://wayback.archive-it.org/all/20140503094327/http://www.law.georgetown.edu/library/404/
404/File Not Found, Washington, DC, October 24 2014
36. Memento Web Time Travel
http://bit.ly/memento-for-chrome
Herbert Van de Sompel
http://mementoweb.org
404/File Not Found, Washington, DC, October 24 2014
37. Travel in Time - Persistently, Precisely, Seamlessly
On-Demand Capture URI of Capture Original URI Datetime T
Herbert Van de Sompel
404/File Not Found, Washington, DC, October 24 2014
Available
Accessible
+ - -
• Time Travel is:
• Persistent – See next slide
• Precise – Following link to URI of Capture retrieves exact
capture
• Seamless – Requires clicking a link as usual
38. Travel in Time - Persistently, Precisely, Seamlessly
On-Demand Capture URI of Capture Original URI Datetime T
Herbert Van de Sompel
404/File Not Found, Washington, DC, October 24 2014
Available
Not Accessible
+ - -
• Time Travel is:
• Persistent – Following link to URI of Capture leads nowhere
• Precise – Following link to URI of Capture leads nowhere
• Seamless – Following link to URI of Capture leads nowhere
39. Travel in Time - Persistently, Precisely, Seamlessly
On-Demand Capture URI of Capture Original URI Datetime T
Herbert Van de Sompel
404/File Not Found, Washington, DC, October 24 2014
Available
Not Accessible
+ + +
• Time Travel is:
• Persistent – Using Memento with [ Original URI ; Datetime ]
works across web archives, versioning systems
• Precise – Using Memento with [ Original URI ; Datetime ]
retrieves nearest capture from other archive
• Seamless – Requires browser plugin
40. Travel in Time - Persistently, Precisely, Seamlessly
On-Demand Capture URI of Capture Original URI Datetime T
Herbert Van de Sompel
404/File Not Found, Washington, DC, October 24 2014
Available
Accessible
- + +
• Time Travel is:
• Persistent – Using Memento with [ Original URI ; Datetime ]
works across web archives, versioning systems
• Precise – Using Memento with [ Original URI ; Datetime ]
retrieves exact capture from other archive
• Seamless – Requires browser plugin
41. Travel in Time - Persistently, Precisely, Seamlessly
On-Demand Capture URI of Capture Original URI Datetime T
Not Available - + +
• Persistent – Using Memento with [ Original URI ; Datetime ]
works across web archives, versioning systems
• Precise – Using Memento with [ Original URI ; Datetime ]
retrieves nearest capture from other archive
• Seamless – Requires browser plugin
Herbert Van de Sompel
• Time Travel is:
404/File Not Found, Washington, DC, October 24 2014
42. Reference Captures and Annotate Links
• Interoperability for time travel:
o Memento protocol specifies interoperability across web
archives, version management systems
o Memento protocol is supported by major web archives
o Need to work towards Memento support by version
management systems
o Need to work towards making Memento experience
seamless through native browser support
o Need to work towards robustness and sustainability of
Memento infrastructure
Herbert Van de Sompel
404/File Not Found, Washington, DC, October 24 2014
43. Conclusion
• Significant technical solutions, infrastructure, ideas exist to
address the link rot/reference rot challenge
• Mobilize the communities that care about this challenge to work
towards joint, interoperable approaches
Herbert Van de Sompel
404/File Not Found, Washington, DC, October 24 2014
44. Creating Pockets of Persistence
http://mementoweb.org
http://hiberlink.org
Herbert Van de Sompel
404/File Not Found, Washington, DC, October 24 2014