InterPlanetary Wayback (IPWB) facilitates permanence and collaboration in web archives by disseminating the contents of WARC files into the InterPlanetary File System (IPFS) network. IPFS is a peer-to-peer content-addressable file system that inherently allows deduplication and facilitates opt-in replication. IPWB splits the header and payload of WARC response records before disseminating into IPFS to leverage the deduplication, builds a CDXJ index with references to the IPFS hashes returns, and combines the headers and payload from IPFS at the time of replay. We also explore the possibility of an index-free, fully decentralized collaborative web archiving system as the next step.
InterPlanetary Wayback: The Next Step Towards Decentralized Web Archiving
1. Sawood Alam, Mat Kelly, Michele C. Weigle, Michael L. Nelson
Web Science and Digital Libraries Research Group
Old Dominion University
Norfolk, Virginia, USA
@WebSciDL
InterPlanetary Wayback
The Next Step Towards Decentralized Web Archiving
IPFS Lab Day, Decentralized Web Summit, 2018
San Francisco, CA (USA)
August 3, 2018
http://github.com/oduwsdl/ipwb
7. Why IPWB?
● Persistence of archived web dependent on resilience of
organizations
● Availability of data is subject to censorship
● Redundancy in web archive files of exact duplicate content
● Lack of public participation in web archiving
● Discoverability issue of small web archives
7@ibnesayeed
21. Current Issues
● IPFS is permanent, but not persistent
● DHT-based IPNS is history-unaware
● CDXJ index, a critical piece of replay, is centralized
21@ibnesayeed
22. Persistence
● Data persistence is critical for web archiving
● A decentralized storage with sufficient replication is needed
● Memory organizations should contribute storage infrastructure
● Qri, Filecoin, IPFS-Cluster, IPFS-Sync etc. can be helpful
22@ibnesayeed
23. IPNS: InterPlanetary Naming System
URI IPFS Hash
http://example.org/yuri.jpg
http://example.com/style.css
http://example.com/logo.png
http://example.com/style.css
How about changes and history?
23@ibnesayeed
24. IPNS Blockchain
● URI → Latest hash
● URI + DateTime → A historical hash
● URI → List historical hashes with times
https://github.com/oduwsdl/IPNS-Blockchain
Owner URI Time Hash PrevBlock
Pub K1
URI1
T1
H1
1234567...
Pub K2
URI2
T2
H2
0000000...
Pub K3
URI3
T3
H3
9876543...
Owner URI Time Hash PrevBlock
Pub K1
URI1
T4
H5
5463728...
Pub K3
URI4
T5
H6
0000000...
IPNS + Blockchain + Memento
24@ibnesayeed
27. ● Reported IPFS slowness https://github.com/ipfs/go-ipfs/issues/1216
○ Has since been fixed, but we did not evaluate again
570 files per minute~10% overhead
27
Storage Space and Time Overhead
@ibnesayeed
28. Replay Time
● 600 requests in 222 seconds
● Slower than PyWB (which took 5.26 seconds)
● File vs. rich object based retrieval
● Never expiring cache
28
https://github.com/ibnesayeed/ipfsapi-concurrency-test
@ibnesayeed
29. Future Works
● Evaluate the improved IPFS on large dataset
● Evaluate deduplication
● Implement an index-free collaborative archiving system
● Utilize IPNS to reference URI-Rs with datetime
29@ibnesayeed
30. Conclusions
● A proof of concept system to leverage a novel approach to
archiving and retrieval
● Storage and time costs evaluation and qualitative analysis
● It can only work for small archives in its current state
● A path to answer “who will archive the archives?”
● More work to be done to make it a truly decentralized
archiving system
30@ibnesayeed
31. InterPlanetary Wayback
The Next Step Towards Decentralized Web Archiving
@WebSciDL
http://github.com/oduwsdl/ipwb
Supported in part by Protocol Labs, AMF 11600663, and NSF IIS-1526700
Sawood Alam, Mat Kelly, Michele C. Weigle, Michael L. Nelson