SlideShare une entreprise Scribd logo
1  sur  14
Télécharger pour lire hors ligne
Supporting Web Archiving
via Web Packaging
Sawood Alam, Michele C. Weigle, Michael L. Nelson,
Martin Klein, and Herbert Van de Sompel
Old Dominion University, Norfolk, Virginia, USA
Los Alamos National Laboratory, New Mexico, USA
Data Archiving and Networked Services, Netherlands
@ibnesayeed
Supported in part by the Andrew W. Mellon Foundation (AMF) grant 11600663
ESCAPE Workshop '19, July 19, 2019, Herndon, Virginia (USA)
@ibnesayeed
An HTML Page With External Style Sheet and Image
2
https://www.cs.odu.edu/~salam/dweb/
@ibnesayeed
An HTTP Response Stored as a WARC Record
3
HTTP headers
Payload
WARC headers
https://iipc.github.io/warc-specifications/
@ibnesayeed
Crawling JavaScript-driven Deferred Representations
4
PhantomJS discovers 75% more resources than Heritrix, but crawls 12 times slower
Web Packaging has the potential to enable
efficient and coherent crawling via Bundles
https://arxiv.org/abs/1508.02315
@ibnesayeed
Archival Replay: Server-side Rewriting
5
https://web.archive.org/web/20190716035757/https://www.cs.odu.edu/~salam/dweb/
URI references (href, src, and srcset etc.) are rewritten to point to their archived versions at a nearby time
@ibnesayeed
Archival Replay: Client-side Rerouting
6
https://oduwsdl.github.io/Reconstructive/
http://ws-dl.blogspot.com/2018/01/2018-01-08-introducing-reconstructive.html
● Avoids zombies (live-leakage)
● Adds an unobtrusive archival banner
(using Custom HTML Element)
@ibnesayeed
Archival Replay: Proxy-based Rerouting
7http://oldweb.today/chrome/20190715110000/https://www.cs.odu.edu/~salam/dweb/
● A web browser runs
on a remote host
● Configured to use a
web archive proxy
● Accessed via VNC
● Does not scale well
Web Packaging can
enable archival
replay from original
URIs locally without
a replay proxy
@ibnesayeed
Mementos That Never Existed on the Live Web
Temporal Violations
8
Live leakage (Zombies)
Origin Violations
https://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html
https://ws-dl.blogspot.com/2015/12/2015-12-08-evaluating-temporal.html
https://ws-dl.blogspot.com/2017/01/2017-01-20-cnncom-has-been-unarchivable.html
https://ws-dl.blogspot.com/2019/03/2019-03-18-cookie-violations-cause.html
Web Packaging
has the potential to
eliminate these issues
Cookie Violations
@ibnesayeed
Memento TimeGate
9
$ curl -I https://www.w3.org/wiki/Main_Page
HTTP/2 200
date: Tue, 16 Jul 2019 03:16:01 GMT
link: <https://www.w3.org/wiki/Main_Page>; rel="original latest-version",
<https://www.w3.org/wiki/Special:TimeGate/Main_Page>; rel="timegate",
<https://www.w3.org/wiki/Special:TimeMap/Main_Page>; rel="timemap"; type="application/link-format"; from="Thu, 01 Jan 1970 00:00:00 GMT"; until="Fri, 16 Nov 2018 19:10:23 GMT",
<https://www.w3.org/wiki/index.php?title=Main_Page&oldid=30366>; rel="first memento"; datetime="Thu, 01 Jan 1970 00:00:00 GMT",
<https://www.w3.org/wiki/index.php?title=Main_Page&oldid=108148>; rel="last memento"; datetime="Fri, 16 Nov 2018 19:10:23 GMT"
content-language: en
vary: Accept-Encoding,Cookie
cache-control: s-maxage=18000, must-revalidate, max-age=0
last-modified: Mon, 15 Jul 2019 22:16:01 GMT
content-type: text/html; charset=UTF-8
$ curl -I -H "Accept-Datetime: Sat, 20 Dec 2014 12:30:00 GMT" https://www.w3.org/wiki/Special:TimeGate/Main_Page
HTTP/2 302
date: Tue, 16 Jul 2019 03:16:21 GMT
vary: Accept-Encoding,Cookie,Accept-Datetime
location: https://www.w3.org/wiki/index.php?title=Main_Page&oldid=80125
link: <https://www.w3.org/wiki/Special:TimeMap/Main_Page>; rel="timemap"; type="application/link-format"; from="Thu, 01 Jan 1970 00:00:00 GMT"; until="Fri, 16 Nov 2018 19:10:23 GMT",
<https://www.w3.org/wiki/index.php?title=Main_Page&oldid=30366>; rel="first memento"; datetime="Thu, 01 Jan 1970 00:00:00 GMT",
<https://www.w3.org/wiki/index.php?title=Main_Page&oldid=108148>; rel="last memento"; datetime="Fri, 16 Nov 2018 19:10:23 GMT",
<https://www.w3.org/wiki/Main_Page>; rel="original latest-version"
content-type: text/html; charset=UTF-8
$ curl -I "https://www.w3.org/wiki/index.php?title=Main_Page&oldid=80125"
HTTP/2 200
date: Tue, 16 Jul 2019 03:38:35 GMT
x-content-type-options: nosniff
memento-datetime: Sat, 20 Dec 2014 11:34:08 GMT
link: <https://www.w3.org/wiki/Main_Page>; rel="original latest-version",<https://www.w3.org/wiki/Special:TimeGate/Main_Page>;
rel="timegate",<https://www.w3.org/wiki/Special:TimeMap/Main_Page>; rel="timemap"; type="application/link-format"; from="Thu, 01 Jan 1970 00:00:00 GMT"; until="Fri, 16 Nov 2018
19:10:23 GMT",<https://www.w3.org/wiki/index.php?title=Main_Page&oldid=30366>; rel="first memento"; datetime="Thu, 01 Jan 1970 00:00:00
GMT",<https://www.w3.org/wiki/index.php?title=Main_Page&oldid=108148>; rel="last memento"; datetime="Fri, 16 Nov 2018 19:10:23 GMT"
content-language: en
vary: Accept-Encoding,Cookie
expires: Thu, 01 Jan 1970 00:00:00 GMT
cache-control: private, must-revalidate, max-age=0
content-type: text/html; charset=UTF-8
https://tools.ietf.org/html/rfc7089
Native TimeGate support in
Exchange Loading would be great
Web archives provide
third-party generic
TimeGate resources
@ibnesayeed 10https://developer.mozilla.org/en-US/docs/Web/API/Cache
● Distributor provides a custom cache namespace with Bundles to stash them in
● More than one related Bundles can have the same namespace
● A configurable policy determines the behavior of cache utilization for Loading
● Archive origins configure Loading policy to only load resources from bundles they
delivered to prevent loading zombie resources
Enable sandboxing via namespaced caches
for security and coherence
Namespaced Cache for Temporal Coherence
Namespaced Bundle Cache
20190718154532
Content-Type: application/webbundle
Bundle-Cache-Name: 20190718154532
Exchange-Loading-Policy: bundle-only
@ibnesayeed
Web Archives Currently Lack Technical Means to
Prove Fixity and Non-repudiation
11
https://twitter.com/Jamie_Maz/status/936349041264414721
http://blog.archive.org/2018/04/24/addressing-recent-claims-of-manipulated-blog-posts-in-the-wayback-machine/
https://ws-dl.blogspot.com/2018/04/2018-04-24-why-we-need-multiple-web.html
● Joy Ann Reid claimed copies of her blog in the Internet Archive has been hacked
● The Internet Archive publicly denied it
● We investigated the matter using multiple web archives and concluded Reid’s
claim to be very unlikely
There is strong need of verifiable web archiving
@ibnesayeed 12
http://www.certificate-transparency.org/
● Signed exchanges contain “Date” response header
● An archive returns “Memento-Datetime” header when delivering an HTTP Bundle
● Validate signature in the context of that historical time
○ The difference in the two times should be within an acceptable margin
● Utilize means like Certificate Transparency Logs to establish temporal validation
A means to establish that the signature would have
been “temporally valid” at a given time in the past
Temporal Validation of Signed Exchanges
@ibnesayeed
Native Memento Support and Acknowledgement
13
https://www.slideshare.net/mweigle/enabling-personal-use-of-web-archives
● Visually acknowledge presence of “Memento-Datetime” header
● Surface archival metadata from “Link” header
● Enable necessary security features for the archived web which is read-only
● Prevent live-leakage
@ibnesayeed
Conclusions
14
Web Packaging has a unique opportunity to devise a technology that supports web archiving
and provides a much needed capability to verify the integrity of archived web resources
Effective Bundled HTTP Exchanges Efficient and coherent crawling
Temporally coherent replayTimeGate/Named Cache for Loading
Temporal validation of Signed Exchanges Long-term fixity and non-repudiation
https://arxiv.org/abs/1906.07104
Native Memento support in web browsers Visual archived resource identification

Contenu connexe

Tendances

Tendances (20)

iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling App...
 
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with JavascriptCombining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
 
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred RepresentationsScripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations
 
A Framework for Verifying the Fixity of Archived Web Resources
A Framework for Verifying the Fixity of Archived Web ResourcesA Framework for Verifying the Fixity of Archived Web Resources
A Framework for Verifying the Fixity of Archived Web Resources
 
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
 
Why We Need Multiple Archives
Why We Need Multiple ArchivesWhy We Need Multiple Archives
Why We Need Multiple Archives
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
 
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
 
Detecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web ArchivesDetecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web Archives
 
DHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository InteroperabilityDHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository Interoperability
 
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web ArchivingWho Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
 
We Need Multiple, Independent Web Archives
We Need Multiple, Independent Web ArchivesWe Need Multiple, Independent Web Archives
We Need Multiple, Independent Web Archives
 
Recommending Archived Webpages Using Only The URI
Recommending Archived Webpages Using Only The URIRecommending Archived Webpages Using Only The URI
Recommending Archived Webpages Using Only The URI
 
Linked Data + Drupal for Oceanographic data management
Linked Data + Drupal for Oceanographic data managementLinked Data + Drupal for Oceanographic data management
Linked Data + Drupal for Oceanographic data management
 
The Web We Want
The Web We WantThe Web We Want
The Web We Want
 
Creating Pockets of Persistence
Creating Pockets of PersistenceCreating Pockets of Persistence
Creating Pockets of Persistence
 
FAIR Signposting: A KISS Approach to a Burning Issue
FAIR Signposting: A KISS Approach to a Burning IssueFAIR Signposting: A KISS Approach to a Burning Issue
FAIR Signposting: A KISS Approach to a Burning Issue
 
Enabling Personal Use of Web Archives
Enabling Personal Use of Web ArchivesEnabling Personal Use of Web Archives
Enabling Personal Use of Web Archives
 
To the Rescue of the Orphans of Scholarly Communication
To the Rescue of the Orphans of Scholarly CommunicationTo the Rescue of the Orphans of Scholarly Communication
To the Rescue of the Orphans of Scholarly Communication
 
Hiberlink: Investigating Reference Rot, December 2013
Hiberlink: Investigating Reference Rot, December 2013Hiberlink: Investigating Reference Rot, December 2013
Hiberlink: Investigating Reference Rot, December 2013
 

Similaire à Supporting Web Archiving via Web Packaging

Optimizing Archival Replay by Eliminating Unnecessary Traffic to Web Archives
Optimizing Archival Replay by Eliminating Unnecessary Traffic to Web ArchivesOptimizing Archival Replay by Eliminating Unnecessary Traffic to Web Archives
Optimizing Archival Replay by Eliminating Unnecessary Traffic to Web Archives
Kritika Garg
 
Browser-Based Digital Preservation
Browser-Based Digital PreservationBrowser-Based Digital Preservation
Browser-Based Digital Preservation
Mat Kelly
 
Keeping Web Records Lg Web Network August 2009
Keeping Web Records Lg Web Network August 2009Keeping Web Records Lg Web Network August 2009
Keeping Web Records Lg Web Network August 2009
Cassie Findlay
 

Similaire à Supporting Web Archiving via Web Packaging (20)

Optimizing Archival Replay by Eliminating Unnecessary Traffic to Web Archives
Optimizing Archival Replay by Eliminating Unnecessary Traffic to Web ArchivesOptimizing Archival Replay by Eliminating Unnecessary Traffic to Web Archives
Optimizing Archival Replay by Eliminating Unnecessary Traffic to Web Archives
 
Storytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web ArchivesStorytelling for Summarizing Collections in Web Archives
Storytelling for Summarizing Collections in Web Archives
 
Taming content delivery at scale
Taming content delivery at scaleTaming content delivery at scale
Taming content delivery at scale
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
 
Web performance optimization for modern web applications
Web performance optimization for modern web applicationsWeb performance optimization for modern web applications
Web performance optimization for modern web applications
 
The Case for HTTP/2
The Case for HTTP/2The Case for HTTP/2
The Case for HTTP/2
 
Changhao jiang facebook
Changhao jiang facebookChanghao jiang facebook
Changhao jiang facebook
 
Time -Travel on the Internet
Time -Travel on the InternetTime -Travel on the Internet
Time -Travel on the Internet
 
T3DD22 - Set up a CDN in 5 Minutes
T3DD22 - Set up a CDN in 5 MinutesT3DD22 - Set up a CDN in 5 Minutes
T3DD22 - Set up a CDN in 5 Minutes
 
Web ARChive (WARC) File Format
Web ARChive (WARC) File FormatWeb ARChive (WARC) File Format
Web ARChive (WARC) File Format
 
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerClient-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorker
 
Browser-Based Digital Preservation
Browser-Based Digital PreservationBrowser-Based Digital Preservation
Browser-Based Digital Preservation
 
What is Nginx and Why You Should to Use it with Wordpress Hosting
What is Nginx and Why You Should to Use it with Wordpress HostingWhat is Nginx and Why You Should to Use it with Wordpress Hosting
What is Nginx and Why You Should to Use it with Wordpress Hosting
 
Web Cache Poisoning
Web Cache PoisoningWeb Cache Poisoning
Web Cache Poisoning
 
Avoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerAvoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorker
 
The Case for HTTP/2 - EpicFEL Sept 2015
The Case for HTTP/2 - EpicFEL Sept 2015The Case for HTTP/2 - EpicFEL Sept 2015
The Case for HTTP/2 - EpicFEL Sept 2015
 
Keeping Web Records Lg Web Network August 2009
Keeping Web Records Lg Web Network August 2009Keeping Web Records Lg Web Network August 2009
Keeping Web Records Lg Web Network August 2009
 
Web Unleashed '19 - Measuring the Adoption of Web Performance Techniques
Web Unleashed '19 - Measuring the Adoption of Web Performance TechniquesWeb Unleashed '19 - Measuring the Adoption of Web Performance Techniques
Web Unleashed '19 - Measuring the Adoption of Web Performance Techniques
 
Word press caching shakir
Word press caching   shakirWord press caching   shakir
Word press caching shakir
 
Cache is king
Cache is kingCache is king
Cache is king
 

Plus de Sawood Alam

Video Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineVideo Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback Machine
Sawood Alam
 

Plus de Sawood Alam (19)

TrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web PagesTrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web Pages
 
CDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection InsightsCDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection Insights
 
Video Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineVideo Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback Machine
 
Profiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento RoutingProfiling Web Archival Voids for Memento Routing
Profiling Web Archival Voids for Memento Routing
 
MementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination FrameworkMementoMap: An Archive Profile Dissemination Framework
MementoMap: An Archive Profile Dissemination Framework
 
MemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoMemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in Go
 
Dockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationDockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to Containerization
 
TPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingTPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive Profiling
 
Introducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research GroupIntroducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research Group
 
Web Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext SearchWeb Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext Search
 
JCDL 2016 Doctoral Consortium - Web Archive Profiling
JCDL 2016 Doctoral Consortium - Web Archive ProfilingJCDL 2016 Doctoral Consortium - Web Archive Profiling
JCDL 2016 Doctoral Consortium - Web Archive Profiling
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief Introduction
 
TPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web ArchivesTPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web Archives
 
Profiling Web Archives
Profiling Web ArchivesProfiling Web Archives
Profiling Web Archives
 
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
 
Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015
 
Profiling Web Archives IIPC GA 2015
Profiling Web Archives IIPC GA 2015Profiling Web Archives IIPC GA 2015
Profiling Web Archives IIPC GA 2015
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief Introduction
 
HTTP Mailbox - Asynchronous RESTful Communication
HTTP Mailbox - Asynchronous RESTful CommunicationHTTP Mailbox - Asynchronous RESTful Communication
HTTP Mailbox - Asynchronous RESTful Communication
 

Dernier

一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
pxcywzqs
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
ayvbos
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
JOHNBEBONYAP1
 
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
ydyuyu
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Monica Sydney
 
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girlsRussian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Monica Sydney
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
ydyuyu
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理
F
 

Dernier (20)

Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
 
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime BalliaBallia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
 
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
 
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac RoomVip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
 
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girlsRussian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
 
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrStory Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
 
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
 
一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理
 
Call girls Service in Ajman 0505086370 Ajman call girls
Call girls Service in Ajman 0505086370 Ajman call girlsCall girls Service in Ajman 0505086370 Ajman call girls
Call girls Service in Ajman 0505086370 Ajman call girls
 

Supporting Web Archiving via Web Packaging

  • 1. Supporting Web Archiving via Web Packaging Sawood Alam, Michele C. Weigle, Michael L. Nelson, Martin Klein, and Herbert Van de Sompel Old Dominion University, Norfolk, Virginia, USA Los Alamos National Laboratory, New Mexico, USA Data Archiving and Networked Services, Netherlands @ibnesayeed Supported in part by the Andrew W. Mellon Foundation (AMF) grant 11600663 ESCAPE Workshop '19, July 19, 2019, Herndon, Virginia (USA)
  • 2. @ibnesayeed An HTML Page With External Style Sheet and Image 2 https://www.cs.odu.edu/~salam/dweb/
  • 3. @ibnesayeed An HTTP Response Stored as a WARC Record 3 HTTP headers Payload WARC headers https://iipc.github.io/warc-specifications/
  • 4. @ibnesayeed Crawling JavaScript-driven Deferred Representations 4 PhantomJS discovers 75% more resources than Heritrix, but crawls 12 times slower Web Packaging has the potential to enable efficient and coherent crawling via Bundles https://arxiv.org/abs/1508.02315
  • 5. @ibnesayeed Archival Replay: Server-side Rewriting 5 https://web.archive.org/web/20190716035757/https://www.cs.odu.edu/~salam/dweb/ URI references (href, src, and srcset etc.) are rewritten to point to their archived versions at a nearby time
  • 6. @ibnesayeed Archival Replay: Client-side Rerouting 6 https://oduwsdl.github.io/Reconstructive/ http://ws-dl.blogspot.com/2018/01/2018-01-08-introducing-reconstructive.html ● Avoids zombies (live-leakage) ● Adds an unobtrusive archival banner (using Custom HTML Element)
  • 7. @ibnesayeed Archival Replay: Proxy-based Rerouting 7http://oldweb.today/chrome/20190715110000/https://www.cs.odu.edu/~salam/dweb/ ● A web browser runs on a remote host ● Configured to use a web archive proxy ● Accessed via VNC ● Does not scale well Web Packaging can enable archival replay from original URIs locally without a replay proxy
  • 8. @ibnesayeed Mementos That Never Existed on the Live Web Temporal Violations 8 Live leakage (Zombies) Origin Violations https://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html https://ws-dl.blogspot.com/2015/12/2015-12-08-evaluating-temporal.html https://ws-dl.blogspot.com/2017/01/2017-01-20-cnncom-has-been-unarchivable.html https://ws-dl.blogspot.com/2019/03/2019-03-18-cookie-violations-cause.html Web Packaging has the potential to eliminate these issues Cookie Violations
  • 9. @ibnesayeed Memento TimeGate 9 $ curl -I https://www.w3.org/wiki/Main_Page HTTP/2 200 date: Tue, 16 Jul 2019 03:16:01 GMT link: <https://www.w3.org/wiki/Main_Page>; rel="original latest-version", <https://www.w3.org/wiki/Special:TimeGate/Main_Page>; rel="timegate", <https://www.w3.org/wiki/Special:TimeMap/Main_Page>; rel="timemap"; type="application/link-format"; from="Thu, 01 Jan 1970 00:00:00 GMT"; until="Fri, 16 Nov 2018 19:10:23 GMT", <https://www.w3.org/wiki/index.php?title=Main_Page&oldid=30366>; rel="first memento"; datetime="Thu, 01 Jan 1970 00:00:00 GMT", <https://www.w3.org/wiki/index.php?title=Main_Page&oldid=108148>; rel="last memento"; datetime="Fri, 16 Nov 2018 19:10:23 GMT" content-language: en vary: Accept-Encoding,Cookie cache-control: s-maxage=18000, must-revalidate, max-age=0 last-modified: Mon, 15 Jul 2019 22:16:01 GMT content-type: text/html; charset=UTF-8 $ curl -I -H "Accept-Datetime: Sat, 20 Dec 2014 12:30:00 GMT" https://www.w3.org/wiki/Special:TimeGate/Main_Page HTTP/2 302 date: Tue, 16 Jul 2019 03:16:21 GMT vary: Accept-Encoding,Cookie,Accept-Datetime location: https://www.w3.org/wiki/index.php?title=Main_Page&oldid=80125 link: <https://www.w3.org/wiki/Special:TimeMap/Main_Page>; rel="timemap"; type="application/link-format"; from="Thu, 01 Jan 1970 00:00:00 GMT"; until="Fri, 16 Nov 2018 19:10:23 GMT", <https://www.w3.org/wiki/index.php?title=Main_Page&oldid=30366>; rel="first memento"; datetime="Thu, 01 Jan 1970 00:00:00 GMT", <https://www.w3.org/wiki/index.php?title=Main_Page&oldid=108148>; rel="last memento"; datetime="Fri, 16 Nov 2018 19:10:23 GMT", <https://www.w3.org/wiki/Main_Page>; rel="original latest-version" content-type: text/html; charset=UTF-8 $ curl -I "https://www.w3.org/wiki/index.php?title=Main_Page&oldid=80125" HTTP/2 200 date: Tue, 16 Jul 2019 03:38:35 GMT x-content-type-options: nosniff memento-datetime: Sat, 20 Dec 2014 11:34:08 GMT link: <https://www.w3.org/wiki/Main_Page>; rel="original latest-version",<https://www.w3.org/wiki/Special:TimeGate/Main_Page>; rel="timegate",<https://www.w3.org/wiki/Special:TimeMap/Main_Page>; rel="timemap"; type="application/link-format"; from="Thu, 01 Jan 1970 00:00:00 GMT"; until="Fri, 16 Nov 2018 19:10:23 GMT",<https://www.w3.org/wiki/index.php?title=Main_Page&oldid=30366>; rel="first memento"; datetime="Thu, 01 Jan 1970 00:00:00 GMT",<https://www.w3.org/wiki/index.php?title=Main_Page&oldid=108148>; rel="last memento"; datetime="Fri, 16 Nov 2018 19:10:23 GMT" content-language: en vary: Accept-Encoding,Cookie expires: Thu, 01 Jan 1970 00:00:00 GMT cache-control: private, must-revalidate, max-age=0 content-type: text/html; charset=UTF-8 https://tools.ietf.org/html/rfc7089 Native TimeGate support in Exchange Loading would be great Web archives provide third-party generic TimeGate resources
  • 10. @ibnesayeed 10https://developer.mozilla.org/en-US/docs/Web/API/Cache ● Distributor provides a custom cache namespace with Bundles to stash them in ● More than one related Bundles can have the same namespace ● A configurable policy determines the behavior of cache utilization for Loading ● Archive origins configure Loading policy to only load resources from bundles they delivered to prevent loading zombie resources Enable sandboxing via namespaced caches for security and coherence Namespaced Cache for Temporal Coherence Namespaced Bundle Cache 20190718154532 Content-Type: application/webbundle Bundle-Cache-Name: 20190718154532 Exchange-Loading-Policy: bundle-only
  • 11. @ibnesayeed Web Archives Currently Lack Technical Means to Prove Fixity and Non-repudiation 11 https://twitter.com/Jamie_Maz/status/936349041264414721 http://blog.archive.org/2018/04/24/addressing-recent-claims-of-manipulated-blog-posts-in-the-wayback-machine/ https://ws-dl.blogspot.com/2018/04/2018-04-24-why-we-need-multiple-web.html ● Joy Ann Reid claimed copies of her blog in the Internet Archive has been hacked ● The Internet Archive publicly denied it ● We investigated the matter using multiple web archives and concluded Reid’s claim to be very unlikely There is strong need of verifiable web archiving
  • 12. @ibnesayeed 12 http://www.certificate-transparency.org/ ● Signed exchanges contain “Date” response header ● An archive returns “Memento-Datetime” header when delivering an HTTP Bundle ● Validate signature in the context of that historical time ○ The difference in the two times should be within an acceptable margin ● Utilize means like Certificate Transparency Logs to establish temporal validation A means to establish that the signature would have been “temporally valid” at a given time in the past Temporal Validation of Signed Exchanges
  • 13. @ibnesayeed Native Memento Support and Acknowledgement 13 https://www.slideshare.net/mweigle/enabling-personal-use-of-web-archives ● Visually acknowledge presence of “Memento-Datetime” header ● Surface archival metadata from “Link” header ● Enable necessary security features for the archived web which is read-only ● Prevent live-leakage
  • 14. @ibnesayeed Conclusions 14 Web Packaging has a unique opportunity to devise a technology that supports web archiving and provides a much needed capability to verify the integrity of archived web resources Effective Bundled HTTP Exchanges Efficient and coherent crawling Temporally coherent replayTimeGate/Named Cache for Loading Temporal validation of Signed Exchanges Long-term fixity and non-repudiation https://arxiv.org/abs/1906.07104 Native Memento support in web browsers Visual archived resource identification