We introduce MementoMap, a framework to express and disseminate holdings of web archives (archive profiles) by themselves or third parties. The framework allows arbitrary, flexible, and dynamic levels of details in its entries that fit the needs of archives of different scales. This enables Memento aggregators to significantly reduce wasted traffic to web archives.
MementoMap: An Archive Profile Dissemination Framework
1. MementoMap
An Archive Profile Dissemination Framework
Sawood Alam, Michele C. Weigle, and Michael L. Nelson
Old Dominion University, Norfolk, VA, USA
@ibnesayeed @WebSciDL
Supported by NSF Grant IIS-1526700
WADL '19, June 6, 2019, Urbana-Champaign, Illinois
9. @ibnesayeed
Broadcasting is Evil
9
From: Michael Nelson [mailto:mln@cs.odu.edu]
Sent: Wednesday, December 02, 2015 12:33 PM
To: Jones, Gina
Cc: Rourke, Patrick; Grotke, Abigail
Subject: Re: WebSciDL
Hi Gina, I'll investigate. memgator is software that one my students wrote, but I suspect the
traffic you're seeing is b/c it is deployed in http://oldweb.today/ can you share the IP
addr from where you're seeing the traffic? I presume the requests are for Memento
TimeMaps? It should not being actually scraping HTML pages.
regards,
Michael
On Wed, 2 Dec 2015, Jones, Gina wrote:
> Hi Michael, we have a slight configuration issue with the current OW
> set up for our webarchives. I think, from looking at the logs, that
> "MemGator:1.0-rc3 <@WebSciDL>" is really causing some issues on our wayback.
> Do you know who is running this scraper? Itʼs not part of memento is it?
>
> Gina Jones
> Web Archiving Team
> Library of Congress
From: Ilya Kreymer <ikreymer@gmail.com>
Date: Wed, 2 Dec 2015 10:33:56 -0800
Subject: high traffic on oldweb!
To: Herbert Van de Sompel <hvdsomp@gmail.com>, Sawood Alam
<ibnesayeed@gmail.com>
Hi Herbert, Sawood,
Herbert: Perhaps you are lucky that I am not using the LANL aggregator, as the traffic has
gotten really high, and also I was asked to remove an archive due to the traffic it was
causing temporarily..
I am thinking that ability to remove source archives quickly is an important aspect of an
aggregator.
Sawood: Hopefully yours will support something like this so I don't need to restart the
container to change the archivelist ;)
Ilya
Broadcasting is wasteful, both clients & archives suffer!
12. @ibnesayeed
What is Archived in Arquivo.pt?
What is Accessed from MemGator?
12
Blind spot of a
content-based profile
Blind spot of a
usage-based profile
13. @ibnesayeed
If Only Archives Could Tell When to Ask Them
● Websites advertise their holdings using sitemap.xml, why can’t archives?
○ Archives have billions or even hundreds of billions URI-Ms
○ Such exhaustive lists would go stale very quickly
● How about robots.txt?
○ It is compact, but is exclusion format, it does not tell what the site has
○ It assumes a single domain, patterns are for paths (not the domain name)
● How about combining the two ideas?
○ Introducing MementoMap!
13
14. @ibnesayeed
A MementoMap Example
14
!context ["http://oduwsdl.github.io/contexts/ukvs"]
!id {uri: "http://archive.example.org/"}
!fields {keys: ["surt"], values: ["frequency"]}
!meta {type: "MementoMap", name: "A Test Web Archive", year: 1996}
!meta {updated_at: "2018-09-03T13:27:52Z"}
* 54321/20000
com,* 10000+
org,arxiv)/ 100
org,arxiv)/* 2500~/900
org,arxiv)/pdf/* 0
uk,co,bbc)/images/* 300+/20-
+ for a lower boundary
- for an upper boundary
~ for an approximate value
21. @ibnesayeed
Who Would have Thought
Arquivo.pt has 10K+ .онлайн Sites?
21
“.онлайн”
(encoded as “xn--80asehdb”)
is an IDN gTLD which means
“.online”
25. @ibnesayeed
Dissemination and Discovery Methods
25
GET /.well-known/mementomap HTTP/1.1
Host: arquivo.pt
Link: <https://arquivo.pt/path/to/mementomap.ukvs>;
rel="mementomap"
<link href="https://arquivo.pt/path/to/mementomap.ukvs"
rel="mementomap">
Well-known URI
Link Header
Link HTML Element
26. @ibnesayeed
Future Work
● Generate blacklists by processing access logs
● Incorporate MementoMap in replay systems
● Encourage archives and aggregators to adopt it
● Encourage use of UKVS in other archival and non-archival contexts
26
27. @ibnesayeed
Conclusions
● Described MementoMap - a flexible and efficient archive profiling framework
● Analyzed complete index of Arquivo.pt to understand nature of web archives
● Evaluated MementoMap against Arquivo.pt’s index
● Save 60% of the wasted MemGator traffic with 1.5% cost (a 119 MB file)
● Proposed “mementomap” as a well-known URI suffix as well as a link relation
for dissemination of MementoMap
● Implemented a single-pass, memory-efficient, and parallelization-friendly
MementoMap generation/compaction algorithm
● Open-sourced the implementation
○ https://github.com/oduwsdl/MementoMap
27