SlideShare une entreprise Scribd logo
1  sur  27
Télécharger pour lire hors ligne
MementoMap
An Archive Profile Dissemination Framework
Sawood Alam, Michele C. Weigle, and Michael L. Nelson
Old Dominion University, Norfolk, VA, USA
@ibnesayeed @WebSciDL
Supported by NSF Grant IIS-1526700
WADL '19, June 6, 2019, Urbana-Champaign, Illinois
@ibnesayeed 2
$ memgator -a archives.json -f cdxj example.com 
> | grep -v "^!" | cut -d '/' -f 3 | sort | uniq -c | sort -nr
198014 web.archive.org
13548 wayback.archive-it.org
1191 webarchive.loc.gov
1044 swap.stanford.edu
953 arquivo.pt
525 wayback.vefsafn.is
225 perma-archives.org
221 archive.md
23 www.webarchive.org.uk
$ memgator -a archives.json -f cdxj jcdl.org 
> | grep -v "^!" | cut -d '/' -f 3 | sort | uniq -c | sort -nr
410 web.archive.org
2 www.webarchive.org.uk
2 arquivo.pt
1 archive.md
Cross-archive Memento Lookup With MemGator
https://github.com/oduwsdl/MemGator
@ibnesayeed
Memento Aggregator
3
@ibnesayeed
Memento Aggregator
4
@ibnesayeed
Memento Aggregator
5
@ibnesayeed
Memento Aggregator
6
@ibnesayeed
Memento Aggregator
7
@ibnesayeed
Memento Aggregator
8
@ibnesayeed
Broadcasting is Evil
9
From: Michael Nelson [mailto:mln@cs.odu.edu]
Sent: Wednesday, December 02, 2015 12:33 PM
To: Jones, Gina
Cc: Rourke, Patrick; Grotke, Abigail
Subject: Re: WebSciDL
Hi Gina, I'll investigate. memgator is software that one my students wrote, but I suspect the
traffic you're seeing is b/c it is deployed in http://oldweb.today/ can you share the IP
addr from where you're seeing the traffic? I presume the requests are for Memento
TimeMaps? It should not being actually scraping HTML pages.
regards,
Michael
On Wed, 2 Dec 2015, Jones, Gina wrote:
> Hi Michael, we have a slight configuration issue with the current OW
> set up for our webarchives. I think, from looking at the logs, that
> "MemGator:1.0-rc3 <@WebSciDL>" is really causing some issues on our wayback.
> Do you know who is running this scraper? Itʼs not part of memento is it?
>
> Gina Jones
> Web Archiving Team
> Library of Congress
From: Ilya Kreymer <ikreymer@gmail.com>
Date: Wed, 2 Dec 2015 10:33:56 -0800
Subject: high traffic on oldweb!
To: Herbert Van de Sompel <hvdsomp@gmail.com>, Sawood Alam
<ibnesayeed@gmail.com>
Hi Herbert, Sawood,
Herbert: Perhaps you are lucky that I am not using the LANL aggregator, as the traffic has
gotten really high, and also I was asked to remove an archive due to the traffic it was
causing temporarily..
I am thinking that ability to remove source archives quickly is an important aspect of an
aggregator.
Sawood: Hopefully yours will support something like this so I don't need to restart the
container to change the archivelist ;)
Ilya
Broadcasting is wasteful, both clients & archives suffer!
@ibnesayeed
Memento Lookup Routing
10
Let’s fix the broadcasting issue
with a more informed routing.
@ibnesayeed
MemGator Log Responses from Various Archives
11
93% of the requests
made from MemGator
to upstream archives
were wasteful.
@ibnesayeed
What is Archived in Arquivo.pt?
What is Accessed from MemGator?
12
Blind spot of a
content-based profile
Blind spot of a
usage-based profile
@ibnesayeed
If Only Archives Could Tell When to Ask Them
● Websites advertise their holdings using sitemap.xml, why can’t archives?
○ Archives have billions or even hundreds of billions URI-Ms
○ Such exhaustive lists would go stale very quickly
● How about robots.txt?
○ It is compact, but is exclusion format, it does not tell what the site has
○ It assumes a single domain, patterns are for paths (not the domain name)
● How about combining the two ideas?
○ Introducing MementoMap!
13
@ibnesayeed
A MementoMap Example
14
!context ["http://oduwsdl.github.io/contexts/ukvs"]
!id {uri: "http://archive.example.org/"}
!fields {keys: ["surt"], values: ["frequency"]}
!meta {type: "MementoMap", name: "A Test Web Archive", year: 1996}
!meta {updated_at: "2018-09-03T13:27:52Z"}
* 54321/20000
com,* 10000+
org,arxiv)/ 100
org,arxiv)/* 2500~/900
org,arxiv)/pdf/* 0
uk,co,bbc)/images/* 300+/20-
+ for a lower boundary
- for an upper boundary
~ for an approximate value
@ibnesayeed
Unified Key Value Store (UKVS)
15
https://github.com/oduwsdl/ORS/blob/master/ukvs.md
@ibnesayeed
UKVS Optional Fields vs. Key Columns
16
@ibnesayeed
UKVS Use Cases
● MementoMap
● CDX File/Server
● Archive ACL
● Archive Fixity
● Extended TimeMap
● … and many more
17
@ibnesayeed
SURTs Representation with Wildcard
18
Original SURTs did not have wildcards.
In practice the common “http://(” prefix
is removed.
@ibnesayeed
Arquivo.pt Index Statistics
19
The Internet archive is
about 150 times bigger
than Arquivo.pt.
@ibnesayeed
Shape of HxPx Key Tree of Arquivo.pt
20
@ibnesayeed
Who Would have Thought
Arquivo.pt has 10K+ .онлайн Sites?
21
“.онлайн”
(encoded as “xn--80asehdb”)
is an IDN gTLD which means
“.online”
@ibnesayeed
Most Archived URI-Rs in Arquivo.pt
22
Arquivo is obsessed with transparent single pixel images and corner graphics.
@ibnesayeed
Processed Lines vs. Compacted MementoMap Growth
23
com,example)/a/1/x
com,example)/a/2
com,example)/a/3
com,example)/b/1
com,example)/b/2
com,example)/c/1
com,example)/a/*
com,example)/b/1
com,example)/b/2
com,example)/c/1
com,example)/*
@ibnesayeed
MementoMap Generation, Compaction, and Lookup
24
1.5% Relative Cost yields 60% Accuracy.
Arquivo.pt can save 60% waisted traffic
by publishing 119MB summary file!
@ibnesayeed
Dissemination and Discovery Methods
25
GET /.well-known/mementomap HTTP/1.1
Host: arquivo.pt
Link: <https://arquivo.pt/path/to/mementomap.ukvs>;
rel="mementomap"
<link href="https://arquivo.pt/path/to/mementomap.ukvs"
rel="mementomap">
Well-known URI
Link Header
Link HTML Element
@ibnesayeed
Future Work
● Generate blacklists by processing access logs
● Incorporate MementoMap in replay systems
● Encourage archives and aggregators to adopt it
● Encourage use of UKVS in other archival and non-archival contexts
26
@ibnesayeed
Conclusions
● Described MementoMap - a flexible and efficient archive profiling framework
● Analyzed complete index of Arquivo.pt to understand nature of web archives
● Evaluated MementoMap against Arquivo.pt’s index
● Save 60% of the wasted MemGator traffic with 1.5% cost (a 119 MB file)
● Proposed “mementomap” as a well-known URI suffix as well as a link relation
for dissemination of MementoMap
● Implemented a single-pass, memory-efficient, and parallelization-friendly
MementoMap generation/compaction algorithm
● Open-sourced the implementation
○ https://github.com/oduwsdl/MementoMap
27

Contenu connexe

Similaire à MementoMap: An Archive Profile Dissemination Framework

Summarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSummarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSawood Alam
 
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web ArchivingWho Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web ArchivingMichael Nelson
 
Browser-Based Digital Preservation
Browser-Based Digital PreservationBrowser-Based Digital Preservation
Browser-Based Digital PreservationMat Kelly
 
Introducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research GroupIntroducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research GroupSawood Alam
 
Metadata - Linked Data
Metadata - Linked DataMetadata - Linked Data
Metadata - Linked DataRichard Wallis
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesSawood Alam
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunitiesAhmed AlSum
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesMichael Nelson
 
Evolutionary & Swarm Computing for the Semantic Web
Evolutionary & Swarm Computing for the Semantic WebEvolutionary & Swarm Computing for the Semantic Web
Evolutionary & Swarm Computing for the Semantic WebAnkit Solanki
 
Open (linked) bibliographic data edmund chamberlain (university of cambridge)
Open (linked) bibliographic data   edmund chamberlain (university of cambridge)Open (linked) bibliographic data   edmund chamberlain (university of cambridge)
Open (linked) bibliographic data edmund chamberlain (university of cambridge)RDTF-Discovery
 
Open (linked) bibliographic data
Open (linked) bibliographic dataOpen (linked) bibliographic data
Open (linked) bibliographic dataEdmund Chamberlain
 
web 2.0, library systems and the library system
web 2.0, library systems and the library systemweb 2.0, library systems and the library system
web 2.0, library systems and the library systemlisld
 
Metadata / Linked Data
Metadata / Linked DataMetadata / Linked Data
Metadata / Linked DataRichard Wallis
 
JCDL 2016 Doctoral Consortium - Web Archive Profiling
JCDL 2016 Doctoral Consortium - Web Archive ProfilingJCDL 2016 Doctoral Consortium - Web Archive Profiling
JCDL 2016 Doctoral Consortium - Web Archive ProfilingSawood Alam
 
Detecting Off-Topic Web Pages at #CUWARC
Detecting Off-Topic Web Pages at #CUWARCDetecting Off-Topic Web Pages at #CUWARC
Detecting Off-Topic Web Pages at #CUWARCMichele Weigle
 
RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)Daniele Dell'Aglio
 
TPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingTPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingSawood Alam
 
More Archives, More Better
More Archives, More Better More Archives, More Better
More Archives, More Better Michael Nelson
 
Forging New Links: Libraries in the Semantic Web
Forging New Links: Libraries in the Semantic WebForging New Links: Libraries in the Semantic Web
Forging New Links: Libraries in the Semantic WebGillian Byrne
 

Similaire à MementoMap: An Archive Profile Dissemination Framework (20)

Summarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSummarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMap
 
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web ArchivingWho Will Archive the Archives? Thoughts About the Future of Web Archiving
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
 
Browser-Based Digital Preservation
Browser-Based Digital PreservationBrowser-Based Digital Preservation
Browser-Based Digital Preservation
 
Introducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research GroupIntroducing Web Archiving and WSDL Research Group
Introducing Web Archiving and WSDL Research Group
 
Metadata - Linked Data
Metadata - Linked DataMetadata - Linked Data
Metadata - Linked Data
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web Bundles
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunities
 
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web PagesBlockchain Can Not Be Used To Verify Replayed Archived Web Pages
Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
 
Evolutionary & Swarm Computing for the Semantic Web
Evolutionary & Swarm Computing for the Semantic WebEvolutionary & Swarm Computing for the Semantic Web
Evolutionary & Swarm Computing for the Semantic Web
 
Open (linked) bibliographic data edmund chamberlain (university of cambridge)
Open (linked) bibliographic data   edmund chamberlain (university of cambridge)Open (linked) bibliographic data   edmund chamberlain (university of cambridge)
Open (linked) bibliographic data edmund chamberlain (university of cambridge)
 
Open (linked) bibliographic data
Open (linked) bibliographic dataOpen (linked) bibliographic data
Open (linked) bibliographic data
 
web 2.0, library systems and the library system
web 2.0, library systems and the library systemweb 2.0, library systems and the library system
web 2.0, library systems and the library system
 
Metadata / Linked Data
Metadata / Linked DataMetadata / Linked Data
Metadata / Linked Data
 
JCDL 2016 Doctoral Consortium - Web Archive Profiling
JCDL 2016 Doctoral Consortium - Web Archive ProfilingJCDL 2016 Doctoral Consortium - Web Archive Profiling
JCDL 2016 Doctoral Consortium - Web Archive Profiling
 
Detecting Off-Topic Web Pages at #CUWARC
Detecting Off-Topic Web Pages at #CUWARCDetecting Off-Topic Web Pages at #CUWARC
Detecting Off-Topic Web Pages at #CUWARC
 
RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)RDF Stream Processing Models (SR4LD2013)
RDF Stream Processing Models (SR4LD2013)
 
TPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingTPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive Profiling
 
November 19, 2014 NISO Virtual Conference: Can't We All Work Together?: Inter...
November 19, 2014 NISO Virtual Conference: Can't We All Work Together?: Inter...November 19, 2014 NISO Virtual Conference: Can't We All Work Together?: Inter...
November 19, 2014 NISO Virtual Conference: Can't We All Work Together?: Inter...
 
More Archives, More Better
More Archives, More Better More Archives, More Better
More Archives, More Better
 
Forging New Links: Libraries in the Semantic Web
Forging New Links: Libraries in the Semantic WebForging New Links: Libraries in the Semantic Web
Forging New Links: Libraries in the Semantic Web
 

Plus de Sawood Alam

TrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web PagesTrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web PagesSawood Alam
 
CDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection InsightsCDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection InsightsSawood Alam
 
Video Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineVideo Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineSawood Alam
 
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingSawood Alam
 
Supporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSupporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSawood Alam
 
Impact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesImpact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesSawood Alam
 
Archive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkSawood Alam
 
Web ARChive (WARC) File Format
Web ARChive (WARC) File FormatWeb ARChive (WARC) File Format
Web ARChive (WARC) File FormatSawood Alam
 
MemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoMemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoSawood Alam
 
Dockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationDockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationSawood Alam
 
Avoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerAvoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerSawood Alam
 
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerClient-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerSawood Alam
 
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesSawood Alam
 
Web Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext SearchWeb Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext SearchSawood Alam
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionSawood Alam
 
TPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web ArchivesTPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web ArchivesSawood Alam
 
Profiling Web Archives
Profiling Web ArchivesProfiling Web Archives
Profiling Web ArchivesSawood Alam
 
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...Sawood Alam
 
Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015Sawood Alam
 
Profiling Web Archives IIPC GA 2015
Profiling Web Archives IIPC GA 2015Profiling Web Archives IIPC GA 2015
Profiling Web Archives IIPC GA 2015Sawood Alam
 

Plus de Sawood Alam (20)

TrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web PagesTrendMachine: Temporal Resilience of Web Pages
TrendMachine: Temporal Resilience of Web Pages
 
CDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection InsightsCDX Summary: Web Archival Collection Insights
CDX Summary: Web Archival Collection Insights
 
Video Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback MachineVideo Archiving and Playback in the Wayback Machine
Video Archiving and Playback in the Wayback Machine
 
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento RoutingMementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing
 
Supporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSupporting Web Archiving via Web Packaging
Supporting Web Archiving via Web Packaging
 
Impact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web ArchivesImpact of HTTP Cookie Violations in Web Archives
Impact of HTTP Cookie Violations in Web Archives
 
Archive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification Framework
 
Web ARChive (WARC) File Format
Web ARChive (WARC) File FormatWeb ARChive (WARC) File Format
Web ARChive (WARC) File Format
 
MemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in GoMemGator - A Memento Aggregator CLI and Server in Go
MemGator - A Memento Aggregator CLI and Server in Go
 
Dockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to ContainerizationDockerize Your Projects - A Brief Introduction to Containerization
Dockerize Your Projects - A Brief Introduction to Containerization
 
Avoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorkerAvoiding Zombies in Archival Replay Using ServiceWorker
Avoiding Zombies in Archival Replay Using ServiceWorker
 
Client-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorkerClient-side Reconstruction of Composite Mementos Using ServiceWorker
Client-side Reconstruction of Composite Mementos Using ServiceWorker
 
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
 
Web Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext SearchWeb Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext Search
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief Introduction
 
TPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web ArchivesTPDL 2015 - Profiling Web Archives
TPDL 2015 - Profiling Web Archives
 
Profiling Web Archives
Profiling Web ArchivesProfiling Web Archives
Profiling Web Archives
 
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
Improving Accessibility of Archived Raster Dictionaries of Complex Script Lan...
 
Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015Profile Serialization IIPC GA 2015
Profile Serialization IIPC GA 2015
 
Profiling Web Archives IIPC GA 2015
Profiling Web Archives IIPC GA 2015Profiling Web Archives IIPC GA 2015
Profiling Web Archives IIPC GA 2015
 

Dernier

Russian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl ServiceRussian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl Servicegwenoracqe6
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebGDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebJames Anderson
 
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...SofiyaSharma5
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...tanu pandey
 
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445ruhi
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024APNIC
 
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Sheetaleventcompany
 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Servicesexy call girls service in goa
 
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call GirlVIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girladitipandeya
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxellan12
 
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
How is AI changing journalism? (v. April 2024)
How is AI changing journalism? (v. April 2024)How is AI changing journalism? (v. April 2024)
How is AI changing journalism? (v. April 2024)Damian Radcliffe
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLimonikaupta
 
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Onlineanilsa9823
 
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort ServiceEnjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort ServiceDelhi Call girls
 

Dernier (20)

Russian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl ServiceRussian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
 
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebGDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
 
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
 
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024
 
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
 
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call GirlVIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
 
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
How is AI changing journalism? (v. April 2024)
How is AI changing journalism? (v. April 2024)How is AI changing journalism? (v. April 2024)
How is AI changing journalism? (v. April 2024)
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
 
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Saket Delhi 💯Call Us 🔝8264348440🔝
 
Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Lucknow Lucknow best sexual service Online
 
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort ServiceEnjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >༒8448380779 Escort Service
 
@9999965857 🫦 Sexy Desi Call Girls Laxmi Nagar 💓 High Profile Escorts Delhi 🫶
@9999965857 🫦 Sexy Desi Call Girls Laxmi Nagar 💓 High Profile Escorts Delhi 🫶@9999965857 🫦 Sexy Desi Call Girls Laxmi Nagar 💓 High Profile Escorts Delhi 🫶
@9999965857 🫦 Sexy Desi Call Girls Laxmi Nagar 💓 High Profile Escorts Delhi 🫶
 

MementoMap: An Archive Profile Dissemination Framework

  • 1. MementoMap An Archive Profile Dissemination Framework Sawood Alam, Michele C. Weigle, and Michael L. Nelson Old Dominion University, Norfolk, VA, USA @ibnesayeed @WebSciDL Supported by NSF Grant IIS-1526700 WADL '19, June 6, 2019, Urbana-Champaign, Illinois
  • 2. @ibnesayeed 2 $ memgator -a archives.json -f cdxj example.com > | grep -v "^!" | cut -d '/' -f 3 | sort | uniq -c | sort -nr 198014 web.archive.org 13548 wayback.archive-it.org 1191 webarchive.loc.gov 1044 swap.stanford.edu 953 arquivo.pt 525 wayback.vefsafn.is 225 perma-archives.org 221 archive.md 23 www.webarchive.org.uk $ memgator -a archives.json -f cdxj jcdl.org > | grep -v "^!" | cut -d '/' -f 3 | sort | uniq -c | sort -nr 410 web.archive.org 2 www.webarchive.org.uk 2 arquivo.pt 1 archive.md Cross-archive Memento Lookup With MemGator https://github.com/oduwsdl/MemGator
  • 9. @ibnesayeed Broadcasting is Evil 9 From: Michael Nelson [mailto:mln@cs.odu.edu] Sent: Wednesday, December 02, 2015 12:33 PM To: Jones, Gina Cc: Rourke, Patrick; Grotke, Abigail Subject: Re: WebSciDL Hi Gina, I'll investigate. memgator is software that one my students wrote, but I suspect the traffic you're seeing is b/c it is deployed in http://oldweb.today/ can you share the IP addr from where you're seeing the traffic? I presume the requests are for Memento TimeMaps? It should not being actually scraping HTML pages. regards, Michael On Wed, 2 Dec 2015, Jones, Gina wrote: > Hi Michael, we have a slight configuration issue with the current OW > set up for our webarchives. I think, from looking at the logs, that > "MemGator:1.0-rc3 <@WebSciDL>" is really causing some issues on our wayback. > Do you know who is running this scraper? Itʼs not part of memento is it? > > Gina Jones > Web Archiving Team > Library of Congress From: Ilya Kreymer <ikreymer@gmail.com> Date: Wed, 2 Dec 2015 10:33:56 -0800 Subject: high traffic on oldweb! To: Herbert Van de Sompel <hvdsomp@gmail.com>, Sawood Alam <ibnesayeed@gmail.com> Hi Herbert, Sawood, Herbert: Perhaps you are lucky that I am not using the LANL aggregator, as the traffic has gotten really high, and also I was asked to remove an archive due to the traffic it was causing temporarily.. I am thinking that ability to remove source archives quickly is an important aspect of an aggregator. Sawood: Hopefully yours will support something like this so I don't need to restart the container to change the archivelist ;) Ilya Broadcasting is wasteful, both clients & archives suffer!
  • 10. @ibnesayeed Memento Lookup Routing 10 Let’s fix the broadcasting issue with a more informed routing.
  • 11. @ibnesayeed MemGator Log Responses from Various Archives 11 93% of the requests made from MemGator to upstream archives were wasteful.
  • 12. @ibnesayeed What is Archived in Arquivo.pt? What is Accessed from MemGator? 12 Blind spot of a content-based profile Blind spot of a usage-based profile
  • 13. @ibnesayeed If Only Archives Could Tell When to Ask Them ● Websites advertise their holdings using sitemap.xml, why can’t archives? ○ Archives have billions or even hundreds of billions URI-Ms ○ Such exhaustive lists would go stale very quickly ● How about robots.txt? ○ It is compact, but is exclusion format, it does not tell what the site has ○ It assumes a single domain, patterns are for paths (not the domain name) ● How about combining the two ideas? ○ Introducing MementoMap! 13
  • 14. @ibnesayeed A MementoMap Example 14 !context ["http://oduwsdl.github.io/contexts/ukvs"] !id {uri: "http://archive.example.org/"} !fields {keys: ["surt"], values: ["frequency"]} !meta {type: "MementoMap", name: "A Test Web Archive", year: 1996} !meta {updated_at: "2018-09-03T13:27:52Z"} * 54321/20000 com,* 10000+ org,arxiv)/ 100 org,arxiv)/* 2500~/900 org,arxiv)/pdf/* 0 uk,co,bbc)/images/* 300+/20- + for a lower boundary - for an upper boundary ~ for an approximate value
  • 15. @ibnesayeed Unified Key Value Store (UKVS) 15 https://github.com/oduwsdl/ORS/blob/master/ukvs.md
  • 16. @ibnesayeed UKVS Optional Fields vs. Key Columns 16
  • 17. @ibnesayeed UKVS Use Cases ● MementoMap ● CDX File/Server ● Archive ACL ● Archive Fixity ● Extended TimeMap ● … and many more 17
  • 18. @ibnesayeed SURTs Representation with Wildcard 18 Original SURTs did not have wildcards. In practice the common “http://(” prefix is removed.
  • 19. @ibnesayeed Arquivo.pt Index Statistics 19 The Internet archive is about 150 times bigger than Arquivo.pt.
  • 20. @ibnesayeed Shape of HxPx Key Tree of Arquivo.pt 20
  • 21. @ibnesayeed Who Would have Thought Arquivo.pt has 10K+ .онлайн Sites? 21 “.онлайн” (encoded as “xn--80asehdb”) is an IDN gTLD which means “.online”
  • 22. @ibnesayeed Most Archived URI-Rs in Arquivo.pt 22 Arquivo is obsessed with transparent single pixel images and corner graphics.
  • 23. @ibnesayeed Processed Lines vs. Compacted MementoMap Growth 23 com,example)/a/1/x com,example)/a/2 com,example)/a/3 com,example)/b/1 com,example)/b/2 com,example)/c/1 com,example)/a/* com,example)/b/1 com,example)/b/2 com,example)/c/1 com,example)/*
  • 24. @ibnesayeed MementoMap Generation, Compaction, and Lookup 24 1.5% Relative Cost yields 60% Accuracy. Arquivo.pt can save 60% waisted traffic by publishing 119MB summary file!
  • 25. @ibnesayeed Dissemination and Discovery Methods 25 GET /.well-known/mementomap HTTP/1.1 Host: arquivo.pt Link: <https://arquivo.pt/path/to/mementomap.ukvs>; rel="mementomap" <link href="https://arquivo.pt/path/to/mementomap.ukvs" rel="mementomap"> Well-known URI Link Header Link HTML Element
  • 26. @ibnesayeed Future Work ● Generate blacklists by processing access logs ● Incorporate MementoMap in replay systems ● Encourage archives and aggregators to adopt it ● Encourage use of UKVS in other archival and non-archival contexts 26
  • 27. @ibnesayeed Conclusions ● Described MementoMap - a flexible and efficient archive profiling framework ● Analyzed complete index of Arquivo.pt to understand nature of web archives ● Evaluated MementoMap against Arquivo.pt’s index ● Save 60% of the wasted MemGator traffic with 1.5% cost (a 119 MB file) ● Proposed “mementomap” as a well-known URI suffix as well as a link relation for dissemination of MementoMap ● Implemented a single-pass, memory-efficient, and parallelization-friendly MementoMap generation/compaction algorithm ● Open-sourced the implementation ○ https://github.com/oduwsdl/MementoMap 27