1. Web Archiving
Profile
OverviewAhmed AlSum
PhD Candidate
Old Dominion University
Web Archiving and Digital Libraries (WADL 2013)
A Workshop at JCDL 2013
July 25-26, 2013
Indianapolis, Indiana, USA
2. What is the problem?
• Web Archives are blackbox, it just accessible
through textbox search (full-text or URI-lookup)
• We need to profile/characterize the web archives
around the world such as:
o Age
o Top-level domains
o Languages
o Growth rate
3. Why
• To optimize the query routing for Memento
Aggregator.
• To determine the missing parts of the web.
4. Who
Full text URI-lookup
Internet Archive x
Library of Congress x
Icelandic Web Archive x
Library and Archives Canada x x
British Library x x
UK National Library x x
Portuguese Web Archive x x
Web Archive of Catalonia x x
Croatian Web Archive x x
Archive of the Czech Web x x
National Taiwan University x x
Archive IT x x
5. How
• Sampling from different sources
• Retrieve the TimeMap from each archive
• Analyze the TimeMaps
6. URIs Samples Sources
Web
1. DMOZ – Random sample
2. DMOZ – TLD %2 of each
TLD from DMOZ (.com,
.org, .jp, etc 52 TLD)
3. DMOZ – Languages 100
URIs for each Languages (24
lang.)
Web Archives
4. Top 1-Gram from Bing
5. Top 1000 queries term
by Yahoo in 9 languages
User requests
6. IA Wayback Machine Log files
7. Memento aggregator log files
* We used hostnames only