WWWoH

Slides from Humanities on the Web: Is it working?
Date: Thursday, 19 March 2009, 10-4
Location: Oxford University, Oxford, UK
Webcast URL: http://webcast.oii.ox.ac.uk/?view=Webcast&ID=20090319_275
Slide URL: http://www.slideshare.net/etmeyer/WWWoH

Afternoon Event:
1:30 – 2:45: JISC/NEH Transatlantic Digitisation Collaboration Programme in
conjunction with the Internet Archive: The World Wide Web of Humanities
OII: Selecting and analysing the sample WWI and WWII collections
(Christine Madsen & Dr. Eric Meyer)
The Internet Archive: Extracting the data (Molly Bragg)
Hanzo Archives Ltd.: Working with the data (Mark Middleton)
Discussion and questions

Full details: http://www.oii.ox.ac.uk/events/details.cfm?id=238

Selecting and Analysing
the WWI and WWII
collections

Christine Madsen
Eric Meyer
19 March 2009

Why WWI and WWII?

Many branches of the humanities

History Journalism Art

Art history Advertising Literature

Political Military
Poetry
science history

Why WWI and WWII?

Well-rounded set of materials

Why WWI and WWII?

• Changes • Differences
over time between
WWI and
Language Doc types WWII

Secondary Top-level
domains domains

Building the Collection

Supplemented with
keyword searches in
the Archive

Harvested from
the Internet
Archive

Selected
from the live
web


Seed
Seeds are: Seed 1
2
the website or Seed
portion of the
website that you
3
plan to include in
your collection

Initial Collection


Expanded
www
Collection
www
A seed is also a web
www
Seed
www www
site from which
3 Seed
www 2
www additional sites can
Seed
be discovered via
www www 5
www
the hyperlinks of the
www
www
site
www www
www Seed
www 6
www
Seed
www 1
www
Seed
www 4
www
www

www
www


Started with WWI

Too small (under 1,000,000 pages / object)
Target was 250 million


Expanded to WWII

Final collection: 5,362,425 unique URLs


‘World War One’ ‘the great war’

‘World War I’
‘Première Guerre
‘First world war’ Mondiale’

‘World War II’ ‘zweiter Weltkrieg’
‘World War Two’


Returning to Record links
‘hub’ sites from first 20
for further pages of
analysis search

[include Following
dead links] links


Expanding scope

http://www.greatwar.co.uk/westfront/Somme/index.htm

http://www.greatwar.co.uk


Expanding scope

memory.loc.gov/ammem/collections/maps/wwii/index.html

www.memory.loc.gov/ammem/collections maps/wwii/


Dealing with illogical or flat directory structures

www.eyewitnesstohistory.com/ <= don’t want whole site

www.eyewitnesstohistory.com/blitzkrieg.htm
www.eyewitnesstohistory.com/dday.html
www.eyewitnesstohistory.com/midway.htm
www.eyewitnesstohistory.com/airbattle.htm
www.eyewitnesstohistory.com/dunkirk.htm
www.eyewitnesstohistory.com/francesurrenders.htm


• Stop when most results are redundant
• Narrow in on more specific topics

Churchill
Hitler
‘zweiter ‘Battle of the
Weltkrieg’ Bulge’
‘Great war’ Guadalcanal
WWI
Allies
WWII
Home front


• Materials in Foreign language
– Focused on German sites
– Consider local conventions, not just translations
WWII
(zweiter Weltkrieg)

the period of National Socialism
(Zeit des Nationalsozialismus)

the period in which the Nazis ruled
(Nazizeit)

• Other foreign languages were included, but
not sought after

Belarusian; Catalan/Valencian; Chamorro;
Czech; Danish; German; Dzongkha; English;
Spanish/Castilian; Finnish; French; Hebrew;
Hungarian; Italian; Japanese; Luba-Katanga;
Dutch/Flemish; Polish; Portuguese; Russian;
Slovenian; Turkish; Ukrainian; Chinese


Difficult to find and include:

Museums, libraries, archives

Some improvement through targeted searches

NYPL (2,100 photographs) Harvard Libraries (1,000 WWI Pamphlets)

Directory Structures still limiting
http://pds.lib.harvard.edu/pds/view/7845178
(first page of a multipage object)

The World Wide Web of Humanities
“Extracting The Data”

St Anne's College, Oxford
March 19, 2009

Molly Bragg, Partner Specialist
Web Group
The Internet Archive

Agenda

 Brief Introduction to IA‟s Web Archives

 Discipline Specific Data Extraction from
Longitudinal Web Archives: The
WWWoH Case Study

 Recommendations for Future Research
and Tools Development Efforts

Brief Introduction to
IA‟s Web Archives

The Internet Archive is…
A digital library of ~4 petabytes of information

 Web Pages
 Educational Courseware
 Films & Videos
 Music & Spoken Word
 Books & Texts
 Software
 Images
The Archive’s combined collections receive
over 6 mil downloads a day!
www.archive.org

IA Web Archives

1.6+ petabytes of primary data (compressed)

 150+ billion URIs, culled from 85+ million
sites, harvested from 1996 to the present
 Includes captures from every domain
 Encompasses content in over 40 languages
 As of 2009, IA will add ½ petabyte to 1 petabyte of
data to these collections each year.

Discipline Specific Data Extraction
from Longitudinal Web Archives:

The WWWoH Case Study

WWWoH Case Study

http://neh-access.archive.org/neh/

WWWoH Case Study

 Unique URLs in the collection: 5,362,425

 Total number of captures: 23,006,857

 Captures span: May, 1996 to Aug, 2008

 Total size of compressed data: ~250 GBs

The Data Extraction
Process
 Oxford Internet Institute selected relevant
sites/URLs
 Identified all captures related to the seeds
 Identified all files embedded in each capture
(on & off seed domains) for extraction
 Attempted to locate additional candidate
seed URLs/domains for inclusion in the
collection using outbound link data

The Data Extraction Process

 Relevant URLs not identified as seeds
were not extracted.
 Automatically harvesting ALL outbound links
can capture relevant non-seed urls however it
can also introduce a large amount of
extraneous content into the collection
 Manually curating outbound links excludes
non-relevant content, however it can be an
overwhelming task due to the volume of links

WWWoH Case Study: WWI

 Number of Seeds: 2263

 Unique Hosts: 906

 Number of Links: 143+ mil

WWWoH Case Study:
WWII
 Number of Seeds: 2592

 Unique Hosts: 1475

 Number of Links: 252+ mil

Challenges

Identifying subject matter-specific
resources of interest for an extraction and
then automating those procedures.

 Tools are missing from the workflow that
might make the initial scoping of an extraction
easier to define and revise
 Available tools for collection building and
access are too technically focused for the
average humanities scholar

Recommendations for
Future Research and Tools
Development Efforts

Implications for Future
Research

 Need link and web graphing tools
that use inbound and outbound link
data to identify further resources of
interest
 Need to experiment with a more
diverse range of UI navigational
paradigms that address the
dimension of time and curatorial input

Ideas/Concepts to Explore:
Nomination Tools

Opportunities

 Extractions make it easier for humanities
scholars to locate and assemble source
materials of interest.
 These collections can accelerate and/or
augment discipline specific research efforts
 Extractions can encourage distributed
collaboration and cooperation between entities
who might not otherwise be aware of one
another

Thank You!

http://neh-access.archive.org/neh/

Molly Bragg, Partner Specialist
The Internet Archive, Web Group
mbragg@archive.org

Search and Analysis of
Data in WWWoH

Mark Middleton
www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Agenda
Brief introduction to Hanzo
Open Source Search-Tools: a toolkit for implementing analytical
applications using web archives
WWWoH — working with the data
Recommendations for future research
Recommendations for future tools development
WWWoH Tools Deliverables


Introduction to Hanzo


Hanzo Archives Limited
Web Archiving Services
Company websites and intranets
Litigation support
E-Discovery
IP protection

Focus on legally defensible web archives of exceptional quality
Very advanced crawlers and access tools: dynamic html, video, flash, web 2.0
Some public archives
Mainly closed archives


Hanzo Archiving Technology
Need advanced capabilities very quickly — continuous product innovation
Rapid development of tools
Create research and open source projects to promote mainstream awareness
of web archives and web archiving technology

Open source projects include
WARC Tools
Search Tools


WWWoH and Development of
Open Source Search-Tools


Objectives
Deliver an open source search engine for web archives that is simple to
extend, easy to install and deploy
Integrate with WARC Tools, the open source web archive file manipulation
tools (Hanzo and IIPC)
Extend the search engine with interesting directives and options
Extend the search engine to provide data to analytical tools, develop an
API, tools, and exemplar analytical tools
Encourage third party analytical tools to use web archives as their data
repository
Migrate WWWoH extraction from ARC to WARC and ingest into Search
Tools


Full Text Search
Implemented FT search on top of WARC Tools — the toolkit for
manipulating ISO-28500 WARC files
Reviewed several options: Java Lucene (and clones), Xapian, DB indexing
(Sphinx, OpenFTS), etc.
Criteria: vibrant development community, extensible (searching web
archives is different: temporal dimension, duplicate handling, etc.), fast and
full-featured (boolean, time queries, ability to index multiple fields, query
language)


Component Architecture
Full text search engine based
on Open Source Ferret
Knowledge Base stores search
results
Python application with Django
model and Django WUI
Memcache
Plug-in architecture to support
multiple analytical applications


Ferret
Ferret is FAST, both indexing
and searching
Highly scalable, up to 100m
documents on a single CPU
Documents/s

Supports distributed search
Phrase search, proximity
ranking, stemming in several
languages, stopwords, multiple
document fields
Ferret Query Language
http://ferret.davebalmain.com/trac/wiki/FerretVsLucene


Advanced Search
url: (+bbc +wwii) -- search for URLs containing both „bbc‟ and „wwii‟
date: [2001 2002] -- search within date range
tag: wwwoh -- search content with the tag „wwoh‟
title: (+wilfred +owen) -- search for Wilfred and Owen within the title
domain: fr -- restrict search to within .fr domain


Working with the Data


Migrating ARC to WARC
Data extracted from IA in ARC files
Hanzo WARC Tools and Search Tools projects combined enabled us to
migrate ARC to WARC files (WARC is the new ISO standard):
Some challenges: broken ARCs, scale, etc.
3,264 WARC files


Programmable Access to Data
WARC Tools and Search Tools provide a rich collection of programmable
tools to enable analytics tools developers to use web archives:
Object-oriented C, REST API, fast iterators
Command lines for manipulating WARCs, indexing, searching
Web applications for browsing, searching, demonstrator analytics
C/C++, Python, Ruby, Perl, … and if you need to, Java, C#

Demonstration: the web applications


http://wwwoh.hanzoarchives.com/


Analytical Tools
Frequency Tables for:
Domains, MIME Types, Countries

Graphing Tools:
GUESS -- an exploratory data analysis and visualization tool for graphs and
networks
Graphviz -- makes diagrams in several formats: images and SVG for web
pages, Postscript; or display in an interactive graph browser
Hypergraph -- provides visualisation of hyperbolic geometry, to handle graphs
and to layout hyperbolic trees


Graphing Tools


Recommendations for Future
Research and Tools Development


Future Research
Faster, richer analytics
Rich API for analytics, to be developed in collaboration with IA, other
archives, and IIPC
Temporal analytics and techniques
Link and network graphing and analytics
Enhance outreach/dissemination to the mainstream development
community and research community


Future Tools Development
Multi-machine indexing and application engine
Tighter integration of graphing tools, with more user parameters and
configurations
Temporal analysis (animation of link graphs over time)
Enhance WARC Tools integration and investigate interoperability with other
IIPC toolsets
Developer documentation
Analyst/researcher documentation
Installation tools for Linux, Mac OS X and Windows XP/Vista


Deliverables at End March 2009


Deliverables
The Search Tools project home is http://code.google.com/p/search-tools/
Source code
Documentation
Issue management
Mailing list

The WARC Tools project home is http://code.google.com/p/warc-tools/
The prototype application is http://wwwoh.hanzoarchives.com/


Thank You
Hanzo Archives Limited
+44 20 8816 8226
www.hanzoarchives.com


WWWoH

Recommandé

Recommandé

Contenu connexe

Similaire à WWWoH

Similaire à WWWoH (6)

Plus de Eric Meyer

Plus de Eric Meyer (20)

Dernier

Dernier (20)

WWWoH

Notes de l'éditeur