This document contains a sample WARC file with records of different types including requests, responses, metadata, and revisits related to archiving the website http://www.archive.org. It includes headers with information like record IDs, dates, URIs, digests, and record types. The records document the initial capture of an image on the site, later metadata and conversion records, and a sample revisit record showing the resource was not modified.
37. Oracle StorageTek SL8500
• up to 64 tape drives
• up to 8500 tapes
• up to 8 hand pickers
• up to 32 linked libraries
Primary storage
2 libraries
16 PB maximum
Backup storage
2 libraries
16 PB maximum
76. set
group
03/07/1882 28/02/1883 01/03/188302/07/1882
Le Matin
01/07/1882
AIPAIP
AIPAIP
set
Contains nothing but metadata
Curator information, allows to
group AIPs sharing the same
intellectual content
AIPAIP
Must contain files to be
preserved
Each AIP is an autonomous
unit
AIPAIP AIPAIP AIPAIP
94. Web archiving at the British Library
Helen Hockx-Yu
Head of Web Archiving
95. Overview
> Part 1: Background, history and organisation
> Part 2: Web Archiving Tools (including
demos)
> Part 3: Access
> Part 4: Non-print Legal Deposit and future
strategy
29th November 2012 Session 7 -Web archiving at the British Library 2
96. BL Structure
> BL Board and Executive Team
> e-Strategy and Information Systems (eIS)
> IT-based products and services
> Finance and Corporate Services (F&CS)
> Money
> Human Resources
> People
> Operations & Services (O&S)
> Front line services
> Scholarship and Collections (S&C)
> Content (Arts and humanities, Social Sciences, Science, Technology & Medicine)
> Strategic Marketing and Communications (SMC)
> Brand and reputation
29th November 2012 Session 7 -Web archiving at the British Library 3
98. Current web archiving strategy
> Selective archiving of websites that
> reflect the diversity of lives, interests and activities throughout the UK
> contain research value or are of research interest
> feature political, cultural, social and economic events of national interest
> demonstrate innovative use of the web4 areas
> Also prioritise websites at risk and web-only content
> Permission based
> Permission to archive, to provide online access and to preserve. Also ask or 3rd
rights clearance
> 30% success rate, 5% explicit refusal (mostly due to 3rd party rights)
> Online access through UK Web Archive
> Expect to crawl at domain level (from April 2013) for Non-
print Legal Deposit
29th November 2012 Session 7 -Web archiving at the British Library 5
99. The current Web Archiving team
29th November 2012 Session 7 -Web archiving at the British Library 6
Skills Profile
> IT
> Collection management, digital curation
> Management
> Communications
> Web Archiving
100. (Internal Collaboration)
> The Web Archiving Team is involved in the end to end process but work
with other departments / teams in the library
29th November 2012 Session 7 -Web archiving at the British Library 7
Department /Team Activity / Support
S&C
> Subject specialist group
> Curator’s Choice project
Selection, curation
eIS Network, hardware and IT support
O&S
Resource Discovery & Research
Corporate level resource discovery http://explore.bl.uk/
CA&D
Digital Processing
Cataloguing (special collection level)
SMC Publicity, press release, events
The Legal Deposit Programme Domain crawl capability / process and policy
101. Curator’s Choice
> Pilot project with a small group of dedicated curators /
subject specialists
> Special Collections of curator’s choice. Curators take
responsibility for owning, maintaining and growing the
collections over time
> Evolving Role of Libraries in the UK
> Political Action and Communication
> Slavery and Abolition in the Caribbean
> UK relations with the Low Countries
> 19th Century English Literature
> Oral History in the UK
> Film in the UK
> Energy
29th November 2012 Session 7 -Web archiving at the British Library 8
102. Web Archiving Advisory Group
> Provide advice and support to the Web Archiving Team
> Act as a ‘critical friend’ to assist in the development of policy
and practice.
> Specific advice and support on:
> Purpose, vision and benefits.
> Strategic direction and planning.
> Synergy with internal teams and collaboration with
external stakeholders/partners.
> Policy changes and risk management
29th November 2012 Session 7 -Web archiving at the British Library 9
103. (External) Collaboration
> UK Web Archiving Consortium (2004-2007): centralised infrastructure
and development, distributed collections
> UK Web Archive partners, National Archives, Legal Deposit Libraries
(LDLs)
> External Collaborators
> Welcome Library
> Live Art Development Agency
> The Cambridge Innovation Network
> The Women’s Library
> Institute of Historical esearch, University of London
> Individual researchers, specialists
> General public – ca. 20 nominations / week
> National organisations: DPC, JISC
> International: IIPC
29th November 2012 Session 7 -Web archiving at the British Library 10
104. JISC UK Web Domain Dataset (1996-2010)
> Collaboration with JISC and the Internet Archive
> UK Web Domain Dataset (1996-2010) – UK websites
extracted from the Internet Archive's collection and
supported by funding from the JISC
> 35TB research dataset
> No local access to individual websites but access to
secondary dataset allowed
> BL has developed visualisations of the dataset
> JISC funded 2 further projects using this dataset
> Analytical Access to the Domain Dark Archive
> Big Data: Demonstrating the Value of the UK Web Domain Dataset for Social
Science Research
29th November 2012 Session 7 -Web archiving at the British Library 11
105. Web Archiving Tools
> Support key processes: selection, harvesting, storage,
access, preservation
> Mostly open source tools, some developed in-house
> New tools / changes to current tools expected when business
processes change due to non-print Legal Deposit
29th November 2012 Session 7 -Web archiving at the British Library 12
106. Selection Tools
> Selection: decide what websites to archive and to include as
part of a web archive collection
> Selection and Permission Tool: https://wct.bl.uk/selection/
> Submit selection – real time checking of duplicates, fetching meta tags from live
sites
> Collect metadata
> Add contact details
> Suggest crawl frequency
> Permissions management – send emails, direct users to online licence form, store
the completed forms, pass details to WCT (create authorisation record and a
pending target)
> Reports
> Twittervane
29th November 2012 Session 7 -Web archiving at the British Library 13
107. Harvesting Tools
> Harvesting: automated downloading of selected websites
using crawler software; quality assurance regarded as an
element
> The Web Curator Tool (WCT): https://wct.bl.uk/wct/
> Job scheduling
> Metadata
> Access control
> Harvesting (uses Heritirx)
> QA
29th November 2012 Session 7 -Web archiving at the British Library 14
108. Quality Assurance
> Placing more emphasis on intellectual content than
appearance or behaviour of a website
> Use four aspects to define quality:
> Completeness of capture: whether the intended content has been captured as
part of the harvest.
> Intellectual content: whether the intellectual content (as opposed to styling and
layout) can be replayed in the Access Tool.
> Behaviour: whether the harvested copy can be replayed including the behaviour
present on the live site, such as the ability to browse between links interactively.
> Appearance: look and feel of a website.
> Rely on visual comparison, previous harvests & crawl logs
> Recent development of QA module to allow bulk operation,
reduce # of clicks and make QA recommendations
29th November 2012 Session 7 -Web archiving at the British Library 15
109. Supporting Long-term Preservation
> Storing data in WARCs and metadata in METS
> Migrate all legacy data into WARCs
> WCT output WARC files
> Submission Information Package (SIP) profiles for selective
and domain crawls
> Storing descriptive metadata (eg permission information) & technical metadata
(eg crawl log, crawl configurations, virus scan events)
> Ingest archived websites in the Digital Library System (DLS)
> Command line tool generates SIPs
> Providing access from the DLS (in future)
29th November 2012 Session 7 -Web archiving at the British Library 16
110. Demo (45 minutes)
> Selection and Permission Tool (https://wct.bl.uk/selection/)
> Web Curator Tool (https://wct.bl.uk/wct/)
29th November 2012 Session 7 -Web archiving at the British Library 17
111. Access
> Currently 3 ways to access the web archive
> Online through the UK Web Archive
> Catalogue records (of special collections)
> Keywords search through primo (corporate resource
discovery system)
> Conduct researcher survey to understand
requirements
> Analytical access
29th November 2012 Session 7 -Web archiving at the British Library 18
113. Keyword search through Primo
29th November 2012 Session 7 -Web archiving at the British Library 20
114. UK Web Archive
29th November 2012 Session 7 -Web archiving at the British Library 21
> Websites archived by BL and
partners since 2004 (65% by
BL)
> 122,99 websites, 50,866
instances, 13.6TBWARCs
> Over 100,000 unique visits
since 1st April 2012
> Key websites include videos
> Full-text, N-gram, title and
URL search
> Browse by subject / special
collection, visual browsing
http://www.webarchive.org.uk
115. Analytical Access
> Shift of focus from the level of single webpages or websites
to the entire web archive collection.
> Use web archives as datasets
> Support survey, annotation, contextualisation and
visualisation
> Allows discovery of patterns, trends and relationships in
inter-linked web pages
> Extracting value from the “haystacks”
> Helps addresses a number of challenging issues
> Scalability
> Accessibility of individual websites
> Components missed by crawlers
29th November 2012 Session 7 -Web archiving at the British Library 22
116. Visualising the UK Web
> http://www.webarchive.org.uk/ukwa/visualisation
> N-gram search
> Links analysis
> Format Analysis
> Geo-index
> http://www.webarchive.org.uk/bluebox/
> uses the Memento aggregate TimeGate hosted by lanl.gov
> “resource not in archive” – who else has it?
> Open data
> Dataset and APIs for general use
> Enable broader community to re-use, explore and visualise content of web archive
29th November 2012 Session 7 -Web archiving at the British Library 23
118. Non-print Legal Deposit: Time of change
> Expected to be in place in April 2013
> Access restricted to premises of Legal Deposit Libraries
> Library-wide Legal Deposit Programme to develop capability
and end-to-end process
> Web Archiving Team acts as “technical supplier” for a
number of projects
> Still need to work out how current (permission-based)
selective archiving relates to domain crawl under Legal
Deposit
> Will we request permissions for online access?
> Will we stop crawling some of the sites we are crawling now and include them in
the annual / bi-annual broad domain crawl?
> Who does what?
29th November 2012 Session 7 -Web archiving at the British Library 25
119. 29th November 2012 Session 7 -Web archiving at the British Library 26
Web Archiving Strategy
26
Domain Crawl
Event
S
p
e
c
i
a
l
c
o
l
l
e
c
t
i
o
n
S
p
e
c
i
a
l
c
o
l
l
e
c
t
i
o
n
Domain
harvesting:
• Broad
sweep of
.uk domain
• Once or
twice a year
Events & key
sites:
• Events of
national
interest
• Sites need
to be
captured
frequently
Special
Collection:
• Focused,
thematic
collections
• Support
priority
subjects
Key sitesEvent
S
p
e
c
i
a
l
c
o
l
l
e
c
t
i
o
n
S
p
e
c
i
a
l
c
o
l
l
e
c
t
i
o
n
120. Web
Archiving
Workshop
Leïla
Medjkoune,
Internet
Memory
IIPC
workshop,
BNF,
Paris,
November
2012
121. Internet
Memory
Internet
Memory
Founda/on
(European
Archive)
• Established
in
2004
in
Amsterdam
and
then
Paris
• Mission:
Preserve
Web
content
by
building
a
shared
WA
plaJorm
• Ac/ons:
DisseminaLon,
R&D
and
partnerships
with
research
groups
and
cultural
insLtuLons
• Open
Access
Collec/ons:
UK
NaLonal
Archives
&
Parliament,
PRONI,
CERN
and
The
NaLonal
Library
of
Ireland
Internet
Memory
Research
• Spin-‐off
of
IM
established
in
June
2011
in
Paris
• Missions:
Operate
large
scale
or
selecLve
crawls
&
develop
new
technologies
(crawl,
access,
processing
and
extracLon)
122. Internet
Memory
Infrastructure
Green
datacenters
Repository
and
data
access
for
large-‐scale
data
management:
• HDFS
(Hadoop
File
System):
Distributed,
fault-‐tolerant
file
system
• Hbase.
A
distributed
key-‐value
index
• Convenient
model
for
temporal
archives
• MapReduce:
A
distributed
execuLon
framework
• Reliable
mechanism
to
run
an
analysis
job
on
very
large
datasets
123. Internet
Memory
Focused
crawling:
• Automated
crawls
• Quality
focused
crawls
:
– Video
capture,
Twiaer
crawls
– ExecuLon
tools
to
overcome
crawling
issues
on
specific
content
Large
scale
crawling
• Inhouse
developped
distributed
sobware
• Scalable
crawler
(10-‐50
Bn
pages)
• Also
designed
for
focused
crawl
and
complex
scoping
124. Research
projects
and
focus
Web
Archiving
and
Preserva/on
✓ Living
Web
Archives
(2007-‐2010)
✓ Archives
to
Community
MEMories:
(2010-‐2013)
✓ SCAlable
PreservaLon
Environment
(2010-‐2013)
Webscale
data
Archiving
and
Extrac/on
✓ Living
Knowledge
(2009-‐2012)
✓ Longitudinal
AnalyLcs
of
Web
Archive
data
(2010-‐2013)
✓ TrendMiner
(2011-‐2014)
✓ DOPA
(2012-‐2014)
✓ AnnoMarket
(2012-‐2014)
125. Web
Archiving
project
?
OrganisaLonal
challenges:
• SelecLon/QA:
Librarian
/
Archivist,
Quality
assurance
team,
Project
manager
• Content
capture/services
development:
Engineers,
developers,
technicians
• Infrastructure
deployment
and
maintenance:
Engineers,
System
administrators
➥ Web
Archiving
projects
require
strong
competences
and
experienced
human
resources
combined
with
a
scalable
infrastructure
126. IM
Shared
plaJorm
Since
its
creaLon
in
2004,
the
Internet
Memory
FoundaLon
works
in
close
collaboraLon
with
partners
insLtuLons
and
research
groups
through
European
projects:
• To
develop
methods
and
tools
improving
web
archiving
quality
• To
grow
its
experLse
and
technological
taskforce
127. Archivethe.Net
(1)
• To
mutualize
knowledge
and
skills
between
insLtuLons
• To
share
internal
developments
with
partners
insLtuLons
• To
cut
services
and
R&D
costs
128. Archivethe.Net
(2)
• Archivethe.net is a shared web archiving platform
associated to a service.
• The platform is combining new technology and
user needs to ensure a good service quality in
terms of reliability and efficiency
• For whom ? our current partners, our new
partners and … for ourselves
129. Benefits
?
• Integrated
web
archiving
process
:
from
selecLon
to
access
• Ongoing
technological
developments
through
specific
or
common
R&D
projects
• Dedicated
and
highly
skilled
team
to
follow
partners’
projects
• Dedicated
infrastructure
130. How
does
it
work?
(1)
• ATN
is
designed
as
a
Saas
(Sobware
as
a
service)
• The
plaJorm
offers
a
friendly
user
interface
to
record
partners
web
archiving
orders
• A
pipeline
organizes
and
manages
the
producLon
• A
QA
team
ensures
the
quality
of
the
archive
to
meet
partners
requirements
132. ARCOMEM
Archivist
tool
?
Set
and
follow
web
archive
campaigns
• V1:
A
crawler
cockpit
and
a
search
and
retrieval
applicaLon
Intelligent
content
acquisiLon:
• Seeds
URLs
• Keywords
• Social
web
sites
APIs
• Social
Media
Categories
(SMC)
133. SARA
Search
and
retrieval
interface:
• Advance
search
funcLonaliLes
• Filtering
via
faceLng
• SorLng
by
content
type,
Social
media
plaJorm,
text/
image
contextual
informaLon
(event,
enLty,...),
etc.
134. Crawler
Cockpit
Interface
• Create/select
a
campaign
• Describe
campaign
(Ltle,
descripLon,
comments,
etc.)
• Define
scope:
select
criteria
such
as
language,
keyword,
url,
organisaLon,
etc.
• Select
social
media
categories
and
APIs
to
explore
• Set
precedence
rules
for
some
content
type
or
source
(images,
videos,
tweets,
news,
etc.)
136. ARCOMEM
Archivist
Tool
V2
•
Refinement
mode
:
Refine
crawl
parameters
to
improve
crawls
•
Improve
access
applicaLon
(SARA)
:
Preview
funcLon
so
that
the
users
can
review
the
results
of
the
campaign
set
up
137. QA
for
Web
Archives?
IM
QA
is
based
on:
• Tools
internally
developed
• Tools
developed
in
the
context
of
European
projects
•
Automated
processes
•
Knowledge
and
skills
of
our
crawl
engineer
and
QA
teams
138. QA
Methodology
and
tools?
Methodology
• Based
upon
crawler
behaviour
• Based
on
insLtuLons
needs
and
policy
• Can
be
manual
(visual)
or
“automated”
• Can
be
made
at
pre
or
post
crawl
Lme
Tools
• Open
source
tools
such
as
plugins
,
proxies,
etc.
• Internally
developed
tools
(fetchers,
automate
check,
etc.)
• Bug
trackers
to
record
informaLon
and
communicate
with
partner
insLtuLons
139. QA
Methodology
and
tools?
SCApe:
Scalable
PreservaLon
Environments
• Automate
visual
QA
to
detect
rendering
issues:
• Improve
archives
quality
and
cut
QA
costs
• Feed
“preservaLon
watch
and
planning”
tools
• First
test
made
on
over
400
pairs
of
urls
• Inhouse
“ExecuLon
plaJorm”
under
deployment
• Results
and
processes
to
be
disseminated
to
IIPC
members
for
feedback
!
140. Technical
challenges
Capture
• Dynamically
generated
content,
deep
web,
etc.
• Non
HTTP
protocoles
(e.g.:
RTMP)
• Social
media
plaJorms,
...
Access
• Replicate
live
funcLonaliLes
and
look
&
feel
• Provide
access
to
very
large
files
➥ Fast
evolving
technologies
➥ Ephemeral
content
➥ MulLplicaLon
of
producLon
means:
➥ Increase
of
user
generated
content
141. Technical
SoluLons
• ExecuLon
based
crawling
(vs
parsing)
• API
crawling
• ApplicaLon
aware
crawling
• Bespoke
fetchers
➥ OrchestraLon
of
tools
ARCOMEM content acquisition
142. Technical
SoluLons
Access
tool:
• Player
replacement:
reproduce
players
funcLonaliLes
• Adapt
access
soluLon
to
type
of
content/plaJorms
(generic
soluLons)
Storage
infrastructure
/
format:
• Enable
access
to
large
files
• Fast
access
to
large
amount
of
content
to
facilitate
search
&
retrieval
143. Use
cases
• Social
media
capture
and
access:
• You
Tube
• Twiaer
• Flickr,
etc.
• Web
Archiving
related
services:
• RedirecLon
service
• Memento
• Legal
issues
with
captured
content
• Full
text
search
• etc.