Presentada en la Jornada Internacional sobre Archivos Web y Depósito Legal Electrónico, en la Biblioteca Nacional de España (BNE), el día 9 de julio de 2013.
2. www.bl.uk 2
2001-2002
Explore
Launch
Domain.UK
project
No public
access
Collaborate
2003-2008
Establish Web
Archiving Programme
Lead UK Web
Archiving Consortium
Launch UK Web
Archive
Build capacity BAU
2008-2011
People, systems and
processes
Curatorial expertise
Technical know-how
2011
Web Archiving as
operational unit
Implement non-print
Legal Deposit since
April 2013
Web Archiving Timeline
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
3. www.bl.uk 3
Before (6 April 2013)
• Selective archiving of websites that
– reflect the diversity of lives, interests and activities throughout the UK
– contain research value or are of research interest
– feature political, cultural, social and economic events of national interest
– demonstrate innovative use of the web
– Also prioritise websites at risk and web-only content
• Permission based
– Permission to archive, to provide online access and to preserve. Also ask
or 3rd party rights clearance
– 30% success rate, 5% explicit refusal (mostly due to 3rd party rights)
• Online access through UK Web Archive
4. www.bl.uk 4
Toolset
• Selection and Permission Tool
– selection and permission management
– Integrated with the Web Curator Tool
• Web Curator Tool
– Job scheduling
– Metadata
– Access control
– Harvesting (uses Heritirx)
– QA
• Indexing and SIP generation – scripts and SOLR (for full-text index)
• Wayback – rendering tool for WARCs
• UK Web Archive – web-based end user interface
5. www.bl.uk 5
Access
•Currently 3 ways to access the web archive
– Online through the UK Web Archive
– Catalogue records (of special collections)
– Keywords search through primo (corporate resource
discovery system)
•Conduct researcher survey / research
projects to understand requirements
8. www.bl.uk 8
UK Web Archive
• 14,118 websites, 60,482
instances, 17.6TB WARCs
• Over 182,761 unique visits 1st
April ‘12 – 31st March ‘13
• Key websites include videos
• Full-text, N-gram, title and
URL search
• Browse by subject / special
collection, visual browsing
• Analytical access
http://www.webarchive.org.uk
9. www.bl.uk 9
Analytical access
• Shift of focus from the level of single webpages or websites to the entire
web archive collection.
• Use web archives as datasets, access to metadata and knowledge
about websites
• Support survey, annotation, contextualisation and visualisation
• Allows discovery of patterns, trends and relationships in inter-linked
web pages
• Helps addresses a number of challenging issues
– Scalability
– Accessibility of individual websites
– Components missed by crawlers
10. www.bl.uk 10
After (6 April 2013)
• Government introduced Non-print Legal Deposit Regulations 2013
• Apply to material published digitally and online, including articles
books, and websites.
• 6 UK Legal Deposit Libraries
• Deposited content accessible “on library premises controlled by the
deposit library”
– after 7 days of collection or deposit
– Single concurrent access
– Catalogue records allowed to be searchable online
– Digital copying not permitted
11. www.bl.uk 11
Legal Deposit of UK websites
• In scope
– Sites that use a .uk or other UK geographic top-level domain
– where part of the publishing process takes place in the UK;
• Will not archive
– sites concerning film and recorded sound where the audio-visual
content predominates
– private intranets and emails
• Over 10 million .uk registered domains
– 4th TLD after .com, .de and .net
– UK organisations also use non .uk domain names (eg .com or .org)
– scale unknown
12. www.bl.uk 12
Domain Crawl
News
S
p
e
c
i
a
l
c
o
l
l
e
c
t
i
o
n
S
p
e
c
i
a
l
c
o
l
l
e
c
t
i
o
n
Domain crawl:
• Broad
sweep of
UK domain
• Once or
twice a year
Events & key
sites and news:
• Events of
UK interest
• High value,
high impact
sites
• National &
regional
news
Special
Collection:
• Focused,
thematic
collections
• Support
priority
subjects
Key sitesEvents
S
p
e
c
i
a
l
c
o
l
l
e
c
t
i
o
n
S
p
e
c
i
a
l
c
o
l
l
e
c
t
i
o
n
Collecting strategy
13. www.bl.uk 13
Access strategy
• Deposited content cannot be accessed outside the reading
rooms.
• Online access can be provided to metadata and selected content
to showcase the Legal Deposit web archive of the UK
– Bibliographic metadata
– Analysis and visualisation of aggregated content
– Statistical and contextual data
– Copy of deposited content with direct permission
• For sites from outside the UK, permission both to harvest and for
public access will be required
14. www.bl.uk 14
Before and after: what has changed
• Everything!
BEFORE AFTER
Scale 14,000 4 – 5 million
Purpose Advocacy, demonstrating
benefits
Legal Deposit
Workflow (and
tools)
Selection prior to harvesting Selection / curation can happen post
harvesting
Permission to
archive
Required Can collect in-scope material without
permission
Access Online Reading rooms only (unless with direct
permission for online access)
Nature of QA Quality control leading to
deselection
Flagging up quality issues
Ownership British Library Legal Deposit Libraries
15. www.bl.uk 15
Progress
• Experimental domain crawl in August-December 2012, no access
– Started with 4.8 million seeds
– Collected 27TB data +1TB of crawl logs
• 1st Legal Deposit domain crawl started in April
– Started with 3.8 million seeds
– Ran between 8th April - 21st June and collected over 31TB data
• Focused collection on National Health Service Reform
– Showcase end-to-end processes including ingest and access in
reading room in early July
• Selecting key sites, news site and events