W/ARC file records

W/ARC file
W/ARC record
Header
Block Ex: HTTP
response, jpeg
file…
Ex: record ID, capture
date, record type,…

WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2006-09-19T17:20:14Z
WARC-Record-ID: <urn:uuid:d7ae5c10-e6b3-4d27-967d-34780c58ba39>
Content-Type: application/warc-fields
Content-Length: 381
software: Heritrix 1.12.0 http://crawler.archive.org
hostname: crawling017.archive.org
ip: 207.241.227.234
isPartOf: testcrawl-20050708
description: testcrawl with WARC output
operator: IA_Admin
http-header-user-agent:
Mozilla/5.0 (compatible; heritrix/1.4.0 +http://crawler.archive.org)
format: WARC file version 0.17
conformsTo:
http://www.archive.org/documents/WarcFileFormat-0.17.html

WARC/1.0 WARC-Type: request
WARC-Target-URI: http://www.archive.org/images/logoc.jpg
WARC-Date: 2006-09-19T17:20:24Z
Content-Length: 236
WARC-Record-ID: <urn:uuid:4885803b-eebd-4b27-a090-144450c11594>
Content-Type: application/http;msgtype=request
WARC-Concurrent-To: <urn:uuid:92283950-ef2f-4d72-b224-
f54c6ec90bb0>
GET /images/logoc.jpg HTTP/1.0
User-Agent: Mozilla/5.0 (compatible; heritrix/1.10.0)
From: stack@example.org
Connection: close
Referer: http://www.archive.org/
Host: www.archive.org
Cookie: PHPSESSID=009d7bb11022f80605aa87e18224d824

WARC/1.0
WARC-Type: response
WARC-Date: 2006-09-19T17:20:24Z
WARC-Block-Digest: sha1:UZY6ND6CCHXETFVJD2MSS7ZENMWF7KQ2
WARC-Payload-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2
WARC-IP-Address: 207.241.233.58
WARC-Record-ID: <urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0>
Content-Type: application/http;msgtype=response
WARC-Identified-Payload-Type: image/jpeg
Content-Length: 1902
HTTP/1.1 200 OK
Date: Tue, 19 Sep 2006 17:18:40 GMT
Server: Apache/2.0.54 (Ubuntu)
Last-Modified: Mon, 16 Jun 2003 22:28:51 GMT
ETag: "3e45-67e-2ed02ec0"
Accept-Ranges: bytes
Connection: close
Content-Type: image/jpeg
[image/jpeg binary data here]

WARC/1.0
WARC-Type: resource
WARC-Target-URI: file://var/www/htdoc/images/logoc.jpg
WARC-Date: 2006-09-19T17:20:24Z
WARC-Record-ID: <urn:uuid:92283950-ef2f-4d72-b224-
f54c6ec90bb0>
WARC-Block-Digest: sha1:CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2
[image/jpeg binary data here]

WARC/1.0
WARC-Type: metadata
WARC-Date: 2006-09-19T17:20:24Z
WARC-Record-ID: <urn:uuid:16da6da0-bcdc-49c3-927e-
57494593b943>
WARC-Refers-To: <urn:uuid:92283950-ef2f-4d72-b224-
f54c6ec90bb0>
Content-Type: application/warc-fields
WARC-Block-Digest: sha1:UZY6ND6CCHXETFVJD2MSS7ZENMWF7KQ2
Content-Length: 59
via: http://www.archive.org/
hopsFromSeed: E
fetchTimeMs: 565

WARC/1.0
WARC-Type: revisit
WARC-Date: 2007-03-06T00:43:35Z
WARC-Profile: http://netpreserve.org/warc/0.17/server-not-modified
WARC-Record-ID: <urn:uuid:16da6da0-bcdc-49c3-927e-57494593bbbb>
WARC-Refers-To: <urn:uuid:92283950-ef2f-4d72-b224-f54c6ec90bb0>
Content-Type: message/http
Content-Length: 226
HTTP/1.x 304 Not Modified
Date: Tue, 06 Mar 2007 00:43:35 GMT
Server: Apache/2.0.54 (Ubuntu) PHP/5.0.5-2ubuntu1.4
Connection: Keep-Alive
Keep-Alive: timeout=15, max=100
Etag: "3e45-67e-2ed02ec0"

WARC/1.0
WARC-Type: conversion
WARC-Target-URI:
http://www.archive.org/images/logoc.jpg
WARC-Date: 2016-09-19T19:00:40Z
WARC-Record-ID: <urn:uuid:16da6da0-bcdc-49c3-927e-
57494593dddd>
WARC-Refers-To: <urn:uuid:92283950-ef2f-4d72-b224-
f54c6ec90bb0>
WARC-Block-Digest:
sha1:XQMRY75YY42ZWC6JAT6KNXKD37F7MOEK
Content-Type: image/neoimg
Content-Length: 934
[image/neoimg binary data here]

WARC/1.0
WARC-Type: response
WARC-Date: 2006-09-19T17:20:24Z
WARC-Block-Digest: sha1:2ASS7ZUZY6ND6CCHXETFVJDENAWF7KQ2
WARC-IP-Address: 207.241.233.58
WARC-Record-ID: <urn:uuid:39509228-ae2f-11b2-763a-aa4c6ec90bb0>
WARC-Segment-Number: 1
Content-Type: application/http;msgtype=response
HTTP/1.1 200 OK
Date: Tue, 19 Sep 2006 17:18:40 GMT
Server: Apache/2.0.54 (Ubuntu)
Last-Modified: Mon, 16 Jun 2003 22:28:51 GMT
ETag: "3e45-67e-2ed02ec0"
Accept-Ranges: bytes
Connection: close
[first 1360 bytes of image/jpeg binary data here]

WARC/1.0
WARC-Type: continuation
WARC-Date: 2006-09-19T17:20:24Z
WARC-Block-Digest: sha1:T7HXETFVA92MSS7ZENMFZY6ND6WF7KB7
WARC-Record-ID: <urn:uuid:70653950-a77f-b212-e434-7a7c6ec909ef>
WARC-Segment-Origin-ID: <urn:uuid:39509228-ae2f-11b2-763a-
aa4c6ec90bb0>
WARC-Segment-Number: 2
WARC-Segment-Total-Length: 1902
WARC-Identified-Payload-Type: image/jpeg
Content-Length: 302

Digitization as a mean for preservation and dissemination

Now
Color (24 bits) – 400dpi –
TIFF uncompressed
1 page ~ 80Mb
More than x500 !!!
Then
Black & white – 300dpi –
TIFF G4
1 page ~ 200Kb

SPAR - Infrastructure
SPAR - Realization
Ingest
SPAR
Storage Abstraction Service (SAS)
Administration
Data management
Storage
Access
Preservation planning
Productionapplications
Disseminationapplications
Preservation
digitization
…
wayback
WEB Archiving
….
….
…
Audiovisual

http://public.ccsds.org/publications/archive/650x0m2.pdf

P
r
e
-
I
n
g
e
s
t
P
r
e
-
I
n
g
e
s
t
P
r
e
-
I
n
g
e
s
t
Storage abstraction service
Ingest
Storage
Preservation
planning
Administration
Data management
Accès
SIP DIPmets
rdf
rdf
Infrastructure
Preservation
digitization
Web archives
And so on

2005 2006 2007 2008 2009 2010 2011 20122004
Operations
may 2010
2013

Backup
storage
Backup
Servers
Backup site
Backup secondary storage
Primary
storage
Secondary storage
Lookup storageServers
Main site
Backup
Lookup storage
Online
storage

Oracle StorageTek SL8500
• up to 64 tape drives
• up to 8500 tapes
• up to 8 hand pickers
• up to 32 linked libraries
Primary storage
2 libraries
16 PB maximum
Backup storage
2 libraries
16 PB maximum

Capacity 1.5 TB
Transfer rate 140 MB/s
Primary storage
LTO5
Backup storage
T10000B
Capacity 1 TB
Transfer rate 120 MB/s
(previously: 9840C – 40GB) (previously: T10000A – 500GB)

P
r
e
-
I
n
g
e
s
t
Storage abstraction service
Ingest
Stockage
Preservation Administration
Data management
Access
SIP
AIP
DIPmets
rdf
rdf
AIP
Which
formats are
allowed?
How copies are
needed, in what
kind of media ?
What is the
maximum size
of a package ?
Do we need to log
each access?

SLA-I.xml, SLA-P.xml, SLA-A.xml
Mets.xml
Contract.pdf

03/07/1882 28/02/1883 01/03/1883
set
group
object
file
02/07/1882
Year 1883
Le Matin
Year 1882
01/07/1882

For this purpose, PDF/X chosen as a good compromise between truth to the
original, wide usage and standardization

Mets.xml: manifest
T000001.tiff: sample
format.xml: machine readable
description
format.txt: human description

http://www.fao.org/oek/jhove2/digital-preservation-and-jhove2-home/jhove2-tutorial/en/

http://bibnum.bnf.fr/containerMD-v1

Ingest request
reception
Manifest
Validation
Package search
within SPAR
SIP characteristics
audit
SIP files audit
and characterization
ARK identifier
generation
SET processing
Ingest completion
SIP reception
Audit
ACT_01
ACT_02
ACT_03
ACT_04
ACT_05
ACT_06
ACT_07
ACT_08
ACT_09

Structural metadata: METS
Descriptive and source metadata:
qualified Dublin Core
Provenance metadata: PREMIS
Technical metadata:
depends on the data-objects

58
1996-2005 2002 & 2004 2004-2008 2006-2010 2010-now
70 Tb 0.5 Tb 45 Tb 22 Tb
operator
robot +
150 Tb

59
Pre Ingest
Digitized books
Digitized
audiovisual
documents
web archiving
Pre Ingest
Pre Ingest

1996-2005 2002 & 2004 2004-2008 2006-2010
70 Tb 0.5 Tb 45 Tb 22 Tb
unknown
2010-now
+Alexa bot
67
150 Tb

ARC
data
ARC
metadata
HTML
HTML
HTML
HTML
harvested
files
ARC
data ARC
data
ARC
data ARC
data
+
harvest 1 harvest 2
+
harvest 3
+
…
…
This is a collection containing French election websites
Here are the
files we
harvested
They are
included in
web archives
specific files
This was done
with these tools

A three-layered model
in SPAR
Harvest Definition (curator collection)
Harvest Instance (“technical” harvest = job)
ARC file (data or metadata)

filedesc://32-metadata-1.arc 0.0.0.0 20100416092026 text/plain 77
1 0 InternetArchive
URL IP-address Archive-date Content-type Archive-length
metadata://netarkivet.dk/crawl/setup/harvestInfo.xml?heritrixVers
ion=1.14.3&harvestid=1&jobid=32 172.20.16.214 20100414095814
text/xml 366
<?xml version="1.0" encoding="UTF-8"?> <harvestInfo>
<version>0.2</version> <jobId>32</jobId>
<priority>LOWPRIORITY</priority> <harvestNum>0</harvestNum>
<origHarvestDefinitionID>1</origHarvestDefinitionID>
<maxBytesPerDomain>-1</maxBytesPerDomain>
<maxObjectsPerDomain>1000</maxObjectsPerDomain>
<orderXMLName>default</orderXMLName> </harvestInfo>
metadata://netarkivet.dk/crawl/setup/order.xml?heritrixVersion=1.
14.3&harvestid=1&jobid=32 172.20.16.214 20100414095815 text/xml
44775
<?xml version="1.0" encoding="UTF-8"?> <crawl-order
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="heritrix_settings.xsd">
…

1996-2005 2002 & 2004 2004-2008 2006-2010
70 Tb 0.5 Tb 45 Tb 22 Tb
unknown
2010-now
+Alexa bot
71
150 Tb

Two layers:
- Collection
- ARC files
1996-2005 2010-now
Three layers:
- Harvest Definition
- Harvest instance
- ARC files
Two layers:
- Collection
- ARC files

2006-2010 2010-now
Four layers:
- Collection
- Harvest division
- Harvest instance
- ARC files
Three layers:
- Harvest Definition
- Harvest instance
- ARC files

03/07/1882 28/02/1883 01/03/1883
set
group
object
file
02/07/1882
Year 1883
Le Matin
Year 1882
01/07/1882 03/07/1882 28/02/1883 01/03/188302/07/1882
Le Matin
01/07/1882

set
group
03/07/1882 28/02/1883 01/03/188302/07/1882
Le Matin
01/07/1882
AIPAIP
AIPAIP
set
Contains nothing but metadata
Curator information, allows to
group AIPs sharing the same
intellectual content
AIPAIP
Must contain files to be
preserved
Each AIP is an autonomous
unit
AIPAIP AIPAIP AIPAIP

<mets>
<dmdSec>
Intellectual metadata
<amdSec>
Administrative metadata
<fileSec>
List of the files
<structMap>
Structure of the package
<sourceMD>
Metadata about the source
used to produce this content
<techMD>
Technical metadata
<digiprovMD>
Provenance metadata

harvestInstance
has harvest
instance
is documented in
Outcome extensions
persons: admins
software
organizations
Harvest event

ARC
data
ARC
metadata
HTML
HTML
HTML
HTML
ARC
data ARC
data
ARC
data ARC
data
+
harvest 1 harvest 2
+
harvest 3
+
…
…
HTML
HTML
HTML
HTML
…

ARC
data
ARC
data
ARC
metadataARC
data ARC
data
ARC
data ARC
data …
…
AIPAIP
AIPAIP
AIPAIP
AIPAIP
AIPAIP
AIPAIP
AIPAIPset
ARC
data ARC
data …
AIPAIP
AIPAIP
ARC
data
AIPAIP
AIPAIPAIPAIP
groups

version-
block
header
metadata
object
First
ARC
record
data
object

containerMD
http://bibnum.bnf.fr/containerMD-v1

containerMD
root element
containerMD
root element
containercontainer
entriesentries
entriesInformationentriesInformation
entryentry
entryentry
entryentry
ARCContainerARCContainer
ARCEntriesARCEntries
ARCRecordARCRecord
ARCRecordARCRecord
ARCRecordARCRecord…
ARC-specific
extensions
ARC-specific
extensions
aggregated
information
about the
entries

Web archiving at the British Library
Helen Hockx-Yu
Head of Web Archiving

Overview
> Part 1: Background, history and organisation
> Part 2: Web Archiving Tools (including
demos)
> Part 3: Access
> Part 4: Non-print Legal Deposit and future
strategy
29th November 2012 Session 7 -Web archiving at the British Library 2

BL Structure
> BL Board and Executive Team
> e-Strategy and Information Systems (eIS)
> IT-based products and services
> Finance and Corporate Services (F&CS)
> Money
> Human Resources
> People
> Operations & Services (O&S)
> Front line services
> Scholarship and Collections (S&C)
> Content (Arts and humanities, Social Sciences, Science, Technology & Medicine)
> Strategic Marketing and Communications (SMC)
> Brand and reputation

Web archiving timeline

Current web archiving strategy
> Selective archiving of websites that
> reflect the diversity of lives, interests and activities throughout the UK
> contain research value or are of research interest
> feature political, cultural, social and economic events of national interest
> demonstrate innovative use of the web4 areas
> Also prioritise websites at risk and web-only content
> Permission based
> Permission to archive, to provide online access and to preserve. Also ask or 3rd
rights clearance
> 30% success rate, 5% explicit refusal (mostly due to 3rd party rights)
> Online access through UK Web Archive
> Expect to crawl at domain level (from April 2013) for Non-
print Legal Deposit

The current Web Archiving team
Skills Profile
> IT
> Collection management, digital curation
> Management
> Communications
> Web Archiving

(Internal Collaboration)
> The Web Archiving Team is involved in the end to end process but work
with other departments / teams in the library
Department /Team Activity / Support
S&C
> Subject specialist group
> Curator’s Choice project
Selection, curation
eIS Network, hardware and IT support
O&S
Resource Discovery & Research
Corporate level resource discovery http://explore.bl.uk/
CA&D
Digital Processing
Cataloguing (special collection level)
SMC Publicity, press release, events
The Legal Deposit Programme Domain crawl capability / process and policy

Curator’s Choice
> Pilot project with a small group of dedicated curators /
subject specialists
> Special Collections of curator’s choice. Curators take
responsibility for owning, maintaining and growing the
collections over time
> Evolving Role of Libraries in the UK
> Political Action and Communication
> Slavery and Abolition in the Caribbean
> UK relations with the Low Countries
> 19th Century English Literature
> Oral History in the UK
> Film in the UK
> Energy

Web Archiving Advisory Group
> Provide advice and support to the Web Archiving Team
> Act as a ‘critical friend’ to assist in the development of policy
and practice.
> Specific advice and support on:
> Purpose, vision and benefits.
> Strategic direction and planning.
> Synergy with internal teams and collaboration with
external stakeholders/partners.
> Policy changes and risk management

(External) Collaboration
> UK Web Archiving Consortium (2004-2007): centralised infrastructure
and development, distributed collections
> UK Web Archive partners, National Archives, Legal Deposit Libraries
(LDLs)
> External Collaborators
> Welcome Library
> Live Art Development Agency
> The Cambridge Innovation Network
> The Women’s Library
> Institute of Historical esearch, University of London
> Individual researchers, specialists
> General public – ca. 20 nominations / week
> National organisations: DPC, JISC
> International: IIPC

JISC UK Web Domain Dataset (1996-2010)
> Collaboration with JISC and the Internet Archive
> UK Web Domain Dataset (1996-2010) – UK websites
extracted from the Internet Archive's collection and
supported by funding from the JISC
> 35TB research dataset
> No local access to individual websites but access to
secondary dataset allowed
> BL has developed visualisations of the dataset
> JISC funded 2 further projects using this dataset
> Analytical Access to the Domain Dark Archive
> Big Data: Demonstrating the Value of the UK Web Domain Dataset for Social
Science Research

Web Archiving Tools
> Support key processes: selection, harvesting, storage,
access, preservation
> Mostly open source tools, some developed in-house
> New tools / changes to current tools expected when business
processes change due to non-print Legal Deposit

Selection Tools
> Selection: decide what websites to archive and to include as
part of a web archive collection
> Selection and Permission Tool: https://wct.bl.uk/selection/
> Submit selection – real time checking of duplicates, fetching meta tags from live
sites
> Collect metadata
> Add contact details
> Suggest crawl frequency
> Permissions management – send emails, direct users to online licence form, store
the completed forms, pass details to WCT (create authorisation record and a
pending target)
> Reports
> Twittervane

Harvesting Tools
> Harvesting: automated downloading of selected websites
using crawler software; quality assurance regarded as an
element
> The Web Curator Tool (WCT): https://wct.bl.uk/wct/
> Job scheduling
> Metadata
> Access control
> Harvesting (uses Heritirx)
> QA

Quality Assurance
> Placing more emphasis on intellectual content than
appearance or behaviour of a website
> Use four aspects to define quality:
> Completeness of capture: whether the intended content has been captured as
part of the harvest.
> Intellectual content: whether the intellectual content (as opposed to styling and
layout) can be replayed in the Access Tool.
> Behaviour: whether the harvested copy can be replayed including the behaviour
present on the live site, such as the ability to browse between links interactively.
> Appearance: look and feel of a website.
> Rely on visual comparison, previous harvests & crawl logs
> Recent development of QA module to allow bulk operation,
reduce # of clicks and make QA recommendations

Supporting Long-term Preservation
> Storing data in WARCs and metadata in METS
> Migrate all legacy data into WARCs
> WCT output WARC files
> Submission Information Package (SIP) profiles for selective
and domain crawls
> Storing descriptive metadata (eg permission information) & technical metadata
(eg crawl log, crawl configurations, virus scan events)
> Ingest archived websites in the Digital Library System (DLS)
> Command line tool generates SIPs
> Providing access from the DLS (in future)

Demo (45 minutes)
> Selection and Permission Tool (https://wct.bl.uk/selection/)
> Web Curator Tool (https://wct.bl.uk/wct/)

Access
> Currently 3 ways to access the web archive
> Online through the UK Web Archive
> Catalogue records (of special collections)
> Keywords search through primo (corporate resource
discovery system)
> Conduct researcher survey to understand
requirements
> Analytical access

Catalogue Records

Keyword search through Primo

UK Web Archive
> Websites archived by BL and
partners since 2004 (65% by
BL)
> 122,99 websites, 50,866
instances, 13.6TBWARCs
> Over 100,000 unique visits
since 1st April 2012
> Key websites include videos
> Full-text, N-gram, title and
URL search
> Browse by subject / special
collection, visual browsing
http://www.webarchive.org.uk

Analytical Access
> Shift of focus from the level of single webpages or websites
to the entire web archive collection.
> Use web archives as datasets
> Support survey, annotation, contextualisation and
visualisation
> Allows discovery of patterns, trends and relationships in
inter-linked web pages
> Extracting value from the “haystacks”
> Helps addresses a number of challenging issues
> Scalability
> Accessibility of individual websites
> Components missed by crawlers

Visualising the UK Web
> http://www.webarchive.org.uk/ukwa/visualisation
> N-gram search
> Links analysis
> Format Analysis
> Geo-index
> http://www.webarchive.org.uk/bluebox/
> uses the Memento aggregate TimeGate hosted by lanl.gov
> “resource not in archive” – who else has it?
> Open data
> Dataset and APIs for general use
> Enable broader community to re-use, explore and visualise content of web archive

Web Archiving Infrastructure

Non-print Legal Deposit: Time of change
> Expected to be in place in April 2013
> Access restricted to premises of Legal Deposit Libraries
> Library-wide Legal Deposit Programme to develop capability
and end-to-end process
> Web Archiving Team acts as “technical supplier” for a
number of projects
> Still need to work out how current (permission-based)
selective archiving relates to domain crawl under Legal
Deposit
> Will we request permissions for online access?
> Will we stop crawling some of the sites we are crawling now and include them in
the annual / bi-annual broad domain crawl?
> Who does what?

Web Archiving Strategy
26
Domain Crawl
Event
S
p
e
c
i
a
l
c
o
l
l
e
c
t
i
o
n
S
p
e
c
i
a
l
c
o
l
l
e
c
t
i
o
n
Domain
harvesting:
• Broad
sweep of
.uk domain
• Once or
twice a year
Events & key
sites:
• Events of
national
interest
• Sites need
to be
captured
frequently
Special
Collection:
• Focused,
thematic
collections
• Support
priority
subjects
Key sitesEvent
S
p
e
c
i
a
l
c
o
l
l
e
c
t
i
o
n
S
p
e
c
i
a
l
c
o
l
l
e
c
t
i
o
n

Web
Archiving
Workshop

Leïla
Medjkoune,
Internet
Memory

IIPC
workshop,
BNF,
Paris,
November
2012

Internet
Memory

Internet
Memory
Founda/on
(European
Archive)

•  Established
in
2004
in
Amsterdam
and
then
Paris

•  Mission:
Preserve
Web
content
by
building
a
shared
WA
plaJorm

•  Ac/ons:
DisseminaLon,
R&D
and
partnerships
with
research
groups
and

cultural
insLtuLons

•  Open
Access
Collec/ons:
UK
NaLonal
Archives
&
Parliament,
PRONI,
CERN
and
The
NaLonal
Library
of
Ireland

Internet
Memory
Research

•  Spin-‐oﬀ
of
IM
established
in
June
2011
in
Paris

•  Missions:
Operate
large
scale
or
selecLve
crawls
&
develop
new

technologies
(crawl,
access,
processing
and
extracLon)

Internet
Memory

Infrastructure

  Green
datacenters

  Repository
and
data
access
for
large-‐scale
data

management:

•  HDFS
(Hadoop
File
System):
Distributed,
fault-‐tolerant

ﬁle
system

•  Hbase.
A
distributed
key-‐value
index

•  Convenient
model
for
temporal
archives

•  MapReduce:
A
distributed
execuLon
framework

•  Reliable
mechanism
to
run
an
analysis
job
on

very
large
datasets

Internet
Memory

Focused
crawling:

•  Automated
crawls

•  Quality
focused
crawls
:

–  Video
capture,
Twiaer
crawls

–  ExecuLon
tools
to
overcome
crawling
issues
on
speciﬁc
content

Large
scale
crawling

•  Inhouse
developped
distributed
sobware

•  Scalable
crawler
(10-‐50
Bn
pages)

•  Also
designed
for
focused
crawl
and
complex
scoping

Research
projects
and
focus

Web
Archiving
and
Preserva/on

✓  Living
Web
Archives
(2007-‐2010)

✓  Archives
to
Community
MEMories:

(2010-‐2013)

✓  SCAlable
PreservaLon
Environment

(2010-‐2013)

Webscale
data
Archiving
and

Extrac/on

✓  Living
Knowledge
(2009-‐2012)

✓  Longitudinal
AnalyLcs
of
Web

Archive
data
(2010-‐2013)

✓  TrendMiner
(2011-‐2014)

✓  DOPA
(2012-‐2014)

✓  AnnoMarket
(2012-‐2014)

Web
Archiving
project
?

OrganisaLonal
challenges:

•  SelecLon/QA:
Librarian
/
Archivist,
Quality
assurance
team,

Project
manager

•  Content
capture/services
development:
Engineers,

developers,
technicians

•  Infrastructure
deployment
and
maintenance:
Engineers,

System
administrators

➥ Web
Archiving
projects
require
strong
competences
and

experienced
human
resources
combined
with
a
scalable

infrastructure

IM
Shared
plaJorm

Since
its
creaLon
in
2004,
the
Internet
Memory

FoundaLon
works
in
close
collaboraLon
with
partners

insLtuLons
and
research
groups
through
European

projects:

•  To
develop
methods
and
tools
improving
web

archiving
quality

•  To
grow
its
experLse
and
technological
taskforce

Archivethe.Net
(1)

•  To
mutualize
knowledge
and
skills
between

insLtuLons

•  To
share
internal
developments
with
partners

insLtuLons

•  To
cut
services
and
R&D
costs

Archivethe.Net
(2)

•  Archivethe.net is a shared web archiving platform
associated to a service.

•  The platform is combining new technology and
user needs to ensure a good service quality in
terms of reliability and efﬁciency

•  For whom ? our current partners, our new
partners and … for ourselves

Beneﬁts
?

•  Integrated
web
archiving
process
:
from
selecLon

to
access

•  Ongoing
technological
developments
through

speciﬁc
or
common
R&D
projects

•  Dedicated
and
highly
skilled
team
to
follow

partners’
projects

•  Dedicated
infrastructure

How
does
it
work?
(1)

•  ATN
is
designed
as
a
Saas

(Sobware
as
a
service)

•  The
plaJorm
oﬀers
a
friendly
user

interface
to
record
partners
web

archiving
orders

•  A
pipeline
organizes
and
manages

the
producLon

•  A
QA
team
ensures
the
quality
of

the
archive
to
meet
partners

requirements

How
does
it
work?
(2)

Demo

ARCOMEM
Archivist
tool
?

Set
and
follow
web
archive

campaigns

•  V1:
A
crawler
cockpit
and
a
search

and
retrieval
applicaLon

Intelligent
content
acquisiLon:

•  Seeds
URLs

•  Keywords

•  Social
web
sites
APIs

•  Social
Media
Categories
(SMC)

SARA

Search
and
retrieval
interface:

•  Advance
search

funcLonaliLes

•  Filtering
via
faceLng

•  SorLng
by
content
type,

Social
media
plaJorm,
text/
image
contextual

informaLon
(event,

enLty,...),
etc.

Crawler
Cockpit
Interface

•  Create/select
a
campaign

•  Describe
campaign
(Ltle,

descripLon,
comments,
etc.)

•  Deﬁne
scope:
select
criteria
such

as
language,
keyword,
url,

organisaLon,
etc.

•  Select
social
media
categories
and

APIs
to
explore

•  Set
precedence
rules
for
some

content
type
or
source
(images,

videos,
tweets,
news,
etc.)

Crawler
cockpit
interface

Demo

ARCOMEM
Archivist
Tool
V2

•
Reﬁnement
mode
:
Reﬁne
crawl

parameters
to
improve
crawls

•
Improve
access
applicaLon
(SARA)
:

Preview
funcLon
so
that
the
users
can

review
the
results
of
the
campaign
set
up

QA
for
Web
Archives?

IM
QA
is
based
on:

•  Tools
internally
developed

•  Tools
developed
in
the
context
of
European
projects

• 
Automated
processes

• 
Knowledge
and
skills
of
our
crawl
engineer
and
QA

teams

QA
Methodology
and
tools?

Methodology

•  Based
upon
crawler
behaviour

•  Based
on
insLtuLons
needs
and
policy

•  Can
be
manual
(visual)
or
“automated”

•  Can
be
made
at
pre
or
post
crawl
Lme

Tools

•  Open
source
tools
such
as
plugins
,
proxies,
etc.

•  Internally
developed
tools
(fetchers,
automate
check,
etc.)

•  Bug
trackers
to
record
informaLon
and
communicate
with

partner
insLtuLons

QA
Methodology
and
tools?

SCApe:
Scalable
PreservaLon
Environments

•  Automate
visual
QA
to
detect
rendering
issues:

•  Improve
archives
quality
and
cut
QA
costs

•  Feed
“preservaLon
watch
and
planning”
tools

•  First
test
made
on
over
400
pairs
of
urls

•  Inhouse
“ExecuLon
plaJorm”
under
deployment

•  Results
and
processes
to
be
disseminated
to
IIPC

members
for
feedback
!

Technical
challenges

Capture

•  Dynamically
generated
content,
deep
web,
etc.

•  Non
HTTP
protocoles
(e.g.:
RTMP)

•  Social
media
plaJorms,
...

Access

•  Replicate
live
funcLonaliLes
and
look
&
feel

•  Provide
access
to
very
large
ﬁles

➥ Fast
evolving
technologies

➥ Ephemeral
content

➥ MulLplicaLon
of
producLon
means:

➥ Increase
of
user
generated
content

Technical
SoluLons

•  ExecuLon
based

crawling
(vs
parsing)

•  API
crawling

•  ApplicaLon
aware

crawling

•  Bespoke
fetchers

➥  OrchestraLon
of
tools

ARCOMEM content acquisition

Technical
SoluLons

Access
tool:

•  Player
replacement:
reproduce
players

funcLonaliLes

•  Adapt
access
soluLon
to
type
of
content/plaJorms

(generic
soluLons)

Storage
infrastructure
/
format:

•  Enable
access
to
large
ﬁles

•  Fast
access
to
large
amount
of
content
to
facilitate

search
&
retrieval

Use
cases

•  Social
media
capture
and
access:

•  You
Tube

•  Twiaer

•  Flickr,
etc.

•  Web
Archiving
related
services:

•  RedirecLon
service

•  Memento

•  Legal
issues
with
captured
content

•  Full
text
search

•  etc.

W/ARC file records

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à W/ARC file records

Similaire à W/ARC file records (20)

Plus de Biblioteca Nacional de España

Plus de Biblioteca Nacional de España (20)

Dernier

Dernier (20)

W/ARC file records