The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

The Web is a Mess
How I learned to stop
worrying and love web
archiving

We are a Digital Library
Mission Statement: Universal access to all knowledge
o Founded by Brewster Kahle in San Francisco,
California in 1996
o Officially designated a Library by the State of California
in 2007
About Internet ArchiveAbout Internet Archive

500,000
500,000
Books
Moving Images

http://flickr.com/photos/marfis75/
500,000
500,000
1,000,000
Books
Moving Images
Audio Recordings

500,000
500,000
1,000,000
2,000,000
Books
Moving Images
Audio Recordings
Hours of TV

500,000
500,000
1,000,000
2,000,000
3,600,000
Books
Moving Images
Audio Recordings
Hours of TV
eBooks

The Archive is accessible to the public via
the website: www.archive.org
o Started collecting content in 1996
o First web pages public available in 2001
o 347+ billion web pages
o 200+ million websites
o Almost every domain
o Content in 140+ Languages
o Collect a broad summary of the web every 30-60
days - approximately 10 billion pages per
snapshot
Access to General Web Archive
Access to General Archive

What is Web Archiving?
Web archiving is the process of
collecting portions of web content,
preserving the collections, and then
providing access to the archives - for
use and re use.

A web archive is a collection of archived
URLs grouped by theme, event, subject area,
or web address.
A web archive contains as much as possible
from the original resources and documents
the change over time. It is a priority to
recreate the same experience a user would
have had if they had visited the live site on
the day it was archived.
What is a Web Archive?

Who is archiving the webWho is web archiving?

Why are We Doing This?
• Web archives preserve the web. They act as the web
equivalent of the archive or library. In this role, their
mission is to acquire and preserve the web for future
generations… ensuring its continued survival for
future generations
• Billions of people around the world have grown
accustomed to using the web as their primary
resource to acquire information.
• The availability of this electronic information is
taken for granted and it is a fallacy that if something
is on the web it will be there forever.
• There’s an essential need for people to understand
that the web represents who we are. It’s our culture
and our social fabric, and we don’t want to lose it.

Why should we archive
the web?

How long does a website live?
• A 1997 report in Scientific
American claims 44 days.
• A subsequent academic 2001
study in IEEE suggests 75 days.
• A 2003 Washington Post
article indicates the number is
100 days.
• A 2013 study by Old Dominion
University says that after the
first year of publishing, nearly
11% of social media will be lost
and after that we will continue to
lose 0.02% per day
How long does a website live?

• Create a thematic/topical web archive on a specific subject
• Capture ‘at risk’ content during a spontaneous event
• Fulfill organizational mandate to preserve institutional memory &
history
• Archive state/local agency publications no longer deposited
in print form
• Archive records to meet university and/or government retention
policies.
• Collect content to act as a research service for scholars to turn to
• Capture social media sites as part of organizational records
• Collect web-based information to augment physical holdings.
• Archive online art ephemera
• End of Life/Closure
Web Archiving Use Cases

What is a crawler?
A crawler is the
software that captures
and archives web
pages. A crawler visits a
page and indexes the
content included
therein

Some technical challenges in
capturing content
• Technical: dynamic content utilize
scripting languages (Flash and
JavaScript). The web is a hodgepodge
of technologies, some old and
outdated, others at the cutting edge.
• Capturing social media sites has
become necessary as the web is
moving away from html and moving
towards applications
• Explore other capture mechanisms
besides using a traditional crawler
resource: hybrid
architecture/API/headless browsers

http://www.chaitalag.com/new/s/tubig
http://www.helenbrowngroup.com/2011/02/rescue-from-the-digital-firehose/gushing-firehose-by-
joseph-robertson/
Amount of
content that is
being archived
Amount of data being
created by content
providers
Challenge: a lot of data

Challenge: How much to archive?
There Are LimiTs…

Challenge: What to archive?
…What is important to you? What do you
want people to know about? What are your
organization’s collecting activities? Vision?

Participant Poll
• Does any of this make
any sense?

Starting a Collection
Collection: A group of
URLs crawled and
organized around a
common theme, topic or
domain
Ask Yourself:
• What is the topic of this collection?
• What websites would you like to archive as
part of this collection?

Collections Start with Seeds
• Seed: starting point URL
for the crawler. The crawler
will follow linked pages
from your seed URL and
archive them if they are ‘in
scope’.
• Document: any file with a
distinct URL (html, image,
PDF, video, etc).

Some of our Partner’s
Digital Collections
• Stanford University (Palo Alto California)
• American University in Cairo
• Biblioteca Nacional de España

Stanford University,
Islamic & Middle Eastern Collection
Use Case: harvest and preserve Iranian
Blogs
• Archiving over 300 blogs written by and for
Iran and the Iranian people
• Includes coverage of 2009 Iranian elections
and the current Middle East unrest

Stanford and New York Universities
Islamic and Middle Eastern Collection

American University of Cairo
Use Case: The American University in Cairo
Web Archive collects, preserves, and
provides access to the web content
published by students, faculty, departments,
and offices at AUC. The archive also collects
Web documents that have long-term
research or historical value.

January 25th Revolution and
University on the Square
Demonstrators in Tahrir Square.
Image courtesy of Ahmad and the American University in Cairo Rare Books and Special Collections Library.

Archivist Driven Captures
Thank you to Egypt's youth and Facebook .
Image courtesy of Martin and Amy Rowe and the American University in Cairo Rare Books and Special Collections Library.

Patron Driven Captures
Screenshot of the University on the Square Contribution form.
In addition to soliciting photos and videos, we asked content providers to
websites, blogs, Twitter feeds, etc.

Archivist as Advocate
Protester documenting the demonstrations in Tahrir Sqare.
Image courtesy of Robeir Rasmy and the American University in Cairo Rare Books and Special Collections Library.

Breaking down the life cycle
• One of its top priorities as a memory institution is to
consolidate whichever strategies lead to the integral
preservation of Spanish Internet-published contents, in
accordance with the library's mission as keeper and
disseminator of Spanish culture.
• Commitment to its patrons, who expect the web archive to
become a publicly and freely accessible key information
source for the study of the 21st century.
Biblioteca Nacional de España

Use cases:
• 2011 Election crawl
• 2012 Humanities crawl
• 2009-present .es domain crawls
• 2013 .es Broad Survey Crawl, visited the top level page of
every web site registered to .es ( in partnership with Red.es)
• 2011-2013 Thematic curation (World cups, Olympics,Global
Hunger)
Biblioteca Nacional de España

http://www.udatleticoisleño.es

http://www.facebook.com/eajpnv

http://twitter.com/xalmar
• Archived wen page from
Facebook and/or Flickr

http://es.wikipedia.org/wiki/Partido_P
irata_(España)

Making sense of it all
• Web Archiving life cycle /model
• Internet Archive future objectives
– Social Media
– Distributed Content
– Visualization and analytical tools for more
useful interaction
– Search
– Mobile platforms
– Enhanced Researcher Access

Web Archiving Life Cycle Model
Web Archiving Life Cycle Model white paper available: http://www.archive-it.org/publications

Outer layer:
• Vision and Objectives
• Resources and Workflow
• Access / Use / Reuse.
• Preservation
• Risk Management
Inner Circle:
• Appraisal and Selection.
• Scoping
• Data Capture
• Storage and Organization
• Quality Assurance and Analysis

Participant Poll
• Are you confused yet?
I hope not. Happy to
answer questions!

The importance of web archiving
“As our digital world continues to grow at a
breathtaking pace and more and more of our daily
live occurs within its digital boundaries, we must
ensure that web archives are there to preserve our
collective global consciousness for future
generations”
Kalev H. Leetaru, University of Illinois

Kristine Hanna,
Director, Archiving Services
Internet Archive
kristine@archive.org
Thank you!

The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

Similaire à The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna

Similaire à The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna (20)

Plus de Biblioteca Nacional de España

Plus de Biblioteca Nacional de España (20)

Dernier

Dernier (20)

The web is a mess: how I learnt to stop worrying and love web archiving. Kristine Hanna