Capture All the URLS: First Steps in Web Archiving

Capture all the URLs:
First Steps in Web Archiving

Kristen Yarmey

Judy Silva

Alexis Antracoli

Digital Services Librarian

Fine & Performing Arts Librarian and Archivist

Records Management Archivist

University of Scranton

Slippery Rock University of Pennsylvania

Drexel University

Where We’re Going
Kristen:
• Intro to web archiving
• Web archives in higher
ed
• Archive-It and other
tools

Judy:
• First steps
• Getting buy-in
• Selecting and scoping

Alexis:
• Metadata
• Policies
• Workflow

All:
•
•
•
•

Challenges
Lessons learned
What’s next?
Q&A

What do we put on the web?
• University publications
•
•
•
•
•
•
•

Course catalogs
Student handbooks
Newsletters
Press releases
Alumni Journal
Admissions viewbook
University calendar

• Governance/Planning
documents and records
• Policies
• Assessment reports (Fact Book)
• Faculty Senate
agendas, minutes, and reports
• Presidential announcements
• Email

• Campus life
•
•
•
•
•
•

Student clubs
Housing contract
Wellness programming
Community outreach
Athletics scores
Alumni class pages

• Events
• Presidential inauguration
• New building
construction/dedication

• Social Media presence
•
•
•
•
•

Facebook
Twitter
Blogs
YouTube
…

Web Archiving in Higher Ed
―We have the responsibility
to preserve things like course information, course roster
information and policies — all sorts of things that we used to
get in paper but are now just showing up as websites.‖
Dean B. Krafft, Chief Technology Strategist, Cornell University

―Almost every office and unit on campus has a web
site with business information. .. Many of our campus
publications are only on the web now as pdfs or html.
[This content] isn’t preserved anywhere else.‖
Ed Busch, Electronic Records Archivist, Michigan State University

Goals:
• Preserve dynamic content
•
•
•
•
•

Text
Images
Animation
Video
…

• Preserve context
•
•
•
•

Hyperlinks
Embedded media
Document method and date of capture
Relate to prior and later versions

• Provide access
• Full text search
• Browsability
• User-friendly interface

Once something
is posted
on the web, it’s
there forever…
right?

New York Times, September 23, 2013

Web Archiving in Higher Ed
“One finding revealed by the survey was the
preponderance of universities that have
initiated web archiving programs in the last 5
years.”
Web Archiving Survey Report by National Digital Stewardship Alliance
June 2012

National Digital Stewardship Alliance Web Archiving Survey Report, June 2012

Tools


Tools: In-House Options
Proprietary tools:
 Adobe Acrobat - convert websites into PDFs
(internal links remain active but other
dynamic functionality is lost)
 Grab-a-Site and WebWhacker – download
files from a website
 Teleport Pro – ―webspidering‖
Open source tools:
 Heritrix – crawler
 HTTrack – downloads web content to a local
directory
 Wayback – discovery
 Memento – access framework
 NutchWAX - search
 Solr – search
 WARCreate – Google Chrome extension for
creating WARC files (view with Wayback,
store your own data)
 Wget – retrieve files from a website
 Web Curator Tool – workflow management
 NetarchiveSuite - software package
 Xenu’s Link Sleuth – finds broken links

Tools: Outsourcing Options
Vendor services
 Archive-It
 California Digital
Library Web Archiving
Service (WAS)
 OCLC Web Harvester


Archive-It
• Subscription service
• Branch of nonprofit Internet Archive
• Crawls, harvests, and hosts web
content, using open source tools and
standard formats
• Yearly fees, based on ―data budget‖

Archive-It: Partners
Archive-It Partners
 279 collecting organizations total
 118 colleges & universities
Pennsylvania Partners:














Bryn Mawr, Haverford, and Swarthmore (joint, 2005)
Bucknell (2012)
Chemical Heritage Foundation (2010)
Curtis Institute of Music (2010)
Drexel (2009)
Free Library of Philadelphia (2010)
Gettysburg College (2013)
La Salle University (2012)
Pennsylvania State University (2012)
Slippery Rock University of Pennsylvania (2011)
Temple University (2013)
University of Pennsylvania Law School (2011)
University of Scranton (2012)

Archive-It: Crawl
• Collection
• Seeds (regular, one-time, or RSS)
• Documents = any file with a distinct URL, including…
– HTML
– Images
– Video
– Audio
– PDF
–…

• Scope = which URLs are captured and which are not
• Frequency = how often seed is crawled

Archive-It: Access
Users can:
• Search
• Browse

From:
• Archive-It website
• Portal page
• Embedded search
boxes
• Library catalog
• Finding aids
• 404 error pages
• Wayback Machine

Content can be
public or private.

Archive-It: Manage
Metadata
• Dublin Core
• Collection, seed, document level

Storage
• Archive-It hosts content and
backup on multiple servers
• Partner can request copy of data

Support
• Training sessions
• Partner support
• User community

Campus Stakeholders
• Library
• Information
Technology
• Public Relations
• Administration
• New President

Funding
• Information
Technology
• Public Relations
• Library
• Provost’s Office
• Grants
• Donors

Selecting More Content
• Athletics
• Student
organizations
• Alumni
• President’s page
• Provost’s page
• 125th Anniversary
• University Curriculum
Committee minutes
(password protected)

What is a seed?
• A seed is any URL that you want to
capture:
• An entire website
• http://www.whitehouse.gov/

• A specific part of a website
• http://www.whitehouse.gov/issues/foreign-policy/

• A specific URL
• http://www.whitehouse.gov/sites/default/files/rss_viewer/natio
nal_security_strategy.pdf

Building the Program
•
•
•
•
•

Policy
Records Management Benefits
Standardizing Metadata
Developing Quality Control Procedures
Working within organizational constraints

Collection Development
• Developed policy
•
•
•
•
•

Mission
Scope
Designated Community
Intellectual Property
Access

• Determined/reviewed
seeds to crawl and
frequency
• Maintain an up-to-date
list of seeds that are
regularly crawled
―Brasseri F – Archives oubliees,‖ by GuillaBar.
http://www.flickr.com/photos/guillabar/8666232614/

Updating Metadata
• Selected fields to use consistently:
•
•
•
•

Title
Creator
Description
Collector

• Standardized names
• Eliminated groups

Quality Control Procedures
•
•
•
•

New program
Excel spreadsheet
Track by seed
Check basic yes/no
problems:
• Crawl too large.
• Date Queued
• Robots.txt

• Track errors:
• Various seed errors
• Embedded file problems

• Track updates:
•
•
•
•

New URLs
Recrawls
Patch crawls
Web administrator contacts

―Our Quality Control,‖ by Paphio.
http://www.flickr.com/photos/paphio/3313728492/

Challenges
•
•
•
•

Staffing
Time-intensive
Correcting technical problems
Not yet knowing how people will use the
crawls as a resource
• Capturing online publications and email
newsletters

Lessons Learned

―Lessons‖ by Pavel Ivashkov.
http://www.flickr.com/photos/ipasha/5588688937/

• Web-archiving takes time
• There are ways to make it
work with a small staff
• Metadata can be basic
and still useful
• Quality Control is
important
• You can’t correct every
error with limited staff
• Need to keep up with new
sites and URL changes

Up Next
• Additional outreach to Web administrators
• Official launch of Web archiving program
to University
• Exploring cross-training to improve quality
control program
• Institute regular scanning of environment
for new content and updates
• Social Media

Resources
 Archive-It Knowledge Center (October 2013)
 Brenda Reyes Ayala’s Web Archiving Bibliography (June
2013)
 Kalpesh Padia et al., Visualizing Digital Collections at
Archive-It (August 2012)
 National Digital Stewardship Alliance, Web Archiving
Survey Report (June 2012)
 International Internet Preservation Consortium, Future of
the Web Workshop (May 2012)
 Jinfang Niu, ―An Overview of Web Archiving‖ (D-Lib
Magazine, March 2012)
 Inside Higher Ed, Archiving the Web for Scholars (May
2011)
 WebArchivists, Web-Archives Timeline

Capture All the URLS: First Steps in Web Archiving

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (7)

Similaire à Capture All the URLS: First Steps in Web Archiving

Similaire à Capture All the URLS: First Steps in Web Archiving (20)

Plus de Kristen Yarmey

Plus de Kristen Yarmey (20)

Dernier

Dernier (20)

Capture All the URLS: First Steps in Web Archiving

Notes de l'éditeur