Presentation with Judy Silva (Fine & Performing Arts Librarian and Archivist at Slippery Rock University) and Alexis Antracoli (Records Management archivist at Drexel University) at the Pennsylvania Library Association's 2013 annual conference in Seven Springs, Pennsylvania.
Abstract: As higher education embraces new technologies, teaching, learning, research, and record-keeping is increasingly taking place on university websites, on university-related social media pages, and elsewhere on the open web. This dynamic digital content, however, is highly vulnerable to degradation and loss. This session will introduce the concept of web archiving and articulate why it’s important for colleges and universities. Speakers will demonstrate web archiving service Archive-It and then share lessons learned from their institutions’ web archiving initiatives, from unexpected stumbling blocks to strategies for raising funds and support from campus stakeholders.
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Capture All the URLS: First Steps in Web Archiving
1. Capture all the URLs:
First Steps in Web Archiving
Kristen Yarmey
Judy Silva
Alexis Antracoli
Digital Services Librarian
Fine & Performing Arts Librarian and Archivist
Records Management Archivist
University of Scranton
Slippery Rock University of Pennsylvania
Drexel University
2. Where We’re Going
Kristen:
• Intro to web archiving
• Web archives in higher
ed
• Archive-It and other
tools
Judy:
• First steps
• Getting buy-in
• Selecting and scoping
Alexis:
• Metadata
• Policies
• Workflow
All:
•
•
•
•
Challenges
Lessons learned
What’s next?
Q&A
4. What do we put on the web?
• University publications
•
•
•
•
•
•
•
Course catalogs
Student handbooks
Newsletters
Press releases
Alumni Journal
Admissions viewbook
University calendar
• Governance/Planning
documents and records
• Policies
• Assessment reports (Fact Book)
• Faculty Senate
agendas, minutes, and reports
• Presidential announcements
• Email
• Campus life
•
•
•
•
•
•
Student clubs
Housing contract
Wellness programming
Community outreach
Athletics scores
Alumni class pages
• Events
• Presidential inauguration
• New building
construction/dedication
• Social Media presence
•
•
•
•
•
Facebook
Twitter
Blogs
YouTube
…
5. Web Archiving in Higher Ed
―We have the responsibility
to preserve things like course information, course roster
information and policies — all sorts of things that we used to
get in paper but are now just showing up as websites.‖
Dean B. Krafft, Chief Technology Strategist, Cornell University
―Almost every office and unit on campus has a web
site with business information. .. Many of our campus
publications are only on the web now as pdfs or html.
[This content] isn’t preserved anywhere else.‖
Ed Busch, Electronic Records Archivist, Michigan State University
6. Goals:
• Preserve dynamic content
•
•
•
•
•
Text
Images
Animation
Video
…
• Preserve context
•
•
•
•
Hyperlinks
Embedded media
Document method and date of capture
Relate to prior and later versions
• Provide access
• Full text search
• Browsability
• User-friendly interface
12. Web Archiving in Higher Ed
“One finding revealed by the survey was the
preponderance of universities that have
initiated web archiving programs in the last 5
years.”
Web Archiving Survey Report by National Digital Stewardship Alliance
June 2012
15. Tools: In-House Options
Proprietary tools:
Adobe Acrobat - convert websites into PDFs
(internal links remain active but other
dynamic functionality is lost)
Grab-a-Site and WebWhacker – download
files from a website
Teleport Pro – ―webspidering‖
Open source tools:
Heritrix – crawler
HTTrack – downloads web content to a local
directory
Wayback – discovery
Memento – access framework
NutchWAX - search
Solr – search
WARCreate – Google Chrome extension for
creating WARC files (view with Wayback,
store your own data)
Wget – retrieve files from a website
Web Curator Tool – workflow management
NetarchiveSuite - software package
Xenu’s Link Sleuth – finds broken links
National Digital Stewardship Alliance Web Archiving Survey Report, June 2012
16. Tools: Outsourcing Options
Vendor services
Archive-It
California Digital
Library Web Archiving
Service (WAS)
OCLC Web Harvester
National Digital Stewardship Alliance Web Archiving Survey Report, June 2012
17. Archive-It
• Subscription service
• Branch of nonprofit Internet Archive
• Crawls, harvests, and hosts web
content, using open source tools and
standard formats
• Yearly fees, based on ―data budget‖
18. Archive-It: Partners
Archive-It Partners
279 collecting organizations total
118 colleges & universities
Pennsylvania Partners:
Bryn Mawr, Haverford, and Swarthmore (joint, 2005)
Bucknell (2012)
Chemical Heritage Foundation (2010)
Curtis Institute of Music (2010)
Drexel (2009)
Free Library of Philadelphia (2010)
Gettysburg College (2013)
La Salle University (2012)
Pennsylvania State University (2012)
Slippery Rock University of Pennsylvania (2011)
Temple University (2013)
University of Pennsylvania Law School (2011)
University of Scranton (2012)
19. Archive-It: Crawl
• Collection
• Seeds (regular, one-time, or RSS)
• Documents = any file with a distinct URL, including…
– HTML
– Images
– Video
– Audio
– PDF
–…
• Scope = which URLs are captured and which are not
• Frequency = how often seed is crawled
20. Archive-It: Access
Users can:
• Search
• Browse
From:
• Archive-It website
• Portal page
• Embedded search
boxes
• Library catalog
• Finding aids
• 404 error pages
• Wayback Machine
Content can be
public or private.
21.
22.
23. Archive-It: Manage
Metadata
• Dublin Core
• Collection, seed, document level
Storage
• Archive-It hosts content and
backup on multiple servers
• Partner can request copy of data
Support
• Training sessions
• Partner support
• User community
30. What is a seed?
• A seed is any URL that you want to
capture:
• An entire website
• http://www.whitehouse.gov/
• A specific part of a website
• http://www.whitehouse.gov/issues/foreign-policy/
• A specific URL
• http://www.whitehouse.gov/sites/default/files/rss_viewer/natio
nal_security_strategy.pdf
34. Collection Development
• Developed policy
•
•
•
•
•
Mission
Scope
Designated Community
Intellectual Property
Access
• Determined/reviewed
seeds to crawl and
frequency
• Maintain an up-to-date
list of seeds that are
regularly crawled
―Brasseri F – Archives oubliees,‖ by GuillaBar.
http://www.flickr.com/photos/guillabar/8666232614/
35. Updating Metadata
• Selected fields to use consistently:
•
•
•
•
Title
Creator
Description
Collector
• Standardized names
• Eliminated groups
36. Quality Control Procedures
•
•
•
•
New program
Excel spreadsheet
Track by seed
Check basic yes/no
problems:
• Crawl too large.
• Date Queued
• Robots.txt
• Track errors:
• Various seed errors
• Embedded file problems
• Track updates:
•
•
•
•
New URLs
Recrawls
Patch crawls
Web administrator contacts
―Our Quality Control,‖ by Paphio.
http://www.flickr.com/photos/paphio/3313728492/
38. Lessons Learned
―Lessons‖ by Pavel Ivashkov.
http://www.flickr.com/photos/ipasha/5588688937/
• Web-archiving takes time
• There are ways to make it
work with a small staff
• Metadata can be basic
and still useful
• Quality Control is
important
• You can’t correct every
error with limited staff
• Need to keep up with new
sites and URL changes
39. Up Next
• Additional outreach to Web administrators
• Official launch of Web archiving program
to University
• Exploring cross-training to improve quality
control program
• Institute regular scanning of environment
for new content and updates
• Social Media
40. Resources
Archive-It Knowledge Center (October 2013)
Brenda Reyes Ayala’s Web Archiving Bibliography (June
2013)
Kalpesh Padia et al., Visualizing Digital Collections at
Archive-It (August 2012)
National Digital Stewardship Alliance, Web Archiving
Survey Report (June 2012)
International Internet Preservation Consortium, Future of
the Web Workshop (May 2012)
Jinfang Niu, ―An Overview of Web Archiving‖ (D-Lib
Magazine, March 2012)
Inside Higher Ed, Archiving the Web for Scholars (May
2011)
WebArchivists, Web-Archives Timeline
Notes de l'éditeur
Recognition of changing platformAll of our stuff is going here, and it’s dynamic
This is harder than you think. Digital files are highly vulnerable.
Already seeing this happen.
Often a combination of tools
Often a combination of tools
As we’ve seen, Archive-It is the popular favorite and has an impressive list of users.Started looking into web archiving in 2009; followed the topic on professional listservs and saw Archive-It mentioned repeatedly. Had read about Brewster Kahle’s work. Tried the Wayback Machine and was impressed with what was being collected already. In 2010 saw a presentation at MARAC (Mid-Atlantic Archives Conference) by a colleague Rebecca Goldman at Drexel who suggested I try a webinar and that was it.Few Options in 2009Archive-It endorsed by ColleaguesInternet Archive’s WayBack Machine Presentation at Professional Conference Attended a WebinarRan a TrialArchive-It Support
First contacted Information Technology (partly to determine they were not already archiving the website somehow). VP Info Technology referred us to PR . . . At Slippery Rock University the Public RelationsOffice is responsible for the website, so they were a natural in helping to select content for capture and preservation. PR publishes more and more content in electronic format (in some cases only electronic): course catalogs, alumni magazine, press releases . . . Administration seeking storage solutions as network drives fillLibrary willing to test the web archiving concept with our content
IT deferred concept (and funding) to PR. PR said they could not afford it. Library ended up paying for it. Will ask Provost’s Office next year.
Library Homepage Archives: Digital CollectionsUniversity HomepageRockPride (campus e-newsletter)Catalogs: undergraduate and graduateDecided to use library content as prototype, it allowed us to practice and then showcase our own content. Library homepage and Digital Collections.The library is responsible for the annual Student Research Symposium, so that provides a stepping stone beyond library content and some additional stakeholders (faculty whose students are participants in the symposium). Also e-newsletter of faculty publications.PR’s suggestions: University homepage, campus e-newsletter and catalogs (undergraduate and graduate)
Popular campus activities like Athletics, student organizations, alumniInfluential friends: president and provostOne time events: anniversaries, etc.it is now possible to archive intranet content behind a username and password(provided that the partner supplies those credentials in the web application).
Archive-it help pages are user friendly
Options: limit or expandSet a timeSet a data limitCrawl frequencyUniversity Homepage: The homepage URL does not necessarily need crawled further than the root at the moment, once a month.RockPride (campus e-newsletter): A new Rock Pride online magazine runs every Friday during the school semester, and roughly once a month throughout the summer.Catalog: undergraduate and graduate catalogs, back to 2004. Annually.
look at other college and university sites (available from the Archive-It site) to see what they were harvesting and how they were naming collections.Stumbling blocks: Kristen and Alexis?