Presentation delivered by Lynda Schmitz Fuhrig, Electronic Archivist, and Jennifer Wright, Archivist, for the Smithsonian Institution Archives, at the Smithsonian Archives Fair on October 14, 2011 in Washington, DC.
Although it first began capturing institutional websites in the late 1990s, the Smithsonian Institution Archives initiated a project in 2009 to capture the explosion of public websites and social media instances maintained by its many museums, research centers, and programs with the Heritrix crawler. This presentation reviews appraisal, accessioning, and capture issues in documenting the Smithsonian’s web presence in the early 21st Century.
Preserving the Smithsonian Institution’s Web Presence
1. Preserving the
Smithsonian Institution’s
Web Presence
Smithsonian Lynda Schmitz Fuhrig and Jennifer Wright
Institution Archives
Oct. 14, 2011
Fair
2. The Mission of SI Archives
Appraise, acquire, and preserve the records of
the Institution
Offer a range of research and reference
services
Establish policy and provide expert guidance
on record keeping practices
Create and promote products and services
that broaden understanding of the
Smithsonian
Provide professional archival and conservation
expertise
5. Website and Social Media
Registry
A “record” is any official recorded information,
regardless of medium or characteristics,
created, received, and maintained by a
Smithsonian museum, office, or employee
Websites and social media accounts must be
managed as records
Registry allows staff from across the
Smithsonian to add and update information
about all of their websites and social media
accounts
6. Appraising Records
All records must be appraised to determine
their ultimate disposition
Records appraised based on
administrative, legal, historical, and research
value
Records with long-term value are transferred
to Archives
7. Appraising Traditional Websites
Websites are public face of Smithsonian
Significant historical and research value
Constantly changing
Crawl annually and before and after major
redesigns
Work with webmasters to determine if crawls
should be more or less frequent
8. Appraising Social Media
Accounts
All social media accounts are used differently
Each account appraised individually based on
content
Accounts containing significant original content
will be fully captured each year
Accounts consisting mostly of links to other
resources will be captured occasionally to
document existence
Method and frequency of capture may depend
on terms of service and ability to avoid
capturing non-Smithsonian content
9. Past Web Archiving Procedures
• Files transferred from the Smithsonian’s IT
office
• HTTrack web crawler
• Scripts used to create XHTML preservation
files but very manual and time-consuming
10. Heritrix
• Archival web crawler
• Open source
• Java
• Developed by Internet Archive, National Library
of Norway and National and University Library of
Iceland
11. WARC
WARC – Web ARChive file format
International standard – ISO 28500:2009
Extension of the ARC format in use since 1996
Container format
18. Social Media
Third-party issues
Privacy concerns
Different tools
19. Lessons Learned
In-house archiving takes time
No one-size fits all solution
Master site registry requires regular updating
20.
21. Contacts and Resources
Lynda Schmitz Fuhrig
Digital Services Division
schmitzfuhrigl@si.edu
Jennifer Wright
Archives and Information Management Team
wrightjm@si.edu
Smithsonian Institution Archives website:
http://siarchives.si.edu