SlideShare a Scribd company logo
1 of 59
From Seed to Harvest:
Web Archiving Program
Considerations for SUL
Nicholas Taylor
@nullhandle
Stanford University Libraries
April 17, 2013 “Digital” by Flickr user clickclaker under CC BY-NC-ND 2.0
hello, my name is Nicholas…
Library of Congress Web Archiving
Library of Congress: “MINERVA”
Web Archiving Life Cycle Model
“Web Archiving Life Cycle Model” by M. Bragg, K.
Hanna, et al. (2013). Reproduced with permission.
Web Archiving Life Cycle Model
Program Elements
• Vision and Objectives
• Resources and Workflow
• Access / Use / Reuse
• Preservation
• Risk Management
Workflow Elements
• Appraisal and Selection
• Scoping
• Data Capture
• Storage and Organization
• Quality Assurance and
Analysis
PROGRAM ELEMENTS
Web Archiving
“Element Blocks” by Flickr user Asian Art Museum under CC BY-NC-ND 2.0
Vision and Objectives
web archiving program vision
ePADD Discovery Module
PASIG
SUL mission
“The Stanford University Libraries
(SUL) is more than a cluster of
libraries; it connects people with
information by providing diverse
resources and services to the
academic community.”
“Stanford University
Libraries…develops and
implements resources and
services…that support research
and instruction.”
SUL: “Stanford University Libraries on Vimeo”
SUL: “About The Stanford University Libraries”
SUL: “SULAIR Brief Guide”
DLSS mission
“DLSS is the information
technology production arm of
the Stanford Libraries; it serves as
the digitization, digital
preservation and access
systems provider for SUL; and it
is the research and
development unit for new
technologies, standards and
methodologies related to library
systems.”
SUL: “New Images of Rare Books and Digitization Devices”
SUL: “SULAIR Digital Library Systems and Services (DLSS)”
proposed program mission
“The web archiving program will provide
capabilities for the acquisition, preservation,
and dissemination of resources that are
increasingly and, often, exclusively
accessible via the web that are necessary to
support University research, instruction, and
other purposes.”
objectives
• build infrastructure
• develop expertise
• create research
collections
• archive records and
deprecated content
• mirror government
documents
“Objective” by Flickr user Pedro J. Ferreira under CC BY-NC-ND 2.0
Resources and Workflow
cost modeling
“dollar butterfly (2)” by Flickr user eikosi under CC BY-SA 2.0
staffing
• service manager
• crawl engineer
• curators
• system administrators
• software engineers
• technical services
• legal counsel
“Digitizing Mark Adams cartoons” by Flickr user suldpg under CC BY-NC-SA 2.0
infrastructure
“Google Storage Server” by Flickr user Kazuya (Kaz) Yokohama under CC BY-NC-ND 2.0
readily workflow-able
• collection
management
• site nomination
• permissions tracking
• crawl scheduling
• data capture
• quality assurance
“Web Curator Tool User Manual Version 1.5.2”
workflow challenges
• test crawling
• automated QA
• AIP/DIP generation
• SDR ingest
• indexing
• enabling access
• tools testing
“Salmon Ladder at Bonneville Dam” by Flickr user Serolynne under CC BY-NC-ND 2.0
Access / Use / Reuse
access policy
• dark archive
• data redistribution
• embargo
• onsite/offsite replay
• takedown requests
“DO NOT DUPLICATE” by Flickr user Sam UL under CC BY-NC-SA 2.0
browse and API: Wayback
Internet Archive: “Wayback Machine”
UK Web Archive: “Wayback Machine”
many Wayback Machines
Wikipedia: “List of Web archiving initiatives”
discovery: Memento
“Memento”
discovery: SearchWorks
SUL: “SearchWorks”
full-text search: Solr
Archive-It: “Explore All Archives”
Preservation
bit preservation
“Binary” by Flickr user mikecogh under CC BY-SA 2.0
preservation engineering
“Máquina de Rube Goldberg en la base del Alinghi” by Flickr user freshwater2006 under CC BY-NC 2.0
Risk Management
Risk Management
• “appified” web
• copyright
• ephemeral web
• financial sustainability
• fostering use
“Zombie Awareness - Extinguisher” by Flickr user Spiffy0777 under CC BY-NC-SA 2.0
Policy
copyright
• § 108 (library
exceptions)
• fair use
• notification vs.
permission
• opt-out / takedown
• robots.txt
• third-party sites
• exceptions?
“Noria con Copyrights” by Flickr user Alex Novoa under CC BY-NC-ND 2.0
collection development
“leaf-cutter ants” by Flickr user Vilseskogen under CC BY-NC-SA 2.0
WORKFLOW ELEMENTS
Web Archiving
“Workflow” by Flickr user luismi_cavalle under CC BY 2.0
Appraisal and Selection
informing selection
• value
• risk
• size
• extent to which
archived
“Fruit market-Barcelona” by Flickr user Marcel Theisen under CC BY-NC-SA 2.0
TwitterVane
UK Web Archive: “TwitterVane”
Wikipedia Live Monitor
Thomas Steiner: “Wikipedia Live Monitor”
Wikipedia articles
Wikipedia: “List of think tanks in the United States”
UNT Nomination Tool
University of North Texas Libraries: “Nomination Tool”
Scoping
the purpose of scoping
“More god?” by Flickr user one two one three under CC BY-NC-SA 2.0
Data Capture
Heritrix
Internet Archive: “A Quick Guide to Running Your First Crawl Job”
other data capture tools
Dan Chudnov and Laura Wrubel: “social feed manager”
Mat Kelly: “WAIL”
Archive Team: “Wget with WARC output”
the elusive web
“Light Writing - Spider Web” by Flickr user forcefeed:swede under CC BY-ND 2.0
scale
“chutes and ladders” by Flickr user reallyboring under CC BY-NC-SA 2.0
Storage and Organization
packages and their contents
“lots and lots and lots of boxes” by Flickr user Toastwife under CC BY-NC-SA 2.0
Quality Assurance and Analysis
QA before, after, during
“Check” by Flickr user ex.libris under CC BY-NC-ND 2.0
Metadata / Description
Metadata / Description
“Hello! My URL Is...” by Flickr user vasta under CC BY-NC-ND 2.0
BEYOND THE MODEL
Considerations
“My donut” by Flickr user Molemaster under CC BY-NC-SA 2.0
other program requirements
• marketing/outreach
• performance metrics
• service level
definitions
• service roadmap
• training
• user documentation
“Sticky notes” by Flickr user Kris Krug under CC BY-SA 2.0
incorporating existing projects
• plan capacity
• normalize data
• ingest into SDR
• seek permissions
• process
• catalog
• enable access
“Geckos” by Flickr user smashz under CC BY-NC-ND 2.0
community engagement
the web changes
Internet Archive: “Wayback Machine”
Nicholas Taylor
@nullhandle
“Thank You” by Flickr user muffintinmom under CC BY 2.0

More Related Content

Similar to From Seed to Harvest: Web Archiving Program Considerations for SUL

User-centered research for developing programs & articulating value.
User-centered research for developing programs & articulating value.User-centered research for developing programs & articulating value.
User-centered research for developing programs & articulating value.Lynn Connaway
 
Lots of LOCKSS Keeping Stuff Safe: The Future of the LOCKSS Program
Lots of LOCKSS Keeping Stuff Safe: The Future of the LOCKSS ProgramLots of LOCKSS Keeping Stuff Safe: The Future of the LOCKSS Program
Lots of LOCKSS Keeping Stuff Safe: The Future of the LOCKSS Programnullhandle
 
Designing Preservable Websites
Designing Preservable WebsitesDesigning Preservable Websites
Designing Preservable Websitesnullhandle
 
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)TimelessFuture
 
Archives 2.0 And Web 2.0
Archives 2.0 And Web 2.0Archives 2.0 And Web 2.0
Archives 2.0 And Web 2.0jkreeder
 
"You can just tell whether a website looks reliable or not." People's modes o...
"You can just tell whether a website looks reliable or not." People's modes o..."You can just tell whether a website looks reliable or not." People's modes o...
"You can just tell whether a website looks reliable or not." People's modes o...Lynn Connaway
 
"You can just tell whether a website looks reliable or not." People's modes o...
"You can just tell whether a website looks reliable or not." People's modes o..."You can just tell whether a website looks reliable or not." People's modes o...
"You can just tell whether a website looks reliable or not." People's modes o...OCLC
 
Upcycling a Schol Comm Unit: Building Bridges with Creativity, Reallocations,...
Upcycling a Schol Comm Unit: Building Bridges with Creativity, Reallocations,...Upcycling a Schol Comm Unit: Building Bridges with Creativity, Reallocations,...
Upcycling a Schol Comm Unit: Building Bridges with Creativity, Reallocations,...NASIG
 
Conservation's Digital Landscape: one conservator's perspective
Conservation's Digital Landscape: one conservator's perspectiveConservation's Digital Landscape: one conservator's perspective
Conservation's Digital Landscape: one conservator's perspectiveNancie Ravenel
 
The workflows for the ingest of digital objects into a repository/digital li...
The workflows for the ingest of digital objects into a repository/digital li...The workflows for the ingest of digital objects into a repository/digital li...
The workflows for the ingest of digital objects into a repository/digital li...Hong (Jenny) Jing
 
Tool Academy: Web Archiving
Tool Academy: Web ArchivingTool Academy: Web Archiving
Tool Academy: Web Archivingnullhandle
 
Collaboration and Cash: Web Archiving Incentive Awards
Collaboration and Cash: Web Archiving Incentive AwardsCollaboration and Cash: Web Archiving Incentive Awards
Collaboration and Cash: Web Archiving Incentive AwardsAnna Perricci
 
Lots More LOCKSS for Web Archiving: Boons from the LOCKSS Software Re-Archite...
Lots More LOCKSS for Web Archiving: Boons from the LOCKSS Software Re-Archite...Lots More LOCKSS for Web Archiving: Boons from the LOCKSS Software Re-Archite...
Lots More LOCKSS for Web Archiving: Boons from the LOCKSS Software Re-Archite...nullhandle
 
GENI Engineering Conference -- Ian Foster
GENI Engineering Conference -- Ian FosterGENI Engineering Conference -- Ian Foster
GENI Engineering Conference -- Ian FosterIan Foster
 
Applying research methods: Investigating the Many Faces of Digital Visitors &...
Applying research methods: Investigating the Many Faces of Digital Visitors &...Applying research methods: Investigating the Many Faces of Digital Visitors &...
Applying research methods: Investigating the Many Faces of Digital Visitors &...Lynn Connaway
 
Applying research methods: Investigating the Many Faces of Digital Visitors &...
Applying research methods: Investigating the Many Faces of Digital Visitors &...Applying research methods: Investigating the Many Faces of Digital Visitors &...
Applying research methods: Investigating the Many Faces of Digital Visitors &...OCLC
 
People's mode of online engagement: The Many Faces of Digital Visitors and Re...
People's mode of online engagement: The Many Faces of Digital Visitors and Re...People's mode of online engagement: The Many Faces of Digital Visitors and Re...
People's mode of online engagement: The Many Faces of Digital Visitors and Re...Lynn Connaway
 
People's mode of online engagement: The Many Faces of Digital Visitors and R...
 People's mode of online engagement: The Many Faces of Digital Visitors and R... People's mode of online engagement: The Many Faces of Digital Visitors and R...
People's mode of online engagement: The Many Faces of Digital Visitors and R...OCLC
 
Unlocking LOCKSS with APIs
Unlocking LOCKSS with APIsUnlocking LOCKSS with APIs
Unlocking LOCKSS with APIsnullhandle
 
Content & Features Reno: Less Is More
Content & Features Reno: Less Is MoreContent & Features Reno: Less Is More
Content & Features Reno: Less Is MoreCharlie Morris
 

Similar to From Seed to Harvest: Web Archiving Program Considerations for SUL (20)

User-centered research for developing programs & articulating value.
User-centered research for developing programs & articulating value.User-centered research for developing programs & articulating value.
User-centered research for developing programs & articulating value.
 
Lots of LOCKSS Keeping Stuff Safe: The Future of the LOCKSS Program
Lots of LOCKSS Keeping Stuff Safe: The Future of the LOCKSS ProgramLots of LOCKSS Keeping Stuff Safe: The Future of the LOCKSS Program
Lots of LOCKSS Keeping Stuff Safe: The Future of the LOCKSS Program
 
Designing Preservable Websites
Designing Preservable WebsitesDesigning Preservable Websites
Designing Preservable Websites
 
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
 
Archives 2.0 And Web 2.0
Archives 2.0 And Web 2.0Archives 2.0 And Web 2.0
Archives 2.0 And Web 2.0
 
"You can just tell whether a website looks reliable or not." People's modes o...
"You can just tell whether a website looks reliable or not." People's modes o..."You can just tell whether a website looks reliable or not." People's modes o...
"You can just tell whether a website looks reliable or not." People's modes o...
 
"You can just tell whether a website looks reliable or not." People's modes o...
"You can just tell whether a website looks reliable or not." People's modes o..."You can just tell whether a website looks reliable or not." People's modes o...
"You can just tell whether a website looks reliable or not." People's modes o...
 
Upcycling a Schol Comm Unit: Building Bridges with Creativity, Reallocations,...
Upcycling a Schol Comm Unit: Building Bridges with Creativity, Reallocations,...Upcycling a Schol Comm Unit: Building Bridges with Creativity, Reallocations,...
Upcycling a Schol Comm Unit: Building Bridges with Creativity, Reallocations,...
 
Conservation's Digital Landscape: one conservator's perspective
Conservation's Digital Landscape: one conservator's perspectiveConservation's Digital Landscape: one conservator's perspective
Conservation's Digital Landscape: one conservator's perspective
 
The workflows for the ingest of digital objects into a repository/digital li...
The workflows for the ingest of digital objects into a repository/digital li...The workflows for the ingest of digital objects into a repository/digital li...
The workflows for the ingest of digital objects into a repository/digital li...
 
Tool Academy: Web Archiving
Tool Academy: Web ArchivingTool Academy: Web Archiving
Tool Academy: Web Archiving
 
Collaboration and Cash: Web Archiving Incentive Awards
Collaboration and Cash: Web Archiving Incentive AwardsCollaboration and Cash: Web Archiving Incentive Awards
Collaboration and Cash: Web Archiving Incentive Awards
 
Lots More LOCKSS for Web Archiving: Boons from the LOCKSS Software Re-Archite...
Lots More LOCKSS for Web Archiving: Boons from the LOCKSS Software Re-Archite...Lots More LOCKSS for Web Archiving: Boons from the LOCKSS Software Re-Archite...
Lots More LOCKSS for Web Archiving: Boons from the LOCKSS Software Re-Archite...
 
GENI Engineering Conference -- Ian Foster
GENI Engineering Conference -- Ian FosterGENI Engineering Conference -- Ian Foster
GENI Engineering Conference -- Ian Foster
 
Applying research methods: Investigating the Many Faces of Digital Visitors &...
Applying research methods: Investigating the Many Faces of Digital Visitors &...Applying research methods: Investigating the Many Faces of Digital Visitors &...
Applying research methods: Investigating the Many Faces of Digital Visitors &...
 
Applying research methods: Investigating the Many Faces of Digital Visitors &...
Applying research methods: Investigating the Many Faces of Digital Visitors &...Applying research methods: Investigating the Many Faces of Digital Visitors &...
Applying research methods: Investigating the Many Faces of Digital Visitors &...
 
People's mode of online engagement: The Many Faces of Digital Visitors and Re...
People's mode of online engagement: The Many Faces of Digital Visitors and Re...People's mode of online engagement: The Many Faces of Digital Visitors and Re...
People's mode of online engagement: The Many Faces of Digital Visitors and Re...
 
People's mode of online engagement: The Many Faces of Digital Visitors and R...
 People's mode of online engagement: The Many Faces of Digital Visitors and R... People's mode of online engagement: The Many Faces of Digital Visitors and R...
People's mode of online engagement: The Many Faces of Digital Visitors and R...
 
Unlocking LOCKSS with APIs
Unlocking LOCKSS with APIsUnlocking LOCKSS with APIs
Unlocking LOCKSS with APIs
 
Content & Features Reno: Less Is More
Content & Features Reno: Less Is MoreContent & Features Reno: Less Is More
Content & Features Reno: Less Is More
 

More from nullhandle

Understanding Legal Use Cases for Web Archives
Understanding Legal Use Cases for Web ArchivesUnderstanding Legal Use Cases for Web Archives
Understanding Legal Use Cases for Web Archivesnullhandle
 
Interoperability and Technical Collaboration for Web and Social Media Archiving
Interoperability and Technical Collaboration for Web and Social Media ArchivingInteroperability and Technical Collaboration for Web and Social Media Archiving
Interoperability and Technical Collaboration for Web and Social Media Archivingnullhandle
 
Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Susta...
Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Susta...Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Susta...
Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Susta...nullhandle
 
2015 NDSA Web Archiving Survey Report Highlights
2015 NDSA Web Archiving Survey Report Highlights2015 NDSA Web Archiving Survey Report Highlights
2015 NDSA Web Archiving Survey Report Highlightsnullhandle
 
Collection Development for Selective Web Archiving
Collection Development for Selective Web ArchivingCollection Development for Selective Web Archiving
Collection Development for Selective Web Archivingnullhandle
 
Why Not Lots of Copies Keep(ing) Software Safe?
Why Not Lots of Copies Keep(ing) Software Safe?Why Not Lots of Copies Keep(ing) Software Safe?
Why Not Lots of Copies Keep(ing) Software Safe?nullhandle
 
WASAPI Web Archive Data Transfer APIs
WASAPI Web Archive Data Transfer APIsWASAPI Web Archive Data Transfer APIs
WASAPI Web Archive Data Transfer APIsnullhandle
 
Measure All the (Web Archiving) Things!
Measure All the (Web Archiving) Things!Measure All the (Web Archiving) Things!
Measure All the (Web Archiving) Things!nullhandle
 
A Snapshot of the U.S. Web Archiving Landscape through the 2013 NDSA Survey R...
A Snapshot of the U.S. Web Archiving Landscape through the 2013 NDSA Survey R...A Snapshot of the U.S. Web Archiving Landscape through the 2013 NDSA Survey R...
A Snapshot of the U.S. Web Archiving Landscape through the 2013 NDSA Survey R...nullhandle
 
Campaign Web Archives to Support Multi-Institutional Research
Campaign Web Archives to Support Multi-Institutional ResearchCampaign Web Archives to Support Multi-Institutional Research
Campaign Web Archives to Support Multi-Institutional Researchnullhandle
 
2013 NDSA Web Archiving Survey Report Highlights
2013 NDSA Web Archiving Survey Report Highlights2013 NDSA Web Archiving Survey Report Highlights
2013 NDSA Web Archiving Survey Report Highlightsnullhandle
 
Considerations for Strategic Web Archive Collection Development
Considerations for Strategic Web Archive Collection DevelopmentConsiderations for Strategic Web Archive Collection Development
Considerations for Strategic Web Archive Collection Developmentnullhandle
 
Boiling the Ocean, Together: Web Archive Collection Development in a Global C...
Boiling the Ocean, Together: Web Archive Collection Development in a Global C...Boiling the Ocean, Together: Web Archive Collection Development in a Global C...
Boiling the Ocean, Together: Web Archive Collection Development in a Global C...nullhandle
 
Advocating for Web Archivability
Advocating for Web ArchivabilityAdvocating for Web Archivability
Advocating for Web Archivabilitynullhandle
 
Building Archivable Websites
Building Archivable WebsitesBuilding Archivable Websites
Building Archivable Websitesnullhandle
 
Link Persistence, Website Persistence
Link Persistence, Website PersistenceLink Persistence, Website Persistence
Link Persistence, Website Persistencenullhandle
 
A Survey of Research Prospects for more Manageable Personal Digital Photo Col...
A Survey of Research Prospects for more Manageable Personal Digital Photo Col...A Survey of Research Prospects for more Manageable Personal Digital Photo Col...
A Survey of Research Prospects for more Manageable Personal Digital Photo Col...nullhandle
 
Using Wayback Machine for Research
Using Wayback Machine for ResearchUsing Wayback Machine for Research
Using Wayback Machine for Researchnullhandle
 
Web and Twitter Archiving at the Library of Congress
Web and Twitter Archiving at the Library of CongressWeb and Twitter Archiving at the Library of Congress
Web and Twitter Archiving at the Library of Congressnullhandle
 
Where We're Going: Non-Traditional Careers for LIS Graduates
Where We're Going: Non-Traditional Careers for LIS GraduatesWhere We're Going: Non-Traditional Careers for LIS Graduates
Where We're Going: Non-Traditional Careers for LIS Graduatesnullhandle
 

More from nullhandle (20)

Understanding Legal Use Cases for Web Archives
Understanding Legal Use Cases for Web ArchivesUnderstanding Legal Use Cases for Web Archives
Understanding Legal Use Cases for Web Archives
 
Interoperability and Technical Collaboration for Web and Social Media Archiving
Interoperability and Technical Collaboration for Web and Social Media ArchivingInteroperability and Technical Collaboration for Web and Social Media Archiving
Interoperability and Technical Collaboration for Web and Social Media Archiving
 
Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Susta...
Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Susta...Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Susta...
Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Susta...
 
2015 NDSA Web Archiving Survey Report Highlights
2015 NDSA Web Archiving Survey Report Highlights2015 NDSA Web Archiving Survey Report Highlights
2015 NDSA Web Archiving Survey Report Highlights
 
Collection Development for Selective Web Archiving
Collection Development for Selective Web ArchivingCollection Development for Selective Web Archiving
Collection Development for Selective Web Archiving
 
Why Not Lots of Copies Keep(ing) Software Safe?
Why Not Lots of Copies Keep(ing) Software Safe?Why Not Lots of Copies Keep(ing) Software Safe?
Why Not Lots of Copies Keep(ing) Software Safe?
 
WASAPI Web Archive Data Transfer APIs
WASAPI Web Archive Data Transfer APIsWASAPI Web Archive Data Transfer APIs
WASAPI Web Archive Data Transfer APIs
 
Measure All the (Web Archiving) Things!
Measure All the (Web Archiving) Things!Measure All the (Web Archiving) Things!
Measure All the (Web Archiving) Things!
 
A Snapshot of the U.S. Web Archiving Landscape through the 2013 NDSA Survey R...
A Snapshot of the U.S. Web Archiving Landscape through the 2013 NDSA Survey R...A Snapshot of the U.S. Web Archiving Landscape through the 2013 NDSA Survey R...
A Snapshot of the U.S. Web Archiving Landscape through the 2013 NDSA Survey R...
 
Campaign Web Archives to Support Multi-Institutional Research
Campaign Web Archives to Support Multi-Institutional ResearchCampaign Web Archives to Support Multi-Institutional Research
Campaign Web Archives to Support Multi-Institutional Research
 
2013 NDSA Web Archiving Survey Report Highlights
2013 NDSA Web Archiving Survey Report Highlights2013 NDSA Web Archiving Survey Report Highlights
2013 NDSA Web Archiving Survey Report Highlights
 
Considerations for Strategic Web Archive Collection Development
Considerations for Strategic Web Archive Collection DevelopmentConsiderations for Strategic Web Archive Collection Development
Considerations for Strategic Web Archive Collection Development
 
Boiling the Ocean, Together: Web Archive Collection Development in a Global C...
Boiling the Ocean, Together: Web Archive Collection Development in a Global C...Boiling the Ocean, Together: Web Archive Collection Development in a Global C...
Boiling the Ocean, Together: Web Archive Collection Development in a Global C...
 
Advocating for Web Archivability
Advocating for Web ArchivabilityAdvocating for Web Archivability
Advocating for Web Archivability
 
Building Archivable Websites
Building Archivable WebsitesBuilding Archivable Websites
Building Archivable Websites
 
Link Persistence, Website Persistence
Link Persistence, Website PersistenceLink Persistence, Website Persistence
Link Persistence, Website Persistence
 
A Survey of Research Prospects for more Manageable Personal Digital Photo Col...
A Survey of Research Prospects for more Manageable Personal Digital Photo Col...A Survey of Research Prospects for more Manageable Personal Digital Photo Col...
A Survey of Research Prospects for more Manageable Personal Digital Photo Col...
 
Using Wayback Machine for Research
Using Wayback Machine for ResearchUsing Wayback Machine for Research
Using Wayback Machine for Research
 
Web and Twitter Archiving at the Library of Congress
Web and Twitter Archiving at the Library of CongressWeb and Twitter Archiving at the Library of Congress
Web and Twitter Archiving at the Library of Congress
 
Where We're Going: Non-Traditional Careers for LIS Graduates
Where We're Going: Non-Traditional Careers for LIS GraduatesWhere We're Going: Non-Traditional Careers for LIS Graduates
Where We're Going: Non-Traditional Careers for LIS Graduates
 

Recently uploaded

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 

Recently uploaded (20)

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 

From Seed to Harvest: Web Archiving Program Considerations for SUL

Editor's Notes

  1. A little bit of my background: I've worked for the last two-and-a-half years for the Library of Congress Web Archiving program. That program has been running for 13 years now, accumulating over 60 collections, many of them focused on public policy and the legislative branch; over 13,000 nominated websites; and over 400 terabytes of content. Our large-scale crawling is provided by Internet Archive, creating additional workflow requirements and complexities. I am 1 of 3 project managers in the 5-person web archiving team and have personally transferred and ingested over 200 terabytes of content into the repository, reviewed hundreds of websites, and have been involved in the planning or have directly managed all upgrades to our workflow tools over the last couple of years, including especially our data management and QR tools.
  2. Some of you may have seen or even been involved in the creation of Archive-It’s Web Archiving Life Cycle Model, released in March. I thought this would be a useful way of reviewing program considerations.
  3. The outer circle consists of broader program elements. The inner circle is more particular in focus, concentrating on the mechanics and requirements of workflows.
  4. I'll start on the outside with program elements and work my way in.
  5. The foremost questions for a web archiving program are, “what is it meant to do?” and “what is its relationship to the mission of larger organizational structures?”
  6. Stanford University Libraries is and has been involved in so many interesting and innovative digital library and digital preservation projects. Throughout the rest of this presentation, I refer to best practices and approaches for web archiving from other institutions. My vision for Stanford web archiving is not just to stand up a production service, but also to develop tools and approaches that are as innovative as these other projects.
  7. Looking at the mission of the Stanford University Libraries, it seemed to me that the key points were “providing diverse resources and services” and, in doing so, “supporting research and instruction.”
  8. Looking then at the mission of the Digital Library Systems and Services group, where the web archiving program would be situated, its role seemed to be to provide IT infrastructure, research, and development. in furtherance of the Libraries' mission.
  9. Considering the missions of its parents, a web archiving program mission might look something like this.
  10. The mission may be operationalized by articulating more concrete objectives, such as building infrastructure, developing distributed staff expertise, and identifying classes of content for collecting.
  11. Program goals can only be achieved through the allocation of resources and development of workflows. What resources will a web archiving program require?
  12. Cost modeling is a foremost concern for resource planning. Stanford has some familiarity with cost models from service providers such as Archive-It and CDL’s Web Archiving service, which are based on quotas for data volume, seed count, crawl duration, and/or number of active collections. One advantage of bringing web archiving in-house is to be able to provide better quality captures than these limits permit. The challenge will be ensuring that the cost model is easy to understand, provides for quality archiving, and is financially sustainable.Cost modeling is difficult. At the Library of Congress, our bulk crawling contract with Internet Archive was based on a ceiling on the number of seeds. When we tried to model the costs for website nominators within the Library on a per-seed basis, we found that the more seeds were submitted to the crawl, the less each seed “cost”, communicating a muddled price signal to nominators.
  13. A production web archiving program will involve an ongoing or episodic commitment from other staff beyond the two intended FTEs. Curators propose collections and select websites. System administrators maintain the IT infrastructure for web archiving systems. Software engineers enhance web archiving workflows. Technical services staff enhance descriptive metadata and facilitate discovery. Legal counsel helps establish and refine sounds legal terms for the operation of the service.
  14. Depending on the scale of the program, web archiving may have significant demands on compute, memory, and storage. Indexing and analysis will require robust IT infrastructure.
  15. I understood from Stanford’s web archiving report from several years ago that there was interest in the Web Curator Tool. Web Curator Tool would be a good drop-in solution for handling a number of customer-facing elements of the production web archiving workflow.
  16. Don’t worry, though, there will be plenty left to automate. At the Library of Congress, for instance, I’ve spent a significant amount of time improving the movement of and management of data from Internet Archive to our repository.
  17. The eventual aim of workflows is to support access and use.
  18. The access policy should specify under what terms content can be made available to which designated users and will need to be determined by assessing relevant legal and other risks.
  19. The most typical access method is the Wayback Machine, an open source program developed by Internet Archive for ARC and WARC web archive file format replay. It allows you to browse date snapshots for individual URLs and also provides an XML API.
  20. Wayback Machine is in fact the most common access interface used by the international cultural heritage web archiving community.
  21. One of the advantages of also using Wayback Machine is that it natively supports Memento, a prototype extension of the http protocol that will facilitate discovery of resources in distributed web archives.
  22. Of greater importance will be supporting local discovery. I saw that some web archiving collections already had records in SearchWorks. In addition to cataloging web archive collections, it’d be worth assessing the feasibility of website-level records.
  23. Full-text search is becoming more common, especially among smaller institutions with smaller collections and larger institutions with more robust infrastructure. This is something that Stanford should consider, to augment other forms of access.
  24. Access is interdependent with preservation. How are web archives preserved?
  25. The most basic requirement for preservation of web archives, as with other forms of digital content, is bit preservation. Checksums should be generated for all content encapsulated in the SIP and checked every time the data is copied to a new filesystem. An AIP should be stored in SDR.
  26. Beyond bit preservation, there are not yet widely adopted approaches to web archive preservation engineering. I’ve participated in the IIPC Preservation Working Group for the last couple of years. We recently created and distributed a preservation survey among IIPC members. Of the dozen or so institutions that have filled it out so far, responses are all over the map in terms of approaches to data normalization, perception of file format obsolescence risk, and technical metadata requirements. The most ambitious of current efforts is being undertaken by the Austrian National Library, who are collecting technical metadata for individual items within their ARCs and WARCs. More moderate approaches the Stanford might consider are that of BnF and Harvard, who collect and store technical metadata principally about the container files themselves.
  27. Preservation is a means of mitigating one kind of risk. There are other kinds of risks to be managed.
  28. These are some of the other risks a web archiving program will have to confront. Over the last year, I’ve become increasingly sensitized to the risk of inadequate use or under-developed stakeholders at the Library of Congress as access server space becomes scarce and IT Services wants to get the best return on investment of limited resources.
  29. As visually indicated by the fact that it surrounds the entire life cycle, policy provides the foundation for all of the other program elements.
  30. The legal issues in web archiving center primarily though not exclusively on copyright. Section 108 of the Copyright Act provides exceptions for library preservation of at-risk materials, but does not cover web archiving. Web archiving programs address the copyright issue through a combination of fair use best practices and permissions, with provisions for opting out of crawling or de-accessifying crawled content.One of the challenges related to copyright permissions at the Library of Congress context is that permissions requirements mandated by legal counsel are collection-specific, leading to problems when we collect websites with the same content owner in collections where different permissions are specified.
  31. The other major policy area is collection development. This will be informed by existing collection development and records policies. The web is much bigger than any one institution’s capability to collect. Collection development policy should help determine what should be collected, how much of it, how comprehensive or representative it should be, what collecting should take place outside of a collection framework (e.g., Technical Services’ EEMs), and so on.
  32. So, so far I’ve been talking about the broader elements of a web archiving program. Now I’m going to talk about more workflow-level considerations.https://secure.flickr.com/photos/vfsdigitaldesign/5396094193/
  33. Collection development policy defines what the combination of individual collecting projects should look like in the aggregate. Appraisal and selection are how, within an individual collecting project, a curator decides to collect one resource as versus another.
  34. Selection is a challengingly subjective task, especially given the size of the web. Criteria to consider include the value of the website, the risk of its disappearance, the resources it would take to archive it, and the extent to which it has already been archived by other institutions.
  35. There have been some nascent efforts to crowd-source the problem. The UK Web Archive is currently working on a tool to identify frequently-cited links in curated Twitter streams.
  36. There’s been some discussion of using a live monitor of Wikipedia edits for the same purpose.
  37. Wikipedia itself is a crowd-sourced production and may also used to seed certain topical collections.
  38. Lastly, the University of North Texas Nomination Tool is used collaboratively by many web archiving institutions and archivists to pool seeds, often in response to breaking events. It was used most recently to curate seed lists for a papal transition crawl and an end of presidential term crawl.
  39. After appraising and selecting a resource to be collected, the next essential step is to define the scope of the crawl.
  40. Scoping is creating instructions for where the crawler should or should not go, after setting out from the seed URLs. It is the primary mechanism for ensuring crawling resources are used most efficiently and the primary focus of QA. Seeds and scopes are sometimes fungible from a crawling perspective but not from a permissions workflow or cataloging perspective.
  41. Once seeds are selected and scoping is configured, you deploy software to capture the data.
  42. Just as Wayback Machine is the most typical software used for web archive access, its counterpart Heritrix is the most common software used for data capture. Heritrix is an open source, scalable, archival web crawler and stores captured content in ISO-standard WARC files.
  43. Heritrix is not the only data capture tool available, nor the only one that produces WARC files. Wget and the Web Archiving Integration Layer may be useful to consider for test and/or small-scale crawling. George Washington University’s social feed manager is a tool for archiving Twitter streams and is an example of how the web archiving community is exploring other methodologies for capturing web content.
  44. The social feed manager hints at a future in which API-enabled archiving becomes more common. As it is, Heritrix and the web crawling paradigm generally are far more suitable to the comparatively static web of 10 years ago than the contemporary web. Continuing efforts and collaborations will be required on the data capture front to maintain the efficacy of web archiving tools.
  45. And it’s not that we don’t have the capabilities now to tackle some of the data capture challenges; we just don’t have effective ways to do so at scale, a requirement for a robust production workflow.
  46. Once you have the data, you need to organize and store it.
  47. Data to be included in the SIP could be the WARCs themselves but also the crawler configuration and logs. It will be important to track the relationship between packages and the collection, website(s), and/or capture date ranges they represent (this may or may not be transparent in the filenames).
  48. In the life cycle model, QA takes place after storage and organization.
  49. I think that it usually takes place before, during, and after data capture. “Before” includes scoping, assessing obstacles to archiving, or surfacing JavaScript links with a web automation framework like PhantomJS. “During” includes checking up on the running crawl to make sure it doesn’t get stuck. “After” includes reviewing crawl logs, inspecting harvested sites, and making scoping adjustments. QA is most important after the first crawl of a resource.
  50. Descriptive metadata may be created or enhanced during many of the workflow stages of the life cycle.
  51. Descriptive metadata would optimally come from multiple sources: selectors, catalogers, and automated methods. cURL is a basic automated method for extracting metadata from the head of archived pages. I’ve experimented some with text analysis tools that could suggest appropriate keywords from a controlled vocabulary, but I’m not aware of any tools that are production-ready.
  52. I acknowledge that the life cycle model doesn’t cover every aspect of what Stanford will need to consider in the creation of its web archiving program.
  53. There are many other considerations such as how will the success of the program be benchmarked and the requirements of different stakeholders be balanced?
  54. There will also be the challenge of incorporating existing projects. To what extent can the disparate efforts be standardized, and is that even desirable?
  55. The web archiving program will need to engage not just with internal stakeholders but external groups and institutions as well. Web archiving is definitely a community effort, and the community needs all the help it can get.
  56. Lastly, it will be necessary to revisit and re-evaluate many of the aforementioned program and workflow elements on an ongoing basis to keep pace with the changing information environment and evolving best practice. Stanford University has come a long way since 1996 and finds itself now at a great moment to become more involved with web archiving. I’d welcome the opportunity to help lead that effort.