Best SEO Services Company in Dallas | Best SEO Agency Dallas
Progress Made and Lessons Learned through Collaborative Web Archiving Projects
1. Progress Made and
Lessons Learned
through Collaborative Web
Archiving Projects
Anna Perricci
Columbia University Libraries
Archive-It Partner Meeting 2014
November 18, 2014
2. Web Resources Archiving Collaboration
• Many thanks to the Mellon Foundation
• Building collaborations among
– The web archiving community
– Other research libraries
– Users and potential users of web archives
– Site creators
3. Incentive awards projects
to advance web archiving tools
Warcbase: Building a Scalable Web Archiving Platform on HBase
and Hadoop. (Jimmy Lin, University of Maryland)
Archiving Transactions Towards Uninterruptible Web Service
(Zhiwu Xie and Edward A. Fox, Virginia Tech University)
4. Incentive awards projects
to advance web archiving tools
Visualizing Digital Collections of Web Archives (Michele
Weigle, Old Dominion University)
Tools for Managing Seed URLs (Michael Nelson, Old
Dominion University)
5. Incentive awards projects
to advance web archiving tools
Perma.cc: Mitigating the
Pervasive Problem of Link
Rot in Scholarly Works and
Preserving Online Content
(Kim Dulin, The Harvard
Library Innovation Lab)
Free Law Project
Providing free access to
primary legal materials,
developing legal research
tools, and supporting
academic research on legal
corpora)
6. Building an efficient, coherent, and scalable
national framework for collecting web content
8. Program Components
• Communication and coordination
• Seed management and harvest
• Supplemental quality review (QA testing)
• MARC Metadata
• Local preservation storage (seeking solutions)
9. The first 18 months of collaborative collecting
• Planning, needs assessment (interviews with stakeholders including
Associate University Librarians for collection development at each Borrow
Direct institution in 2013), timelines created
• Group communication (spreadsheets, Basecamp), cultivating dialogs
• Coordinate seed URLs nomination for pilots collections (CCWA,
CAUSEWAY), QA testing and creation of MARC records
• Trying out workflows for optimal balance of involvement and efficient
forward motion on projects
• In planning stages for sharing costs & 5 year plan for Borrow Direct/Ivy
Plus collaborations
11. Contemporary Composers Web Archive
Selectors
• Borrow Direct Music Librarians Group: music librarians at Brown,
Columbia, Cornell, Dartmouth, Harvard, Johns Hopkins, Princeton,
and Yale universities, MIT, and the universities of Chicago and
Pennsylvania
Cataloging expertise
• Russell Merritt (cataloger specializing in music resources)
• Kate Harcourt (Director of Original and Special Materials Cataloging)
• Alex Thurman (Web Resources Collection Coordinator)
14. Progress on CCWA & lessons learned so far
By the numbers:
• 11 curators participating
• 56 sites currently available in Archive-It
– 23 additional sites for follow up
• 27 GB of content archived (268,519 URLs)
• 50 MARC records in WorldCat as of 11/18/14
– Russell Merritt (music cataloger) collaboratively developed MARC records
for composers websites; further cataloging of available sites through 2CUL
Outreach
• SAA presentation on MARC records for CCWA
http://www.slideshare.net/annaperricci/lightning-talk-for-session-703-of-society-of-american-archivists
• Over 30 sites tested for quality by five music librarians;
bibliographic assistant on the grant tested all sites in collection
15. CCWA Permissions
77 Composers
Yes (37)
No (0)
Did not respond (35)
No contact info (2)
Recently died/did not
contact (3)
17. Creating MARC records for web archives
• Creating MARC records for archived websites is standard
practice at CUL
– MARC records make web archives discoverable in CLIO
(Columbia Libraries Information Online)
• Collection level and seed level records
• Will use Archive-It interface to add Dublin Core metadata
18. Anticipating wider use of MARC records
• Records have been regularly
released to WorldCat
• Collaborators on cataloging
were attentive to which
fields will ordinarily be
stripped out when a MARC
record is imported to
another institution’s OPAC
22. Progress on CAUSEWAY & lessons learned
• Curators from 9 Borrow Direct institutions (Ivies Plus Art &
Architecture Group)
– Lead advisors: Carole Ann Fabian and Chris Sala
• 137 seed URLs (over 100 harvested and being released as sites
are tested, cataloged and assigned metadata in Archive-It)
• 51 GB of content archived (1,006,114 URLs )
• Over 60 sites available in Archive-It with DC metadata (also all
60+ have MARC records in CLIO)
Outreach
• Update sent to IVAAG soliciting feedback
• Gave update and got feedback at semi annual IVAAG meeting
• Presentation scheduled for ARLIS/NA 2015
23. CAUSEWAY Permissions
137 Site owners
Yes (74)
No (3)
Later (2)
No contact info (2)
Did not respond (56)
28. Cataloging expertise brought to CAUSEWAY
• Alex’s expertise in cataloging architecture and urban planning
sites (built through collaboration with Chris Sala on the Avery
collecting of web archives) equips him to make more specific
MARC records for sites in CAUSEWAY
• Columbia University art and architecture librarians encourage
users to find resources via records in the OPAC so access to
CAUSEWAY sites will likely be via the MARC records which point
to the calendar page for archived sites
• Alex is working with our Bibliographic Assistant, Naeema Akter
(position funded by the grant as well) to add appropriate
metadata for better browsing in the Archive-It interface
30. CAUSEWAY goals for duration of
remainder of grant
• Collect all nominated sites in scope, test for quality, create a MARC
record for each archived website (by early 2015)
• Evaluate quality and solicit feedback (ongoing)
• Meet at ARLIS/NA and discuss progress (March 2015)
– Anna will also give a presentation on collaborative web
archiving projects at ARLIS/NA
• Establish ongoing workflows and goals (2015 and onward)
• End of pilot phase: December 2015
32. Pilot climate change collecting
& lessons learned so far
• 25 selectors from 5 institutions
Great range of fields:
-Wide variety of area studies (9)
-Social science (5)
-Science and environmental science (4)
-Medical (1), Law (1), Special Collections (1)
-Collection Development AUL (3), Preservation (1)
• 127 seeds websites nominated (some duplication)
• A lot of enthusiasm for topic
33. What we’ve learned about
workflows and scale
• Distributing work does not reduce costs
• Collaborative effort builds the project and new tasks promote
professional growth
• Quality Assurance and cataloging integral to process of
creating high quality collections of web archives
42. Wider reach with guidelines rather than
suggesting changes on case by case basis
43. Web archiving initiatives
focusing on art resources
An initiative designed to address the “urgent need to document the
dynamic web-based versions of auction catalogues, catalogues
raisonnés, and scholarly research projects, as well as artist, gallery,
and museum websites” (http://www.nyarc.org/content/web-archiving)
Artist files Special Interest Group
44. What do you want to learn
about web archiving?
Do you have any suggestions on how the SAA Web
Archiving Roundtable can help you develop your
knowledge of web archiving?
Categories we identified based on the 33 responses:
– Description
– Preservation
– Access/ Use
– Project Management/ Collaboration
– Appraisal/ Collection Dev/ Policy
– Technology/ Capture/ Tools
– Business Case/ Costs/ Best Practices
45. Some presentations, papers, panels & posters during grant
• Moderated: “Web Archiving: Experiences, Perspectives and Possibilities” held at METRO on 10/20/14
• Presentation (lightning talk): “MARC Records for the Contemporary Composers Web Archive” for the Society of
American Archivists annual conference on 8/16/14
URL (via Academic Commons): http://dx.doi.org/10.7916/D8028Q3S
• Presentation: “SAA Web Archiving Roundtable Education Needs Assessment Survey Results” for the SAA Web
Archiving Roundtable meeting at Society of American Archivists annual conference (co-presented with John Bence)
on 8/14/14
• Presentation: “How Collaboration Can Save [More of] the Web: Recent Progress in Collaborative Web Archiving
Initiatives” for the METRO Conference 2014 on 1/15/14
• Poster session: “Assessment of the Effectiveness of the Human Rights Web Archive @Columbia University” (co-presented
with Pamela Graham) at the ACRL/NY Symposium on 12/6/13
URL (via Academic Commons): http://dx.doi.org/10.7916/D8BG2KZ9
• Presentation: “How Collaboration Can Save [More of] the Web: Recent Progress in Collaborative Web Archiving
Initiatives” for the Best Practices Exchange on 11/14/13 (with Scott Reed)
URL (via Academic Commons): http://dx.doi.org/10.7916/D8G73BNK
• Presentation: “Web Archiving Resource Collaboration” at CrawlCamp held at
METRO on 7/17/13
46. Are project elements
on schedule & within budget?
• So far yes though we have plenty of challenges and work
ahead of us
• Steady progress on citation analysis but it’s been much harder
than we thought it’d be
• Lots of room for engagement and team work including
maintenance and coordination of cooperative efforts
49. The next 12.5 months
• Complete remainder of work called for in grant
• Establish shared cost model for collaborative collection building
(e.g. CCWA and CAUSEWAY)
• Plan for scaling (maintenance and growth)
• Codify roles for meaningful involvement in web archiving efforts
• Contribute to professional organizations to strengthen web
archiving efforts nationally and internationally
50. Credits to some of many collaborators
• Bob Wolven, Alex Thurman, Naeema Akter
• Pamela Graham, Kate Harcourt, Christina Harlow
• Talia Jimenez, Stephen Davis, incentives awards oversight panel:
Kris Carpenter, Mark Phillips, Rob Sanderson & Perry Willett
• Elizabeth Davis, Russell Merritt & Borrow Direct music librarians
• Carole Ann Fabian, Chris Sala, Ivies Plus Art & Architecture Group
• Borrow Direct Associate University Librarians for Collection
Development group
• Climate change selectors at Borrow Direct institutions
• Archive-It staff
• Community for discussion and participation
Including: NYARC, METRO, International Internet Preservation Consortium
(IIPC), SAA Web Archiving Roundtable, ARLIS/NA Artist Files SIG