Using Wayback Machine for Research

•Télécharger en tant que PPT, PDF•

1 j'aime•1,811 vues

This document discusses the Wayback Machine, an open source tool used by many institutions to archive and provide access to historical web pages. It describes common limitations of web archives like missing elements from pages and errors with JavaScript. Workarounds are provided like disabling JavaScript. The document also provides strategies for finding pages missing from archives, such as using search engines to find historical URLs when a site URL has changed. It encourages involvement in identifying important websites to archive for future access.

Technologie

Nicholas Taylor
Repository Development Group
Using Wayback Machine for Research

not one, but many Wayback Machines
 open source software to “replay” web archives
 rewrites links to point to archived resources
 allows for temporal navigation within archive
 used by many web archiving institutions
 33 out of 62 initiatives listed on Wikipedia

California Digital Library Web Archiving Service

Harvard University Web Archive Collection
Service

limitation: banner displaces page elements

workaround: insert live site URL in archive

structure of a Wayback Machine URL
http://webarchiveqr.loc.gov/loc_sites/20120131201510/http://www.loc.gov/index.html
Wayback Machine URL collection date/timestamp
(YYYYMMDDHHMMSS)
URL of archived
resource

FINDING MISSING
RESOURCES
Strategies for

removed or moved?
 don’t start with the archive
 missing resources have often just moved (
Klein & Nelson, 2010)
 Synchronicity for Firefox helps find new location
 scrapes archived version for “fingerprint”
keywords; uses them to query search engines

find archives for a site whose URL has changed
 website URL changed recently
 historical URL is unknown
 solution: use search engine to find historical
URL then apply it in the archive

check Internet Archive’s Wayback Machine

IA Wayback coverage goes back to July 2010

use search engine to find historical URL

note the redirect from http://it.usaspending.gov/

find archives for a site whose URL has changed
 congressional committee hearings archive
 live site URL doesn’t work in archive
 solution: find a site in the archive that would
link to the desired site, then navigate to
contemporaneous snapshot

find archives for a previously accessible webpage
 records currently stored in password-protected
part of site may have previously been publicly-
accessible
 conceptual site organization lasts longer than
exact link construction
 solution: figure out where desired resource
would be on the live site, then navigate to
analogous section on archived site

 what websites from today
would you want to be able to
consult in five, ten, twenty
years’ time?
 have you told us what is
important to capture?
help us to help you

links
 Library of Congress Web Archiving Program:
http://www.loc.gov/webarchiving/
 Library of Congress Web Archives: http://
loc.gov/lcwa/
 International Internet Preservation Consortium:
http://netpreserve.org/
 National Digital Information Infrastructure and
Preservation Program: http://
www.digitalpreservation.gov/

Contenu connexe

Tendances

Investigating Using the Dark WebCase IQ

Bug bountyn|u - The Open Security Community

Brute Force Attack Security Use Case Guide Protect724manoj

Dark web markets: from the silk road to alphabay, trends and developmentsAndres Baravalle

Burp Suite StarterFadi Abdulwahab

Logging, monitoring and auditingPiyush Jain

Dos attackManjushree Mashal

Heartbleedn|u - The Open Security Community

Online Social Networks: 5 threats and 5 ways to use them safelyTom Eston

phishing-awareness-powerpoint.pptxvdgtkhdh

Ethical hacking/ Penetration TestingANURAG CHAKRABORTY

Penetration Testing Tutorial | Penetration Testing Tools | Cyber Security Tra...Edureka!

Phishing Incident Response PlaybookNaushad CEH, CHFI, MTA, ITIL

Dark Web Presentation.pptxAbhinavRaj219245

Website hacking and prevention (All Tools,Topics & Technique )Jay Nagar

Open source intelligencebalakumaran779

DDoS Attack PPT by Nitin BishtNitin Bisht

FootprintingDuah John

Identity Theft Presentationcharlesgarrett

OWASP Top 10 Vulnerabilities - A5-Broken Access Control; A6-Security Misconfi...Lenur Dzhemiliev

Tendances (20)

Investigating Using the Dark Web

Bug bounty

Brute Force Attack Security Use Case Guide

Dark web markets: from the silk road to alphabay, trends and developments

Burp Suite Starter

Logging, monitoring and auditing

Dos attack

Heartbleed

Online Social Networks: 5 threats and 5 ways to use them safely

phishing-awareness-powerpoint.pptx

Ethical hacking/ Penetration Testing

Penetration Testing Tutorial | Penetration Testing Tools | Cyber Security Tra...

Phishing Incident Response Playbook

Dark Web Presentation.pptx

Website hacking and prevention (All Tools,Topics & Technique )

Open source intelligence

DDoS Attack PPT by Nitin Bisht

Footprinting

Identity Theft Presentation

OWASP Top 10 Vulnerabilities - A5-Broken Access Control; A6-Security Misconfi...

Similaire à Using Wayback Machine for Research

Web Archiving Intro (circa 2015)Anna Perricci

SharePoint Saturday Utah 2015 - SP2013 Search Driven SitesBrian Culver

Mashups for LibrariesNicole C. Engard

ELAG - Mashing Up and Remixing the Library Websitelibrarywebchic

On building a search interface discovery systemDenis Shestakov

Html5 histroy apiMuktadiur Rahman

SharePoint Saturday DFW 2015 - Build a SharePoint 2013 Search Driven ApplicationBrian Culver

E Write Intro To Web 2LeslieOflahavan

Online Collections Crawlability for Libraries, Archives, and Museumsmherbison

SRC 204 - Build a SharePoint 2013 Search Driven Application!Brian Culver

Website Mashuplibrarywebchic

Library hacksAndy Powell

LD4L OCLC Data StrategyRichard Wallis

At Begin, URL Handling and RESTBrian Loomis

The development of web archiving 3Essam Obaid

WordpressCKLS

Web Crawleriamthevictory

SharePoint 2013 Search Driven Sites - SPSHOUBrian Culver

Build a Search Driven Site-Understanding Cross-Site PublishingSPC Adriatics

Boost and SEOTamaghna Banerjee

Similaire à Using Wayback Machine for Research (20)

Web Archiving Intro (circa 2015)

SharePoint Saturday Utah 2015 - SP2013 Search Driven Sites

Mashups for Libraries

ELAG - Mashing Up and Remixing the Library Website

On building a search interface discovery system

Html5 histroy api

SharePoint Saturday DFW 2015 - Build a SharePoint 2013 Search Driven Application

E Write Intro To Web 2

Online Collections Crawlability for Libraries, Archives, and Museums

SRC 204 - Build a SharePoint 2013 Search Driven Application!

Website Mashup

Library hacks

LD4L OCLC Data Strategy

At Begin, URL Handling and REST

The development of web archiving 3

Wordpress

Web Crawler

SharePoint 2013 Search Driven Sites - SPSHOU

Build a Search Driven Site-Understanding Cross-Site Publishing

Boost and SEO

Plus de nullhandle

Understanding Legal Use Cases for Web Archivesnullhandle

Lots More LOCKSS for Web Archiving: Boons from the LOCKSS Software Re-Archite...nullhandle

Unlocking LOCKSS with APIsnullhandle

Lots of LOCKSS Keeping Stuff Safe: The Future of the LOCKSS Programnullhandle

Interoperability and Technical Collaboration for Web and Social Media Archivingnullhandle

Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Susta...nullhandle

2015 NDSA Web Archiving Survey Report Highlightsnullhandle

Collection Development for Selective Web Archivingnullhandle

Why Not Lots of Copies Keep(ing) Software Safe?nullhandle

WASAPI Web Archive Data Transfer APIsnullhandle

Building Web Archiving Technology, Togethernullhandle

Outreach to Campus Webmasters for a Better Web, and Better Web Archivingnullhandle

Measure All the (Web Archiving) Things!nullhandle

A Snapshot of the U.S. Web Archiving Landscape through the 2013 NDSA Survey R...nullhandle

Campaign Web Archives to Support Multi-Institutional Researchnullhandle

2013 NDSA Web Archiving Survey Report Highlightsnullhandle

Considerations for Strategic Web Archive Collection Developmentnullhandle

Boiling the Ocean, Together: Web Archive Collection Development in a Global C...nullhandle

Advocating for Web Archivabilitynullhandle

Building Archivable Websitesnullhandle

Plus de nullhandle (20)

Understanding Legal Use Cases for Web Archives

Lots More LOCKSS for Web Archiving: Boons from the LOCKSS Software Re-Archite...

Unlocking LOCKSS with APIs

Lots of LOCKSS Keeping Stuff Safe: The Future of the LOCKSS Program

Interoperability and Technical Collaboration for Web and Social Media Archiving

Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Susta...

2015 NDSA Web Archiving Survey Report Highlights

Collection Development for Selective Web Archiving

Why Not Lots of Copies Keep(ing) Software Safe?

WASAPI Web Archive Data Transfer APIs

Building Web Archiving Technology, Together

Outreach to Campus Webmasters for a Better Web, and Better Web Archiving

Measure All the (Web Archiving) Things!

A Snapshot of the U.S. Web Archiving Landscape through the 2013 NDSA Survey R...

Campaign Web Archives to Support Multi-Institutional Research

2013 NDSA Web Archiving Survey Report Highlights

Considerations for Strategic Web Archive Collection Development

Boiling the Ocean, Together: Web Archive Collection Development in a Global C...

Advocating for Web Archivability

Building Archivable Websites

Dernier

Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

UiPath Community: Communication Mining from Zero to HeroUiPathCommunity

Decarbonising Buildings: Making a net-zero built environment a realityIES VE

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3

Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3

Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda

From Family Reminiscence to Scholarly Archive .Alan Dix

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

The State of Passkeys with FIDO Alliance.pptxLoriGlavin3

Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery

Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen

Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3

Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA

Rise of the Machines: Known As Drones...Rick Flair

Dernier (20)

Moving Beyond Passwords: FIDO Paris Seminar.pdf

How AI, OpenAI, and ChatGPT impact business and software.

UiPath Community: Communication Mining from Zero to Hero

Decarbonising Buildings: Making a net-zero built environment a reality

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx

Digital Identity is Under Attack: FIDO Paris Seminar.pptx

Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...

From Family Reminiscence to Scholarly Archive .

DevEX - reference for building teams, processes, and platforms

The State of Passkeys with FIDO Alliance.pptx

Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...

Testing tools and AI - ideas what to try with some tool examples

Genislab builds better products and faster go-to-market with Lean project man...

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx

Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

Assure Ecommerce and Retail Operations Uptime with ThousandEyes

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

Long journey of Ruby standard library at RubyConf AU 2024

Rise of the Machines: Known As Drones...

Using Wayback Machine for Research

1. Nicholas Taylor Repository Development Group Using Wayback Machine for Research

2. WAYBACK MACHINE? What Is the

3. WABAC Machine?

4. Internet Archive’s Wayback Machine

5. not one, but many Wayback Machines  open source software to “replay” web archives  rewrites links to point to archived resources  allows for temporal navigation within archive  used by many web archiving institutions  33 out of 62 initiatives listed on Wikipedia

6. Government of Canada Web Archive

7. Government of Canada Web Archive

8. Portuguese Web Archive

9. Web Archive Singapore

10. Web Archive Singapore

11. Catalonian Web Archive

12. Catalonian Web Archive

13. California Digital Library Web Archiving Service

14. Harvard University Web Archive Collection Service

15. LIMITATIONS AND WORKAROUNDS Common

16. limitation: banner displaces page elements

17. workaround: hide the banner

18. limitation: AJAX-enabled sites

19. limitation: AJAX-enabled sites

20. workaround: disable JavaScript

21. limitation: nav menu link errors

22. workaround: insert live site URL in archive

23. workaround: insert live site URL in archive

24. workaround: insert live site URL in archive

25. limitation: no full-text search

26. workaround: none yet, but R&D ongoing

27. MECHANICS Basic

28. structure of a Wayback Machine URL http://webarchiveqr.loc.gov/loc_sites/20120131201510/http://www.loc.gov/index.html Wayback Machine URL collection date/timestamp (YYYYMMDDHHMMSS) URL of archived resource

33. document wildcarding

34. document wildcarding

35. document wildcarding

36. FINDING MISSING RESOURCES Strategies for

37. removed or moved?  don’t start with the archive  missing resources have often just moved ( Klein & Nelson, 2010)  Synchronicity for Firefox helps find new location  scrapes archived version for “fingerprint” keywords; uses them to query search engines

38. MementoFox

39. MementoFox

40. find archives for a site whose URL has changed  website URL changed recently  historical URL is unknown  solution: use search engine to find historical URL then apply it in the archive

41. Federal IT Dashboard

42. check Internet Archive’s Wayback Machine

43. IA Wayback coverage goes back to July 2010

44. LCWA only goes back to June 2011

45. use search engine to find historical URL

46. use search engine to find historical URL

47. White House IT Dashboard announcement

48. note the redirect from http://it.usaspending.gov/

49. append URL to IA Wayback URL

50. append URL to LC Wayback URL

51. find archives for a site whose URL has changed  congressional committee hearings archive  live site URL doesn’t work in archive  solution: find a site in the archive that would link to the desired site, then navigate to contemporaneous snapshot

52. hearings archive only spans 2001-2006

53. hearings archive URL changed in 2011

54. truncate archival access URL

55. snapshot from prior to site change

56. navigate to appropriate section

57. navigate to appropriate section

58. find archives for a previously accessible webpage  records currently stored in password-protected part of site may have previously been publicly- accessible  conceptual site organization lasts longer than exact link construction  solution: figure out where desired resource would be on the live site, then navigate to analogous section on archived site

59. location of resources on live site

60. location of resources on live site

61. authentication required

62. check the site in the archive

63. navigate to an individual capture

64. navigate to appropriate section

65. navigate to appropriate section

66. GET INVOLVED How You Can

67.  what websites from today would you want to be able to consult in five, ten, twenty years’ time?  have you told us what is important to capture? help us to help you

68. End of Term 2012 Web Archive

69. USEFUL RESOURCES Other

70. End of Term 2008 Web Archive

71. CyberCemetery

72. LCWA

73. Project One Web Archives

74. links  Library of Congress Web Archiving Program: http://www.loc.gov/webarchiving/  Library of Congress Web Archives: http:// loc.gov/lcwa/  International Internet Preservation Consortium: http://netpreserve.org/  National Digital Information Infrastructure and Preservation Program: http:// www.digitalpreservation.gov/

75. questions? webcapture@loc.gov

Notes de l'éditeur

Mr. Peabody and Sherman’s time machine plot device from the television show “Rocky & Bullwinkle.”
The Wayback Machine most people are familiar with.
http://www.collectionscanada.gc.ca/webarchives/20071114183551/http://www.accord-treaty.gc.ca/main.asp?language=0
http://www.collectionscanada.gc.ca/webarchives/*/http://www.accord-treaty.gc.ca/main.asp?language=0
http://www.arquivo.pt/wayback/wayback/id4390263index3?l=en
http://was.nl.sg/wayback/20080404151626/http://www.biosingapore.org.sg/
http://was.nl.sg/wayback/*/http://www.biosingapore.org.sg/
http://www.padi.cat:8080/wayback/20120327044230/http://www.udg.edu/
http://www.padi.cat:8080/wayback/*/http://www.udg.edu/
http://webarchives.cdlib.org/sw16689n33/http://bawsca.org/
http://wax.lib.harvard.edu/collections/wayback.do?stamp=20080714184732&lang=eng&primColl=61&seed=175&liveWebUrl=tiffanni.blogspot.com%2F
When the Twitter link in the footer is clicked…
…the AJAX code truncates the URL, resulting in a blank page.
If you disable JavaScript in the browser and then click on the Twitter link, the page loads fine.
The navigation menu layout is awry and the links aren’t clickable.
Just because Wayback can’t properly rewrite the link doesn’t mean the crawler didn’t capture it. Navigate to the live site.
Find the desired URL.
Append the desired URL to the Wayback URL.
In the Library of Congress Web Archives, it’s only possible to search the bibliographic records.
The British Library and Internet Archive are exploring Lucene/Solr for full-text searching of web archives.
Note the live site URL.
Appending the live site URL to the Wayback URL takes you to a “snapshot” of that page in the archive.
Full date range is wildcarded (any date), so all snapshots for that URL are presented.
Date range is wildcarded to include only those captures from the specified year.
An individual page in the archive.
The time and specific resource are wildcarded, so it shows all resources captured for the specified domain on the specified day.
An example of one of the captured resources in the list.
Example of a live site.
Adjust the slider to request a Memento (i.e. archived resource) for the current URL.
We know that the website existed before then; how do we find it?
Copy the link to the IT Dashboard.
Additional captures from 2009 and 2010 are presented in the archive.
Additional captures from 2009 are presented in the archive.
The teleconference archives are in the events section.
If you click on any of the individual calls…
…you’re taken to an authentication page.
Even though the site URLs changed, there’s a decent chance that the teleconference archives were previously located in the events section.
Sure enough, they’re there, and not password-protected.
http://eotarchive.cdlib.org/2012.html
http://eotarchive.cdlib.org/search?browse-all=yes
http://govinfo.library.unt.edu/
http://lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html

Using Wayback Machine for Research

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Using Wayback Machine for Research

Similaire à Using Wayback Machine for Research (20)

Plus de nullhandle

Plus de nullhandle (20)

Dernier

Dernier (20)

Using Wayback Machine for Research

Notes de l'éditeur