2. Definition of Web Archiving
“Web archiving is the process of collecting portions of the World Wide Web
and ensuring the collection is preserved in an archive”
such as an archive site, for future researchers, historians, and the public. Due to
the massive size of the Web, web archivists typically employ web crawlers for
automated collection. The largest web archiving organization based on a
crawling approach is the Internet Archive which strives to maintain an archive
of
the entire Web. National libraries, national archives and various consortia of
organizations are also involved in archiving culturally important Web content.
Commercial web archiving software and services are also available to
organizations who need to archive their own web content for corporate heritage,
regulatory, or legal purposes.
3. Web Crawlers
A Web crawler is a computer program that browses the World Wide Web in a
methodical, automated manner or in an orderly fashion. Other terms for Web
crawlers are ants, automatic indexers, bots, Web spiders, Web robots, Web
scutters.
• This process is called Web crawling or spidering. Many sites, in particular
search engines, use spidering as a means of providing up-to-date data. Web
crawlers are mainly used to create a copy of all the visited pages for later
processing by a search engine that will index the downloaded pages to
provide fast searches.
• Also, crawlers can be used to gather specific types of information from
Web pages, such as harvesting e-mail addresses.
• A Web crawler is one type of bot, or software agent. In general, it starts
with a list of URLs to visit, called the seeds. As the crawler visits these
URLs, it identifies all the hyperlinks in the page and adds them to the list
of URLs to visit, called the crawl frontier. URLs from the frontier are
recursively visited according to a set of policies.
4. WHAT IS A SEARCH ENGINE
“In a search engines, such as Google and HotBot, consist of a software
package that crawls the Web, extracts and organizes the data in a database.
People can then submit a search query using a Web browser. The search engine
locates the appropriate data in the database and displays it via the browser”
Search engines have three major elements:
• The spider, also called the crawler, harvester, robot or gatherer. The spider visits
a Web page, reads it, and then follows links to other pages within the site. The
spider returns to the site on a regular basis, such as every month or two, to look
for changes.
• The Index. Everything the spider finds goes into the index. The index, is like a
giant book containing a copy of every web page that the spider finds. If a web
page changes, then this book is updated with new information.
• Search engine software. This is the program that sifts through the millions of
pages recorded in the index to find matches to a search and rank them in order of
what it believes is most relevant. Search engine software is also available to run
on a local Web site. The software has the same basic components, but the spider
just visits the local site or a limited number of sites in a community.
5. Web Crawler Behavior
The behavior of a Web crawler is the outcome of a combination of
policies:
• A selection policy that states which pages to download,
• A re-visit policy that states when to check for changes to the
pages,
• A politeness policy that states how to avoid overloading Web
sites, and
• A parallelization policy that states how to coordinate distributed
Web crawlers.
6. High Level Architecture of a Web Crawler
Web crawlers are a central part of search engines, and details on their algorithms and
architecture are kept as business secrets
8. Internet Archive
“The Internet Archive is a non-profit digital library with the stated mission of
"universal access to all knowledge. It offers permanent storage and access to
collections of digitized materials, including websites, music, moving images,
and books. The Internet Archive was founded by Brewster Kahle in 1996”
• With offices located in San Francisco, California, USA and data centers in
San Francisco, Redwood City, and Mountain View, California, USA, the
Archive's largest collection is its web archive, "snapshots of the World
Wide Web.“
• The Archive allows the public to both upload and download digital
material to its data cluster, and provides unrestricted online access to that
material at no cost. The Archive also oversees one of the world's largest
book digitization projects. It is a member of the American Library
Association and is officially recognized by the State of California as a
library.
9. Brewster Kahle founded the Archive in 1996 at the same time that he began
the for-profit web crawling company Alexa Internet. The Archive began to
archive the World Wide Web from 1996, but it did not make this collection
available until 2001, when it developed the Wayback Machine. Now the
Internet Archive includes texts, audio, moving images, and software. It hosts a
number of other projects: the NASA Images Archive, the contract crawling
service Archive-It, and the wiki-editable library catalog and book information
site Open Library.
According to its website:
– Most societies place importance on preserving artifacts of their culture
and heritage. Without such artifacts, civilization has no memory and
no mechanism to learn from its successes and failures. Our culture
now produces more and more artifacts in digital form. The Archive's
mission is to help preserve those artifacts and create an Internet
library for researchers, historians, and scholars.
10. Wayback Machine
The Internet Archive has "Wayback Machine" for its service that
allows archives of the World Wide Web to be searched and
accessed. This service allows users to see archived versions of
web pages of the past. Millions of websites and their
associated data (images, source code, documents, etc.) are
saved in a gigantic database. The service can be used to see
what previous versions of websites used to look like, to grab
original source code from websites that may no longer be
directly available, or to visit websites that no longer even
exist. Not all websites are available, however, because many
website owners choose to exclude their sites.
11. Web Archiving Techniques
The most common web archiving technique uses web crawlers to automate the
process of collecting web pages. Web crawlers typically view web pages in
the
same manner that users with a browser see the Web, and therefore provide a
comparatively simple method of remotely harvesting web content. Examples
of web crawlers frequently used for web archiving include:
• Automated Internet Sessions in biterScripting
• Heritrix
• HTTrack
• Wget
12. Heritrix
• Heritrix is the Internet Archive’s web crawler, which
was specially designed for web archiving. It is open-
source and written in Java. The main interface is
accessible using a web browser, and there is a
command-line tool that can optionally be used to
initiate crawls.
• Heritrix was developed jointly by Internet Archive
and the Nordic national libraries on specifications
written in early 2003. The first official release was in
January 2004, and it has been continually improved
by employees of the Internet Archive and other
interested parties.
13. Organization using Heritrix
A number of organizations and national libraries are
using Heritrix, among them:
- Bibliothèque nationale de France
- British Library
- National Library of Finland
- National Library of Newzeland
14. Bibliothèque Nationale de France
The Bibliothèque nationale de France (BnF) is the
National Library of France, located in Paris. It is
intended to be the repository of all that is published in
France. The current president of the library is Bruno
Racine.
15. British Library
The British Library is the national library of the United
Kingdom, and one of the world's largest libraries in terms of
total number of items. The library is a major research library,
holding over 150 million items from every country in the
world, in virtually all known languages and in many formats,
both print and digital: books, manuscripts, journals,
newspapers, magazines, sound and music recordings, videos,
play-scripts, patents, databases, maps, stamps, prints,
drawings. The Library's collections include around 14 million
books.
16. ARC FILE
• Heritrix by default stores the web resources it crawls in an Arc file.This
format has been used by the Internet Archive since 1996 to store its web
archives. The WARC file format, similar to ARC but more precisely
specified and flexible, can also be used. Heritrix can also be configured to
store files in a directory format similar to the Wget crawler that uses the
URL to name the directory and filename of each resource.
• An Arc file stores multiple archived resources in a single file in order to
avoid managing a large number of small files. The file consists of a
sequence of URL records, each with a header containing metadata about
how the resource was requested followed by the HTTP header and the
response.
Example:
• filedesc://IA-2006062.arc 0.0.0.0 20060622190110 text/plain 76 1 1
InternetArchive URL IP-address Archive-date Content-type Archive-
length
• http://foo.edu:80/hello.html 127.10.100.2 19961104142103 text/html 187
HTTP/1.1 200 OK Date: Thu, 22 Jun 2006 19:01:15 GMT Server: Apache
Last-Modified: Sat, 10 Jun 2006 22:33:11 GMT Content-Length: 30
Content-Type: text/html <html> Hello World!!! </html>
17. Screenshot of Heritrix Admin Console
Stable release 3.0.0 / December 5, 2009; 14 months ago)
Written in Java
Operating system Linux/Unix-like/Windows(unsupported)
Type Web crawler
License GNU Lesser General Public License
Website http://crawler.archive.
18. Database Archive
Database archiving refers to methods for archiving the underlying content of
database-driven websites. It typically requires the extraction of the database
content into a standard schema, often using XML. Once stored in that standard
format, the archived content of multiple databases can then be made available
using a single access system.
Transactional Archiving
Transactional archiving is an event-driven approach, which collects the actual
transactions which take place between a web server and a web browser. It is primarily
used as a means of preserving evidence of the content which was actually viewed on a
particular website, on a given date. This may be particularly important for organizations
which need to comply with legal or regulatory requirements for disclosing and retaining
information.
A transactional archiving system typically operates by intercepting every HTTP request
to, and response from, the web server, filtering each response to eliminate duplicate
content, and permanently storing the responses as bit streams.
19. HTTrack
• HTTrack is a free and open source Web crawler and offline browser,
developed by and licensed under the GNU General Public License.
• It allows one to download World Wide Web sites from the Internet to a
local computer. By default, HTTrack arranges the downloaded site by the
original site's relative link-structure. The downloaded (or "mirrored")
website can be browsed by opening a page of the site in a browser.
• HTTrack uses a Web crawler to download a website. Some parts of the
website may not be downloaded by default due to the robots exclusion
protocol unless disabled during the program. HTTrack can follow links
that are generated with basic JavaScript and inside Applets or Flash, but
not complex links.
20.
21. IIPC
International Internet Preservation Consortium is an international organization of
libraries to coordinate efforts to preserve internet content for the future. Membership is
open to archives, museums, libraries, and cultural heritage institutions.
Its membership includes
• Austrian National Library,
• Biblioteka Narodowa,
• Bibliothèque et Archives nationales du Québec,
• Bibliothèque nationale de France,
• British Library,
• California Digital Library,
• Clementinum,
• German National Library,
• Institut national de l'audiovisuel,
• Internet Archive,
• Koninklijke Bibliotheek, National Library of the Netherlands, Library and Archives Canada,
National and University Library in Zagreb, National and University Library of Iceland,
National and University Library of Slovenia, National Diet Library, National Library Board,
National Library of Australia, National Library of Catalonia, National Library of China,
National Library of Finland, National Library of Israel, National Library of Korea,
National Library of New Zealand, National Library of Norway, National Library of Poland,
National Library of Scotland, National Library of Sweden, Royal Netherlands Academy of
Arts and Sciences, Swiss National Library, The National Archives, United States Government
Printing Office, and WebCite
22. Pandora Archive
• PANDORA - Australia's Web Archive is the national web archive for the
preservation of Australia's online publications. It was established by the
National Library of Australia in 1996, and is now built in collaboration
with a number of other Australian state libraries and cultural collecting
organization, including the Australian Institute of Aboriginal and Torres
Strait Islander Studies, the Australian War Memorial, and the National
Film and Sound Archive.
• The PANDORA Archive collects selected Australian web resources,
preserves them, and makes them available for viewing. Access to the
archive is made available to the public via the Pandora web site. Web sites
are selected based on their cultural significance and research value in the
long term.
23. Difficulties and Limitations
Crawlers
Web archives which rely on web crawling as their primary means of
collecting the Web are influenced by the difficulties of web crawling:
However, it is important to note that a native format web archive, i.e. a fully
browsable web archive, with working links, media, etc., is only really
possible using crawler technology.
The Web is so large that crawling a significant portion of it takes a large
amount of technical resources. The Web is changing so fast that portions of
a website may change before a crawler has even finished crawling it.
24. Difficulties and Limitations
General limitations
Not only must web archivists deal with the technical challenges of web
archiving, they must also contend with intellectual property laws. Peter
Lyman states that "although the Web is popularly regarded as a public
domain resource, it is copyrighted; thus, archivists have no legal right to
copy the Web". However national libraries in many countries do have a
legal right to copy portions of the web under an extension of a legal
deposit.
Some private non-profit web archives that are made publicly accessible like
WebCite or the Internet Archive allow content owners to hide or remove
archived content that they do not want the public to have access to. Other
web archives are only accessible from certain locations or have regulated
usage.