SlideShare a Scribd company logo
1 of 24
The Development
      of
 Web Archiving


      Dr. Essam Obaid
Definition of Web Archiving
 “Web archiving is the process of collecting portions of the World Wide Web
          and ensuring the collection is preserved in an archive”

such as an archive site, for future researchers, historians, and the public. Due to
 the massive size of the Web, web archivists typically employ web crawlers for
     automated collection. The largest web archiving organization based on a
  crawling approach is the Internet Archive which strives to maintain an archive
                                           of
   the entire Web. National libraries, national archives and various consortia of
 organizations are also involved in archiving culturally important Web content.
      Commercial web archiving software and services are also available to
organizations who need to archive their own web content for corporate heritage,
                           regulatory, or legal purposes.
Web Crawlers
A Web crawler is a computer program that browses the World Wide Web in a
methodical, automated manner or in an orderly fashion. Other terms for Web
 crawlers are ants, automatic indexers, bots, Web spiders, Web robots, Web
                                     scutters.
• This process is called Web crawling or spidering. Many sites, in particular
   search engines, use spidering as a means of providing up-to-date data. Web
   crawlers are mainly used to create a copy of all the visited pages for later
   processing by a search engine that will index the downloaded pages to
   provide fast searches.
• Also, crawlers can be used to gather specific types of information from
   Web pages, such as harvesting e-mail addresses.
• A Web crawler is one type of bot, or software agent. In general, it starts
   with a list of URLs to visit, called the seeds. As the crawler visits these
   URLs, it identifies all the hyperlinks in the page and adds them to the list
   of URLs to visit, called the crawl frontier. URLs from the frontier are
   recursively visited according to a set of policies.
WHAT IS A SEARCH ENGINE
      “In a search engines, such as Google and HotBot, consist of a software
    package that crawls the Web, extracts and organizes the data in a database.
  People can then submit a search query using a Web browser. The search engine
    locates the appropriate data in the database and displays it via the browser”

  Search engines have three major elements:
• The spider, also called the crawler, harvester, robot or gatherer. The spider visits
  a Web page, reads it, and then follows links to other pages within the site. The
  spider returns to the site on a regular basis, such as every month or two, to look
  for changes.
• The Index. Everything the spider finds goes into the index. The index, is like a
  giant book containing a copy of every web page that the spider finds. If a web
  page changes, then this book is updated with new information.
• Search engine software. This is the program that sifts through the millions of
  pages recorded in the index to find matches to a search and rank them in order of
  what it believes is most relevant. Search engine software is also available to run
  on a local Web site. The software has the same basic components, but the spider
  just visits the local site or a limited number of sites in a community.
Web Crawler Behavior
The behavior of a Web crawler is the outcome of a combination of
policies:

• A selection policy that states which pages to download,
• A re-visit policy that states when to check for changes to the
  pages,
• A politeness policy that states how to avoid overloading Web
  sites, and
• A parallelization policy that states how to coordinate distributed
  Web crawlers.
High Level Architecture of a Web Crawler




Web crawlers are a central part of search engines, and details on their algorithms and
                     architecture are kept as business secrets
Web Based Archives
Internet Archive
“The Internet Archive is a non-profit digital library with the stated mission of
 "universal access to all knowledge. It offers permanent storage and access to
 collections of digitized materials, including websites, music, moving images,
  and books. The Internet Archive was founded by Brewster Kahle in 1996”

•   With offices located in San Francisco, California, USA and data centers in
    San Francisco, Redwood City, and Mountain View, California, USA, the
    Archive's largest collection is its web archive, "snapshots of the World
    Wide Web.“

•   The Archive allows the public to both upload and download digital
    material to its data cluster, and provides unrestricted online access to that
    material at no cost. The Archive also oversees one of the world's largest
    book digitization projects. It is a member of the American Library
    Association and is officially recognized by the State of California as a
    library.
Brewster Kahle founded the Archive in 1996 at the same time that he began
the for-profit web crawling company Alexa Internet. The Archive began to
archive the World Wide Web from 1996, but it did not make this collection
available until 2001, when it developed the Wayback Machine. Now the
Internet Archive includes texts, audio, moving images, and software. It hosts a
number of other projects: the NASA Images Archive, the contract crawling
service Archive-It, and the wiki-editable library catalog and book information
site Open Library.

According to its website:
   – Most societies place importance on preserving artifacts of their culture
      and heritage. Without such artifacts, civilization has no memory and
      no mechanism to learn from its successes and failures. Our culture
      now produces more and more artifacts in digital form. The Archive's
      mission is to help preserve those artifacts and create an Internet
      library for researchers, historians, and scholars.
Wayback Machine
The Internet Archive has "Wayback Machine" for its service that
    allows archives of the World Wide Web to be searched and
  accessed. This service allows users to see archived versions of
        web pages of the past. Millions of websites and their
     associated data (images, source code, documents, etc.) are
    saved in a gigantic database. The service can be used to see
   what previous versions of websites used to look like, to grab
     original source code from websites that may no longer be
     directly available, or to visit websites that no longer even
   exist. Not all websites are available, however, because many
            website owners choose to exclude their sites.
Web Archiving Techniques
The most common web archiving technique uses web crawlers to automate the
process of collecting web pages. Web crawlers typically view web pages in
   the
same manner that users with a browser see the Web, and therefore provide a
comparatively simple method of remotely harvesting web content. Examples
of web crawlers frequently used for web archiving include:
•   Automated Internet Sessions in biterScripting
•   Heritrix
•   HTTrack
•   Wget
Heritrix
• Heritrix is the Internet Archive’s web crawler, which
  was specially designed for web archiving. It is open-
  source and written in Java. The main interface is
  accessible using a web browser, and there is a
  command-line tool that can optionally be used to
  initiate crawls.
• Heritrix was developed jointly by Internet Archive
  and the Nordic national libraries on specifications
  written in early 2003. The first official release was in
  January 2004, and it has been continually improved
  by employees of the Internet Archive and other
  interested parties.
Organization using Heritrix

A number of organizations and national libraries are
using Heritrix, among them:
- Bibliothèque nationale de France
- British Library
- National Library of Finland
- National Library of Newzeland
Bibliothèque Nationale de France
 The Bibliothèque nationale de France (BnF) is the
   National Library of France, located in Paris. It is
intended to be the repository of all that is published in
 France. The current president of the library is Bruno
                        Racine.
British Library
 The British Library is the national library of the United
 Kingdom, and one of the world's largest libraries in terms of
total number of items. The library is a major research library,
  holding over 150 million items from every country in the
world, in virtually all known languages and in many formats,
      both print and digital: books, manuscripts, journals,
newspapers, magazines, sound and music recordings, videos,
    play-scripts, patents, databases, maps, stamps, prints,
drawings. The Library's collections include around 14 million
                              books.
ARC FILE
• Heritrix by default stores the web resources it crawls in an Arc file.This
  format has been used by the Internet Archive since 1996 to store its web
  archives. The WARC file format, similar to ARC but more precisely
  specified and flexible, can also be used. Heritrix can also be configured to
  store files in a directory format similar to the Wget crawler that uses the
  URL to name the directory and filename of each resource.
• An Arc file stores multiple archived resources in a single file in order to
  avoid managing a large number of small files. The file consists of a
  sequence of URL records, each with a header containing metadata about
  how the resource was requested followed by the HTTP header and the
  response.
Example:
• filedesc://IA-2006062.arc 0.0.0.0 20060622190110 text/plain 76 1 1
  InternetArchive URL IP-address Archive-date Content-type Archive-
  length

•   http://foo.edu:80/hello.html 127.10.100.2 19961104142103 text/html 187
    HTTP/1.1 200 OK Date: Thu, 22 Jun 2006 19:01:15 GMT Server: Apache
    Last-Modified: Sat, 10 Jun 2006 22:33:11 GMT Content-Length: 30
    Content-Type: text/html <html> Hello World!!! </html>
Screenshot of Heritrix Admin Console




Stable release           3.0.0 / December 5, 2009; 14 months ago)
Written in               Java
Operating system         Linux/Unix-like/Windows(unsupported)
Type                     Web crawler
License                  GNU Lesser General Public License
Website                  http://crawler.archive.
Database Archive
Database archiving refers to methods for archiving the underlying content of
database-driven websites. It typically requires the extraction of the database
content into a standard schema, often using XML. Once stored in that standard
format, the archived content of multiple databases can then be made available
using a single access system.

                  Transactional Archiving
Transactional archiving is an event-driven approach, which collects the actual
transactions which take place between a web server and a web browser. It is primarily
used as a means of preserving evidence of the content which was actually viewed on a
particular website, on a given date. This may be particularly important for organizations
which need to comply with legal or regulatory requirements for disclosing and retaining
information.
A transactional archiving system typically operates by intercepting every HTTP request
to, and response from, the web server, filtering each response to eliminate duplicate
content, and permanently storing the responses as bit streams.
HTTrack
•   HTTrack is a free and open source Web crawler and offline browser,
    developed by and licensed under the GNU General Public License.

•   It allows one to download World Wide Web sites from the Internet to a
    local computer. By default, HTTrack arranges the downloaded site by the
    original site's relative link-structure. The downloaded (or "mirrored")
    website can be browsed by opening a page of the site in a browser.

•   HTTrack uses a Web crawler to download a website. Some parts of the
    website may not be downloaded by default due to the robots exclusion
    protocol unless disabled during the program. HTTrack can follow links
    that are generated with basic JavaScript and inside Applets or Flash, but
    not complex links.
IIPC
International Internet Preservation Consortium is an international organization of
libraries to coordinate efforts to preserve internet content for the future. Membership is
 open to archives, museums, libraries, and cultural heritage institutions.

Its membership includes
• Austrian National Library,
• Biblioteka Narodowa,
• Bibliothèque et Archives nationales du Québec,
• Bibliothèque nationale de France,
• British Library,
• California Digital Library,
• Clementinum,
• German National Library,
• Institut national de l'audiovisuel,
• Internet Archive,
• Koninklijke Bibliotheek, National Library of the Netherlands, Library and Archives Canada,
     National and University Library in Zagreb, National and University Library of Iceland,
     National and University Library of Slovenia, National Diet Library, National Library Board,
     National Library of Australia, National Library of Catalonia, National Library of China,
     National Library of Finland, National Library of Israel, National Library of Korea,
     National Library of New Zealand, National Library of Norway, National Library of Poland,
     National Library of Scotland, National Library of Sweden, Royal Netherlands Academy of
     Arts and Sciences, Swiss National Library, The National Archives, United States Government
     Printing Office, and WebCite
Pandora Archive

• PANDORA - Australia's Web Archive is the national web archive for the
   preservation of Australia's online publications. It was established by the
   National Library of Australia in 1996, and is now built in collaboration
    with a number of other Australian state libraries and cultural collecting
   organization, including the Australian Institute of Aboriginal and Torres
    Strait Islander Studies, the Australian War Memorial, and the National
                            Film and Sound Archive.

  •   The PANDORA Archive collects selected Australian web resources,
      preserves them, and makes them available for viewing. Access to the
   archive is made available to the public via the Pandora web site. Web sites
    are selected based on their cultural significance and research value in the
                                    long term.
Difficulties and Limitations

                                   Crawlers

    Web archives which rely on web crawling as their primary means of
   collecting the Web are influenced by the difficulties of web crawling:
However, it is important to note that a native format web archive, i.e. a fully
   browsable web archive, with working links, media, etc., is only really
                      possible using crawler technology.
 The Web is so large that crawling a significant portion of it takes a large
 amount of technical resources. The Web is changing so fast that portions of
    a website may change before a crawler has even finished crawling it.
Difficulties and Limitations

                            General limitations
  Not only must web archivists deal with the technical challenges of web
   archiving, they must also contend with intellectual property laws. Peter
    Lyman states that "although the Web is popularly regarded as a public
   domain resource, it is copyrighted; thus, archivists have no legal right to
   copy the Web". However national libraries in many countries do have a
     legal right to copy portions of the web under an extension of a legal
                                    deposit.
Some private non-profit web archives that are made publicly accessible like
  WebCite or the Internet Archive allow content owners to hide or remove
  archived content that they do not want the public to have access to. Other
  web archives are only accessible from certain locations or have regulated
                                     usage.

More Related Content

What's hot

Best Practices for Descriptive Metadata
Best Practices for Descriptive MetadataBest Practices for Descriptive Metadata
Best Practices for Descriptive MetadataOCLC
 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawlingDenis Shestakov
 
Open for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin CoreOpen for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin CoreAndy Powell
 
Digital resources management_information_outreach_CSE
Digital resources management_information_outreach_CSEDigital resources management_information_outreach_CSE
Digital resources management_information_outreach_CSESrijan Technologies
 
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...Martin Klein
 
Exchange of usage metadata in a network of institutional repositories: the ...
Exchange of usage metadata in a network of institutional repositories: the ...Exchange of usage metadata in a network of institutional repositories: the ...
Exchange of usage metadata in a network of institutional repositories: the ...Benoit Pauwels
 
How Libraries Use Publisher Metadata Redux (Steven Shadle)
How Libraries Use Publisher Metadata Redux (Steven Shadle)How Libraries Use Publisher Metadata Redux (Steven Shadle)
How Libraries Use Publisher Metadata Redux (Steven Shadle)Charleston Conference
 
Handout for Metadata for your Digital Collections
Handout for Metadata for your Digital CollectionsHandout for Metadata for your Digital Collections
Handout for Metadata for your Digital CollectionsJenn Riley
 
UKSG webinar: Making Connections - Creating Linked Open Library Data with Nei...
UKSG webinar: Making Connections - Creating Linked Open Library Data with Nei...UKSG webinar: Making Connections - Creating Linked Open Library Data with Nei...
UKSG webinar: Making Connections - Creating Linked Open Library Data with Nei...UKSG: connecting the knowledge community
 
Building Linked Data Applications
Building Linked Data ApplicationsBuilding Linked Data Applications
Building Linked Data ApplicationsEUCLID project
 
Introduction to Linked Data Platform (LDP)
Introduction to Linked Data Platform (LDP)Introduction to Linked Data Platform (LDP)
Introduction to Linked Data Platform (LDP)Hector Correa
 
-Open Archives Initiatives(final)
-Open Archives Initiatives(final)-Open Archives Initiatives(final)
-Open Archives Initiatives(final)floyd taag
 
Open for Business Open Archives, OpenURL, RSS and the Dublin Core
Open for Business  Open Archives, OpenURL, RSS and the Dublin CoreOpen for Business  Open Archives, OpenURL, RSS and the Dublin Core
Open for Business Open Archives, OpenURL, RSS and the Dublin CoreAndy Powell
 
The network reshapes the research library collection
The network reshapes the research library collectionThe network reshapes the research library collection
The network reshapes the research library collectionlisld
 
UKSG Conference 2016 Breakout Session - Discovery and linking integrity – do ...
UKSG Conference 2016 Breakout Session - Discovery and linking integrity – do ...UKSG Conference 2016 Breakout Session - Discovery and linking integrity – do ...
UKSG Conference 2016 Breakout Session - Discovery and linking integrity – do ...UKSG: connecting the knowledge community
 

What's hot (17)

Best Practices for Descriptive Metadata
Best Practices for Descriptive MetadataBest Practices for Descriptive Metadata
Best Practices for Descriptive Metadata
 
Access to Content via Link Resolvers
Access to Content via Link ResolversAccess to Content via Link Resolvers
Access to Content via Link Resolvers
 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawling
 
Open for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin CoreOpen for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin Core
 
Digital resources management_information_outreach_CSE
Digital resources management_information_outreach_CSEDigital resources management_information_outreach_CSE
Digital resources management_information_outreach_CSE
 
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
 
Exchange of usage metadata in a network of institutional repositories: the ...
Exchange of usage metadata in a network of institutional repositories: the ...Exchange of usage metadata in a network of institutional repositories: the ...
Exchange of usage metadata in a network of institutional repositories: the ...
 
Browser
BrowserBrowser
Browser
 
How Libraries Use Publisher Metadata Redux (Steven Shadle)
How Libraries Use Publisher Metadata Redux (Steven Shadle)How Libraries Use Publisher Metadata Redux (Steven Shadle)
How Libraries Use Publisher Metadata Redux (Steven Shadle)
 
Handout for Metadata for your Digital Collections
Handout for Metadata for your Digital CollectionsHandout for Metadata for your Digital Collections
Handout for Metadata for your Digital Collections
 
UKSG webinar: Making Connections - Creating Linked Open Library Data with Nei...
UKSG webinar: Making Connections - Creating Linked Open Library Data with Nei...UKSG webinar: Making Connections - Creating Linked Open Library Data with Nei...
UKSG webinar: Making Connections - Creating Linked Open Library Data with Nei...
 
Building Linked Data Applications
Building Linked Data ApplicationsBuilding Linked Data Applications
Building Linked Data Applications
 
Introduction to Linked Data Platform (LDP)
Introduction to Linked Data Platform (LDP)Introduction to Linked Data Platform (LDP)
Introduction to Linked Data Platform (LDP)
 
-Open Archives Initiatives(final)
-Open Archives Initiatives(final)-Open Archives Initiatives(final)
-Open Archives Initiatives(final)
 
Open for Business Open Archives, OpenURL, RSS and the Dublin Core
Open for Business  Open Archives, OpenURL, RSS and the Dublin CoreOpen for Business  Open Archives, OpenURL, RSS and the Dublin Core
Open for Business Open Archives, OpenURL, RSS and the Dublin Core
 
The network reshapes the research library collection
The network reshapes the research library collectionThe network reshapes the research library collection
The network reshapes the research library collection
 
UKSG Conference 2016 Breakout Session - Discovery and linking integrity – do ...
UKSG Conference 2016 Breakout Session - Discovery and linking integrity – do ...UKSG Conference 2016 Breakout Session - Discovery and linking integrity – do ...
UKSG Conference 2016 Breakout Session - Discovery and linking integrity – do ...
 

Viewers also liked

PRESERVATION Web archiving
PRESERVATION  Web archivingPRESERVATION  Web archiving
PRESERVATION Web archivingEssam Obaid
 
publishing production
publishing productionpublishing production
publishing productionEssam Obaid
 
7 شخصيات يجب أن تحذفهم فورا من الفيسبوك
7 شخصيات يجب أن تحذفهم فورا من الفيسبوك7 شخصيات يجب أن تحذفهم فورا من الفيسبوك
7 شخصيات يجب أن تحذفهم فورا من الفيسبوكEssam Obaid
 
تقنيات 6 سيجما فى المؤسسات الاكاديمية والمعلوماتية
تقنيات 6 سيجما فى المؤسسات الاكاديمية والمعلوماتيةتقنيات 6 سيجما فى المؤسسات الاكاديمية والمعلوماتية
تقنيات 6 سيجما فى المؤسسات الاكاديمية والمعلوماتيةEssam Obaid
 
تفاعل ادارة السجلات والوثائق مع مواقع التواصل الاجتماعى
تفاعل ادارة السجلات والوثائق مع مواقع التواصل الاجتماعىتفاعل ادارة السجلات والوثائق مع مواقع التواصل الاجتماعى
تفاعل ادارة السجلات والوثائق مع مواقع التواصل الاجتماعىEssam Obaid
 
Cloud computing دور الحوسبة السحابية فى المكتبات الرقمية ونظم الارشفة الالكتر...
Cloud computing دور الحوسبة السحابية فى المكتبات الرقمية ونظم الارشفة الالكتر...Cloud computing دور الحوسبة السحابية فى المكتبات الرقمية ونظم الارشفة الالكتر...
Cloud computing دور الحوسبة السحابية فى المكتبات الرقمية ونظم الارشفة الالكتر...Essam Obaid
 
ادارة السجلات والارشفة الالكترونية - E archive
ادارة السجلات والارشفة الالكترونية - E archiveادارة السجلات والارشفة الالكترونية - E archive
ادارة السجلات والارشفة الالكترونية - E archiveEssam Obaid
 
ECM نظم إدارة المحتوى المؤسسى
 ECM نظم إدارة المحتوى المؤسسى ECM نظم إدارة المحتوى المؤسسى
ECM نظم إدارة المحتوى المؤسسىEssam Obaid
 
models of e publishing
models of e publishingmodels of e publishing
models of e publishingEssam Obaid
 
الاتجاهات البحثية فى إدارة المعرفة
الاتجاهات البحثية فى إدارة المعرفةالاتجاهات البحثية فى إدارة المعرفة
الاتجاهات البحثية فى إدارة المعرفةEssam Obaid
 
introduction to electronic publishing
 introduction to electronic publishing introduction to electronic publishing
introduction to electronic publishingEssam Obaid
 
E archive ادارة السجلات والارشفة الالكترونية - المفاهيم والمصطلحات
E archive  ادارة السجلات والارشفة الالكترونية - المفاهيم والمصطلحات   E archive  ادارة السجلات والارشفة الالكترونية - المفاهيم والمصطلحات
E archive ادارة السجلات والارشفة الالكترونية - المفاهيم والمصطلحات Essam Obaid
 
content analysis
content analysiscontent analysis
content analysisEssam Obaid
 
1356947482.9353caiibgbmmarketingmngtmodule d
1356947482.9353caiibgbmmarketingmngtmodule d1356947482.9353caiibgbmmarketingmngtmodule d
1356947482.9353caiibgbmmarketingmngtmodule dمحمد الجوري
 
Sustainability Assessment Report – Klean Kanteen
Sustainability Assessment Report – Klean Kanteen Sustainability Assessment Report – Klean Kanteen
Sustainability Assessment Report – Klean Kanteen Connie Kwan
 

Viewers also liked (20)

PRESERVATION Web archiving
PRESERVATION  Web archivingPRESERVATION  Web archiving
PRESERVATION Web archiving
 
publishing production
publishing productionpublishing production
publishing production
 
7 شخصيات يجب أن تحذفهم فورا من الفيسبوك
7 شخصيات يجب أن تحذفهم فورا من الفيسبوك7 شخصيات يجب أن تحذفهم فورا من الفيسبوك
7 شخصيات يجب أن تحذفهم فورا من الفيسبوك
 
تقنيات 6 سيجما فى المؤسسات الاكاديمية والمعلوماتية
تقنيات 6 سيجما فى المؤسسات الاكاديمية والمعلوماتيةتقنيات 6 سيجما فى المؤسسات الاكاديمية والمعلوماتية
تقنيات 6 سيجما فى المؤسسات الاكاديمية والمعلوماتية
 
تفاعل ادارة السجلات والوثائق مع مواقع التواصل الاجتماعى
تفاعل ادارة السجلات والوثائق مع مواقع التواصل الاجتماعىتفاعل ادارة السجلات والوثائق مع مواقع التواصل الاجتماعى
تفاعل ادارة السجلات والوثائق مع مواقع التواصل الاجتماعى
 
Cloud computing دور الحوسبة السحابية فى المكتبات الرقمية ونظم الارشفة الالكتر...
Cloud computing دور الحوسبة السحابية فى المكتبات الرقمية ونظم الارشفة الالكتر...Cloud computing دور الحوسبة السحابية فى المكتبات الرقمية ونظم الارشفة الالكتر...
Cloud computing دور الحوسبة السحابية فى المكتبات الرقمية ونظم الارشفة الالكتر...
 
ادارة السجلات والارشفة الالكترونية - E archive
ادارة السجلات والارشفة الالكترونية - E archiveادارة السجلات والارشفة الالكترونية - E archive
ادارة السجلات والارشفة الالكترونية - E archive
 
ECM نظم إدارة المحتوى المؤسسى
 ECM نظم إدارة المحتوى المؤسسى ECM نظم إدارة المحتوى المؤسسى
ECM نظم إدارة المحتوى المؤسسى
 
models of e publishing
models of e publishingmodels of e publishing
models of e publishing
 
الاتجاهات البحثية فى إدارة المعرفة
الاتجاهات البحثية فى إدارة المعرفةالاتجاهات البحثية فى إدارة المعرفة
الاتجاهات البحثية فى إدارة المعرفة
 
introduction to electronic publishing
 introduction to electronic publishing introduction to electronic publishing
introduction to electronic publishing
 
E archive ادارة السجلات والارشفة الالكترونية - المفاهيم والمصطلحات
E archive  ادارة السجلات والارشفة الالكترونية - المفاهيم والمصطلحات   E archive  ادارة السجلات والارشفة الالكترونية - المفاهيم والمصطلحات
E archive ادارة السجلات والارشفة الالكترونية - المفاهيم والمصطلحات
 
content analysis
content analysiscontent analysis
content analysis
 
1356947482.9353caiibgbmmarketingmngtmodule d
1356947482.9353caiibgbmmarketingmngtmodule d1356947482.9353caiibgbmmarketingmngtmodule d
1356947482.9353caiibgbmmarketingmngtmodule d
 
دورة صيانة الذات
دورة صيانة الذاتدورة صيانة الذات
دورة صيانة الذات
 
من سيربح المليون للنشر
من سيربح المليون   للنشرمن سيربح المليون   للنشر
من سيربح المليون للنشر
 
Brothers meetting للنشر
Brothers meetting   للنشرBrothers meetting   للنشر
Brothers meetting للنشر
 
خداع البصر
خداع البصرخداع البصر
خداع البصر
 
اكتشف الصورة
اكتشف الصورةاكتشف الصورة
اكتشف الصورة
 
Sustainability Assessment Report – Klean Kanteen
Sustainability Assessment Report – Klean Kanteen Sustainability Assessment Report – Klean Kanteen
Sustainability Assessment Report – Klean Kanteen
 

Similar to The development of web archiving 3

Archtecture of world wide web
Archtecture of world wide webArchtecture of world wide web
Archtecture of world wide webtechlovers3
 
Internet browsing techniques
Internet browsing techniquesInternet browsing techniques
Internet browsing techniquesTola Odugbesan
 
Internet and its applications
Internet and its applicationsInternet and its applications
Internet and its applicationsBurhan Ahmed
 
Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Roxanne Missingham
 
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara AubryArchiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara AubryBiblioteca Nacional de España
 
Can you save the web? Web Archiving!
Can you save the web? Web Archiving!Can you save the web? Web Archiving!
Can you save the web? Web Archiving!Vangelis Banos
 
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...Micah Altman
 
Presentation on Koha
Presentation on KohaPresentation on Koha
Presentation on KohaNur Ahammad
 
01-Lecture Web System & Technology Introduction.pptx
01-Lecture Web System & Technology  Introduction.pptx01-Lecture Web System & Technology  Introduction.pptx
01-Lecture Web System & Technology Introduction.pptxShoaibRajper1
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunitiesAhmed AlSum
 
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012lljohnston
 
Alt search engines_ocallaghan
Alt search engines_ocallaghanAlt search engines_ocallaghan
Alt search engines_ocallaghantamara1066
 
201399627 kovacs-collection-cyberspace
201399627 kovacs-collection-cyberspace201399627 kovacs-collection-cyberspace
201399627 kovacs-collection-cyberspacehomeworkping4
 
UNDERSTANDINGWWW - SEARCH ENGINE[Replica].pdf
UNDERSTANDINGWWW - SEARCH ENGINE[Replica].pdfUNDERSTANDINGWWW - SEARCH ENGINE[Replica].pdf
UNDERSTANDINGWWW - SEARCH ENGINE[Replica].pdfNarmadhaM13
 

Similar to The development of web archiving 3 (20)

Archtecture of world wide web
Archtecture of world wide webArchtecture of world wide web
Archtecture of world wide web
 
Internet browsing techniques
Internet browsing techniquesInternet browsing techniques
Internet browsing techniques
 
Internet and its applications
Internet and its applicationsInternet and its applications
Internet and its applications
 
World wide web
World wide webWorld wide web
World wide web
 
Internet content as research data
Internet content as research dataInternet content as research data
Internet content as research data
 
Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012
 
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara AubryArchiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
 
Internet and Its Applications
Internet and Its ApplicationsInternet and Its Applications
Internet and Its Applications
 
Can you save the web? Web Archiving!
Can you save the web? Web Archiving!Can you save the web? Web Archiving!
Can you save the web? Web Archiving!
 
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
 
Presentation on Koha
Presentation on KohaPresentation on Koha
Presentation on Koha
 
01-Lecture Web System & Technology Introduction.pptx
01-Lecture Web System & Technology  Introduction.pptx01-Lecture Web System & Technology  Introduction.pptx
01-Lecture Web System & Technology Introduction.pptx
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunities
 
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
 
Webtech
WebtechWebtech
Webtech
 
Alt search engines_ocallaghan
Alt search engines_ocallaghanAlt search engines_ocallaghan
Alt search engines_ocallaghan
 
Digital Content Management
Digital Content ManagementDigital Content Management
Digital Content Management
 
Search engines
Search enginesSearch engines
Search engines
 
201399627 kovacs-collection-cyberspace
201399627 kovacs-collection-cyberspace201399627 kovacs-collection-cyberspace
201399627 kovacs-collection-cyberspace
 
UNDERSTANDINGWWW - SEARCH ENGINE[Replica].pdf
UNDERSTANDINGWWW - SEARCH ENGINE[Replica].pdfUNDERSTANDINGWWW - SEARCH ENGINE[Replica].pdf
UNDERSTANDINGWWW - SEARCH ENGINE[Replica].pdf
 

More from Essam Obaid

دورة مجاناً تاسيس إدارة الاعلام بالمؤسسات
دورة مجاناً تاسيس إدارة الاعلام بالمؤسساتدورة مجاناً تاسيس إدارة الاعلام بالمؤسسات
دورة مجاناً تاسيس إدارة الاعلام بالمؤسساتEssam Obaid
 
استراتيجية الاعلام الاجتماعى فى ادارة المعرفة الذكية
استراتيجية الاعلام الاجتماعى فى ادارة المعرفة الذكيةاستراتيجية الاعلام الاجتماعى فى ادارة المعرفة الذكية
استراتيجية الاعلام الاجتماعى فى ادارة المعرفة الذكيةEssam Obaid
 
الادارة الالكترونية
الادارة الالكترونيةالادارة الالكترونية
الادارة الالكترونيةEssam Obaid
 
الدوريات الأجنبية فى مكتبات الكليات العلمية فى جامعة أسيوط
الدوريات الأجنبية فى مكتبات الكليات العلمية فى جامعة أسيوطالدوريات الأجنبية فى مكتبات الكليات العلمية فى جامعة أسيوط
الدوريات الأجنبية فى مكتبات الكليات العلمية فى جامعة أسيوطEssam Obaid
 
الدوريات الأجنبية فى مكتبات الكليات العلمية فى جامعة أسيوط
الدوريات الأجنبية فى مكتبات الكليات العلمية فى جامعة أسيوطالدوريات الأجنبية فى مكتبات الكليات العلمية فى جامعة أسيوط
الدوريات الأجنبية فى مكتبات الكليات العلمية فى جامعة أسيوطEssam Obaid
 
مكتبات الجمعيات الأهلية و المؤسسات الخاصة بمحافظة الإسكندرية
مكتبات الجمعيات الأهلية و المؤسسات الخاصة بمحافظة الإسكندريةمكتبات الجمعيات الأهلية و المؤسسات الخاصة بمحافظة الإسكندرية
مكتبات الجمعيات الأهلية و المؤسسات الخاصة بمحافظة الإسكندريةEssam Obaid
 
مراقب وثائق الجودة
مراقب وثائق الجودةمراقب وثائق الجودة
مراقب وثائق الجودةEssam Obaid
 
برمجيات الأرشفة والسجلات الالكترونية بين التسويق والتطبيق
برمجيات الأرشفة والسجلات الالكترونية  بين التسويق والتطبيقبرمجيات الأرشفة والسجلات الالكترونية  بين التسويق والتطبيق
برمجيات الأرشفة والسجلات الالكترونية بين التسويق والتطبيقEssam Obaid
 
دليل لقيادة المشاريع واجتياز اختبار PMP
دليل لقيادة المشاريع واجتياز اختبار PMPدليل لقيادة المشاريع واجتياز اختبار PMP
دليل لقيادة المشاريع واجتياز اختبار PMPEssam Obaid
 
إدارة المعرفة والادارة الالكترونية فى المؤسسات
إدارة المعرفة  والادارة الالكترونية فى المؤسساتإدارة المعرفة  والادارة الالكترونية فى المؤسسات
إدارة المعرفة والادارة الالكترونية فى المؤسساتEssam Obaid
 
التطوع الالكتروني واستقطاب المتطوعين مهارات التطوع الافتراضي
 التطوع الالكتروني واستقطاب المتطوعين مهارات التطوع الافتراضي   التطوع الالكتروني واستقطاب المتطوعين مهارات التطوع الافتراضي
التطوع الالكتروني واستقطاب المتطوعين مهارات التطوع الافتراضي Essam Obaid
 
إدارة السجلات والارشفة الالكترونية
إدارة السجلات والارشفة الالكترونيةإدارة السجلات والارشفة الالكترونية
إدارة السجلات والارشفة الالكترونيةEssam Obaid
 
أنظمة البحث والاسترجاع في المكتبات العامة دراسة تقييميه لنظام مكتبة الملك عبد...
أنظمة البحث والاسترجاع في المكتبات العامة دراسة تقييميه لنظام مكتبة الملك عبد...أنظمة البحث والاسترجاع في المكتبات العامة دراسة تقييميه لنظام مكتبة الملك عبد...
أنظمة البحث والاسترجاع في المكتبات العامة دراسة تقييميه لنظام مكتبة الملك عبد...Essam Obaid
 
تطبيق منهجية 6 سيجما (Six Sigma) في المكتبات: دراسة استطلاعية لآراء مدراء الم...
تطبيق منهجية 6 سيجما (Six Sigma) في المكتبات: دراسة استطلاعية لآراء مدراء الم...تطبيق منهجية 6 سيجما (Six Sigma) في المكتبات: دراسة استطلاعية لآراء مدراء الم...
تطبيق منهجية 6 سيجما (Six Sigma) في المكتبات: دراسة استطلاعية لآراء مدراء الم...Essam Obaid
 
ادارة المشروعات الرقمية
ادارة المشروعات الرقميةادارة المشروعات الرقمية
ادارة المشروعات الرقميةEssam Obaid
 
إدارة محتوى مواقع التواصل الاجتماعي في المؤسسات الخدمية والتجارية
إدارة  محتوى مواقع التواصل الاجتماعي  في المؤسسات الخدمية والتجاريةإدارة  محتوى مواقع التواصل الاجتماعي  في المؤسسات الخدمية والتجارية
إدارة محتوى مواقع التواصل الاجتماعي في المؤسسات الخدمية والتجاريةEssam Obaid
 
تطبيق مبادئ إدارة الجودة الشاملة
تطبيق مبادئ  إدارة الجودة الشاملةتطبيق مبادئ  إدارة الجودة الشاملة
تطبيق مبادئ إدارة الجودة الشاملةEssam Obaid
 
تأثير النشر الالكتروني في خدمات المكتبات الجامعية
تأثير النشر الالكتروني في خدمات المكتبات الجامعية  تأثير النشر الالكتروني في خدمات المكتبات الجامعية
تأثير النشر الالكتروني في خدمات المكتبات الجامعية Essam Obaid
 
واقع العمل التطوعي فى المكتبات العامة المصرية
واقع العمل التطوعي فى المكتبات العامة المصريةواقع العمل التطوعي فى المكتبات العامة المصرية
واقع العمل التطوعي فى المكتبات العامة المصريةEssam Obaid
 
التخطيط الاستراتيجى فى مؤسسات المعلومات السعودية
التخطيط الاستراتيجى فى مؤسسات المعلومات السعوديةالتخطيط الاستراتيجى فى مؤسسات المعلومات السعودية
التخطيط الاستراتيجى فى مؤسسات المعلومات السعوديةEssam Obaid
 

More from Essam Obaid (20)

دورة مجاناً تاسيس إدارة الاعلام بالمؤسسات
دورة مجاناً تاسيس إدارة الاعلام بالمؤسساتدورة مجاناً تاسيس إدارة الاعلام بالمؤسسات
دورة مجاناً تاسيس إدارة الاعلام بالمؤسسات
 
استراتيجية الاعلام الاجتماعى فى ادارة المعرفة الذكية
استراتيجية الاعلام الاجتماعى فى ادارة المعرفة الذكيةاستراتيجية الاعلام الاجتماعى فى ادارة المعرفة الذكية
استراتيجية الاعلام الاجتماعى فى ادارة المعرفة الذكية
 
الادارة الالكترونية
الادارة الالكترونيةالادارة الالكترونية
الادارة الالكترونية
 
الدوريات الأجنبية فى مكتبات الكليات العلمية فى جامعة أسيوط
الدوريات الأجنبية فى مكتبات الكليات العلمية فى جامعة أسيوطالدوريات الأجنبية فى مكتبات الكليات العلمية فى جامعة أسيوط
الدوريات الأجنبية فى مكتبات الكليات العلمية فى جامعة أسيوط
 
الدوريات الأجنبية فى مكتبات الكليات العلمية فى جامعة أسيوط
الدوريات الأجنبية فى مكتبات الكليات العلمية فى جامعة أسيوطالدوريات الأجنبية فى مكتبات الكليات العلمية فى جامعة أسيوط
الدوريات الأجنبية فى مكتبات الكليات العلمية فى جامعة أسيوط
 
مكتبات الجمعيات الأهلية و المؤسسات الخاصة بمحافظة الإسكندرية
مكتبات الجمعيات الأهلية و المؤسسات الخاصة بمحافظة الإسكندريةمكتبات الجمعيات الأهلية و المؤسسات الخاصة بمحافظة الإسكندرية
مكتبات الجمعيات الأهلية و المؤسسات الخاصة بمحافظة الإسكندرية
 
مراقب وثائق الجودة
مراقب وثائق الجودةمراقب وثائق الجودة
مراقب وثائق الجودة
 
برمجيات الأرشفة والسجلات الالكترونية بين التسويق والتطبيق
برمجيات الأرشفة والسجلات الالكترونية  بين التسويق والتطبيقبرمجيات الأرشفة والسجلات الالكترونية  بين التسويق والتطبيق
برمجيات الأرشفة والسجلات الالكترونية بين التسويق والتطبيق
 
دليل لقيادة المشاريع واجتياز اختبار PMP
دليل لقيادة المشاريع واجتياز اختبار PMPدليل لقيادة المشاريع واجتياز اختبار PMP
دليل لقيادة المشاريع واجتياز اختبار PMP
 
إدارة المعرفة والادارة الالكترونية فى المؤسسات
إدارة المعرفة  والادارة الالكترونية فى المؤسساتإدارة المعرفة  والادارة الالكترونية فى المؤسسات
إدارة المعرفة والادارة الالكترونية فى المؤسسات
 
التطوع الالكتروني واستقطاب المتطوعين مهارات التطوع الافتراضي
 التطوع الالكتروني واستقطاب المتطوعين مهارات التطوع الافتراضي   التطوع الالكتروني واستقطاب المتطوعين مهارات التطوع الافتراضي
التطوع الالكتروني واستقطاب المتطوعين مهارات التطوع الافتراضي
 
إدارة السجلات والارشفة الالكترونية
إدارة السجلات والارشفة الالكترونيةإدارة السجلات والارشفة الالكترونية
إدارة السجلات والارشفة الالكترونية
 
أنظمة البحث والاسترجاع في المكتبات العامة دراسة تقييميه لنظام مكتبة الملك عبد...
أنظمة البحث والاسترجاع في المكتبات العامة دراسة تقييميه لنظام مكتبة الملك عبد...أنظمة البحث والاسترجاع في المكتبات العامة دراسة تقييميه لنظام مكتبة الملك عبد...
أنظمة البحث والاسترجاع في المكتبات العامة دراسة تقييميه لنظام مكتبة الملك عبد...
 
تطبيق منهجية 6 سيجما (Six Sigma) في المكتبات: دراسة استطلاعية لآراء مدراء الم...
تطبيق منهجية 6 سيجما (Six Sigma) في المكتبات: دراسة استطلاعية لآراء مدراء الم...تطبيق منهجية 6 سيجما (Six Sigma) في المكتبات: دراسة استطلاعية لآراء مدراء الم...
تطبيق منهجية 6 سيجما (Six Sigma) في المكتبات: دراسة استطلاعية لآراء مدراء الم...
 
ادارة المشروعات الرقمية
ادارة المشروعات الرقميةادارة المشروعات الرقمية
ادارة المشروعات الرقمية
 
إدارة محتوى مواقع التواصل الاجتماعي في المؤسسات الخدمية والتجارية
إدارة  محتوى مواقع التواصل الاجتماعي  في المؤسسات الخدمية والتجاريةإدارة  محتوى مواقع التواصل الاجتماعي  في المؤسسات الخدمية والتجارية
إدارة محتوى مواقع التواصل الاجتماعي في المؤسسات الخدمية والتجارية
 
تطبيق مبادئ إدارة الجودة الشاملة
تطبيق مبادئ  إدارة الجودة الشاملةتطبيق مبادئ  إدارة الجودة الشاملة
تطبيق مبادئ إدارة الجودة الشاملة
 
تأثير النشر الالكتروني في خدمات المكتبات الجامعية
تأثير النشر الالكتروني في خدمات المكتبات الجامعية  تأثير النشر الالكتروني في خدمات المكتبات الجامعية
تأثير النشر الالكتروني في خدمات المكتبات الجامعية
 
واقع العمل التطوعي فى المكتبات العامة المصرية
واقع العمل التطوعي فى المكتبات العامة المصريةواقع العمل التطوعي فى المكتبات العامة المصرية
واقع العمل التطوعي فى المكتبات العامة المصرية
 
التخطيط الاستراتيجى فى مؤسسات المعلومات السعودية
التخطيط الاستراتيجى فى مؤسسات المعلومات السعوديةالتخطيط الاستراتيجى فى مؤسسات المعلومات السعودية
التخطيط الاستراتيجى فى مؤسسات المعلومات السعودية
 

The development of web archiving 3

  • 1. The Development of Web Archiving Dr. Essam Obaid
  • 2. Definition of Web Archiving “Web archiving is the process of collecting portions of the World Wide Web and ensuring the collection is preserved in an archive” such as an archive site, for future researchers, historians, and the public. Due to the massive size of the Web, web archivists typically employ web crawlers for automated collection. The largest web archiving organization based on a crawling approach is the Internet Archive which strives to maintain an archive of the entire Web. National libraries, national archives and various consortia of organizations are also involved in archiving culturally important Web content. Commercial web archiving software and services are also available to organizations who need to archive their own web content for corporate heritage, regulatory, or legal purposes.
  • 3. Web Crawlers A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, Web scutters. • This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. • Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses. • A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.
  • 4. WHAT IS A SEARCH ENGINE “In a search engines, such as Google and HotBot, consist of a software package that crawls the Web, extracts and organizes the data in a database. People can then submit a search query using a Web browser. The search engine locates the appropriate data in the database and displays it via the browser” Search engines have three major elements: • The spider, also called the crawler, harvester, robot or gatherer. The spider visits a Web page, reads it, and then follows links to other pages within the site. The spider returns to the site on a regular basis, such as every month or two, to look for changes. • The Index. Everything the spider finds goes into the index. The index, is like a giant book containing a copy of every web page that the spider finds. If a web page changes, then this book is updated with new information. • Search engine software. This is the program that sifts through the millions of pages recorded in the index to find matches to a search and rank them in order of what it believes is most relevant. Search engine software is also available to run on a local Web site. The software has the same basic components, but the spider just visits the local site or a limited number of sites in a community.
  • 5. Web Crawler Behavior The behavior of a Web crawler is the outcome of a combination of policies: • A selection policy that states which pages to download, • A re-visit policy that states when to check for changes to the pages, • A politeness policy that states how to avoid overloading Web sites, and • A parallelization policy that states how to coordinate distributed Web crawlers.
  • 6. High Level Architecture of a Web Crawler Web crawlers are a central part of search engines, and details on their algorithms and architecture are kept as business secrets
  • 8. Internet Archive “The Internet Archive is a non-profit digital library with the stated mission of "universal access to all knowledge. It offers permanent storage and access to collections of digitized materials, including websites, music, moving images, and books. The Internet Archive was founded by Brewster Kahle in 1996” • With offices located in San Francisco, California, USA and data centers in San Francisco, Redwood City, and Mountain View, California, USA, the Archive's largest collection is its web archive, "snapshots of the World Wide Web.“ • The Archive allows the public to both upload and download digital material to its data cluster, and provides unrestricted online access to that material at no cost. The Archive also oversees one of the world's largest book digitization projects. It is a member of the American Library Association and is officially recognized by the State of California as a library.
  • 9. Brewster Kahle founded the Archive in 1996 at the same time that he began the for-profit web crawling company Alexa Internet. The Archive began to archive the World Wide Web from 1996, but it did not make this collection available until 2001, when it developed the Wayback Machine. Now the Internet Archive includes texts, audio, moving images, and software. It hosts a number of other projects: the NASA Images Archive, the contract crawling service Archive-It, and the wiki-editable library catalog and book information site Open Library. According to its website: – Most societies place importance on preserving artifacts of their culture and heritage. Without such artifacts, civilization has no memory and no mechanism to learn from its successes and failures. Our culture now produces more and more artifacts in digital form. The Archive's mission is to help preserve those artifacts and create an Internet library for researchers, historians, and scholars.
  • 10. Wayback Machine The Internet Archive has "Wayback Machine" for its service that allows archives of the World Wide Web to be searched and accessed. This service allows users to see archived versions of web pages of the past. Millions of websites and their associated data (images, source code, documents, etc.) are saved in a gigantic database. The service can be used to see what previous versions of websites used to look like, to grab original source code from websites that may no longer be directly available, or to visit websites that no longer even exist. Not all websites are available, however, because many website owners choose to exclude their sites.
  • 11. Web Archiving Techniques The most common web archiving technique uses web crawlers to automate the process of collecting web pages. Web crawlers typically view web pages in the same manner that users with a browser see the Web, and therefore provide a comparatively simple method of remotely harvesting web content. Examples of web crawlers frequently used for web archiving include: • Automated Internet Sessions in biterScripting • Heritrix • HTTrack • Wget
  • 12. Heritrix • Heritrix is the Internet Archive’s web crawler, which was specially designed for web archiving. It is open- source and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls. • Heritrix was developed jointly by Internet Archive and the Nordic national libraries on specifications written in early 2003. The first official release was in January 2004, and it has been continually improved by employees of the Internet Archive and other interested parties.
  • 13. Organization using Heritrix A number of organizations and national libraries are using Heritrix, among them: - Bibliothèque nationale de France - British Library - National Library of Finland - National Library of Newzeland
  • 14. Bibliothèque Nationale de France The Bibliothèque nationale de France (BnF) is the National Library of France, located in Paris. It is intended to be the repository of all that is published in France. The current president of the library is Bruno Racine.
  • 15. British Library The British Library is the national library of the United Kingdom, and one of the world's largest libraries in terms of total number of items. The library is a major research library, holding over 150 million items from every country in the world, in virtually all known languages and in many formats, both print and digital: books, manuscripts, journals, newspapers, magazines, sound and music recordings, videos, play-scripts, patents, databases, maps, stamps, prints, drawings. The Library's collections include around 14 million books.
  • 16. ARC FILE • Heritrix by default stores the web resources it crawls in an Arc file.This format has been used by the Internet Archive since 1996 to store its web archives. The WARC file format, similar to ARC but more precisely specified and flexible, can also be used. Heritrix can also be configured to store files in a directory format similar to the Wget crawler that uses the URL to name the directory and filename of each resource. • An Arc file stores multiple archived resources in a single file in order to avoid managing a large number of small files. The file consists of a sequence of URL records, each with a header containing metadata about how the resource was requested followed by the HTTP header and the response. Example: • filedesc://IA-2006062.arc 0.0.0.0 20060622190110 text/plain 76 1 1 InternetArchive URL IP-address Archive-date Content-type Archive- length • http://foo.edu:80/hello.html 127.10.100.2 19961104142103 text/html 187 HTTP/1.1 200 OK Date: Thu, 22 Jun 2006 19:01:15 GMT Server: Apache Last-Modified: Sat, 10 Jun 2006 22:33:11 GMT Content-Length: 30 Content-Type: text/html <html> Hello World!!! </html>
  • 17. Screenshot of Heritrix Admin Console Stable release 3.0.0 / December 5, 2009; 14 months ago) Written in Java Operating system Linux/Unix-like/Windows(unsupported) Type Web crawler License GNU Lesser General Public License Website http://crawler.archive.
  • 18. Database Archive Database archiving refers to methods for archiving the underlying content of database-driven websites. It typically requires the extraction of the database content into a standard schema, often using XML. Once stored in that standard format, the archived content of multiple databases can then be made available using a single access system. Transactional Archiving Transactional archiving is an event-driven approach, which collects the actual transactions which take place between a web server and a web browser. It is primarily used as a means of preserving evidence of the content which was actually viewed on a particular website, on a given date. This may be particularly important for organizations which need to comply with legal or regulatory requirements for disclosing and retaining information. A transactional archiving system typically operates by intercepting every HTTP request to, and response from, the web server, filtering each response to eliminate duplicate content, and permanently storing the responses as bit streams.
  • 19. HTTrack • HTTrack is a free and open source Web crawler and offline browser, developed by and licensed under the GNU General Public License. • It allows one to download World Wide Web sites from the Internet to a local computer. By default, HTTrack arranges the downloaded site by the original site's relative link-structure. The downloaded (or "mirrored") website can be browsed by opening a page of the site in a browser. • HTTrack uses a Web crawler to download a website. Some parts of the website may not be downloaded by default due to the robots exclusion protocol unless disabled during the program. HTTrack can follow links that are generated with basic JavaScript and inside Applets or Flash, but not complex links.
  • 20.
  • 21. IIPC International Internet Preservation Consortium is an international organization of libraries to coordinate efforts to preserve internet content for the future. Membership is open to archives, museums, libraries, and cultural heritage institutions. Its membership includes • Austrian National Library, • Biblioteka Narodowa, • Bibliothèque et Archives nationales du Québec, • Bibliothèque nationale de France, • British Library, • California Digital Library, • Clementinum, • German National Library, • Institut national de l'audiovisuel, • Internet Archive, • Koninklijke Bibliotheek, National Library of the Netherlands, Library and Archives Canada, National and University Library in Zagreb, National and University Library of Iceland, National and University Library of Slovenia, National Diet Library, National Library Board, National Library of Australia, National Library of Catalonia, National Library of China, National Library of Finland, National Library of Israel, National Library of Korea, National Library of New Zealand, National Library of Norway, National Library of Poland, National Library of Scotland, National Library of Sweden, Royal Netherlands Academy of Arts and Sciences, Swiss National Library, The National Archives, United States Government Printing Office, and WebCite
  • 22. Pandora Archive • PANDORA - Australia's Web Archive is the national web archive for the preservation of Australia's online publications. It was established by the National Library of Australia in 1996, and is now built in collaboration with a number of other Australian state libraries and cultural collecting organization, including the Australian Institute of Aboriginal and Torres Strait Islander Studies, the Australian War Memorial, and the National Film and Sound Archive. • The PANDORA Archive collects selected Australian web resources, preserves them, and makes them available for viewing. Access to the archive is made available to the public via the Pandora web site. Web sites are selected based on their cultural significance and research value in the long term.
  • 23. Difficulties and Limitations Crawlers Web archives which rely on web crawling as their primary means of collecting the Web are influenced by the difficulties of web crawling: However, it is important to note that a native format web archive, i.e. a fully browsable web archive, with working links, media, etc., is only really possible using crawler technology. The Web is so large that crawling a significant portion of it takes a large amount of technical resources. The Web is changing so fast that portions of a website may change before a crawler has even finished crawling it.
  • 24. Difficulties and Limitations General limitations Not only must web archivists deal with the technical challenges of web archiving, they must also contend with intellectual property laws. Peter Lyman states that "although the Web is popularly regarded as a public domain resource, it is copyrighted; thus, archivists have no legal right to copy the Web". However national libraries in many countries do have a legal right to copy portions of the web under an extension of a legal deposit. Some private non-profit web archives that are made publicly accessible like WebCite or the Internet Archive allow content owners to hide or remove archived content that they do not want the public to have access to. Other web archives are only accessible from certain locations or have regulated usage.