SlideShare une entreprise Scribd logo
1  sur  76
Slides from Humanities on the Web: Is it working?
Date: Thursday, 19 March 2009, 10-4
Location: Oxford University, Oxford, UK
Webcast URL: http://webcast.oii.ox.ac.uk/?view=Webcast&ID=20090319_275
Slide URL: http://www.slideshare.net/etmeyer/WWWoH

Afternoon Event:
1:30 – 2:45: JISC/NEH Transatlantic Digitisation Collaboration Programme in
conjunction with the Internet Archive: The World Wide Web of Humanities
         OII: Selecting and analysing the sample WWI and WWII collections
                  (Christine Madsen & Dr. Eric Meyer)
         The Internet Archive: Extracting the data (Molly Bragg)
         Hanzo Archives Ltd.: Working with the data (Mark Middleton)
         Discussion and questions

Full details: http://www.oii.ox.ac.uk/events/details.cfm?id=238
Selecting and Analysing
the WWI and WWII
collections

Christine Madsen
Eric Meyer
19 March 2009
Why WWI and WWII?




Many branches of the humanities

      History      Journalism       Art


     Art history   Advertising   Literature


                    Political     Military
       Poetry
                    science       history
Why WWI and WWII?

Well-rounded set of materials
Why WWI and WWII?




• Changes                             • Differences
  over time                             between
                                        WWI and
              Language    Doc types     WWII




              Secondary   Top-level
               domains    domains
Building the Collection



Supplemented with
keyword searches in
    the Archive


  Harvested from
   the Internet
     Archive




     Selected
   from the live
       web
Building the Collection




                                  Seed
Seeds are:              Seed        1
                          2
the website or                 Seed
   portion of the
   website that you
                                 3
   plan to include in
   your collection

                        Initial Collection
Building the Collection



                    Expanded
         www
                    Collection
                                          www
                                                                      A seed is also a web
 www
         Seed
                 www www
                                                                      site from which
           3                                  Seed
                                   www          2
                                                            www       additional sites can
                        Seed
                                                                      be discovered via
         www www          5
                                  www
                                                                      the hyperlinks of the
                                          www
                                             www
                                                                      site
       www              www
                          www                        Seed
                                        www            6
                                                              www
       Seed
www      1
                www
                           Seed
                  www        4
                                    www
                                                     www

       www
                           www
Building the Collection



Started with WWI




   Too small (under 1,000,000 pages / object)
           Target was 250 million
Building the Collection



Expanded to WWII




  Final collection: 5,362,425 unique URLs
Building the Collection




‘World War One’     ‘the great war’

‘World War I’
                    ‘Première Guerre
‘First world war’   Mondiale’

‘World War II’      ‘zweiter Weltkrieg’
‘World War Two’
Building the Collection




Returning to   Record links
 ‘hub’ sites   from first 20
 for further     pages of
   analysis       search




 [include       Following
dead links]       links
Building the Collection




  Expanding scope



http://www.greatwar.co.uk/westfront/Somme/index.htm




               http://www.greatwar.co.uk
Building the Collection




   Expanding scope



memory.loc.gov/ammem/collections/maps/wwii/index.html




      www.memory.loc.gov/ammem/collections maps/wwii/
Building the Collection




 Dealing with illogical or flat directory structures

    www.eyewitnesstohistory.com/ <= don’t want whole site



www.eyewitnesstohistory.com/blitzkrieg.htm
www.eyewitnesstohistory.com/dday.html
www.eyewitnesstohistory.com/midway.htm
www.eyewitnesstohistory.com/airbattle.htm
www.eyewitnesstohistory.com/dunkirk.htm
www.eyewitnesstohistory.com/francesurrenders.htm
Building the Collection




• Stop when most results are redundant
• Narrow in on more specific topics


                                    Churchill
                                    Hitler
                    ‘zweiter        ‘Battle of the
                    Weltkrieg’      Bulge’
                    ‘Great war’     Guadalcanal
         WWI
                                    Allies
         WWII
                                    Home front
Building the Collection



• Materials in Foreign language
   – Focused on German sites
   – Consider local conventions, not just translations
  WWII
  (zweiter Weltkrieg)




        the period of National Socialism
        (Zeit des Nationalsozialismus)



               the period in which the Nazis ruled
               (Nazizeit)
• Other foreign languages were included, but
  not sought after


   Belarusian; Catalan/Valencian; Chamorro;
   Czech; Danish; German; Dzongkha; English;
   Spanish/Castilian; Finnish; French; Hebrew;
   Hungarian; Italian; Japanese; Luba-Katanga;
   Dutch/Flemish; Polish; Portuguese; Russian;
   Slovenian; Turkish; Ukrainian; Chinese
Building the Collection




                  Difficult to find and include:

                     Museums, libraries, archives




      Some improvement through targeted searches

NYPL (2,100 photographs)               Harvard Libraries (1,000 WWI Pamphlets)




               Directory Structures still limiting
             http://pds.lib.harvard.edu/pds/view/7845178
                    (first page of a multipage object)
The World Wide Web of Humanities
     “Extracting The Data”

       St Anne's College, Oxford
            March 19, 2009

               Molly Bragg, Partner Specialist
               Web Group
               The Internet Archive
Agenda

 Brief Introduction to IA‟s Web Archives

 Discipline Specific Data Extraction from
 Longitudinal Web Archives: The
 WWWoH Case Study

 Recommendations for Future Research
 and Tools Development Efforts
Brief Introduction to
IA‟s Web Archives
The Internet Archive is…
          A digital library of ~4 petabytes of information



     Web Pages
     Educational Courseware
     Films & Videos
     Music & Spoken Word
     Books & Texts
     Software
     Images
The Archive’s combined collections receive
     over 6 mil downloads a day!
           www.archive.org
IA Web Archives

1.6+ petabytes of primary data (compressed)

 150+ billion URIs, culled from 85+ million
  sites, harvested from 1996 to the present
 Includes captures from every domain
 Encompasses content in over 40 languages
 As of 2009, IA will add ½ petabyte to 1 petabyte of
  data to these collections each year.
Discipline Specific Data Extraction
 from Longitudinal Web Archives:

    The WWWoH Case Study
WWWoH Case Study




http://neh-access.archive.org/neh/
WWWoH Case Study

 Unique URLs in the collection: 5,362,425


 Total number of captures: 23,006,857

 Captures span: May, 1996 to Aug, 2008

 Total size of compressed data: ~250 GBs
The Data Extraction
                  Process
 Oxford Internet Institute selected relevant
 sites/URLs
 Identified all captures related to the seeds
 Identified all files embedded in each capture
 (on & off seed domains) for extraction
 Attempted to locate additional candidate
 seed URLs/domains for inclusion in the
 collection using outbound link data
The Data Extraction Process

 Relevant URLs not identified as seeds
  were not extracted.
   Automatically harvesting ALL outbound links
   can capture relevant non-seed urls however it
   can also introduce a large amount of
   extraneous content into the collection
   Manually curating outbound links excludes
   non-relevant content, however it can be an
   overwhelming task due to the volume of links
WWWoH Case Study: WWI



 Number of Seeds: 2263


 Unique Hosts: 906

 Number of Links: 143+ mil
WWWoH Case Study: WWI
WWWoH Case Study: WWI
WWI: Example
WWI: Example
WWI: Example
WWI: Example
WWWoH Case Study:
      WWII
 Number of Seeds: 2592


 Unique Hosts: 1475

 Number of Links: 252+ mil
WWWoH Case Study:
WWII
WWWoH Case Study:
WWII
WWII: Example
WWII: Example
WWII: Example
Challenges


Identifying subject matter-specific
resources of interest for an extraction and
then automating those procedures.

      Tools are missing from the workflow that
     might make the initial scoping of an extraction
     easier to define and revise
      Available tools for collection building and
     access are too technically focused for the
     average humanities scholar
Recommendations for
Future Research and Tools
   Development Efforts
Implications for Future
       Research

 Need link and web graphing tools
 that use inbound and outbound link
 data to identify further resources of
 interest
 Need to experiment with a more
 diverse range of UI navigational
 paradigms that address the
 dimension of time and curatorial input
Ideas/Concepts to Explore:
Nomination Tools
Ideas/Concepts to Explore:
Nomination Tools
Opportunities


 Extractions make it easier for humanities
scholars to locate and assemble source
materials of interest.
 These collections can accelerate and/or
augment discipline specific research efforts
 Extractions can encourage distributed
collaboration and cooperation between entities
who might not otherwise be aware of one
another
Thank You!


http://neh-access.archive.org/neh/

   Molly Bragg, Partner Specialist
  The Internet Archive, Web Group
          mbragg@archive.org
Search and Analysis of
Data in WWWoH

Mark Middleton
      www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
Agenda
  Brief introduction to Hanzo
  Open Source Search-Tools: a toolkit for implementing analytical
  applications using web archives
  WWWoH — working with the data
  Recommendations for future research
  Recommendations for future tools development
  WWWoH Tools Deliverables




                     www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
Introduction to Hanzo




www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
Hanzo Archives Limited
  Web Archiving Services
    Company websites and intranets
    Litigation support
    E-Discovery
    IP protection

  Focus on legally defensible web archives of exceptional quality
    Very advanced crawlers and access tools: dynamic html, video, flash, web 2.0
    Some public archives
    Mainly closed archives




                                           www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
Hanzo Archiving Technology
  Need advanced capabilities very quickly — continuous product innovation
    Rapid development of tools
    Create research and open source projects to promote mainstream awareness
    of web archives and web archiving technology

  Open source projects include
    WARC Tools
    Search Tools




                                         www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
WWWoH and Development of
 Open Source Search-Tools




     www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
Objectives
  Deliver an open source search engine for web archives that is simple to
  extend, easy to install and deploy
  Integrate with WARC Tools, the open source web archive file manipulation
  tools (Hanzo and IIPC)
  Extend the search engine with interesting directives and options
  Extend the search engine to provide data to analytical tools, develop an
  API, tools, and exemplar analytical tools
  Encourage third party analytical tools to use web archives as their data
  repository
  Migrate WWWoH extraction from ARC to WARC and ingest into Search
  Tools



                     www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
Full Text Search
  Implemented FT search on top of WARC Tools — the toolkit for
  manipulating ISO-28500 WARC files
  Reviewed several options: Java Lucene (and clones), Xapian, DB indexing
  (Sphinx, OpenFTS), etc.
  Criteria: vibrant development community, extensible (searching web
  archives is different: temporal dimension, duplicate handling, etc.), fast and
  full-featured (boolean, time queries, ability to index multiple fields, query
  language)




                      www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
Component Architecture
  Full text search engine based
  on Open Source Ferret
  Knowledge Base stores search
  results
  Python application with Django
  model and Django WUI
  Memcache
  Plug-in architecture to support
  multiple analytical applications




                      www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
Ferret
                                                                                       Ferret is FAST, both indexing
                                                                                       and searching
                                                                                       Highly scalable, up to 100m
                                                                                       documents on a single CPU
Documents/s




                                                                                       Supports distributed search
                                                                                       Phrase search, proximity
                                                                                       ranking, stemming in several
                                                                                       languages, stopwords, multiple
                                                                                       document fields
                                                                                       Ferret Query Language
              http://ferret.davebalmain.com/trac/wiki/FerretVsLucene




                                                       www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
Advanced Search
  url: (+bbc +wwii) -- search for URLs containing both „bbc‟ and „wwii‟
  date: [2001 2002] -- search within date range
  tag: wwwoh -- search content with the tag „wwoh‟
  title: (+wilfred +owen) -- search for Wilfred and Owen within the title
  domain: fr -- restrict search to within .fr domain




                       www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
Working with the Data




www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
Migrating ARC to WARC
  Data extracted from IA in ARC files
  Hanzo WARC Tools and Search Tools projects combined enabled us to
  migrate ARC to WARC files (WARC is the new ISO standard):
    Some challenges: broken ARCs, scale, etc.
    3,264 WARC files




                       www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
Programmable Access to Data
  WARC Tools and Search Tools provide a rich collection of programmable
  tools to enable analytics tools developers to use web archives:
    Object-oriented C, REST API, fast iterators
    Command lines for manipulating WARCs, indexing, searching
    Web applications for browsing, searching, demonstrator analytics
    C/C++, Python, Ruby, Perl, … and if you need to, Java, C#

  Demonstration: the web applications




                      www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
http://wwwoh.hanzoarchives.com/




         www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
Analytical Tools
  Frequency Tables for:
    Domains, MIME Types, Countries

  Graphing Tools:
    GUESS -- an exploratory data analysis and visualization tool for graphs and
    networks
    Graphviz -- makes diagrams in several formats: images and SVG for web
    pages, Postscript; or display in an interactive graph browser
    Hypergraph -- provides visualisation of hyperbolic geometry, to handle graphs
    and to layout hyperbolic trees




                     www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
Graphing Tools




          www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
Recommendations for Future
Research and Tools Development




         www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
Future Research
  Faster, richer analytics
  Rich API for analytics, to be developed in collaboration with IA, other
  archives, and IIPC
  Temporal analytics and techniques
  Link and network graphing and analytics
  Enhance outreach/dissemination to the mainstream development
  community and research community




                      www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
Future Tools Development
  Multi-machine indexing and application engine
  Tighter integration of graphing tools, with more user parameters and
  configurations
  Temporal analysis (animation of link graphs over time)
  Enhance WARC Tools integration and investigate interoperability with other
  IIPC toolsets
  Developer documentation
  Analyst/researcher documentation
  Installation tools for Linux, Mac OS X and Windows XP/Vista




                     www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
Deliverables at End March 2009




        www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
Deliverables
  The Search Tools project home is http://code.google.com/p/search-tools/
    Source code
    Documentation
    Issue management
    Mailing list

  The WARC Tools project home is http://code.google.com/p/warc-tools/
  The prototype application is http://wwwoh.hanzoarchives.com/




                     www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.
Thank You
       Hanzo Archives Limited
       +44 20 8816 8226
       www.hanzoarchives.com




www.hanzoarchives.com   ◀ Copyright © 2009 Hanzo Archives Limited.

Contenu connexe

Similaire à WWWoH

Can you save the web? Web Archiving!
Can you save the web? Web Archiving!Can you save the web? Web Archiving!
Can you save the web? Web Archiving!Vangelis Banos
 
Social Networking and Web-Based Communities for Learning in Museums
Social Networking and  Web-Based Communities for Learning in MuseumsSocial Networking and  Web-Based Communities for Learning in Museums
Social Networking and Web-Based Communities for Learning in Museumsbkennedy
 
Harnessing the Interactive Web
Harnessing the Interactive WebHarnessing the Interactive Web
Harnessing the Interactive WebBill Warters
 
Wiki Summer Training2
Wiki Summer Training2Wiki Summer Training2
Wiki Summer Training2Robin Young
 

Similaire à WWWoH (6)

Can you save the web? Web Archiving!
Can you save the web? Web Archiving!Can you save the web? Web Archiving!
Can you save the web? Web Archiving!
 
Social Networking and Web-Based Communities for Learning in Museums
Social Networking and  Web-Based Communities for Learning in MuseumsSocial Networking and  Web-Based Communities for Learning in Museums
Social Networking and Web-Based Communities for Learning in Museums
 
Harnessing the Interactive Web
Harnessing the Interactive WebHarnessing the Interactive Web
Harnessing the Interactive Web
 
International Business Research
International Business ResearchInternational Business Research
International Business Research
 
GIS and Archaeology: Fort Vancouver
GIS and Archaeology: Fort VancouverGIS and Archaeology: Fort Vancouver
GIS and Archaeology: Fort Vancouver
 
Wiki Summer Training2
Wiki Summer Training2Wiki Summer Training2
Wiki Summer Training2
 

Plus de Eric Meyer

Quantifying the impacts of investment in humanities archives
Quantifying the impacts of investment in humanities archivesQuantifying the impacts of investment in humanities archives
Quantifying the impacts of investment in humanities archivesEric Meyer
 
Meyer dig ethno_2013sdp
Meyer dig ethno_2013sdpMeyer dig ethno_2013sdp
Meyer dig ethno_2013sdpEric Meyer
 
Meyer Big Data SDP13
Meyer Big Data SDP13Meyer Big Data SDP13
Meyer Big Data SDP13Eric Meyer
 
Studying people who can talk back, Meyer 2013 DH at Oxford summer school
Studying people who can talk back, Meyer 2013 DH at Oxford summer schoolStudying people who can talk back, Meyer 2013 DH at Oxford summer school
Studying people who can talk back, Meyer 2013 DH at Oxford summer schoolEric Meyer
 
2013 Oxford Digital Humanities Summer School Workshop
2013 Oxford Digital Humanities Summer School Workshop2013 Oxford Digital Humanities Summer School Workshop
2013 Oxford Digital Humanities Summer School WorkshopEric Meyer
 
Digital Research and Big Data: Is the Tail Wagging the Dog?
Digital Research and Big Data: Is the Tail Wagging the Dog?Digital Research and Big Data: Is the Tail Wagging the Dog?
Digital Research and Big Data: Is the Tail Wagging the Dog?Eric Meyer
 
The Internet, Science, and Transformations of Knowledge
The Internet, Science, and Transformations of KnowledgeThe Internet, Science, and Transformations of Knowledge
The Internet, Science, and Transformations of KnowledgeEric Meyer
 
MLA 2013: Social Science Tools for understanding the impact of the digital hu...
MLA 2013: Social Science Tools for understanding the impact of the digital hu...MLA 2013: Social Science Tools for understanding the impact of the digital hu...
MLA 2013: Social Science Tools for understanding the impact of the digital hu...Eric Meyer
 
The Internet is Big Data: How internet research has changed our understandin...
The Internet is Big Data: How internet research has changed our understandin...The Internet is Big Data: How internet research has changed our understandin...
The Internet is Big Data: How internet research has changed our understandin...Eric Meyer
 
The End(s) of e-Research
The End(s) of e-ResearchThe End(s) of e-Research
The End(s) of e-ResearchEric Meyer
 
tidsrdhoxss2012
tidsrdhoxss2012tidsrdhoxss2012
tidsrdhoxss2012Eric Meyer
 
DTC-OII Ethnography Online 2011
DTC-OII Ethnography Online 2011DTC-OII Ethnography Online 2011
DTC-OII Ethnography Online 2011Eric Meyer
 
i3 Conference Keynote, Aberdeen
i3 Conference Keynote, Aberdeeni3 Conference Keynote, Aberdeen
i3 Conference Keynote, AberdeenEric Meyer
 
FIA Budapest - Meyer
FIA Budapest - MeyerFIA Budapest - Meyer
FIA Budapest - MeyerEric Meyer
 
Reinventing Research? Information Practices in the Humanites Launch
Reinventing Research? Information Practices in the Humanites LaunchReinventing Research? Information Practices in the Humanites Launch
Reinventing Research? Information Practices in the Humanites LaunchEric Meyer
 
Reinventing Research? Information Practices in the Humanites Information Prof...
Reinventing Research? Information Practices in the Humanites Information Prof...Reinventing Research? Information Practices in the Humanites Information Prof...
Reinventing Research? Information Practices in the Humanites Information Prof...Eric Meyer
 
Virtual Environments and the Future of Collaboration
Virtual Environments and the Future of CollaborationVirtual Environments and the Future of Collaboration
Virtual Environments and the Future of CollaborationEric Meyer
 
Scholarship in the Digital Age
Scholarship in the Digital AgeScholarship in the Digital Age
Scholarship in the Digital AgeEric Meyer
 

Plus de Eric Meyer (20)

Quantifying the impacts of investment in humanities archives
Quantifying the impacts of investment in humanities archivesQuantifying the impacts of investment in humanities archives
Quantifying the impacts of investment in humanities archives
 
Meyer dig ethno_2013sdp
Meyer dig ethno_2013sdpMeyer dig ethno_2013sdp
Meyer dig ethno_2013sdp
 
Meyer Big Data SDP13
Meyer Big Data SDP13Meyer Big Data SDP13
Meyer Big Data SDP13
 
Studying people who can talk back, Meyer 2013 DH at Oxford summer school
Studying people who can talk back, Meyer 2013 DH at Oxford summer schoolStudying people who can talk back, Meyer 2013 DH at Oxford summer school
Studying people who can talk back, Meyer 2013 DH at Oxford summer school
 
2013 Oxford Digital Humanities Summer School Workshop
2013 Oxford Digital Humanities Summer School Workshop2013 Oxford Digital Humanities Summer School Workshop
2013 Oxford Digital Humanities Summer School Workshop
 
Digital Research and Big Data: Is the Tail Wagging the Dog?
Digital Research and Big Data: Is the Tail Wagging the Dog?Digital Research and Big Data: Is the Tail Wagging the Dog?
Digital Research and Big Data: Is the Tail Wagging the Dog?
 
The Internet, Science, and Transformations of Knowledge
The Internet, Science, and Transformations of KnowledgeThe Internet, Science, and Transformations of Knowledge
The Internet, Science, and Transformations of Knowledge
 
MLA 2013: Social Science Tools for understanding the impact of the digital hu...
MLA 2013: Social Science Tools for understanding the impact of the digital hu...MLA 2013: Social Science Tools for understanding the impact of the digital hu...
MLA 2013: Social Science Tools for understanding the impact of the digital hu...
 
The Internet is Big Data: How internet research has changed our understandin...
The Internet is Big Data: How internet research has changed our understandin...The Internet is Big Data: How internet research has changed our understandin...
The Internet is Big Data: How internet research has changed our understandin...
 
The End(s) of e-Research
The End(s) of e-ResearchThe End(s) of e-Research
The End(s) of e-Research
 
tidsrdhoxss2012
tidsrdhoxss2012tidsrdhoxss2012
tidsrdhoxss2012
 
DTC-OII Ethnography Online 2011
DTC-OII Ethnography Online 2011DTC-OII Ethnography Online 2011
DTC-OII Ethnography Online 2011
 
JISC-WW1
JISC-WW1JISC-WW1
JISC-WW1
 
TIDSR-DHOx
TIDSR-DHOxTIDSR-DHOx
TIDSR-DHOx
 
i3 Conference Keynote, Aberdeen
i3 Conference Keynote, Aberdeeni3 Conference Keynote, Aberdeen
i3 Conference Keynote, Aberdeen
 
FIA Budapest - Meyer
FIA Budapest - MeyerFIA Budapest - Meyer
FIA Budapest - Meyer
 
Reinventing Research? Information Practices in the Humanites Launch
Reinventing Research? Information Practices in the Humanites LaunchReinventing Research? Information Practices in the Humanites Launch
Reinventing Research? Information Practices in the Humanites Launch
 
Reinventing Research? Information Practices in the Humanites Information Prof...
Reinventing Research? Information Practices in the Humanites Information Prof...Reinventing Research? Information Practices in the Humanites Information Prof...
Reinventing Research? Information Practices in the Humanites Information Prof...
 
Virtual Environments and the Future of Collaboration
Virtual Environments and the Future of CollaborationVirtual Environments and the Future of Collaboration
Virtual Environments and the Future of Collaboration
 
Scholarship in the Digital Age
Scholarship in the Digital AgeScholarship in the Digital Age
Scholarship in the Digital Age
 

Dernier

BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...Pooja Nehwal
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 

Dernier (20)

BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 

WWWoH

  • 1. Slides from Humanities on the Web: Is it working? Date: Thursday, 19 March 2009, 10-4 Location: Oxford University, Oxford, UK Webcast URL: http://webcast.oii.ox.ac.uk/?view=Webcast&ID=20090319_275 Slide URL: http://www.slideshare.net/etmeyer/WWWoH Afternoon Event: 1:30 – 2:45: JISC/NEH Transatlantic Digitisation Collaboration Programme in conjunction with the Internet Archive: The World Wide Web of Humanities OII: Selecting and analysing the sample WWI and WWII collections (Christine Madsen & Dr. Eric Meyer) The Internet Archive: Extracting the data (Molly Bragg) Hanzo Archives Ltd.: Working with the data (Mark Middleton) Discussion and questions Full details: http://www.oii.ox.ac.uk/events/details.cfm?id=238
  • 2. Selecting and Analysing the WWI and WWII collections Christine Madsen Eric Meyer 19 March 2009
  • 3. Why WWI and WWII? Many branches of the humanities History Journalism Art Art history Advertising Literature Political Military Poetry science history
  • 4. Why WWI and WWII? Well-rounded set of materials
  • 5. Why WWI and WWII? • Changes • Differences over time between WWI and Language Doc types WWII Secondary Top-level domains domains
  • 6. Building the Collection Supplemented with keyword searches in the Archive Harvested from the Internet Archive Selected from the live web
  • 7. Building the Collection Seed Seeds are: Seed 1 2 the website or Seed portion of the website that you 3 plan to include in your collection Initial Collection
  • 8. Building the Collection Expanded www Collection www A seed is also a web www Seed www www site from which 3 Seed www 2 www additional sites can Seed be discovered via www www 5 www the hyperlinks of the www www site www www www Seed www 6 www Seed www 1 www Seed www 4 www www www www
  • 9. Building the Collection Started with WWI Too small (under 1,000,000 pages / object) Target was 250 million
  • 10. Building the Collection Expanded to WWII Final collection: 5,362,425 unique URLs
  • 11. Building the Collection ‘World War One’ ‘the great war’ ‘World War I’ ‘Première Guerre ‘First world war’ Mondiale’ ‘World War II’ ‘zweiter Weltkrieg’ ‘World War Two’
  • 12. Building the Collection Returning to Record links ‘hub’ sites from first 20 for further pages of analysis search [include Following dead links] links
  • 13. Building the Collection Expanding scope http://www.greatwar.co.uk/westfront/Somme/index.htm http://www.greatwar.co.uk
  • 14. Building the Collection Expanding scope memory.loc.gov/ammem/collections/maps/wwii/index.html www.memory.loc.gov/ammem/collections maps/wwii/
  • 15. Building the Collection Dealing with illogical or flat directory structures www.eyewitnesstohistory.com/ <= don’t want whole site www.eyewitnesstohistory.com/blitzkrieg.htm www.eyewitnesstohistory.com/dday.html www.eyewitnesstohistory.com/midway.htm www.eyewitnesstohistory.com/airbattle.htm www.eyewitnesstohistory.com/dunkirk.htm www.eyewitnesstohistory.com/francesurrenders.htm
  • 16. Building the Collection • Stop when most results are redundant • Narrow in on more specific topics Churchill Hitler ‘zweiter ‘Battle of the Weltkrieg’ Bulge’ ‘Great war’ Guadalcanal WWI Allies WWII Home front
  • 17. Building the Collection • Materials in Foreign language – Focused on German sites – Consider local conventions, not just translations WWII (zweiter Weltkrieg) the period of National Socialism (Zeit des Nationalsozialismus) the period in which the Nazis ruled (Nazizeit)
  • 18. • Other foreign languages were included, but not sought after Belarusian; Catalan/Valencian; Chamorro; Czech; Danish; German; Dzongkha; English; Spanish/Castilian; Finnish; French; Hebrew; Hungarian; Italian; Japanese; Luba-Katanga; Dutch/Flemish; Polish; Portuguese; Russian; Slovenian; Turkish; Ukrainian; Chinese
  • 19. Building the Collection Difficult to find and include: Museums, libraries, archives Some improvement through targeted searches NYPL (2,100 photographs) Harvard Libraries (1,000 WWI Pamphlets) Directory Structures still limiting http://pds.lib.harvard.edu/pds/view/7845178 (first page of a multipage object)
  • 20. The World Wide Web of Humanities “Extracting The Data” St Anne's College, Oxford March 19, 2009 Molly Bragg, Partner Specialist Web Group The Internet Archive
  • 21. Agenda  Brief Introduction to IA‟s Web Archives  Discipline Specific Data Extraction from Longitudinal Web Archives: The WWWoH Case Study  Recommendations for Future Research and Tools Development Efforts
  • 23. The Internet Archive is… A digital library of ~4 petabytes of information  Web Pages  Educational Courseware  Films & Videos  Music & Spoken Word  Books & Texts  Software  Images The Archive’s combined collections receive over 6 mil downloads a day! www.archive.org
  • 24. IA Web Archives 1.6+ petabytes of primary data (compressed)  150+ billion URIs, culled from 85+ million sites, harvested from 1996 to the present  Includes captures from every domain  Encompasses content in over 40 languages  As of 2009, IA will add ½ petabyte to 1 petabyte of data to these collections each year.
  • 25. Discipline Specific Data Extraction from Longitudinal Web Archives: The WWWoH Case Study
  • 27. WWWoH Case Study  Unique URLs in the collection: 5,362,425  Total number of captures: 23,006,857  Captures span: May, 1996 to Aug, 2008  Total size of compressed data: ~250 GBs
  • 28. The Data Extraction Process  Oxford Internet Institute selected relevant sites/URLs  Identified all captures related to the seeds  Identified all files embedded in each capture (on & off seed domains) for extraction  Attempted to locate additional candidate seed URLs/domains for inclusion in the collection using outbound link data
  • 29. The Data Extraction Process  Relevant URLs not identified as seeds were not extracted.  Automatically harvesting ALL outbound links can capture relevant non-seed urls however it can also introduce a large amount of extraneous content into the collection  Manually curating outbound links excludes non-relevant content, however it can be an overwhelming task due to the volume of links
  • 30. WWWoH Case Study: WWI  Number of Seeds: 2263  Unique Hosts: 906  Number of Links: 143+ mil
  • 37. WWWoH Case Study: WWII  Number of Seeds: 2592  Unique Hosts: 1475  Number of Links: 252+ mil
  • 43. Challenges Identifying subject matter-specific resources of interest for an extraction and then automating those procedures.  Tools are missing from the workflow that might make the initial scoping of an extraction easier to define and revise  Available tools for collection building and access are too technically focused for the average humanities scholar
  • 44. Recommendations for Future Research and Tools Development Efforts
  • 45. Implications for Future Research  Need link and web graphing tools that use inbound and outbound link data to identify further resources of interest  Need to experiment with a more diverse range of UI navigational paradigms that address the dimension of time and curatorial input
  • 48. Opportunities  Extractions make it easier for humanities scholars to locate and assemble source materials of interest.  These collections can accelerate and/or augment discipline specific research efforts  Extractions can encourage distributed collaboration and cooperation between entities who might not otherwise be aware of one another
  • 49. Thank You! http://neh-access.archive.org/neh/ Molly Bragg, Partner Specialist The Internet Archive, Web Group mbragg@archive.org
  • 50. Search and Analysis of Data in WWWoH Mark Middleton www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  • 51. Agenda Brief introduction to Hanzo Open Source Search-Tools: a toolkit for implementing analytical applications using web archives WWWoH — working with the data Recommendations for future research Recommendations for future tools development WWWoH Tools Deliverables www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  • 52. Introduction to Hanzo www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  • 53. Hanzo Archives Limited Web Archiving Services Company websites and intranets Litigation support E-Discovery IP protection Focus on legally defensible web archives of exceptional quality Very advanced crawlers and access tools: dynamic html, video, flash, web 2.0 Some public archives Mainly closed archives www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  • 54. Hanzo Archiving Technology Need advanced capabilities very quickly — continuous product innovation Rapid development of tools Create research and open source projects to promote mainstream awareness of web archives and web archiving technology Open source projects include WARC Tools Search Tools www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  • 55. WWWoH and Development of Open Source Search-Tools www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  • 56. Objectives Deliver an open source search engine for web archives that is simple to extend, easy to install and deploy Integrate with WARC Tools, the open source web archive file manipulation tools (Hanzo and IIPC) Extend the search engine with interesting directives and options Extend the search engine to provide data to analytical tools, develop an API, tools, and exemplar analytical tools Encourage third party analytical tools to use web archives as their data repository Migrate WWWoH extraction from ARC to WARC and ingest into Search Tools www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  • 57. Full Text Search Implemented FT search on top of WARC Tools — the toolkit for manipulating ISO-28500 WARC files Reviewed several options: Java Lucene (and clones), Xapian, DB indexing (Sphinx, OpenFTS), etc. Criteria: vibrant development community, extensible (searching web archives is different: temporal dimension, duplicate handling, etc.), fast and full-featured (boolean, time queries, ability to index multiple fields, query language) www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  • 58. Component Architecture Full text search engine based on Open Source Ferret Knowledge Base stores search results Python application with Django model and Django WUI Memcache Plug-in architecture to support multiple analytical applications www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  • 59. Ferret Ferret is FAST, both indexing and searching Highly scalable, up to 100m documents on a single CPU Documents/s Supports distributed search Phrase search, proximity ranking, stemming in several languages, stopwords, multiple document fields Ferret Query Language http://ferret.davebalmain.com/trac/wiki/FerretVsLucene www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  • 60. Advanced Search url: (+bbc +wwii) -- search for URLs containing both „bbc‟ and „wwii‟ date: [2001 2002] -- search within date range tag: wwwoh -- search content with the tag „wwoh‟ title: (+wilfred +owen) -- search for Wilfred and Owen within the title domain: fr -- restrict search to within .fr domain www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  • 61. Working with the Data www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  • 62. Migrating ARC to WARC Data extracted from IA in ARC files Hanzo WARC Tools and Search Tools projects combined enabled us to migrate ARC to WARC files (WARC is the new ISO standard): Some challenges: broken ARCs, scale, etc. 3,264 WARC files www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  • 63. Programmable Access to Data WARC Tools and Search Tools provide a rich collection of programmable tools to enable analytics tools developers to use web archives: Object-oriented C, REST API, fast iterators Command lines for manipulating WARCs, indexing, searching Web applications for browsing, searching, demonstrator analytics C/C++, Python, Ruby, Perl, … and if you need to, Java, C# Demonstration: the web applications www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  • 64. http://wwwoh.hanzoarchives.com/ www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  • 65. www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  • 66. www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  • 67. www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  • 68. Analytical Tools Frequency Tables for: Domains, MIME Types, Countries Graphing Tools: GUESS -- an exploratory data analysis and visualization tool for graphs and networks Graphviz -- makes diagrams in several formats: images and SVG for web pages, Postscript; or display in an interactive graph browser Hypergraph -- provides visualisation of hyperbolic geometry, to handle graphs and to layout hyperbolic trees www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  • 69. www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  • 70. Graphing Tools www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  • 71. Recommendations for Future Research and Tools Development www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  • 72. Future Research Faster, richer analytics Rich API for analytics, to be developed in collaboration with IA, other archives, and IIPC Temporal analytics and techniques Link and network graphing and analytics Enhance outreach/dissemination to the mainstream development community and research community www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  • 73. Future Tools Development Multi-machine indexing and application engine Tighter integration of graphing tools, with more user parameters and configurations Temporal analysis (animation of link graphs over time) Enhance WARC Tools integration and investigate interoperability with other IIPC toolsets Developer documentation Analyst/researcher documentation Installation tools for Linux, Mac OS X and Windows XP/Vista www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  • 74. Deliverables at End March 2009 www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  • 75. Deliverables The Search Tools project home is http://code.google.com/p/search-tools/ Source code Documentation Issue management Mailing list The WARC Tools project home is http://code.google.com/p/warc-tools/ The prototype application is http://wwwoh.hanzoarchives.com/ www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
  • 76. Thank You Hanzo Archives Limited +44 20 8816 8226 www.hanzoarchives.com www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Notes de l'éditeur

  1. Aside from being relevant for transatlantic cooperation, because of the involvement of so many countries, the materials available on the World Wars represent a well-rounded set of humanities materials that will allow us to test the tools against a variety of types of documents and resources. World War collections on the web include materials that fall under the topics of history, journalism, art, art history, advertising, literature, poetry, political science, military history and others. <number>
  2. The types of materials that have been digitized also cover a range of challenges that will allow robust testing of our approach, including multiple formats (text, images of documents, photos, audio), multiple languages (English, German, etc.), Many document typesMultiple languagesGet language list from Kris?<number>
  3. All of this started with identifying a set of seed sites. A seed site is a web site from which additional sites can be discovered via the hyperlinks of the site, through in-links to and out-links from the seed site. <number>
  4. Early on in the seed selection process, though, we realized that this selection policy would not result in anything close to the original target of 100-250 million pages, as the first few passes through the collections yielded barely 1 million pages. In the end, our collections are smaller than the total possible limits identified by those responsible for the technological implementation. This was the first lesson for the whole team: even though the data deluge (Hey & Trefethen, 2003) is often identified as a key challenge for researchers across fields, focused collections in the humanities are still relatively unlikely to encompass hundreds of millions of objects. <number>
  5. Early on in the seed selection process, though, we realized that this selection policy would not result in anything close to the original target of 100-250 million pages, as the first few passes through the collections yielded barely 1 million pages. In the end, our collections are smaller than the total possible limits identified by those responsible for the technological implementation. This was the first lesson for the whole team: even though the data deluge (Hey & Trefethen, 2003) is often identified as a key challenge for researchers across fields, focused collections in the humanities are still relatively unlikely to encompass hundreds of millions of objects. <number>
  6. The seeds were identified in a process that began with topic-based web searches. The searches began with the most general topics, ‘World War I,’ ‘World War II’, being sure to include all variations in language, spelling, and phrasing, such as ‘World War One’ and ‘First World War.’ This was followed by searching regional localizations of the phrases and topics, such as ‘the Great War,’ ‘Première Guerre Mondiale,’ and ‘zweiter Weltkrieg.’<number>
  7. For each search, the first twenty pages of the search results were captured by following links from the search results page and copying and pasting the URLs into a spreadsheet. Sites with lists of links to other relevant sites were bookmarked and returned to at a later time for exploration and capture. As the goal was to gather a collection of archived web sites, links to sites that no longer exist were also recorded. These dead links, which appear to be useless on the live web, represent one advantage to this collection method: if the Internet Archive includes archived versions of these pages, they can still be included in the collection. This represents an improvement over the native interface to the Internet Archive’s Wayback Machine, which requires users to type in a URL and then select from various snapshots of those pages collected over time.
  8. The next step was to generalize the URLs in order to maximize the number of pages in the collection. For each URL copied, references to specific pages were removed and the URL truncated to the root site or most logical directory. For example, it was logical to conclude that the entirety of http://www.greatwar.co.uk was on topic, so all references to specific pages, such as http://www.greatwar.co.uk/westfront/Somme/index.htm were removed and replaced with http://www.greatwar.co.uk. (Duplicate sites were removed automatically). Many collections of materials—in particular those from universities, archives, and libraries—were not resident on unique domains. In these cases, the URL could only be truncated as far back as the directory containing the relevant materials. For example: http://memory.loc.gov/ammem/collections/maps/wwii/index.html to http://memory.loc.gov/ammem/collections/maps/wwii/. <number>
  9. The next step was to generalize the URLs in order to maximize the number of pages in the collection. For each URL copied, references to specific pages were removed and the URL truncated to the root site or most logical directory. For example, it was logical to conclude that the entirety of http://www.greatwar.co.uk was on topic, so all references to specific pages, such as http://www.greatwar.co.uk/westfront/Somme/index.htm were removed and replaced with http://www.greatwar.co.uk. (Duplicate sites were removed automatically). Many collections of materials—in particular those from universities, archives, and libraries—were not resident on unique domains. In these cases, the URL could only be truncated as far back as the directory containing the relevant materials. For example: http://memory.loc.gov/ammem/collections/maps/wwii/index.html to http://memory.loc.gov/ammem/collections/maps/wwii/. <number>
  10. Illogical directory structures were often encountered and were a clear barrier to increasing the number of sites collected. EyeWitnesstoHistory.com contains first person accounts of historical events and contains almost fifty pages dedicated to the First and Second World Wars. Each page file sits in the root directory, though, and so needed to be provided individually. The entire site (http://www.eyewitnesstohistory.com/) could not be included because only a fraction of it falls within the scope of the collection, therefore individual pages (http://www.eyewitnesstohistory.com/blitzkrieg.htm, ../dday.html, etc.) had to be recorded. <number>
  11. Although this process may seem to result in an almost infinite number of sites, it became clear that after gathering several hundred seeds, most of the resulting sites identified were redundant. At that point, more precise search terms were selected and the process re-initiated. Narrower topic searches were commonly either biographical (Hitler, Churchill, etc.), event-based (Battle of Midway, Guadalcanal campaign, surrender of Japan), or based around on subjects that while technically broader in scope, are commonly associated with one of the two wars, (holocaust, Allies, home front.) <number>
  12. Because of the time consuming nature of the collection-building process, a decision was made to focus the foreign-language part of the collection on German sites; with the idea that it would be more useful to have one language with a deep collection than many with shallow ones. (Sites identified in other languages were included, but not sought after.) Native German-speakers were consulted and helped design a search strategy to maximize the number of resulting German sites. This strategy took into account local conventions on not speaking only of World War II (zweiter Weltkrieg), for example, but more commonly of the period in which the Nazis ruled (Nazizeit) or Zeit des Nationalsozialismus, the period of National Socialism. This approach illustrated the need for localization, not just translation, when building a collection of sites in other languages. <number>
  13. As the topics for collection development were narrowed, the collection of seed sites continued to grow, but there were several content areas that remained difficult to include. A majority of the material from museums, libraries, and archives was not findable using the subject searches mentioned above. Most of this material was identified using targeted searches of domains likely to contain relevant content. Many of these institutions use local databases to deliver content that are not publicly indexed by common search engines. The New York Public Library has an extensive digital collection of photographs, over 2,100 of which are relevant to one of the world wars. These materials can only by located by first going to NYPL’s site. Similarly, Harvard University has a collection of almost one thousand digitized pamphlets from World War I. They can only be found by searching in the library’s union catalogue. In each of these cases, knowing that the materials exist—or might exist—is a prerequisite for being able to find them. But even when located, materials in databases remained problematic. There is usually no directory structure that can capture a number of items at once, nor are the URLs generated by database searches commonly stable. URLs to the Harvard materials, for example (http://pds.lib.harvard.edu/pds/view/7845178) only provide access to the first page of the multi-page objects. While NYPL does provide stable URLs for the objects in its database, these need to be identified within each bibliographic record in order to be added to the seed list.<number><number>
  14. <number>
  15. <number>
  16. A 501(c)(3) non-profit ; Located in The Presidio, San Francisco, CaliforniaStarted in 1996 to to build an ‘Internet library’ of archived Web pages; Expanded in 1999 to include all media, texts, etc.Focus Harvest, storage, management & access to digital contentContribution and use of open source Web archiving software tools and services.Access to digital assets in the public domainWeb150+Bil objects,~ 1.6 Petabytes of data compressedMoving ImagesPrelinger, public domain filmsStill Images - NASATextsProject Gutenberg, public domain texts, Children’s Digital LibraryAudioLMA, Grateful Dead, public domain audio clips,…Educational CoursewareOther Collections: Software & Television (subsidiary)<number>
  17. 100’s of thousands of online journals and blogsMillions of digitized texts100’s of millions of web sites100’s of billions of unique web pages100’s of file/mime typesBut too many files to count…A single snapshot of the visible Web now exceeds a petabyte of data…
  18. <number>
  19. Nuts and bolts of the data extraction process.<number>
  20. <number>