SlideShare une entreprise Scribd logo
1  sur  34
Preserving Public Government
Information: The End of Term
Web Archive

            Abbie Grotke, Library of Congress
        Tracy Seneca, California Digital Library



        END OF TERM ARCHIVE – FDLC Oct. 17, 2011
Outline
   Background
   Nomination of URLs
   Data transfer
   Preparing for Access
   Demo of public interface
   2012: what you can do
   Our questions for you / your questions for
    us!
Themes
   Ad-hoc nature of this project

   Wellspring of web archiving tools

   Testbed for emerging tools
Collaborating Institutions

 Library of Congress
 Internet Archive

 California Digital Library

 University of North Texas

 US Government Printing Office
Why Archive .gov? Why Collaborate?

    Fit with partner missions to collect and
     preserve at-risk (born-digital)
     government information

    Potential for High Research Use/Interest
     in Archives

    It Takes a Village

    Experienced Partners
Project Goals
   Work collaboratively to preserve public
    U.S. Government Web sites at the end of
    the current presidential administration
    ending January 19, 2009.

   Document federal agencies‟ presence on
    the Web during the transition of
    Presidential administrations.

   To enhance the existing research
    collections of the five partner institutions.
URL Nomination Tool
 Facilitates
  collaboration
 Ingest seed lists

from different sources
 Record known
  metadata
     Branch
     Title
     Comment
     Who nominated

 Create seed lists
for crawls
Volunteer Nominators
   Call for volunteers targeted:
     Government information specialists
     Librarians

     Political and social science researchers

     Academics

     Web archivists


   31 individuals signed up to help
Nominator To-Dos
   Nominate the
    most critical URLs
    for capture as "in
    scope"
   Add new URLs not
    already included
    in the list
   Mark irrelevant or
    obsolete sites as
    "out of scope"
   Add minimal URL
    metadata such as
    site title, agency,
    etc.
Nomination Tool
In Scope vs. Out of Scope
   In scope: Federal government Web sites (.gov,
    .mil, etc.) in the Legislative, Executive, or Judicial
    branches of government. Of particular interest
    for prioritization were sites likely to change
    dramatically or disappear during the transition of
    government

   Out of scope: Local or state government Web
    sites, or any other site not part of the above
    federal government domain

   Not captured: intranets, deep web content
Prioritized URLs
   ~500 URLs nominated by volunteers
Selected Researcher/Curator Interests

    Homeland Security
    Department of Labor
    Department of Treasury
    Education/“No Child Left Behind”
    Health Care Reform
    Stem Cell Research
    Bush Administration Budget Justifications
    Federal Program Assessments
     (ExpectMore.gov)
Nomination Tool Lessons Learned

   No coordination of selection or
    “assignments” so likely gaps in collection

   Start with a blank slate rather than pre-
    populate the database?

   Admin tools/Reporting features need more
    work

   Engage more experts to help identify at-
    risk content
Further Lessons Learned
   CDL: hard to respond to nominations to
    shape crawler settings & focus – defaulted
    to getting as much as we could.

   When used for Deepwater Horizon, we
    found it added an extra task for curators –
    used “delicious” instead.

   The metadata we did get (branch,
    description) was really valuable after the
    fact!
Questions for you:
   What do YOU need from the nomination
    tool?

   Is it nomination or curation?
Partner Harvesting Roles
   Internet Archive – Broad, comprehensive harvests

   Library of Congress – In-depth Legislative branch crawls

   University of North Texas – Sites/Agencies that meet current
    UNT interests, e.g. environmental policy, and collections, as
    well as several “deep web” sites

   California Digital Library – Multiple crawls of all seeds in EOT
    database; sites of interest to their curators

   Government Printing Office – Support and analysis of “official
    documents” found in collection
Crawl Schedule




   Two Approaches:
     Broad,  comprehensive crawls
     Prioritized, selective crawls


   Key dates:
     ElectionDay, November 4
     Inauguration Day, January 20
Data Transfer
   Goal: Distribute 15.9 TB of collected
    content among partners

   LC‟s central transfer server used:
       “Pulled” and “pushed” data from and to partners
        via Internet2, May 2009 – Mid 2010

       Common transfer tools, specifications were key

More info here: http://blogs.loc.gov/digitalpreservation/2011/07/the-
    end-of-term-was-only-the-beginning/
Transfer Tools: Bagger
Preparing for Access

     1st Tuesday of each month, 12:00 pm:


Anything                              No,
to report                             nothing to
on                                    report on
public                                public
access?                               access.
Internet Archive also had:
   MODS record extraction tool

   A full copy of the content from all EOT
    partners

   A QA “Playback” tool (takes screen images
    of archived materials)

   An export of the Nomination Tool
    metadata from UNT
CDL had:
CDL and IA revise records to simpler Dublin Core
Caveats
   As with any web archive, the crawler is
    good, but not always perfect!

   Full-text index of 16 TB of data
     Some behaviors designed to help rank and
     navigate such a large body of content
Demo of Beta Interface
Access gateway lessons

   You could easily use this to integrate:
     materials
              from multiple web archives, no
     matter where.

     anydigital content, whether harvested or
     scanned.
Forthcoming
   Internet Archive tools for visualizing web
    archived data (“explore data”)
Tag cloud extracted from metadata
More questions to you:
   How might you use this archive?
     Rediscovering,
                   providing continued access to
     „lost‟ documents?

     Researching,
                 visualizing trends in
     government data?
What you can do
   Provide feedback on the Beta site
   Help nominate URLs for 2012:
       Any nominations welcome, any amount of time you
        can contribute
       Need particular help with:
         Judicial Branch websites
         Important content or subdomains on very large
          websites (such as NASA.gov) that might be related to
          current Presidential policies
         Government content on non-government domains
          (.com, .edu, etc.)
Timeframe for 2012 project
2012
 January 2012: Project team will begin accepting early nominations of
  priority websites via the Nomination Tool.
 Summer 2012: Recruitment of curators/nominators to help identify
  additional websites for prioritized crawling.
 July/August 2012: Bookend (baseline) crawl of government web
  domains begins.
 Summer/Fall 2012: Partners will crawl various aspects of
  government domains at varying frequencies, depending on selection
  polices/interests. Team will determine strategy for crawling
  prioritized websites.
 November - February 2012-13: Crawl of prioritized websites.


2013
 January 2013: Depending on the outcome of the election, focused
  crawls will be conducted as needed during this period.
 Spring or Summer 2013: Bookend crawl, plus additional crawl of
  prioritized websites as determined by team.
Questions?
eotproject@loc.gov

Follow us on twitter! @eotarchive

Contenu connexe

Tendances

BIZ 2401 and the Library
BIZ 2401 and the LibraryBIZ 2401 and the Library
BIZ 2401 and the Library
Traciwm
 
Internet browsing techniques
Internet browsing techniquesInternet browsing techniques
Internet browsing techniques
Tola Odugbesan
 
Data.dcs: Converting Legacy Data into Linked Data
Data.dcs: Converting Legacy Data into Linked DataData.dcs: Converting Legacy Data into Linked Data
Data.dcs: Converting Legacy Data into Linked Data
Matthew Rowe
 
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Bradley Allen
 

Tendances (20)

BIZ 2401 and the Library
BIZ 2401 and the LibraryBIZ 2401 and the Library
BIZ 2401 and the Library
 
Week 2 computers, web and the internet
Week 2 computers, web and the internetWeek 2 computers, web and the internet
Week 2 computers, web and the internet
 
Policies and Guidelines for Federal Public Websites: ICGI Report Attachments ...
Policies and Guidelines for Federal Public Websites: ICGI Report Attachments ...Policies and Guidelines for Federal Public Websites: ICGI Report Attachments ...
Policies and Guidelines for Federal Public Websites: ICGI Report Attachments ...
 
Lifting the Lid on Linked Data
Lifting the Lid on Linked DataLifting the Lid on Linked Data
Lifting the Lid on Linked Data
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application Scenarios
 
Lawless-3-jun15
Lawless-3-jun15Lawless-3-jun15
Lawless-3-jun15
 
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
 
Internet browsing techniques
Internet browsing techniquesInternet browsing techniques
Internet browsing techniques
 
Data.dcs: Converting Legacy Data into Linked Data
Data.dcs: Converting Legacy Data into Linked DataData.dcs: Converting Legacy Data into Linked Data
Data.dcs: Converting Legacy Data into Linked Data
 
LIBRIS - Linked Library Data
LIBRIS - Linked Library DataLIBRIS - Linked Library Data
LIBRIS - Linked Library Data
 
Searching the web general
Searching the web generalSearching the web general
Searching the web general
 
Gateway to Oklahoma History Case Study: Structured Data and Metadata Evaluati...
Gateway to Oklahoma History Case Study: Structured Data and Metadata Evaluati...Gateway to Oklahoma History Case Study: Structured Data and Metadata Evaluati...
Gateway to Oklahoma History Case Study: Structured Data and Metadata Evaluati...
 
NISO Virtual Conference: Web-Scale Discovery Services: Transforming Access to...
NISO Virtual Conference: Web-Scale Discovery Services: Transforming Access to...NISO Virtual Conference: Web-Scale Discovery Services: Transforming Access to...
NISO Virtual Conference: Web-Scale Discovery Services: Transforming Access to...
 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawling
 
Linked Data and Tools
Linked Data and ToolsLinked Data and Tools
Linked Data and Tools
 
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
 
Hansen-2-jun15
Hansen-2-jun15Hansen-2-jun15
Hansen-2-jun15
 
A Semantic Data Model for Web Applications
A Semantic Data Model for Web ApplicationsA Semantic Data Model for Web Applications
A Semantic Data Model for Web Applications
 
Linked Data in Libraries
Linked Data in LibrariesLinked Data in Libraries
Linked Data in Libraries
 
Webscale Discovery with the Enduser in Mind
Webscale Discovery with the Enduser in Mind Webscale Discovery with the Enduser in Mind
Webscale Discovery with the Enduser in Mind
 

Similaire à Preserving Public Government Information: The End of Term Web Archive

Intranet 2.0 School: Building the essential staff intranet for your library
Intranet 2.0 School: Building the essential staff intranet for your libraryIntranet 2.0 School: Building the essential staff intranet for your library
Intranet 2.0 School: Building the essential staff intranet for your library
Chris Evjy
 
Using Drupal In Libraries
Using Drupal In LibrariesUsing Drupal In Libraries
Using Drupal In Libraries
msfinney
 
datafountainssurvey.doc
datafountainssurvey.docdatafountainssurvey.doc
datafountainssurvey.doc
butest
 
2008-04-24 Enhancing Research Projects with Environmental Informatics and Web...
2008-04-24 Enhancing Research Projects with Environmental Informatics and Web...2008-04-24 Enhancing Research Projects with Environmental Informatics and Web...
2008-04-24 Enhancing Research Projects with Environmental Informatics and Web...
Rudolf Husar
 

Similaire à Preserving Public Government Information: The End of Term Web Archive (20)

Brand Niemann Tutorial12242009
Brand Niemann Tutorial12242009Brand Niemann Tutorial12242009
Brand Niemann Tutorial12242009
 
Put Your Desktop in the Cloud In Support of the Open Government Directive and...
Put Your Desktop in the Cloud In Support of the Open Government Directive and...Put Your Desktop in the Cloud In Support of the Open Government Directive and...
Put Your Desktop in the Cloud In Support of the Open Government Directive and...
 
Put Your Desktop in the Cloud In Support of the Open Government Directive and...
Put Your Desktop in the Cloud In Support of the Open Government Directive and...Put Your Desktop in the Cloud In Support of the Open Government Directive and...
Put Your Desktop in the Cloud In Support of the Open Government Directive and...
 
Crossing Institutional Boundaries to Create Permanent Public Access for Gover...
Crossing Institutional Boundaries to Create Permanent Public Access for Gover...Crossing Institutional Boundaries to Create Permanent Public Access for Gover...
Crossing Institutional Boundaries to Create Permanent Public Access for Gover...
 
GODORT SLDTF 2009 Meeting Outline
GODORT SLDTF 2009 Meeting OutlineGODORT SLDTF 2009 Meeting Outline
GODORT SLDTF 2009 Meeting Outline
 
Web content mining
Web content miningWeb content mining
Web content mining
 
Linked Open Data in Romania
Linked Open Data in RomaniaLinked Open Data in Romania
Linked Open Data in Romania
 
Exploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sourcesExploration, visualization and querying of linked open data sources
Exploration, visualization and querying of linked open data sources
 
Falling Over Free Resources
Falling Over Free ResourcesFalling Over Free Resources
Falling Over Free Resources
 
Warwickshire open data project
Warwickshire open data projectWarwickshire open data project
Warwickshire open data project
 
Week 2 computers, web and the internet
Week 2 computers, web and the internetWeek 2 computers, web and the internet
Week 2 computers, web and the internet
 
Intranet 2.0 School: Building the essential staff intranet for your library
Intranet 2.0 School: Building the essential staff intranet for your libraryIntranet 2.0 School: Building the essential staff intranet for your library
Intranet 2.0 School: Building the essential staff intranet for your library
 
Using Drupal In Libraries
Using Drupal In LibrariesUsing Drupal In Libraries
Using Drupal In Libraries
 
Establishing the Connection: Creating a Linked Data Version of the BNB
Establishing the Connection: Creating a Linked Data Version of the BNBEstablishing the Connection: Creating a Linked Data Version of the BNB
Establishing the Connection: Creating a Linked Data Version of the BNB
 
datafountainssurvey.doc
datafountainssurvey.docdatafountainssurvey.doc
datafountainssurvey.doc
 
EPA OEI Linked Data Process
EPA OEI Linked Data ProcessEPA OEI Linked Data Process
EPA OEI Linked Data Process
 
Online Presentation
Online PresentationOnline Presentation
Online Presentation
 
2008-04-24 Enhancing Research Projects with Environmental Informatics and Web...
2008-04-24 Enhancing Research Projects with Environmental Informatics and Web...2008-04-24 Enhancing Research Projects with Environmental Informatics and Web...
2008-04-24 Enhancing Research Projects with Environmental Informatics and Web...
 
Google analytics
Google analyticsGoogle analytics
Google analytics
 
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
 

Dernier

The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 

Dernier (20)

General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 

Preserving Public Government Information: The End of Term Web Archive

  • 1. Preserving Public Government Information: The End of Term Web Archive Abbie Grotke, Library of Congress Tracy Seneca, California Digital Library END OF TERM ARCHIVE – FDLC Oct. 17, 2011
  • 2. Outline  Background  Nomination of URLs  Data transfer  Preparing for Access  Demo of public interface  2012: what you can do  Our questions for you / your questions for us!
  • 3. Themes  Ad-hoc nature of this project  Wellspring of web archiving tools  Testbed for emerging tools
  • 4. Collaborating Institutions  Library of Congress  Internet Archive  California Digital Library  University of North Texas  US Government Printing Office
  • 5. Why Archive .gov? Why Collaborate?  Fit with partner missions to collect and preserve at-risk (born-digital) government information  Potential for High Research Use/Interest in Archives  It Takes a Village  Experienced Partners
  • 6. Project Goals  Work collaboratively to preserve public U.S. Government Web sites at the end of the current presidential administration ending January 19, 2009.  Document federal agencies‟ presence on the Web during the transition of Presidential administrations.  To enhance the existing research collections of the five partner institutions.
  • 7. URL Nomination Tool  Facilitates collaboration  Ingest seed lists from different sources  Record known metadata  Branch  Title  Comment  Who nominated  Create seed lists for crawls
  • 8. Volunteer Nominators  Call for volunteers targeted:  Government information specialists  Librarians  Political and social science researchers  Academics  Web archivists  31 individuals signed up to help
  • 9. Nominator To-Dos  Nominate the most critical URLs for capture as "in scope"  Add new URLs not already included in the list  Mark irrelevant or obsolete sites as "out of scope"  Add minimal URL metadata such as site title, agency, etc.
  • 11. In Scope vs. Out of Scope  In scope: Federal government Web sites (.gov, .mil, etc.) in the Legislative, Executive, or Judicial branches of government. Of particular interest for prioritization were sites likely to change dramatically or disappear during the transition of government  Out of scope: Local or state government Web sites, or any other site not part of the above federal government domain  Not captured: intranets, deep web content
  • 12. Prioritized URLs  ~500 URLs nominated by volunteers
  • 13. Selected Researcher/Curator Interests  Homeland Security  Department of Labor  Department of Treasury  Education/“No Child Left Behind”  Health Care Reform  Stem Cell Research  Bush Administration Budget Justifications  Federal Program Assessments (ExpectMore.gov)
  • 14. Nomination Tool Lessons Learned  No coordination of selection or “assignments” so likely gaps in collection  Start with a blank slate rather than pre- populate the database?  Admin tools/Reporting features need more work  Engage more experts to help identify at- risk content
  • 15. Further Lessons Learned  CDL: hard to respond to nominations to shape crawler settings & focus – defaulted to getting as much as we could.  When used for Deepwater Horizon, we found it added an extra task for curators – used “delicious” instead.  The metadata we did get (branch, description) was really valuable after the fact!
  • 16. Questions for you:  What do YOU need from the nomination tool?  Is it nomination or curation?
  • 17. Partner Harvesting Roles  Internet Archive – Broad, comprehensive harvests  Library of Congress – In-depth Legislative branch crawls  University of North Texas – Sites/Agencies that meet current UNT interests, e.g. environmental policy, and collections, as well as several “deep web” sites  California Digital Library – Multiple crawls of all seeds in EOT database; sites of interest to their curators  Government Printing Office – Support and analysis of “official documents” found in collection
  • 18. Crawl Schedule  Two Approaches:  Broad, comprehensive crawls  Prioritized, selective crawls  Key dates:  ElectionDay, November 4  Inauguration Day, January 20
  • 19. Data Transfer  Goal: Distribute 15.9 TB of collected content among partners  LC‟s central transfer server used:  “Pulled” and “pushed” data from and to partners via Internet2, May 2009 – Mid 2010  Common transfer tools, specifications were key More info here: http://blogs.loc.gov/digitalpreservation/2011/07/the- end-of-term-was-only-the-beginning/
  • 21. Preparing for Access 1st Tuesday of each month, 12:00 pm: Anything No, to report nothing to on report on public public access? access.
  • 22. Internet Archive also had:  MODS record extraction tool  A full copy of the content from all EOT partners  A QA “Playback” tool (takes screen images of archived materials)  An export of the Nomination Tool metadata from UNT
  • 24.
  • 25. CDL and IA revise records to simpler Dublin Core
  • 26. Caveats  As with any web archive, the crawler is good, but not always perfect!  Full-text index of 16 TB of data  Some behaviors designed to help rank and navigate such a large body of content
  • 27. Demo of Beta Interface
  • 28. Access gateway lessons  You could easily use this to integrate:  materials from multiple web archives, no matter where.  anydigital content, whether harvested or scanned.
  • 29. Forthcoming  Internet Archive tools for visualizing web archived data (“explore data”)
  • 30. Tag cloud extracted from metadata
  • 31. More questions to you:  How might you use this archive?  Rediscovering, providing continued access to „lost‟ documents?  Researching, visualizing trends in government data?
  • 32. What you can do  Provide feedback on the Beta site  Help nominate URLs for 2012:  Any nominations welcome, any amount of time you can contribute  Need particular help with:  Judicial Branch websites  Important content or subdomains on very large websites (such as NASA.gov) that might be related to current Presidential policies  Government content on non-government domains (.com, .edu, etc.)
  • 33. Timeframe for 2012 project 2012  January 2012: Project team will begin accepting early nominations of priority websites via the Nomination Tool.  Summer 2012: Recruitment of curators/nominators to help identify additional websites for prioritized crawling.  July/August 2012: Bookend (baseline) crawl of government web domains begins.  Summer/Fall 2012: Partners will crawl various aspects of government domains at varying frequencies, depending on selection polices/interests. Team will determine strategy for crawling prioritized websites.  November - February 2012-13: Crawl of prioritized websites. 2013  January 2013: Depending on the outcome of the election, focused crawls will be conducted as needed during this period.  Spring or Summer 2013: Bookend crawl, plus additional crawl of prioritized websites as determined by team.

Notes de l'éditeur

  1. Tracy: a downside of the ad-hoc process… no funding and no devoted staff means that it slips to the back-burnerKris and Tracy racking our brains as time permitted… how do we do this? Do we use the nomination tool itself as an access point?
  2. Tracy sees CDL’s Martin Haye demonstrate a test implementation of the XTF access framework on Merritt, which is currently a dark archive with a preservation management interface but no fine-tuned public interface
  3. Tracy - Access Framework for multiple CDL resources – feed it METS files, EAD, TEI, Dublin Core… you tailor the display to the type of material you’re displaying… why not feed it metadata records about archived websites? It doesn’t matter WHERE the content is.
  4. They just needed putting together
  5. Tracy
  6. Tracy – what we learned from building the access interface
  7. More on the way over the next month