SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
Using DISC to Reindex ~50Tb of URIs


                                   Robert Sanderson
                                              rsanderson@lanl.gov
                                              @azaroth42

                                   Herbert Van de Sompel
                                              herbertv@lanl.gov
                                              @hvdsomp

                                     http://www.mementoweb.org/



Towards Seamless Navigation of the Web of the Past
            Big Data Interest Group, LANL, Feb 21st 2013            1
                            [Unclassified]
Summary


Memento

Project Overview
Participants

The Plan …
… and the Reality

Technical Details

Next Run


             Big Data Interest Group, LANL, Feb 21st 2013   2
                             [Unclassified]
Memento Project



   “       Provide web citizens seamless access
                   to the web of the past                      ”
Some accolades:
•  $1M funding from the Library of Congress
•  Winner of 2010 International Digital Preservation Award
•  Accepted by International Internet Preservation Consortium
      as the way forwards for access to web archives
•  Nominated for Best Paper at JCDL 2010
•  Best Poster at JCDL 2012
•  Tim Berners-Lee @ LDOW10: “We absolutely need this”
•  Internet Draft 6 (final call) by May




                Big Data Interest Group, LANL, Feb 21st 2013       3
                                [Unclassified]
Old Copies have New Names
Everyone knows the URL for CNN is:               http://www.cnn.com/
Everyone knows the URL for CNN was:              http://www.cnn.com/




               Big Data Interest Group, LANL, Feb 21st 2013            4
                               [Unclassified]
Old Copies have New Names
Copies of the past versions of to http://www.cnn.com/ exist
… but you don’t go there to get them, they have new URLs
… and there’s no way to automatically discover those new URLs




          http://web.archive.org/web/20031013111028/http://www2.cnn.com/


                  Big Data Interest Group, LANL, Feb 21st 2013             5
                                  [Unclassified]
Discovery is Hard
People want to find, not search
… and especially not searching by hand!




               http://web.archive.org/web/*/http://www2.cnn.com/


                Big Data Interest Group, LANL, Feb 21st 2013       6
                                [Unclassified]
Navigating in the Past
Links to the real resource bring you back out of the past
… or trap you in a single incomplete archive.




                           Pentagon




     Dec 20 2001, 4:51:00 UTC                                          current
                   http://web.archive.org/web/*/http://www2.cnn.com/


                    Big Data Interest Group, LANL, Feb 21st 2013                 7
                                    [Unclassified]
Memento: Time Travel for the Web
  TimeGate: Resource that knows the locations of archived copies
  Memento: An archived copy of previous state of a web resource

Link header to TimeGate
                           302 redirect to Memento




  How does the TimeGate know where the appropriate Memento is?


                  Big Data Interest Group, LANL, Feb 21st 2013           8
                                  [Unclassified]
Project Overview

Goal:
•  To aggregate the metadata of the distributed archives
   of the IIPC, and
    •  To provide Memento based access to the holdings
       of open archives
    •  To provide knowledge of the holdings of restricted
       archives
    •  To provide knowledge to IIPC members of the
       holdings of totally closed archives




              Big Data Interest Group, LANL, Feb 21st 2013   9
                              [Unclassified]
Experiment Participants

•    Austrian National Library
•    Bibliothèque Nationale de France
•    British Library
•    Institut National de l'Audiovisuel
•    Internet Archive
•    Koninklijke Bibliotheek
•    Library of Congress
•    Netarchive.dk
•    Swiss National Library
•    University of North Texas

• Los Alamos National Laboratory
• Old Dominion University

                Big Data Interest Group, LANL, Feb 21st 2013   10
                                [Unclassified]
Data

Canonical URL     kosovakosovo.com/photo.php?id=5785
   Datestamp      20090608161553
 Request URL      http://www.kosovakosovo.com/photo.php?id=5785
   MIME type      text/html
  HTTP Status     200
   Checksum       M36MRHSBVPLKMUN6PFOIEV3AH5ADITAN
    Redirect?     -
        Bytes     563096

  Storage File    AIT-1068-20090608161511.warc.gz
                  

           Multiplied by 6Tb compressed, ~50Tb uncompressed



                   Big Data Interest Group, LANL, Feb 21st 2013   11
                                   [Unclassified]
The Plan…

•  To provide fast access to distributed archives, LANL
   would merge the indexes of the holdings of multiple
   archives and provide Memento based access

•  Step 1: Library of Congress gathers CDX files
   Step 2: LANL indexes (a.k.a. “…”)
   Step 3: Profit

•  Data: 6T of gzipped CDX files (mostly from IA)
    •  Shipped on hard drives
•  Computing: 210 node DISC cluster at LANL
    •  2x 2ghz processors, 2x 2T HDD, 8G RAM


              Big Data Interest Group, LANL, Feb 21st 2013   12
                              [Unclassified]
… and the Reality

•  Hardware failure killed one of the drives en route
    •  Transferred remaining files via BagIt from LoC

•  DISC has restricted access:
    •  Had to transfer data over intranet
    •  2 weeks to sync (5Mb/sec)
    •  And then 2 weeks to get the processed results off

•  Compute cluster has faulty switch, unreliable nodes:
    •  Ran original processing 15 times without success
       due to hardware failures



              Big Data Interest Group, LANL, Feb 21st 2013   13
                              [Unclassified]
Processing Design

•  For each CDX file,
    •  For each URI + timestamps,
        •  Map URI to an appropriate database slice
        •  Merge timestamps with those of previous CDXs

•  Possible because:
    •  No need to do truncated search
    •  No need to walk through URIs in order
    •  No need for time based access, only URI

•  Problem is “Embarrassingly Parallel”



              Big Data Interest Group, LANL, Feb 21st 2013   14
                              [Unclassified]
Approach 1: Online Messaging

•  25 read nodes, 1 control node, 150 write nodes
•  Messages (1000 URIs) sent via control node to write




•  Failed 15 times due to hardware issues
             Big Data Interest Group, LANL, Feb 21st 2013   15
                             [Unclassified]
Approach 1: Implementation

•  Python

•  MPI        L
•  PyMPI       LL
•  No failure detection, no auto restart, no zombie killing,
   no …        LLL
•  Several months of experiments to find issues, fine tune
   parameters most likely to complete…




              Big Data Interest Group, LANL, Feb 21st 2013     16
                              [Unclassified]
Approach 2: No Interaction

•  43 read/split nodes
•  Phase 1: Read nodes split CDX files to 3000 slices




              Big Data Interest Group, LANL, Feb 21st 2013   17
                              [Unclassified]
Approach 2: No Interaction

•  Phase 2a: Transfer CDX slices to Control node
•  Phase 2b: Transfer CDX slices to Write nodes




             Big Data Interest Group, LANL, Feb 21st 2013   18
                             [Unclassified]
Approach 2: No Interaction

•  50 write nodes (* 60 slices each = 3000 slices)
•  Phase 3: Merge slices from nodes to BerkeleyDBs




             Big Data Interest Group, LANL, Feb 21st 2013   19
                             [Unclassified]
Next Steps


•  Re-index, using non-interactive approach
    •  New data: 14 Tb (~120Tb uncompressed?)
    •  Use FileMap for better automation?

•  Two eSata 8T LaCie cubes, 2 3T internal drives to
   install locally

•  More intelligent data partitioning
•  Some sort of error detection and handling!




              Big Data Interest Group, LANL, Feb 21st 2013   20
                              [Unclassified]
Memento

                              http://mementoweb.org/

                                          Robert Sanderson
                                        rsanderson@lanl.gov
                                                @azaroth42

                                     Herbert Van de Sompel
                                         herbertv@lanl.gov
                                                @hvdsomp


Big Data Interest Group, LANL, Feb 21st 2013              21
                [Unclassified]

Contenu connexe

En vedette

Linked Data: Building Standards and Communities
Linked Data: Building Standards and CommunitiesLinked Data: Building Standards and Communities
Linked Data: Building Standards and Communities
Robert Sanderson
 
No uso de las TIC en el Aula
No uso de las TIC en el AulaNo uso de las TIC en el Aula
No uso de las TIC en el Aula
Andrés Mov
 
Erika Pricyla Cerino HernáNdez
Erika Pricyla Cerino HernáNdezErika Pricyla Cerino HernáNdez
Erika Pricyla Cerino HernáNdez
guest1cc234
 
Dit Heb Je Nog Nooit Gezien
Dit Heb Je Nog Nooit GezienDit Heb Je Nog Nooit Gezien
Dit Heb Je Nog Nooit Gezien
guest6964ce
 

En vedette (20)

Analyzing the Persistence of Referenced Web Resources with Memento
Analyzing the Persistence of Referenced Web Resources with MementoAnalyzing the Persistence of Referenced Web Resources with Memento
Analyzing the Persistence of Referenced Web Resources with Memento
 
Linked Data: Building Standards and Communities
Linked Data: Building Standards and CommunitiesLinked Data: Building Standards and Communities
Linked Data: Building Standards and Communities
 
Transcending Silos: Shared Canvas Data Model for Digital Facsimiles
Transcending Silos: Shared Canvas Data Model for Digital FacsimilesTranscending Silos: Shared Canvas Data Model for Digital Facsimiles
Transcending Silos: Shared Canvas Data Model for Digital Facsimiles
 
No uso de las TIC en el Aula
No uso de las TIC en el AulaNo uso de las TIC en el Aula
No uso de las TIC en el Aula
 
Erika Pricyla Cerino HernáNdez
Erika Pricyla Cerino HernáNdezErika Pricyla Cerino HernáNdez
Erika Pricyla Cerino HernáNdez
 
SharedCanvas: A Collaborative Model for Medieval Manuscript Layout Dissemina...
SharedCanvas: A Collaborative Model for Medieval Manuscript Layout Dissemina...SharedCanvas: A Collaborative Model for Medieval Manuscript Layout Dissemina...
SharedCanvas: A Collaborative Model for Medieval Manuscript Layout Dissemina...
 
Niso Annotation Webinar
Niso Annotation WebinarNiso Annotation Webinar
Niso Annotation Webinar
 
Python Web Interaction
Python Web InteractionPython Web Interaction
Python Web Interaction
 
NLLC 2011: Memento, Open Annotation, SharedCanvas
NLLC 2011: Memento, Open Annotation, SharedCanvasNLLC 2011: Memento, Open Annotation, SharedCanvas
NLLC 2011: Memento, Open Annotation, SharedCanvas
 
Dit Heb Je Nog Nooit Gezien
Dit Heb Je Nog Nooit GezienDit Heb Je Nog Nooit Gezien
Dit Heb Je Nog Nooit Gezien
 
W3C Open Annotation: Status and Use Cases
W3C Open Annotation: Status and Use CasesW3C Open Annotation: Status and Use Cases
W3C Open Annotation: Status and Use Cases
 
Making Web Annotations Persistent over Time
Making Web Annotations Persistent over TimeMaking Web Annotations Persistent over Time
Making Web Annotations Persistent over Time
 
NISO Annotation Meeting (San Francisco)
NISO Annotation Meeting (San Francisco)NISO Annotation Meeting (San Francisco)
NISO Annotation Meeting (San Francisco)
 
W3C Web Annotation WG Update (I Annotate 2016)
W3C Web Annotation WG Update (I Annotate 2016)W3C Web Annotation WG Update (I Annotate 2016)
W3C Web Annotation WG Update (I Annotate 2016)
 
IIIF Presentation API
IIIF Presentation API IIIF Presentation API
IIIF Presentation API
 
IIIF Overview for Linked Data Exhibitions
IIIF Overview for Linked Data ExhibitionsIIIF Overview for Linked Data Exhibitions
IIIF Overview for Linked Data Exhibitions
 
Annotating Scholarly Works - the W3C Open Annotation Model
Annotating Scholarly Works - the W3C Open Annotation ModelAnnotating Scholarly Works - the W3C Open Annotation Model
Annotating Scholarly Works - the W3C Open Annotation Model
 
Hemoptysis jack
Hemoptysis jackHemoptysis jack
Hemoptysis jack
 
Lactate by jack.
Lactate by jack.Lactate by jack.
Lactate by jack.
 
Pneumothorax ..jack
Pneumothorax ..jackPneumothorax ..jack
Pneumothorax ..jack
 

Similaire à Big Data: Indexing ~50Tb of URIs

MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
Salil Navgire
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Mohit Tare
 
January 2022: Central Iowa Linux Users Group: Git
January 2022: Central Iowa Linux Users Group: GitJanuary 2022: Central Iowa Linux Users Group: Git
January 2022: Central Iowa Linux Users Group: Git
Andrew Denner
 
DIGITAL LIBRARIES
DIGITAL LIBRARIESDIGITAL LIBRARIES
DIGITAL LIBRARIES
viedma2
 

Similaire à Big Data: Indexing ~50Tb of URIs (20)

Big Data - architectural concerns for the new age
Big Data - architectural concerns for the new ageBig Data - architectural concerns for the new age
Big Data - architectural concerns for the new age
 
Spectra Logic BlackPearl Developer Summit 2015
Spectra Logic BlackPearl Developer Summit 2015Spectra Logic BlackPearl Developer Summit 2015
Spectra Logic BlackPearl Developer Summit 2015
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
IASSIT Kansa Presentation
IASSIT Kansa PresentationIASSIT Kansa Presentation
IASSIT Kansa Presentation
 
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
From 1000/day to 1000/sec: The Evolution of Incapsula's BIG DATA System [Surg...
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
 
What is New in W3C land?
What is New in W3C land?What is New in W3C land?
What is New in W3C land?
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Network Engineering for High Speed Data Sharing
Network Engineering for High Speed Data SharingNetwork Engineering for High Speed Data Sharing
Network Engineering for High Speed Data Sharing
 
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
 
afternoon3.pdf
afternoon3.pdfafternoon3.pdf
afternoon3.pdf
 
Labours of Love & Convenience - Open Repositories 2018
Labours of Love & Convenience - Open Repositories 2018Labours of Love & Convenience - Open Repositories 2018
Labours of Love & Convenience - Open Repositories 2018
 
Linked Data from a Digital Object Management System
Linked Data from a Digital Object Management SystemLinked Data from a Digital Object Management System
Linked Data from a Digital Object Management System
 
Honey on the Wire KohaCon18
Honey on the Wire  KohaCon18Honey on the Wire  KohaCon18
Honey on the Wire KohaCon18
 
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling framework
 
January 2022: Central Iowa Linux Users Group: Git
January 2022: Central Iowa Linux Users Group: GitJanuary 2022: Central Iowa Linux Users Group: Git
January 2022: Central Iowa Linux Users Group: Git
 
Pg nordic-day-2014-2 tb-enough
Pg nordic-day-2014-2 tb-enoughPg nordic-day-2014-2 tb-enough
Pg nordic-day-2014-2 tb-enough
 
DIGITAL LIBRARIES
DIGITAL LIBRARIESDIGITAL LIBRARIES
DIGITAL LIBRARIES
 
Case Study - How Rackspace Query Terabytes Of Data
Case Study - How Rackspace Query Terabytes Of DataCase Study - How Rackspace Query Terabytes Of Data
Case Study - How Rackspace Query Terabytes Of Data
 

Plus de Robert Sanderson

Plus de Robert Sanderson (20)

Understanding Linked Art
Understanding Linked ArtUnderstanding Linked Art
Understanding Linked Art
 
LUX - Cross Collections Cultural Heritage at Yale
LUX - Cross Collections Cultural Heritage at YaleLUX - Cross Collections Cultural Heritage at Yale
LUX - Cross Collections Cultural Heritage at Yale
 
Zoom as a Paradigm for Linked Open Usable Data
Zoom as a Paradigm for Linked Open Usable DataZoom as a Paradigm for Linked Open Usable Data
Zoom as a Paradigm for Linked Open Usable Data
 
Provenance and Uncertainty in Linked Art
Provenance and Uncertainty in Linked ArtProvenance and Uncertainty in Linked Art
Provenance and Uncertainty in Linked Art
 
Data is our Product: Thoughts on LOD Sustainability
Data is our Product: Thoughts on LOD SustainabilityData is our Product: Thoughts on LOD Sustainability
Data is our Product: Thoughts on LOD Sustainability
 
A Perspective on Wikidata: Ecosystems, Trust, and Usability
A Perspective on Wikidata: Ecosystems, Trust, and UsabilityA Perspective on Wikidata: Ecosystems, Trust, and Usability
A Perspective on Wikidata: Ecosystems, Trust, and Usability
 
Linked Art: Sustainable Cultural Knowledge through Linked Open Usable Data
Linked Art: Sustainable Cultural Knowledge through Linked Open Usable DataLinked Art: Sustainable Cultural Knowledge through Linked Open Usable Data
Linked Art: Sustainable Cultural Knowledge through Linked Open Usable Data
 
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open DataIllusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
Illusions of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data
 
Structural Metadata in RDF (IS575)
Structural Metadata in RDF (IS575)Structural Metadata in RDF (IS575)
Structural Metadata in RDF (IS575)
 
Sanderson CNI 2020 Keynote - Cultural Heritage Research Data Ecosystem
Sanderson CNI 2020 Keynote - Cultural Heritage Research Data EcosystemSanderson CNI 2020 Keynote - Cultural Heritage Research Data Ecosystem
Sanderson CNI 2020 Keynote - Cultural Heritage Research Data Ecosystem
 
Tiers of Abstraction and Audience in Cultural Heritage Data Modeling
Tiers of Abstraction and Audience in Cultural Heritage Data ModelingTiers of Abstraction and Audience in Cultural Heritage Data Modeling
Tiers of Abstraction and Audience in Cultural Heritage Data Modeling
 
The Importance of being LOUD
The Importance of being LOUDThe Importance of being LOUD
The Importance of being LOUD
 
Introduction to Linked Art Model
Introduction to Linked Art ModelIntroduction to Linked Art Model
Introduction to Linked Art Model
 
Standards and Communities: Connected People, Consistent Data, Usable Applicat...
Standards and Communities: Connected People, Consistent Data, Usable Applicat...Standards and Communities: Connected People, Consistent Data, Usable Applicat...
Standards and Communities: Connected People, Consistent Data, Usable Applicat...
 
Strong Opinions, Weakly Held
Strong Opinions, Weakly HeldStrong Opinions, Weakly Held
Strong Opinions, Weakly Held
 
IIIF Discovery Walkthrough
IIIF Discovery WalkthroughIIIF Discovery Walkthrough
IIIF Discovery Walkthrough
 
Linked Art: An Art Museum Profile for CIDOC-CRM
Linked Art: An Art Museum Profile for CIDOC-CRMLinked Art: An Art Museum Profile for CIDOC-CRM
Linked Art: An Art Museum Profile for CIDOC-CRM
 
Euromed2018 Keynote: Usability over Completeness, Community over Committee
Euromed2018 Keynote: Usability over Completeness, Community over CommitteeEuromed2018 Keynote: Usability over Completeness, Community over Committee
Euromed2018 Keynote: Usability over Completeness, Community over Committee
 
Linked Art - Our Linked Open Usable Data Model
Linked Art - Our Linked Open Usable Data ModelLinked Art - Our Linked Open Usable Data Model
Linked Art - Our Linked Open Usable Data Model
 
EuropeanaTech Keynote: Shout it out LOUD
EuropeanaTech Keynote: Shout it out LOUDEuropeanaTech Keynote: Shout it out LOUD
EuropeanaTech Keynote: Shout it out LOUD
 

Dernier

Dernier (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

Big Data: Indexing ~50Tb of URIs

  • 1. Using DISC to Reindex ~50Tb of URIs Robert Sanderson rsanderson@lanl.gov @azaroth42 Herbert Van de Sompel herbertv@lanl.gov @hvdsomp http://www.mementoweb.org/ Towards Seamless Navigation of the Web of the Past Big Data Interest Group, LANL, Feb 21st 2013 1 [Unclassified]
  • 2. Summary Memento Project Overview Participants The Plan … … and the Reality Technical Details Next Run Big Data Interest Group, LANL, Feb 21st 2013 2 [Unclassified]
  • 3. Memento Project “ Provide web citizens seamless access to the web of the past ” Some accolades: •  $1M funding from the Library of Congress •  Winner of 2010 International Digital Preservation Award •  Accepted by International Internet Preservation Consortium as the way forwards for access to web archives •  Nominated for Best Paper at JCDL 2010 •  Best Poster at JCDL 2012 •  Tim Berners-Lee @ LDOW10: “We absolutely need this” •  Internet Draft 6 (final call) by May Big Data Interest Group, LANL, Feb 21st 2013 3 [Unclassified]
  • 4. Old Copies have New Names Everyone knows the URL for CNN is: http://www.cnn.com/ Everyone knows the URL for CNN was: http://www.cnn.com/ Big Data Interest Group, LANL, Feb 21st 2013 4 [Unclassified]
  • 5. Old Copies have New Names Copies of the past versions of to http://www.cnn.com/ exist … but you don’t go there to get them, they have new URLs … and there’s no way to automatically discover those new URLs http://web.archive.org/web/20031013111028/http://www2.cnn.com/ Big Data Interest Group, LANL, Feb 21st 2013 5 [Unclassified]
  • 6. Discovery is Hard People want to find, not search … and especially not searching by hand! http://web.archive.org/web/*/http://www2.cnn.com/ Big Data Interest Group, LANL, Feb 21st 2013 6 [Unclassified]
  • 7. Navigating in the Past Links to the real resource bring you back out of the past … or trap you in a single incomplete archive. Pentagon Dec 20 2001, 4:51:00 UTC current http://web.archive.org/web/*/http://www2.cnn.com/ Big Data Interest Group, LANL, Feb 21st 2013 7 [Unclassified]
  • 8. Memento: Time Travel for the Web TimeGate: Resource that knows the locations of archived copies Memento: An archived copy of previous state of a web resource Link header to TimeGate 302 redirect to Memento How does the TimeGate know where the appropriate Memento is? Big Data Interest Group, LANL, Feb 21st 2013 8 [Unclassified]
  • 9. Project Overview Goal: •  To aggregate the metadata of the distributed archives of the IIPC, and •  To provide Memento based access to the holdings of open archives •  To provide knowledge of the holdings of restricted archives •  To provide knowledge to IIPC members of the holdings of totally closed archives Big Data Interest Group, LANL, Feb 21st 2013 9 [Unclassified]
  • 10. Experiment Participants •  Austrian National Library •  Bibliothèque Nationale de France • British Library • Institut National de l'Audiovisuel • Internet Archive • Koninklijke Bibliotheek • Library of Congress •  Netarchive.dk •  Swiss National Library •  University of North Texas • Los Alamos National Laboratory • Old Dominion University Big Data Interest Group, LANL, Feb 21st 2013 10 [Unclassified]
  • 11. Data Canonical URL kosovakosovo.com/photo.php?id=5785 Datestamp 20090608161553 Request URL http://www.kosovakosovo.com/photo.php?id=5785 MIME type text/html HTTP Status 200 Checksum M36MRHSBVPLKMUN6PFOIEV3AH5ADITAN Redirect? - Bytes 563096 Storage File AIT-1068-20090608161511.warc.gz Multiplied by 6Tb compressed, ~50Tb uncompressed Big Data Interest Group, LANL, Feb 21st 2013 11 [Unclassified]
  • 12. The Plan… •  To provide fast access to distributed archives, LANL would merge the indexes of the holdings of multiple archives and provide Memento based access •  Step 1: Library of Congress gathers CDX files Step 2: LANL indexes (a.k.a. “…”) Step 3: Profit •  Data: 6T of gzipped CDX files (mostly from IA) •  Shipped on hard drives •  Computing: 210 node DISC cluster at LANL •  2x 2ghz processors, 2x 2T HDD, 8G RAM Big Data Interest Group, LANL, Feb 21st 2013 12 [Unclassified]
  • 13. … and the Reality •  Hardware failure killed one of the drives en route •  Transferred remaining files via BagIt from LoC •  DISC has restricted access: •  Had to transfer data over intranet •  2 weeks to sync (5Mb/sec) •  And then 2 weeks to get the processed results off •  Compute cluster has faulty switch, unreliable nodes: •  Ran original processing 15 times without success due to hardware failures Big Data Interest Group, LANL, Feb 21st 2013 13 [Unclassified]
  • 14. Processing Design •  For each CDX file, •  For each URI + timestamps, •  Map URI to an appropriate database slice •  Merge timestamps with those of previous CDXs •  Possible because: •  No need to do truncated search •  No need to walk through URIs in order •  No need for time based access, only URI •  Problem is “Embarrassingly Parallel” Big Data Interest Group, LANL, Feb 21st 2013 14 [Unclassified]
  • 15. Approach 1: Online Messaging •  25 read nodes, 1 control node, 150 write nodes •  Messages (1000 URIs) sent via control node to write •  Failed 15 times due to hardware issues Big Data Interest Group, LANL, Feb 21st 2013 15 [Unclassified]
  • 16. Approach 1: Implementation •  Python •  MPI L •  PyMPI LL •  No failure detection, no auto restart, no zombie killing, no … LLL •  Several months of experiments to find issues, fine tune parameters most likely to complete… Big Data Interest Group, LANL, Feb 21st 2013 16 [Unclassified]
  • 17. Approach 2: No Interaction •  43 read/split nodes •  Phase 1: Read nodes split CDX files to 3000 slices Big Data Interest Group, LANL, Feb 21st 2013 17 [Unclassified]
  • 18. Approach 2: No Interaction •  Phase 2a: Transfer CDX slices to Control node •  Phase 2b: Transfer CDX slices to Write nodes Big Data Interest Group, LANL, Feb 21st 2013 18 [Unclassified]
  • 19. Approach 2: No Interaction •  50 write nodes (* 60 slices each = 3000 slices) •  Phase 3: Merge slices from nodes to BerkeleyDBs Big Data Interest Group, LANL, Feb 21st 2013 19 [Unclassified]
  • 20. Next Steps •  Re-index, using non-interactive approach •  New data: 14 Tb (~120Tb uncompressed?) •  Use FileMap for better automation? •  Two eSata 8T LaCie cubes, 2 3T internal drives to install locally •  More intelligent data partitioning •  Some sort of error detection and handling! Big Data Interest Group, LANL, Feb 21st 2013 20 [Unclassified]
  • 21. Memento http://mementoweb.org/ Robert Sanderson rsanderson@lanl.gov @azaroth42 Herbert Van de Sompel herbertv@lanl.gov @hvdsomp Big Data Interest Group, LANL, Feb 21st 2013 21 [Unclassified]