SlideShare une entreprise Scribd logo
1  sur  1
Télécharger pour lire hors ligne
Pairtrees for object storage
                               John Kunze and Stephen Abrams, California Digital Library (CDL)

                                                                                                                                     Summary
The deadly embrace                                              Objects in a pairtree                                                Pairtree is the thinnest smear we can add to our very well-
• Digital repositories tend to require a surrender of storage   A pairtree is especially useful if, for each contained object,          understood filesystems and their universal tools (the
  transparency that creates unhealthy system dependency            all of the object’s parts, and nothing but its parts, are            universal “API”) to create a very well-understood,
• Internally objects are often broken up so that they can be       enclosed in the object’s directory                                   platform-independent object storage substrate
  difficult to piece together in case of trouble                Import such a pairtree and, knowing nothing about the                Pairtree is not a complete repository system, but it is
                                                                   objects’ structure and semantics, you can reliably                   complete for object storage and makes it easier to build
Fig. 1. Object storage should not                                                                                                       systems and to share objects between institutions
need a fearful entanglement with                                • Enumerate all objects and their identifiers
software. Since objects have to                                 • Produce any object by requested id
be parked in a filesystem before
repository software upgrade, what
                                                                • Maintain and back it up with ordinary OS tools                     Why pairs of characters?
                                                                • Rebuild the collection in case of database corruption              Taking two chars at a time balances path depth and
if we left them in there and built                                 simply by walking the pairtree                                       fanout (number of possible entries in any directory)
our repositories around them?
                                                                To walk a pairtree requires knowing path termination rules           • Example: ab2def3 ⇒ ab/2d/ef/3/
                                          Jim B L
                                                                • A pairpath terminates when you reach a file or reach a             • Each pair, letters+digits, has 36x36 possibilities
                                                                   directory name with 1 char or more than 2 chars                   Compared to taking one char at a time
                                                                  ab/                                                                • Only 36 possibilities, but path depth grows rapidly
A pairtree maps ids to paths,                                     --- cd/                                                           • Example: ab2def3 ⇒ a/b/2/d/e/f/3/
                                                                                                                                     At another extreme, taking seven characters at a time
 two characters at a time                                                   |--- foo/
                                                                            |      | README.txt                                      • Short paths, but 78 billion (367) possible items
A pairtree is a filesystem hierarchy that uses an identifier                |      | thumbnail.gif                                   • Example: ab2def3 ⇒ ab2def3/
   string to derive an object directory (or folder) location                |      |--- master_images/
• The derivation takes successive pairs of characters and                   |      |       |      ...
   creates a succession of directories, called a pairpath                   |      |
                                                                                                                                     Pairtree credits and details
                                                                            |      --- gh/                                          Pairtree specification:
              ab2def3 ⇒ ab/2d/ef/3/
                                                                            --- e/                                                       www.ietf.org/internet-drafts/draft-kunze-pairtree-01.txt
• A pairpath ends at directory containing an object’s files;                                                                              www.cdlib.org/inside/diglib/pairtree/pairtreespec.html
                                                                                   --- bar/
   most systems do variation of this (is variation needed?)                                                                          Authors from CDL and University of Michigan (UM):
                                                                                            | metadata
• Reverse the mapping to find all ids/objects in a pairtree;                                                                            Martin Haye, Erik Hetzner, John Kunze, Mark Reyes,
                                                                                           | 54321.wav
   pairpath termination rules permit variable length ids                                                                                and Cory Snavely; many thanks to Stephen Abrams,
                                                                                           | index.html                                 Sebastien Korner, Brian Tingle, et al
Pre-converting problematic characters                              Fig. 2. Example pairtree containing two objects:                  Pairtree origins include
Some identifier characters are inconvenient or illegal in          abcd and abcde. The first object is enclosed in                    • Prototype: UCSF tobacco control
  filenames and must be hex-encoded (e.g., *→^2a)                  directory foo/, the second in bar/. While foo/                    documents and CDL digitized books
      id:    what-the-*@?#!                                        does not subsume e/ at the same level, by                         • Early production: digitized books
         → what-the-^2a@^3f#!                                      enclosure, it does subsume the gh/ underneath it.                 for UM and Hathi Trust
         ⇒ wh/at/-t/he/-^/2a/@^/3f/#!                                                                                                                                                            cyocum



But to keep paths short, 3 common chars are converted to 3
  rare chars (at cost of complexity): /→= :→+ .→,               Sample software implementation                                       For further information
      id:    ark:/13030/xt12t3                                     http://search.cpan.org/~jak/Pairtree-0.2/lib/File/Pairtree.pm     Please contact jak@ucop.edu or stephen.abrams@ucop.edu
            → ark+=13030=xt12t3                                 A Perl module that implements two mappings: id2ppath() takes an      For information on CDL’s Preservation Program, see
            ⇒ ar/k+/=1/30/30/=x/t1/2t/3/                           id into a pairpath and ppath2id() performs the inverse mapping.     http://www.cdlib.org/programs/digital_preservation.html

Contenu connexe

Similaire à Pairtrees for object storage

Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-researchsaintdevil163
 
stream processing engine
stream processing enginestream processing engine
stream processing enginetiana528
 
Building modern data lakes
Building modern data lakes Building modern data lakes
Building modern data lakes Minio
 
Spark Gotchas and Lessons Learned (2/20/20)
Spark Gotchas and Lessons Learned (2/20/20)Spark Gotchas and Lessons Learned (2/20/20)
Spark Gotchas and Lessons Learned (2/20/20)Jen Waller
 
Genomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyGenomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyUri Laserson
 
Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...
Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...
Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...Anne Nicolas
 
Presentation distro recipes-2013
Presentation distro recipes-2013Presentation distro recipes-2013
Presentation distro recipes-2013olberger
 
Analysis Software Benchmark
Analysis Software BenchmarkAnalysis Software Benchmark
Analysis Software BenchmarkAkira Shibata
 
Streams, sockets and filters oh my!
Streams, sockets and filters oh my!Streams, sockets and filters oh my!
Streams, sockets and filters oh my!Elizabeth Smith
 
Extbase object to xml mapping
Extbase object to xml mappingExtbase object to xml mapping
Extbase object to xml mappingThomas Maroschik
 
Session 24 - Distribute Data and Metadata Management with gLite
Session 24 - Distribute Data and Metadata Management with gLiteSession 24 - Distribute Data and Metadata Management with gLite
Session 24 - Distribute Data and Metadata Management with gLiteISSGC Summer School
 
Spark Gotchas and Lessons Learned
Spark Gotchas and Lessons LearnedSpark Gotchas and Lessons Learned
Spark Gotchas and Lessons LearnedJen Waller
 
Query processing and optimization
Query processing and optimizationQuery processing and optimization
Query processing and optimizationArif A.
 
Pig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramPig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramViswanath Gangavaram
 
The BagIt file package format
The BagIt file package formatThe BagIt file package format
The BagIt file package formatJohn Kunze
 
[Nvidia] Extracting Depot Paths Into New Instances of Their Own
[Nvidia] Extracting Depot Paths Into New Instances of Their Own[Nvidia] Extracting Depot Paths Into New Instances of Their Own
[Nvidia] Extracting Depot Paths Into New Instances of Their OwnPerforce
 

Similaire à Pairtrees for object storage (20)

Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-research
 
stream processing engine
stream processing enginestream processing engine
stream processing engine
 
Hadoop
HadoopHadoop
Hadoop
 
Building modern data lakes
Building modern data lakes Building modern data lakes
Building modern data lakes
 
Spark Gotchas and Lessons Learned (2/20/20)
Spark Gotchas and Lessons Learned (2/20/20)Spark Gotchas and Lessons Learned (2/20/20)
Spark Gotchas and Lessons Learned (2/20/20)
 
Genomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyGenomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive Biology
 
Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...
Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...
Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...
 
Presentation distro recipes-2013
Presentation distro recipes-2013Presentation distro recipes-2013
Presentation distro recipes-2013
 
Analysis Software Benchmark
Analysis Software BenchmarkAnalysis Software Benchmark
Analysis Software Benchmark
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 
Streams, sockets and filters oh my!
Streams, sockets and filters oh my!Streams, sockets and filters oh my!
Streams, sockets and filters oh my!
 
Git studynotes
Git studynotesGit studynotes
Git studynotes
 
Bin carver
Bin carverBin carver
Bin carver
 
Extbase object to xml mapping
Extbase object to xml mappingExtbase object to xml mapping
Extbase object to xml mapping
 
Session 24 - Distribute Data and Metadata Management with gLite
Session 24 - Distribute Data and Metadata Management with gLiteSession 24 - Distribute Data and Metadata Management with gLite
Session 24 - Distribute Data and Metadata Management with gLite
 
Spark Gotchas and Lessons Learned
Spark Gotchas and Lessons LearnedSpark Gotchas and Lessons Learned
Spark Gotchas and Lessons Learned
 
Query processing and optimization
Query processing and optimizationQuery processing and optimization
Query processing and optimization
 
Pig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramPig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaram
 
The BagIt file package format
The BagIt file package formatThe BagIt file package format
The BagIt file package format
 
[Nvidia] Extracting Depot Paths Into New Instances of Their Own
[Nvidia] Extracting Depot Paths Into New Instances of Their Own[Nvidia] Extracting Depot Paths Into New Instances of Their Own
[Nvidia] Extracting Depot Paths Into New Instances of Their Own
 

Plus de John Kunze

The YAMZ Metadictionary
The YAMZ MetadictionaryThe YAMZ Metadictionary
The YAMZ MetadictionaryJohn Kunze
 
YAMZ Metadata Vocabulary Builder
YAMZ Metadata Vocabulary BuilderYAMZ Metadata Vocabulary Builder
YAMZ Metadata Vocabulary BuilderJohn Kunze
 
The ARK Alliance: 20 years, 850 institutions, 8.2 billion persistent identifi...
The ARK Alliance: 20 years, 850 institutions, 8.2 billion persistent identifi...The ARK Alliance: 20 years, 850 institutions, 8.2 billion persistent identifi...
The ARK Alliance: 20 years, 850 institutions, 8.2 billion persistent identifi...John Kunze
 
EZID and N2T at CDL
EZID and N2T at CDLEZID and N2T at CDL
EZID and N2T at CDLJohn Kunze
 
YAMZ.net: better, faster, cheaper taxonomy building
YAMZ.net:  better, faster, cheaper taxonomy buildingYAMZ.net:  better, faster, cheaper taxonomy building
YAMZ.net: better, faster, cheaper taxonomy buildingJohn Kunze
 
A Vocabulary for Persistence
A Vocabulary for PersistenceA Vocabulary for Persistence
A Vocabulary for PersistenceJohn Kunze
 
Identifiers obey Resolvers not Schemes
Identifiers obey Resolvers not SchemesIdentifiers obey Resolvers not Schemes
Identifiers obey Resolvers not SchemesJohn Kunze
 
Names, Things, and Open Identifier Infrastructure: N2T and ARKs
Names, Things, and Open Identifier Infrastructure: N2T and ARKsNames, Things, and Open Identifier Infrastructure: N2T and ARKs
Names, Things, and Open Identifier Infrastructure: N2T and ARKsJohn Kunze
 
ARK identifiers: lessons learnt at BnF: paths forward
ARK identifiers: lessons learnt at BnF: paths forwardARK identifiers: lessons learnt at BnF: paths forward
ARK identifiers: lessons learnt at BnF: paths forwardJohn Kunze
 
YAMZ: a cross-domain crowd-sourced metadata vocabulary
YAMZ: a cross-domain crowd-sourced metadata vocabularyYAMZ: a cross-domain crowd-sourced metadata vocabulary
YAMZ: a cross-domain crowd-sourced metadata vocabularyJohn Kunze
 
DataONE Preservation and Metadata Working Group Report 2014
DataONE Preservation and Metadata Working Group Report 2014DataONE Preservation and Metadata Working Group Report 2014
DataONE Preservation and Metadata Working Group Report 2014John Kunze
 
Selected Bash shell tricks from Camp CDL breakout group
Selected Bash shell tricks from Camp CDL breakout groupSelected Bash shell tricks from Camp CDL breakout group
Selected Bash shell tricks from Camp CDL breakout groupJohn Kunze
 
Annotating Research Datasets
Annotating Research DatasetsAnnotating Research Datasets
Annotating Research DatasetsJohn Kunze
 
The Data Management Ecosystem
The Data Management EcosystemThe Data Management Ecosystem
The Data Management EcosystemJohn Kunze
 
Library Tools Supporting Data-Rich Research
Library Tools Supporting Data-Rich ResearchLibrary Tools Supporting Data-Rich Research
Library Tools Supporting Data-Rich ResearchJohn Kunze
 
Big Data's Long Tail
Big Data's Long TailBig Data's Long Tail
Big Data's Long TailJohn Kunze
 
Scalable Identifiers for Natural History Collections
Scalable Identifiers for Natural History CollectionsScalable Identifiers for Natural History Collections
Scalable Identifiers for Natural History CollectionsJohn Kunze
 
Future-Proofing the Web: What We Can Do Today
Future-Proofing the Web: What We Can Do TodayFuture-Proofing the Web: What We Can Do Today
Future-Proofing the Web: What We Can Do TodayJohn Kunze
 
Supporting Data-Rich Research on Many Fronts
Supporting Data-Rich Research on Many FrontsSupporting Data-Rich Research on Many Fronts
Supporting Data-Rich Research on Many FrontsJohn Kunze
 

Plus de John Kunze (20)

The YAMZ Metadictionary
The YAMZ MetadictionaryThe YAMZ Metadictionary
The YAMZ Metadictionary
 
YAMZ Metadata Vocabulary Builder
YAMZ Metadata Vocabulary BuilderYAMZ Metadata Vocabulary Builder
YAMZ Metadata Vocabulary Builder
 
The ARK Alliance: 20 years, 850 institutions, 8.2 billion persistent identifi...
The ARK Alliance: 20 years, 850 institutions, 8.2 billion persistent identifi...The ARK Alliance: 20 years, 850 institutions, 8.2 billion persistent identifi...
The ARK Alliance: 20 years, 850 institutions, 8.2 billion persistent identifi...
 
EZID and N2T at CDL
EZID and N2T at CDLEZID and N2T at CDL
EZID and N2T at CDL
 
YAMZ.net: better, faster, cheaper taxonomy building
YAMZ.net:  better, faster, cheaper taxonomy buildingYAMZ.net:  better, faster, cheaper taxonomy building
YAMZ.net: better, faster, cheaper taxonomy building
 
A Vocabulary for Persistence
A Vocabulary for PersistenceA Vocabulary for Persistence
A Vocabulary for Persistence
 
Identifiers obey Resolvers not Schemes
Identifiers obey Resolvers not SchemesIdentifiers obey Resolvers not Schemes
Identifiers obey Resolvers not Schemes
 
Names, Things, and Open Identifier Infrastructure: N2T and ARKs
Names, Things, and Open Identifier Infrastructure: N2T and ARKsNames, Things, and Open Identifier Infrastructure: N2T and ARKs
Names, Things, and Open Identifier Infrastructure: N2T and ARKs
 
ARK identifiers: lessons learnt at BnF: paths forward
ARK identifiers: lessons learnt at BnF: paths forwardARK identifiers: lessons learnt at BnF: paths forward
ARK identifiers: lessons learnt at BnF: paths forward
 
YAMZ: a cross-domain crowd-sourced metadata vocabulary
YAMZ: a cross-domain crowd-sourced metadata vocabularyYAMZ: a cross-domain crowd-sourced metadata vocabulary
YAMZ: a cross-domain crowd-sourced metadata vocabulary
 
DataONE Preservation and Metadata Working Group Report 2014
DataONE Preservation and Metadata Working Group Report 2014DataONE Preservation and Metadata Working Group Report 2014
DataONE Preservation and Metadata Working Group Report 2014
 
Selected Bash shell tricks from Camp CDL breakout group
Selected Bash shell tricks from Camp CDL breakout groupSelected Bash shell tricks from Camp CDL breakout group
Selected Bash shell tricks from Camp CDL breakout group
 
Annotating Research Datasets
Annotating Research DatasetsAnnotating Research Datasets
Annotating Research Datasets
 
The Data Management Ecosystem
The Data Management EcosystemThe Data Management Ecosystem
The Data Management Ecosystem
 
Library Tools Supporting Data-Rich Research
Library Tools Supporting Data-Rich ResearchLibrary Tools Supporting Data-Rich Research
Library Tools Supporting Data-Rich Research
 
Big Data's Long Tail
Big Data's Long TailBig Data's Long Tail
Big Data's Long Tail
 
Pamwg 2012ahm
Pamwg 2012ahmPamwg 2012ahm
Pamwg 2012ahm
 
Scalable Identifiers for Natural History Collections
Scalable Identifiers for Natural History CollectionsScalable Identifiers for Natural History Collections
Scalable Identifiers for Natural History Collections
 
Future-Proofing the Web: What We Can Do Today
Future-Proofing the Web: What We Can Do TodayFuture-Proofing the Web: What We Can Do Today
Future-Proofing the Web: What We Can Do Today
 
Supporting Data-Rich Research on Many Fronts
Supporting Data-Rich Research on Many FrontsSupporting Data-Rich Research on Many Fronts
Supporting Data-Rich Research on Many Fronts
 

Dernier

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 

Dernier (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 

Pairtrees for object storage

  • 1. Pairtrees for object storage John Kunze and Stephen Abrams, California Digital Library (CDL) Summary The deadly embrace Objects in a pairtree Pairtree is the thinnest smear we can add to our very well- • Digital repositories tend to require a surrender of storage A pairtree is especially useful if, for each contained object, understood filesystems and their universal tools (the transparency that creates unhealthy system dependency all of the object’s parts, and nothing but its parts, are universal “API”) to create a very well-understood, • Internally objects are often broken up so that they can be enclosed in the object’s directory platform-independent object storage substrate difficult to piece together in case of trouble Import such a pairtree and, knowing nothing about the Pairtree is not a complete repository system, but it is objects’ structure and semantics, you can reliably complete for object storage and makes it easier to build Fig. 1. Object storage should not systems and to share objects between institutions need a fearful entanglement with • Enumerate all objects and their identifiers software. Since objects have to • Produce any object by requested id be parked in a filesystem before repository software upgrade, what • Maintain and back it up with ordinary OS tools Why pairs of characters? • Rebuild the collection in case of database corruption Taking two chars at a time balances path depth and if we left them in there and built simply by walking the pairtree fanout (number of possible entries in any directory) our repositories around them? To walk a pairtree requires knowing path termination rules • Example: ab2def3 ⇒ ab/2d/ef/3/ Jim B L • A pairpath terminates when you reach a file or reach a • Each pair, letters+digits, has 36x36 possibilities directory name with 1 char or more than 2 chars Compared to taking one char at a time ab/ • Only 36 possibilities, but path depth grows rapidly A pairtree maps ids to paths, --- cd/ • Example: ab2def3 ⇒ a/b/2/d/e/f/3/ At another extreme, taking seven characters at a time two characters at a time |--- foo/ | | README.txt • Short paths, but 78 billion (367) possible items A pairtree is a filesystem hierarchy that uses an identifier | | thumbnail.gif • Example: ab2def3 ⇒ ab2def3/ string to derive an object directory (or folder) location | |--- master_images/ • The derivation takes successive pairs of characters and | | | ... creates a succession of directories, called a pairpath | | Pairtree credits and details | --- gh/ Pairtree specification: ab2def3 ⇒ ab/2d/ef/3/ --- e/ www.ietf.org/internet-drafts/draft-kunze-pairtree-01.txt • A pairpath ends at directory containing an object’s files; www.cdlib.org/inside/diglib/pairtree/pairtreespec.html --- bar/ most systems do variation of this (is variation needed?) Authors from CDL and University of Michigan (UM): | metadata • Reverse the mapping to find all ids/objects in a pairtree; Martin Haye, Erik Hetzner, John Kunze, Mark Reyes, | 54321.wav pairpath termination rules permit variable length ids and Cory Snavely; many thanks to Stephen Abrams, | index.html Sebastien Korner, Brian Tingle, et al Pre-converting problematic characters Fig. 2. Example pairtree containing two objects: Pairtree origins include Some identifier characters are inconvenient or illegal in abcd and abcde. The first object is enclosed in • Prototype: UCSF tobacco control filenames and must be hex-encoded (e.g., *→^2a) directory foo/, the second in bar/. While foo/ documents and CDL digitized books id: what-the-*@?#! does not subsume e/ at the same level, by • Early production: digitized books → what-the-^2a@^3f#! enclosure, it does subsume the gh/ underneath it. for UM and Hathi Trust ⇒ wh/at/-t/he/-^/2a/@^/3f/#! cyocum But to keep paths short, 3 common chars are converted to 3 rare chars (at cost of complexity): /→= :→+ .→, Sample software implementation For further information id: ark:/13030/xt12t3 http://search.cpan.org/~jak/Pairtree-0.2/lib/File/Pairtree.pm Please contact jak@ucop.edu or stephen.abrams@ucop.edu → ark+=13030=xt12t3 A Perl module that implements two mappings: id2ppath() takes an For information on CDL’s Preservation Program, see ⇒ ar/k+/=1/30/30/=x/t1/2t/3/ id into a pairpath and ppath2id() performs the inverse mapping. http://www.cdlib.org/programs/digital_preservation.html