SlideShare une entreprise Scribd logo
1  sur  13
Télécharger pour lire hors ligne
HDF5 ⬌ Zarr
2020 ESIP Summer Meeting
Aleksandar Jelenak
• Zarr ➔ HDF5*
• HDF5 ➔ Zarr
*A portion of this work was supported by the United States Geological Survey (USGS). Any opinions, findings, conclusions, or recommendations
expressed in this material are those of the authors and do not necessarily reflect the views of United States Government.
2Overview
Zarr ➔ HDF5 3
https://medium.com/pangeo/cloud-performant-reading-of-netcdf4-hdf5-data-using-the-zarr-library-1a95c5c92314
• Created a new Zarr store: FileChunkStore
• Developed Zarr JSON metadata for (HDF5) chunk file location: .zchunkstore
• Small fixes in the zarr and xarray Python packages to support a separate Zarr store
for file chunks
4Zarr ➔ HDF5: Implementation
"zeta/9.38": {
"offset": 3883848970,
"size": 6907818
},
"zeta/9.39": {
"offset": 3890756788,
"size": 7879493
},
"zeta/9.4": {
"offset": 3648355033,
"size": 6525250
},
"zeta/9.40": {
"offset": 3898636281,
"size": 7132453
}
5Zarr ➔ HDF5: Performance comparison
Zarr reading Zarr Zarr reading HDF5
• Python >= 3.6
• HDF5-1.10.6 library
• pip install git+https://github.com/h5py/h5py.git
• pip install git+https://github.com/ajelenak/xarray.git@zarr-chunkstore
• pip install git+https://github.com/HDFGroup/zarr-python.git@hdf5
• pip install fsspec
• HDF5-to-Zarr translator:
https://gist.github.com/ajelenak/80354a95b449cedea5cca508004f97a9
6Zarr ➔ HDF5: How to try out?
• HDF5 dataset compact layout not supported
• HDF5 dataset data may be written by compressor/filter not supported by Zarr
• Storage system hosting the HDF5 file must allow partial file reading
7Zarr ➔ HDF5: Limitations
• HDF5 API access to Zarr data is provided by the HDF’s Highly Scalabale Data
Service (HSDS)
• Only for Zarr data in AWS S3
• HSDS object store schema is similar to Zarr: a combination of JSON and binary
objects
• HSDS JSON is a superset of HDF5/JSON
• Still work in progress
8HDF5 ➔ Zarr
• Using special HSDS schema chunking layout: H5D_CHUNKED_REF_INDIRECT
• This chunking layout is not supported by the HDF5 library
• Developed to enable HSDS access to chunks in HDF5 files in object storage
• Chunk information for one Zarr array is stored as an anonymous HDF5 compound
dataset
• The compound datatype has 3 fields for: byte offset (always 0), chunk object size,
and chunk object URI
• The HDF5 dataset representing the Zarr array has the
H5D_CHUNKED_REF_INDIRECT layout and its value points to the anonymous HDF5
dataset with chunk location information
9HDF5 ➔ Zarr: Implementation
• Because HSDS does not (yet) support the Blosc compressor, the original Zarr
dataset was copied with the Zlib compressor instead:
10HDF5 ➔ Zarr: Data Wrangling
from sys import stdout
import zarr
import fsspec
from numcodecs import Zlib
src_root = zarr.open(
fsspec.get_mapper('s3://pangeo-data-uswest2/esip/adcirc/adcirc_01d',
anon=False,
requester_pays=True),
mode='r')
dest_root = zarr.open(
fsspec.get_mapper('s3://hdf5-zarr/adcirc_01d.zarr’, anon=False),
mode='w')
zarr.copy_all(src_root, dest_root, shallow=False, without_attrs=False,
log=stdout, if_exists='replace', dry_run=True,
compressor=Zlib(level=6), filters=None)
HDF5 ➔ Zarr: Example translation
Zarr array HDF5 dataset with chunk info
Name : /zeta
Type : zarr.core.Array
Data type : float64
Shape : (720, 9228245)
Chunk shape : (10, 141973)
Compressor : Zlib(level=6)
No. bytes : 53154691200 (49.5G)
Chunks initialized : 4680/4680
Type : h5pyd.Dataset
Data type : compound
Shape : (72, 65)
Value:
[[(0, 1949049, 's3://hdf5-zarr/adcirc_01d.zarr/zeta/0.0')
(0, 2911533, 's3://hdf5-zarr/adcirc_01d.zarr/zeta/0.1')
(0, 2506163, 's3://hdf5-zarr/adcirc_01d.zarr/zeta/0.2') ...
(0, 4344724, 's3://hdf5-zarr/adcirc_01d.zarr/zeta/0.62')
(0, 5696617, 's3://hdf5-zarr/adcirc_01d.zarr/zeta/0.63')
(0, 4275725, 's3://hdf5-zarr/adcirc_01d.zarr/zeta/0.64')]
11
• Zarr array data may be written by compressor/filter not supported by HSDS
• Since both Zarr and HSDS are written in Python, it should be possible to add Zarr’s
numcodecs package to HSDS to resolve this limitation
12HDF5 ➔ Zarr: Limitations
THANK YOU!
ajelenak@hdfgroup.org
13

Contenu connexe

Tendances

STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...The HDF-EOS Tools and Information Center
 
Interoperability with netCDF-4 - Experience with NPP and HDF-EOS5 products
Interoperability with netCDF-4 - Experience with NPP and HDF-EOS5 productsInteroperability with netCDF-4 - Experience with NPP and HDF-EOS5 products
Interoperability with netCDF-4 - Experience with NPP and HDF-EOS5 productsThe HDF-EOS Tools and Information Center
 

Tendances (20)

MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10
 
HDF5 Roadmap 2019-2020
HDF5 Roadmap 2019-2020HDF5 Roadmap 2019-2020
HDF5 Roadmap 2019-2020
 
HDF for the Cloud - Serverless HDF
HDF for the Cloud - Serverless HDFHDF for the Cloud - Serverless HDF
HDF for the Cloud - Serverless HDF
 
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
 
Leveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software TestingLeveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software Testing
 
MODIS Land and HDF-EOS
MODIS Land and HDF-EOSMODIS Land and HDF-EOS
MODIS Land and HDF-EOS
 
Parallel HDF5 Developments
Parallel HDF5 DevelopmentsParallel HDF5 Developments
Parallel HDF5 Developments
 
HDF-EOS 2/5 to netCDF Converter
HDF-EOS 2/5 to netCDF ConverterHDF-EOS 2/5 to netCDF Converter
HDF-EOS 2/5 to netCDF Converter
 
Easy Access of NASA HDF data via OPeNDAP
Easy Access of NASA HDF data via OPeNDAPEasy Access of NASA HDF data via OPeNDAP
Easy Access of NASA HDF data via OPeNDAP
 
NetCDF and HDF5
NetCDF and HDF5NetCDF and HDF5
NetCDF and HDF5
 
HDF Update 2016
HDF Update 2016HDF Update 2016
HDF Update 2016
 
Product Designer Hub - Taking HPD to the Web
Product Designer Hub - Taking HPD to the WebProduct Designer Hub - Taking HPD to the Web
Product Designer Hub - Taking HPD to the Web
 
Google Colaboratory for HDF-EOS
Google Colaboratory for HDF-EOSGoogle Colaboratory for HDF-EOS
Google Colaboratory for HDF-EOS
 
HDF Product Designer
HDF Product DesignerHDF Product Designer
HDF Product Designer
 
Efficiently serving HDF5 via OPeNDAP
Efficiently serving HDF5 via OPeNDAPEfficiently serving HDF5 via OPeNDAP
Efficiently serving HDF5 via OPeNDAP
 
MATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and CapabilitiesMATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and Capabilities
 
NEON HDF5
NEON HDF5NEON HDF5
NEON HDF5
 
Open-source Scientific Computing and Data Analytics using HDF
Open-source Scientific Computing and Data Analytics using HDFOpen-source Scientific Computing and Data Analytics using HDF
Open-source Scientific Computing and Data Analytics using HDF
 
Interoperability with netCDF-4 - Experience with NPP and HDF-EOS5 products
Interoperability with netCDF-4 - Experience with NPP and HDF-EOS5 productsInteroperability with netCDF-4 - Experience with NPP and HDF-EOS5 products
Interoperability with netCDF-4 - Experience with NPP and HDF-EOS5 products
 
HDF Product Designer: Using Templates to Achieve Interoperability
HDF Product Designer: Using Templates to Achieve InteroperabilityHDF Product Designer: Using Templates to Achieve Interoperability
HDF Product Designer: Using Templates to Achieve Interoperability
 

Similaire à HDF5 <-> Zarr

HDF5 Advanced Topics - Object's Properties, Storage Methods, Filters, Datatypes
HDF5 Advanced Topics - Object's Properties, Storage Methods, Filters, DatatypesHDF5 Advanced Topics - Object's Properties, Storage Methods, Filters, Datatypes
HDF5 Advanced Topics - Object's Properties, Storage Methods, Filters, DatatypesThe HDF-EOS Tools and Information Center
 

Similaire à HDF5 <-> Zarr (20)

Cloud-Optimized HDF5 Files
Cloud-Optimized HDF5 FilesCloud-Optimized HDF5 Files
Cloud-Optimized HDF5 Files
 
Creating Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 FilesCreating Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 Files
 
HDF5 Advanced Topics - Object's Properties, Storage Methods, Filters, Datatypes
HDF5 Advanced Topics - Object's Properties, Storage Methods, Filters, DatatypesHDF5 Advanced Topics - Object's Properties, Storage Methods, Filters, Datatypes
HDF5 Advanced Topics - Object's Properties, Storage Methods, Filters, Datatypes
 
Dimension Scales in HDF-EOS2 and HDF-EOS5
Dimension Scales in HDF-EOS2 and HDF-EOS5 Dimension Scales in HDF-EOS2 and HDF-EOS5
Dimension Scales in HDF-EOS2 and HDF-EOS5
 
HDF5 Tools Updates
HDF5 Tools UpdatesHDF5 Tools Updates
HDF5 Tools Updates
 
HDF5 Tools Update
HDF5 Tools UpdateHDF5 Tools Update
HDF5 Tools Update
 
Hdf5 intro
Hdf5 introHdf5 intro
Hdf5 intro
 
HDF for the Cloud
HDF for the CloudHDF for the Cloud
HDF for the Cloud
 
HDF5 Advanced Topics
HDF5 Advanced TopicsHDF5 Advanced Topics
HDF5 Advanced Topics
 
HDF Update for DAAC Managers (2017-02-27)
HDF Update for DAAC Managers (2017-02-27)HDF Update for DAAC Managers (2017-02-27)
HDF Update for DAAC Managers (2017-02-27)
 
S3 VFD
S3 VFDS3 VFD
S3 VFD
 
Performance Tuning in HDF5
Performance Tuning in HDF5 Performance Tuning in HDF5
Performance Tuning in HDF5
 
Introduction to HDF5
Introduction to HDF5Introduction to HDF5
Introduction to HDF5
 
Advanced HDF5 Features
Advanced HDF5 FeaturesAdvanced HDF5 Features
Advanced HDF5 Features
 
Integrating HDF5 with SRB
Integrating HDF5 with SRBIntegrating HDF5 with SRB
Integrating HDF5 with SRB
 
Introduction to NetCDF-4
Introduction to NetCDF-4Introduction to NetCDF-4
Introduction to NetCDF-4
 
Using HDF5 tools for performance tuning and troubleshooting
Using HDF5 tools for performance tuning and troubleshootingUsing HDF5 tools for performance tuning and troubleshooting
Using HDF5 tools for performance tuning and troubleshooting
 
Introduction to HDF5
Introduction to HDF5Introduction to HDF5
Introduction to HDF5
 
HDF5 Life cycle of data
HDF5 Life cycle of dataHDF5 Life cycle of data
HDF5 Life cycle of data
 
UML Representation of NPOESS Data Products in HDF5
UML Representation of NPOESS Data Products in HDF5UML Representation of NPOESS Data Products in HDF5
UML Representation of NPOESS Data Products in HDF5
 

Plus de The HDF-EOS Tools and Information Center

Plus de The HDF-EOS Tools and Information Center (13)

Accessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDSAccessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDS
 
The State of HDF
The State of HDFThe State of HDF
The State of HDF
 
Highly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance FeaturesHighly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance Features
 
HDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance DiscussionHDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance Discussion
 
Hyrax: Serving Data from S3
Hyrax: Serving Data from S3Hyrax: Serving Data from S3
Hyrax: Serving Data from S3
 
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLABAccessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
 
HDF - Current status and Future Directions
HDF - Current status and Future DirectionsHDF - Current status and Future Directions
HDF - Current status and Future Directions
 
HDF-EOS Data Product Developer's Guide
HDF-EOS Data Product Developer's GuideHDF-EOS Data Product Developer's Guide
HDF-EOS Data Product Developer's Guide
 
HDF Status Update
HDF Status UpdateHDF Status Update
HDF Status Update
 
NASA Terra Data Fusion
NASA Terra Data FusionNASA Terra Data Fusion
NASA Terra Data Fusion
 
HDF Cloud: HDF5 at Scale
HDF Cloud: HDF5 at ScaleHDF Cloud: HDF5 at Scale
HDF Cloud: HDF5 at Scale
 
HDF Data in the Cloud
HDF Data in the CloudHDF Data in the Cloud
HDF Data in the Cloud
 
HDF Kita Lab: JupyterLab + HDF Service
HDF Kita Lab: JupyterLab + HDF ServiceHDF Kita Lab: JupyterLab + HDF Service
HDF Kita Lab: JupyterLab + HDF Service
 

Dernier

Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
What are the features of Vehicle Tracking System?
What are the features of Vehicle Tracking System?What are the features of Vehicle Tracking System?
What are the features of Vehicle Tracking System?Watsoo Telematics
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 

Dernier (20)

Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
What are the features of Vehicle Tracking System?
What are the features of Vehicle Tracking System?What are the features of Vehicle Tracking System?
What are the features of Vehicle Tracking System?
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 

HDF5 <-> Zarr

  • 1. HDF5 ⬌ Zarr 2020 ESIP Summer Meeting Aleksandar Jelenak
  • 2. • Zarr ➔ HDF5* • HDF5 ➔ Zarr *A portion of this work was supported by the United States Geological Survey (USGS). Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of United States Government. 2Overview
  • 3. Zarr ➔ HDF5 3 https://medium.com/pangeo/cloud-performant-reading-of-netcdf4-hdf5-data-using-the-zarr-library-1a95c5c92314
  • 4. • Created a new Zarr store: FileChunkStore • Developed Zarr JSON metadata for (HDF5) chunk file location: .zchunkstore • Small fixes in the zarr and xarray Python packages to support a separate Zarr store for file chunks 4Zarr ➔ HDF5: Implementation "zeta/9.38": { "offset": 3883848970, "size": 6907818 }, "zeta/9.39": { "offset": 3890756788, "size": 7879493 }, "zeta/9.4": { "offset": 3648355033, "size": 6525250 }, "zeta/9.40": { "offset": 3898636281, "size": 7132453 }
  • 5. 5Zarr ➔ HDF5: Performance comparison Zarr reading Zarr Zarr reading HDF5
  • 6. • Python >= 3.6 • HDF5-1.10.6 library • pip install git+https://github.com/h5py/h5py.git • pip install git+https://github.com/ajelenak/xarray.git@zarr-chunkstore • pip install git+https://github.com/HDFGroup/zarr-python.git@hdf5 • pip install fsspec • HDF5-to-Zarr translator: https://gist.github.com/ajelenak/80354a95b449cedea5cca508004f97a9 6Zarr ➔ HDF5: How to try out?
  • 7. • HDF5 dataset compact layout not supported • HDF5 dataset data may be written by compressor/filter not supported by Zarr • Storage system hosting the HDF5 file must allow partial file reading 7Zarr ➔ HDF5: Limitations
  • 8. • HDF5 API access to Zarr data is provided by the HDF’s Highly Scalabale Data Service (HSDS) • Only for Zarr data in AWS S3 • HSDS object store schema is similar to Zarr: a combination of JSON and binary objects • HSDS JSON is a superset of HDF5/JSON • Still work in progress 8HDF5 ➔ Zarr
  • 9. • Using special HSDS schema chunking layout: H5D_CHUNKED_REF_INDIRECT • This chunking layout is not supported by the HDF5 library • Developed to enable HSDS access to chunks in HDF5 files in object storage • Chunk information for one Zarr array is stored as an anonymous HDF5 compound dataset • The compound datatype has 3 fields for: byte offset (always 0), chunk object size, and chunk object URI • The HDF5 dataset representing the Zarr array has the H5D_CHUNKED_REF_INDIRECT layout and its value points to the anonymous HDF5 dataset with chunk location information 9HDF5 ➔ Zarr: Implementation
  • 10. • Because HSDS does not (yet) support the Blosc compressor, the original Zarr dataset was copied with the Zlib compressor instead: 10HDF5 ➔ Zarr: Data Wrangling from sys import stdout import zarr import fsspec from numcodecs import Zlib src_root = zarr.open( fsspec.get_mapper('s3://pangeo-data-uswest2/esip/adcirc/adcirc_01d', anon=False, requester_pays=True), mode='r') dest_root = zarr.open( fsspec.get_mapper('s3://hdf5-zarr/adcirc_01d.zarr’, anon=False), mode='w') zarr.copy_all(src_root, dest_root, shallow=False, without_attrs=False, log=stdout, if_exists='replace', dry_run=True, compressor=Zlib(level=6), filters=None)
  • 11. HDF5 ➔ Zarr: Example translation Zarr array HDF5 dataset with chunk info Name : /zeta Type : zarr.core.Array Data type : float64 Shape : (720, 9228245) Chunk shape : (10, 141973) Compressor : Zlib(level=6) No. bytes : 53154691200 (49.5G) Chunks initialized : 4680/4680 Type : h5pyd.Dataset Data type : compound Shape : (72, 65) Value: [[(0, 1949049, 's3://hdf5-zarr/adcirc_01d.zarr/zeta/0.0') (0, 2911533, 's3://hdf5-zarr/adcirc_01d.zarr/zeta/0.1') (0, 2506163, 's3://hdf5-zarr/adcirc_01d.zarr/zeta/0.2') ... (0, 4344724, 's3://hdf5-zarr/adcirc_01d.zarr/zeta/0.62') (0, 5696617, 's3://hdf5-zarr/adcirc_01d.zarr/zeta/0.63') (0, 4275725, 's3://hdf5-zarr/adcirc_01d.zarr/zeta/0.64')] 11
  • 12. • Zarr array data may be written by compressor/filter not supported by HSDS • Since both Zarr and HSDS are written in Python, it should be possible to add Zarr’s numcodecs package to HSDS to resolve this limitation 12HDF5 ➔ Zarr: Limitations