SlideShare une entreprise Scribd logo
1  sur  21
DM_PPT_NP_v02
Utilizing HDF4 File Content Maps
for the Cloud Computing
Hyokyung Joe Lee
The HDF Group
This work was supported by NASA/GSFC under
Raytheon Co. contract number NNG15HZ39C
DM_PPT_NP_v02
2
HDF File Format is for Data.
• PDF for Document, HDF for Data
• Why PDF over MS Word DOC?
– Free, Portable, Sharing & Archiving
• Why HDF over MS Excel XLS(X)?
– Free, Portable, Sharing & Archiving
• HDF: HDF4 & HDF5
DM_PPT_NP_v02
3
HDF4 is “old” format.
• Old = Large volume over long time
• Old = Limitation (32-bit)
• Old = More difficult to sustain
DM_PPT_NP_v02
4
HDF4 is old. So What?
• Convert it to HDF5.
DM_PPT_NP_v02
5
Any alternative?
Cloudification!
DM_PPT_NP_v02
6
Cloudificaiton - Wiktionary
The conversion and/or
migration of data and
application programs in order
to make use of
cloud computing
DM_PPT_NP_v02
7
Why Cloud? AI+Bigdata+Cloud =
DM_PPT_NP_v02
8
ABC Example: El Nino Detection
DM_PPT_NP_v02
9
Cloudificaiton is cool but how?
Use
HDF4 File Content Map.
Group Array
Table
Attribute
Palette
DM_PPT_NP_v02
10
What is HDF4 Map?
XML (ASCII) file that maps the content of binary file.
<h4:Array name="c" path="/a/b/" nDimensions="4">
<h4:dataDimensionSizes>180 8 32 4</h4:dataDimensionSizes>
<h4:chunks>
<h4:chunkDimensionSizes>1 8 32 4</h4:chunkDimensionSizes>
<h4:fillValues value="-9999.000000" chunkPositionInArray="[0,0,0,0]"/>
<h4:byteStream offset="70798703" nBytes="2468" chunkPositionInArray="[114,0,0,0]"/>
<h4:byteStream offset="89101024" nBytes="32" chunkPositionInArray="[172,0,0,0]"/>
<h4:byteStream offset="89127527" nBytes="32" chunkPositionInArray="[173,0,0,0]"/>
</h4:chunks>
</h4:Array>
DM_PPT_NP_v02
11
It is a map with addresses.
<h4:Array name="c" path="/a/b/" nDimensions="4">
<h4:dataDimensionSizes>180 8 32 4</h4:dataDimensionSizes>
<h4:chunks>
<h4:chunkDimensionSizes>1 8 32 4</h4:chunkDimensionSizes>
<h4:fillValues value="-9999.000000" chunkPositionInArray="[0,0,0,0]"/>
…
<h4:byteStream offset="70798703" nBytes="2468" chunkPositionInArray="[114,0,0,0]"/>
<h4:byteStream offset="89101024" nBytes="32" chunkPositionInArray="[172,0,0,0]"/>
<h4:byteStream offset="89127527" nBytes="32" chunkPositionInArray="[173,0,0,0]"/>
</h4:chunks>
</h4:Array> Addresses in the dataAddresses in the file
DM_PPT_NP_v02
12
<h4:Array name="c" path="/a/b/" nDimensions="4">
<h4:dataDimensionSizes>180 8 32 4</h4:dataDimensionSizes>
<h4:chunks>
<h4:chunkDimensionSizes>1 8 32 4</h4:chunkDimensionSizes>
<h4:fillValues value="-9999.000000" chunkPositionInArray="[0,0,0,0]"/>
…
<h4:byteStream offset="70798703" nBytes="2468" chunkPositionInArray="[114,0,0,0]"/>
<h4:byteStream offset="89101024" nBytes="32" chunkPositionInArray="[172,0,0,0]"/>
<h4:byteStream offset="89127527" nBytes="32" chunkPositionInArray="[173,0,0,0]"/>
</h4:chunks>
</h4:Array>
Byte size in map is quite useful.
Bigger chunks may have more information.
Nothing
interesting
This chunk may
have useful
information.
These chunks
may have same
information.
DM_PPT_NP_v02
13
Run data analytics on maps.
Compute checksum and use Elastic Search & Kibana.
Frequency distribution
of checksums
DM_PPT_NP_v02
14
Some chunks are repeated.
A single HDF4 file has 160+ chunks of same data.
Chunks with the same
checksum have the
same data
DM_PPT_NP_v02
15
At collection level, it scales up.
Hundreds of HDF4 files have the 16K chunks of same data.
DM_PPT_NP_v02
16
Elastic search with maps
.. can help users
locate the HDF4 file of
interest.
Nothing
interesting
Most
interesting
DM_PPT_NP_v02
17
Reduce storage cost (e.g., S3) by avoiding redundancy.
Make each chunk searchable through search engine.
Run cloud computing on chunks of interest.
Store chunks as cloud objects
DM_PPT_NP_v02
18
NASA Earthdata search is too shallow.
Index HDF4 data using maps and make deep web.
Provide search interface for the deep web.
Frequently searched data can be cached as cloud objects.
Users can run cloud computing on cached objects in RT.
Verify results with HDF4 archives from NASA data centers.
Shallow Web is not Enough
DM_PPT_NP_v02
19
(BACC= Bigdata Analytics in Cloud Computing)
1. Use HDF archive as is. Create maps for HDF.
2. Maps can be indexed and searched.
3. ELT (Extract Load Transform) only relevant data into
cloud from HDF.
4. Offset/length based file IO is universal - all existing
BACC solutions will work. No dependency on HDF APIs.
HDF: Antifragile Solution for BACC
DM_PPT_NP_v02
20
Future Work
1. HDF5 Mapping Project?
2. Use HDF Product Designer for archiving cloud objects
and analytics results in HDF5.
3. Re-map: To metadata is human, to data is divine: For
the same binary object, user can easily re-define
meaning of data, re-index it, search, and analyze it.
(e.g., serve the same binary data in Chinese, Spanish,
Russian, etc.)
DM_PPT_NP_v02
21
This work was supported by
NASA/GSFC under Raytheon Co.
contract number NNG15HZ39C

Contenu connexe

Tendances

Tendances (20)

HDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the CloudHDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the Cloud
 
Moving form HDF4 to HDF5/netCDF-4
Moving form HDF4 to HDF5/netCDF-4Moving form HDF4 to HDF5/netCDF-4
Moving form HDF4 to HDF5/netCDF-4
 
HDF5 Performance Enhancements with the Elimination of Unlimited Dimension
HDF5 Performance Enhancements with the Elimination of Unlimited DimensionHDF5 Performance Enhancements with the Elimination of Unlimited Dimension
HDF5 Performance Enhancements with the Elimination of Unlimited Dimension
 
Geospatial Data Abstraction Library (GDAL) Enhancement for ESDIS (GEE)
Geospatial Data Abstraction Library (GDAL) Enhancement for ESDIS (GEE)Geospatial Data Abstraction Library (GDAL) Enhancement for ESDIS (GEE)
Geospatial Data Abstraction Library (GDAL) Enhancement for ESDIS (GEE)
 
Product Designer Hub - Taking HPD to the Web
Product Designer Hub - Taking HPD to the WebProduct Designer Hub - Taking HPD to the Web
Product Designer Hub - Taking HPD to the Web
 
ICESat-2 Metadata and Status
ICESat-2 Metadata and StatusICESat-2 Metadata and Status
ICESat-2 Metadata and Status
 
Open-source Scientific Computing and Data Analytics using HDF
Open-source Scientific Computing and Data Analytics using HDFOpen-source Scientific Computing and Data Analytics using HDF
Open-source Scientific Computing and Data Analytics using HDF
 
Hierarchical Data Formats (HDF) Update
Hierarchical Data Formats (HDF) UpdateHierarchical Data Formats (HDF) Update
Hierarchical Data Formats (HDF) Update
 
SPD and KEA: HDF5 based file formats for Earth Observation
SPD and KEA: HDF5 based file formats for Earth ObservationSPD and KEA: HDF5 based file formats for Earth Observation
SPD and KEA: HDF5 based file formats for Earth Observation
 
Scientific Computing and Visualization using HDF
Scientific Computing and Visualization using HDFScientific Computing and Visualization using HDF
Scientific Computing and Visualization using HDF
 
MODIS Land and HDF-EOS
MODIS Land and HDF-EOSMODIS Land and HDF-EOS
MODIS Land and HDF-EOS
 
Improved Methods for Accessing Scientific Data for the Masses
Improved Methods for Accessing Scientific Data for the MassesImproved Methods for Accessing Scientific Data for the Masses
Improved Methods for Accessing Scientific Data for the Masses
 
HDF Product Designer: Using Templates to Achieve Interoperability
HDF Product Designer: Using Templates to Achieve InteroperabilityHDF Product Designer: Using Templates to Achieve Interoperability
HDF Product Designer: Using Templates to Achieve Interoperability
 
HDF Product Designer
HDF Product DesignerHDF Product Designer
HDF Product Designer
 
HDF Project Update
HDF Project UpdateHDF Project Update
HDF Project Update
 
GDAL Enhancement for ESDIS Project
GDAL Enhancement for ESDIS ProjectGDAL Enhancement for ESDIS Project
GDAL Enhancement for ESDIS Project
 
ArcGIS and Multi-D: Tools & Roadmap
ArcGIS and Multi-D: Tools & RoadmapArcGIS and Multi-D: Tools & Roadmap
ArcGIS and Multi-D: Tools & Roadmap
 
Bridging ICESat and ICESat-2 Standard Data Products
Bridging ICESat and ICESat-2 Standard Data ProductsBridging ICESat and ICESat-2 Standard Data Products
Bridging ICESat and ICESat-2 Standard Data Products
 
Multidimensional Scientific Data in ArcGIS
Multidimensional Scientific Data in ArcGISMultidimensional Scientific Data in ArcGIS
Multidimensional Scientific Data in ArcGIS
 
Working with Scientific Data in MATLAB
Working with Scientific Data in MATLABWorking with Scientific Data in MATLAB
Working with Scientific Data in MATLAB
 

En vedette

Unidata's Approach to Community Broadening through Data and Technology Sharing
Unidata's Approach to Community Broadening through Data and Technology SharingUnidata's Approach to Community Broadening through Data and Technology Sharing
Unidata's Approach to Community Broadening through Data and Technology Sharing
The HDF-EOS Tools and Information Center
 

En vedette (9)

Pilot Project for HDF5 Metadata Structures for SWOT
Pilot Project for HDF5 Metadata Structures for SWOTPilot Project for HDF5 Metadata Structures for SWOT
Pilot Project for HDF5 Metadata Structures for SWOT
 
HDF Cloud Services
HDF Cloud ServicesHDF Cloud Services
HDF Cloud Services
 
Breakthrough Listen
Breakthrough ListenBreakthrough Listen
Breakthrough Listen
 
Using visualization tools to access HDF data via OPeNDAP
Using visualization tools to access HDF data via OPeNDAP Using visualization tools to access HDF data via OPeNDAP
Using visualization tools to access HDF data via OPeNDAP
 
Advanced HDF5 Features
Advanced HDF5 FeaturesAdvanced HDF5 Features
Advanced HDF5 Features
 
Introduction to HDF5
Introduction to HDF5Introduction to HDF5
Introduction to HDF5
 
Hdf5 current future
Hdf5 current futureHdf5 current future
Hdf5 current future
 
Unidata's Approach to Community Broadening through Data and Technology Sharing
Unidata's Approach to Community Broadening through Data and Technology SharingUnidata's Approach to Community Broadening through Data and Technology Sharing
Unidata's Approach to Community Broadening through Data and Technology Sharing
 
HDF5 Tools
HDF5 ToolsHDF5 Tools
HDF5 Tools
 

Similaire à Utilizing HDF4 File Content Maps for the Cloud Computing

Ensuring Long Term Access to Remotely Sensed HDF4 Data with Layout Maps
Ensuring Long Term Access to Remotely Sensed HDF4 Data with Layout MapsEnsuring Long Term Access to Remotely Sensed HDF4 Data with Layout Maps
Ensuring Long Term Access to Remotely Sensed HDF4 Data with Layout Maps
The HDF-EOS Tools and Information Center
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Safe Software
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Safe Software
 
HDFS tiered storage: mounting object stores in HDFS
HDFS tiered storage: mounting object stores in HDFSHDFS tiered storage: mounting object stores in HDFS
HDFS tiered storage: mounting object stores in HDFS
DataWorks Summit
 

Similaire à Utilizing HDF4 File Content Maps for the Cloud Computing (20)

HDF Data in the Cloud
HDF Data in the CloudHDF Data in the Cloud
HDF Data in the Cloud
 
HDF5 Life cycle of data
HDF5 Life cycle of dataHDF5 Life cycle of data
HDF5 Life cycle of data
 
HDF Cloud: HDF5 at Scale
HDF Cloud: HDF5 at ScaleHDF Cloud: HDF5 at Scale
HDF Cloud: HDF5 at Scale
 
Integrating HDF5 with SRB
Integrating HDF5 with SRBIntegrating HDF5 with SRB
Integrating HDF5 with SRB
 
Cloud-Optimized HDF5 Files
Cloud-Optimized HDF5 FilesCloud-Optimized HDF5 Files
Cloud-Optimized HDF5 Files
 
Unit-3.pptx
Unit-3.pptxUnit-3.pptx
Unit-3.pptx
 
Ensuring Long Term Access to Remotely Sensed HDF4 Data with Layout Maps
Ensuring Long Term Access to Remotely Sensed HDF4 Data with Layout MapsEnsuring Long Term Access to Remotely Sensed HDF4 Data with Layout Maps
Ensuring Long Term Access to Remotely Sensed HDF4 Data with Layout Maps
 
Accessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDSAccessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDS
 
Putting some Spark into HDF5
Putting some Spark into HDF5Putting some Spark into HDF5
Putting some Spark into HDF5
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
HDF for the Cloud
HDF for the CloudHDF for the Cloud
HDF for the Cloud
 
Hadoop and CLOUDIAN HyperStore
Hadoop and CLOUDIAN HyperStoreHadoop and CLOUDIAN HyperStore
Hadoop and CLOUDIAN HyperStore
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
 
Performance Tuning in HDF5
Performance Tuning in HDF5 Performance Tuning in HDF5
Performance Tuning in HDF5
 
Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
 
Parallel Computing with HDF Server
Parallel Computing with HDF ServerParallel Computing with HDF Server
Parallel Computing with HDF Server
 
HDFS tiered storage: mounting object stores in HDFS
HDFS tiered storage: mounting object stores in HDFSHDFS tiered storage: mounting object stores in HDFS
HDFS tiered storage: mounting object stores in HDFS
 

Plus de The HDF-EOS Tools and Information Center

Plus de The HDF-EOS Tools and Information Center (20)

The State of HDF
The State of HDFThe State of HDF
The State of HDF
 
Highly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance FeaturesHighly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance Features
 
Creating Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 FilesCreating Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 Files
 
HDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance DiscussionHDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance Discussion
 
Hyrax: Serving Data from S3
Hyrax: Serving Data from S3Hyrax: Serving Data from S3
Hyrax: Serving Data from S3
 
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLABAccessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
 
HDF - Current status and Future Directions
HDF - Current status and Future DirectionsHDF - Current status and Future Directions
HDF - Current status and Future Directions
 
HDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and FutureHDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and Future
 
HDF - Current status and Future Directions
HDF - Current status and Future Directions HDF - Current status and Future Directions
HDF - Current status and Future Directions
 
H5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only LibraryH5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only Library
 
MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10
 
HDF for the Cloud - Serverless HDF
HDF for the Cloud - Serverless HDFHDF for the Cloud - Serverless HDF
HDF for the Cloud - Serverless HDF
 
HDF5 <-> Zarr
HDF5 <-> ZarrHDF5 <-> Zarr
HDF5 <-> Zarr
 
HDF for the Cloud - New HDF Server Features
HDF for the Cloud - New HDF Server FeaturesHDF for the Cloud - New HDF Server Features
HDF for the Cloud - New HDF Server Features
 
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
 
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
 
HDF5 and Ecosystem: What Is New?
HDF5 and Ecosystem: What Is New?HDF5 and Ecosystem: What Is New?
HDF5 and Ecosystem: What Is New?
 
HDF5 Roadmap 2019-2020
HDF5 Roadmap 2019-2020HDF5 Roadmap 2019-2020
HDF5 Roadmap 2019-2020
 
Leveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software TestingLeveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software Testing
 
Google Colaboratory for HDF-EOS
Google Colaboratory for HDF-EOSGoogle Colaboratory for HDF-EOS
Google Colaboratory for HDF-EOS
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Dernier (20)

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 

Utilizing HDF4 File Content Maps for the Cloud Computing

  • 1. DM_PPT_NP_v02 Utilizing HDF4 File Content Maps for the Cloud Computing Hyokyung Joe Lee The HDF Group This work was supported by NASA/GSFC under Raytheon Co. contract number NNG15HZ39C
  • 2. DM_PPT_NP_v02 2 HDF File Format is for Data. • PDF for Document, HDF for Data • Why PDF over MS Word DOC? – Free, Portable, Sharing & Archiving • Why HDF over MS Excel XLS(X)? – Free, Portable, Sharing & Archiving • HDF: HDF4 & HDF5
  • 3. DM_PPT_NP_v02 3 HDF4 is “old” format. • Old = Large volume over long time • Old = Limitation (32-bit) • Old = More difficult to sustain
  • 4. DM_PPT_NP_v02 4 HDF4 is old. So What? • Convert it to HDF5.
  • 6. DM_PPT_NP_v02 6 Cloudificaiton - Wiktionary The conversion and/or migration of data and application programs in order to make use of cloud computing
  • 9. DM_PPT_NP_v02 9 Cloudificaiton is cool but how? Use HDF4 File Content Map. Group Array Table Attribute Palette
  • 10. DM_PPT_NP_v02 10 What is HDF4 Map? XML (ASCII) file that maps the content of binary file. <h4:Array name="c" path="/a/b/" nDimensions="4"> <h4:dataDimensionSizes>180 8 32 4</h4:dataDimensionSizes> <h4:chunks> <h4:chunkDimensionSizes>1 8 32 4</h4:chunkDimensionSizes> <h4:fillValues value="-9999.000000" chunkPositionInArray="[0,0,0,0]"/> <h4:byteStream offset="70798703" nBytes="2468" chunkPositionInArray="[114,0,0,0]"/> <h4:byteStream offset="89101024" nBytes="32" chunkPositionInArray="[172,0,0,0]"/> <h4:byteStream offset="89127527" nBytes="32" chunkPositionInArray="[173,0,0,0]"/> </h4:chunks> </h4:Array>
  • 11. DM_PPT_NP_v02 11 It is a map with addresses. <h4:Array name="c" path="/a/b/" nDimensions="4"> <h4:dataDimensionSizes>180 8 32 4</h4:dataDimensionSizes> <h4:chunks> <h4:chunkDimensionSizes>1 8 32 4</h4:chunkDimensionSizes> <h4:fillValues value="-9999.000000" chunkPositionInArray="[0,0,0,0]"/> … <h4:byteStream offset="70798703" nBytes="2468" chunkPositionInArray="[114,0,0,0]"/> <h4:byteStream offset="89101024" nBytes="32" chunkPositionInArray="[172,0,0,0]"/> <h4:byteStream offset="89127527" nBytes="32" chunkPositionInArray="[173,0,0,0]"/> </h4:chunks> </h4:Array> Addresses in the dataAddresses in the file
  • 12. DM_PPT_NP_v02 12 <h4:Array name="c" path="/a/b/" nDimensions="4"> <h4:dataDimensionSizes>180 8 32 4</h4:dataDimensionSizes> <h4:chunks> <h4:chunkDimensionSizes>1 8 32 4</h4:chunkDimensionSizes> <h4:fillValues value="-9999.000000" chunkPositionInArray="[0,0,0,0]"/> … <h4:byteStream offset="70798703" nBytes="2468" chunkPositionInArray="[114,0,0,0]"/> <h4:byteStream offset="89101024" nBytes="32" chunkPositionInArray="[172,0,0,0]"/> <h4:byteStream offset="89127527" nBytes="32" chunkPositionInArray="[173,0,0,0]"/> </h4:chunks> </h4:Array> Byte size in map is quite useful. Bigger chunks may have more information. Nothing interesting This chunk may have useful information. These chunks may have same information.
  • 13. DM_PPT_NP_v02 13 Run data analytics on maps. Compute checksum and use Elastic Search & Kibana. Frequency distribution of checksums
  • 14. DM_PPT_NP_v02 14 Some chunks are repeated. A single HDF4 file has 160+ chunks of same data. Chunks with the same checksum have the same data
  • 15. DM_PPT_NP_v02 15 At collection level, it scales up. Hundreds of HDF4 files have the 16K chunks of same data.
  • 16. DM_PPT_NP_v02 16 Elastic search with maps .. can help users locate the HDF4 file of interest. Nothing interesting Most interesting
  • 17. DM_PPT_NP_v02 17 Reduce storage cost (e.g., S3) by avoiding redundancy. Make each chunk searchable through search engine. Run cloud computing on chunks of interest. Store chunks as cloud objects
  • 18. DM_PPT_NP_v02 18 NASA Earthdata search is too shallow. Index HDF4 data using maps and make deep web. Provide search interface for the deep web. Frequently searched data can be cached as cloud objects. Users can run cloud computing on cached objects in RT. Verify results with HDF4 archives from NASA data centers. Shallow Web is not Enough
  • 19. DM_PPT_NP_v02 19 (BACC= Bigdata Analytics in Cloud Computing) 1. Use HDF archive as is. Create maps for HDF. 2. Maps can be indexed and searched. 3. ELT (Extract Load Transform) only relevant data into cloud from HDF. 4. Offset/length based file IO is universal - all existing BACC solutions will work. No dependency on HDF APIs. HDF: Antifragile Solution for BACC
  • 20. DM_PPT_NP_v02 20 Future Work 1. HDF5 Mapping Project? 2. Use HDF Product Designer for archiving cloud objects and analytics results in HDF5. 3. Re-map: To metadata is human, to data is divine: For the same binary object, user can easily re-define meaning of data, re-index it, search, and analyze it. (e.g., serve the same binary data in Chinese, Spanish, Russian, etc.)
  • 21. DM_PPT_NP_v02 21 This work was supported by NASA/GSFC under Raytheon Co. contract number NNG15HZ39C

Notes de l'éditeur

  1. Good morning, everyone! My name is Joe Lee and I’m a software engineer at The HDF Group. Although I attended the past ESIP meetings regularly, I could not travel this summer. ESIP meeting is a great place to learn and share new ideas and technologies through face-to-face conversation so I apologize for presenting my new idea over telecon.
  2. Although you may have heard about Hierarchical Data Format, let me start my presentation by giving a very short introduction to HDF. HDF is similar to PDF in many ways as a free & portable binary format although the brand power of HDF is much weaker than the brand power of PDF. Everybody knows that PDF is for publishing document. HDF is for publishing data of any size – big or small. For example, NASA has used HDF for several decades to archive big data as large as Earth observation because it is good for sharing and archiving. HDF has two incompatible formats called HDF4 and HDF5. As the number indicates, HDF4 is old format and HDF5 is relatively new format.
  3. The idea that I am going to present today is mainly about HDF4 because HDF4 is old. I cannot tell exactly how old HDF4 is because I don’t want to discriminate any file format based on its age.  Old can mean many different things – both good and bad. For example, old means a large volume of earth data has been archived in HDF4. Old also means that HDF4 has some limits that are already overcome by today’s technology. As the technology advances very fast, you’ll see fewer tools that support HDF4. I put an image of CD-player because HDF4 reminds me the CD player in my 20 years old car. In 1995, I had to pay extra money for it as a premium car audio option.
  4. Last November, my 20 year old car was finally broken after racking up 250 thousand miles so I went shopping for a new car. I was surprised to know that new cars do not have CD players any more. Instead, they have USB or SD memory card slots and they accept MP3 formats. I’m telling this story because the modernization of HDF4 data is necessary before it gets too old to sustain. Since HDF4 is not backward-compatible with HDF5, HDF5 users need to convert HDF4 to HDF5 if their tools do not support HDF4. The HDF Group already provides h4toh5 conversion tool. This is a good solution as long as you are willing to convert millions of HDF4 files into HDF5 files.
  5. Thinking about the future alternative, like Tesla car that can stream music from cloud, I think Earth data streaming from cloud is the way to go. So, converting HDF4 to HDF5 is an OK solution but I think there should be an alternative if we’d like to modernize old HDF4 data in cloud age.
  6. I found the word “Cloudification” and I like it a lot. Wiktionary defines it as “The conversion ….”
  7. Why does cloud computing matter? I think I don’t have to explain it any more thanks to IBM Watson and Google AlphaGo. When combined with AI and big data, cloud computing can do amazing things like beating human experts.
  8. For another example, last winter, I was involved in a project called data container study. I ran a machine learning experiment with 20 years of NASA sea surface temperature data near Peru from 1987 and 2008 using Open Science Data Cloud and I could detect an anomaly quickly in a few seconds. The result matched nicely with El Nino in 1998. Open Science Data Cloud was very convenient and fast.
  9. What I also learned from data container study is that efficient I/O is the key. OSDC provides 200 terra bytes of public data in HDF4 format. However, they are not directly usable for me because OSDC does not provide any search interface to the collection that is similar to NASA Earthdata search. OSDC only provides a list of HDF4 file names available and all I can do is to transfer a collection of HDF4 files “as is” from cloud storage to the computing nodes. This is horribly inefficient because I need a way to search & filter the only relevant data to speed up my data analytics at collection level. Thus, I came up with an idea to use HDF4 file content map to maximize the utilization of cloud computing. A single binary HDF4 file can have multiple data objects represented as array, group, table, attributes and so on. Each object can be precisely located using the offset from the beginning of file and the number of bytes to read using HDF4 file content map. The rationale is that if the only relevant object can be searched and loaded into data analytics engine, you can reduce the amount of I/O and thus get the result much faster. Without shredding thousands of HDF4 files into objects with HDF4 maps, you must load 200 TB of data into computing nodes, process them, and throw away. You must repeat this for different analytics jobs. You need to wait days for I/O operation while the actual data analytics takes only a few seconds.
  10. So what is HDF4 file content map that I’m talking about? It is an XML file that maps the content of the HDF4 binary file. Unless you’re a hacker working for NSA, it’s hard to know what’s inside the HDF4 binary file as shown in the slide. HDF4 binary file is a long stream of bytes and HDF4 map file can tell you how to decrypt the stream correctly.
  11. Interpreting binary data is possible because the file content map has full of addresses. In HDF, a dataset can be organized into chunks for efficient I/O and the HDF4 map can tell where you can find a chunk of data. The chunk position in array is a good indication of where data is located on on Earth if dataset is grid. By fully disclosing offset and number of bytes to read from the binary file, you don’t need HDF4 libraries to access a chunk of data.
  12. If you read the file content map carefully, you can find some interesting patterns from byte size of each chunk. The fillValues XML tag indicates that there’s nothing to be analyzed in the chunk. Small size of chunk indicates that the chunk contains a lot of repeated information so it can be compressed. Big size of chunk indicates that the chunk has more information than other compressed chunks.
  13. To find useful data from a huge collection of HDF4 in OSDC, I ran an elastic search on chunks and visually inspected the frequency distribution of checksums with Kibana after computing MD5 checksum of each chunk. MD5 checksum on individual chunk is not provided by h4mapwriter yet, so I created a separate script in Python.
  14. Running some analytics on HDF4 file using the HDF4 map was very fun. It revealed that the same chunk of data is repeated within a HDF4 file.
  15. At collection level, it scales up nicely. Hundreds of HDF4 files have the 16K chunk of same data. This makes sense because some observations from the Earth will be same for a long period of time.
  16. Once index is built with Elastic Search, I can easily run query to find a dataset that I’m interested in using the byte size information. For example, I could sort the dataset size from the smallest to the largest over hundreds of HDF4 files. As expected, the smallest byte size dataset showed almost nothing when it is visualized with HDFView. The largest byte size dataset returned a colorful image.
  17. Based on the HDF4 map information, I learned that it is possible to re-organize the entire collection of HDF4 data to optimize the use of cloud storage. If you optimize data organization, ETL time for cloud computing will be shortened and the cost of storage will be also reduced. If you can build search engine on top of those objects, advanced HDF4 users can run cloud computing directly on HDF chunks of their interest after filtering out irrelevant data based on the search result. Users can always transform HDF chunks into other format such as Apache Parquet or JSON to meet their cloud computing needs.
  18. From the Elastic Search experiment with HDF4 map, now I have a new wish list for NASA Earthdata search. Although I like the new and improved NASA Earthdata search, I still think it’s too shallow because it does not index what’s inside from granules. If Earthdata search can index HDF4 maps and provide search interface, collection at the chunk level can be returned to a user’s query. I’d like to call such search service as deep web search. For a chunk collection that deep web search returns, user can stream chunks to user’s cloud storage. Here, the key is to deliver chunk collection to user’s cloud service provider. Downloading the entire HDF4 data does not make sense in this workflow. Then, user can run his analytics job using cloud computing on the streamed chunks. If necessary, users can go back to the original HDF4 archives and run the same analytics if necessary using the traditional off-cloud method.
  19. In summary, data archived in HDF is ready for big data analytics for certain access patterns that data producers prescribed. The prescribed pattern may not match exactly what users need. For such use case scenario, HDF maps can be indexed and searched to identify relevant pieces from HDF. I call it anti-fragile solution because any big data analytics solution in any computer language in any cloud computing environment will work. For example, I could read data over network in PHP language using Apache web server that supports byte-range and it worked pretty nicely. I picked PHP because PHP binding doesn’t exist for HDF. Relying on a single monolithic library to access data is too fragile.
  20. You may wonder if the same solution can be applied to HDF5 or netCDF4. Unfortunately, there is no HDF5 mapper tool yet. How can a user save a collection of chunks in HDF5 easily for future use? I think HDF Product Designer is a good candidate for creating a new HDF5 from chunk objects in cloud. It can play a role of h4toh5 conversion tool with on-demand collection level subset/aggregation capability. Finally, the HDF4 map idea has a great potential as flexible metadata solution. While binary is forever, metadata doesn’t have to be. If you re-map the same binary data with a different dialect, you can serve wider community that understands the dialect. One example is rewriting HDF4 file content map in different languages. Then, international users can discover and access Earthdata more easily. Thank you for listening and I hope that you can use HDF4 map wisely in your next cloud computing project. Do you have any question?