This document discusses using HDF4 file content maps to enable cloud computing capabilities for HDF4 files. HDF4 files contain scientific data but their large size and legacy format pose challenges. The document proposes creating XML maps that describe HDF4 file structure and contents, including chunk locations and sizes. These maps could then be indexed and searched to locate relevant data chunks. Only those chunks would need to be extracted to the cloud, avoiding unnecessary data transfers. This would allow HDF4 files to be queried and analyzed using cloud-based tools while reducing storage costs.
This document discusses NEON's use of HDF5 file format for its ecological data. The goals are to implement a fast and efficient file format, develop a standardized data delivery structure, and provide metadata. It describes the HDF5 file structure, metadata inclusion, and an example workflow for processing eddy covariance data into HDF5 files. Future work includes integrating R code for HDF5 file generation and embedding ecological metadata.
This document discusses using MATLAB for working with big data and scientific data formats. It provides an overview of MATLAB's capabilities for scientific data, including interfaces for HDF5 and NetCDF formats. It also describes how MATLAB can be used to access, analyze, and visualize big data from sources like Hadoop, databases, and RESTful web services. As a demonstration, it shows how MATLAB can access HDF5 data stored on an HDF Server through RESTful web requests and analyze the data using in-memory data types and functions.
This document discusses MATLAB support for scientific data formats and analytics workflows. It provides an overview of MATLAB's capabilities for accessing, exploring, and preprocessing large scientific datasets. These include built-in support for HDF5, NetCDF, and other file formats. It also describes datastore objects that allow loading large datasets incrementally for analysis. The document concludes with an example that uses a FileDatastore to access and summarize HDF5 data from NASA ice sheet surveys in a MapReduce workflow.
This document provides information about HDF (Hierarchical Data Format) tools and resources for working with Earth observation data. It summarizes HDF's focus on helping users at different stages of working with data, from initial product design to long-term archiving. It also describes specific HDF tools for viewing, comparing, converting between formats and adding metadata to scientific data files.
The HDF Group provides updates on new features in HDF including faster compression, single writer/multiple reader file access, virtual datasets, and dynamically loaded filters. They also discuss tools like HDFView, nagg for data aggregation, and a new HDF5 ODBC driver. The work is supported by NASA.
This document discusses new features and capabilities in MATLAB for working with scientific data formats and performing technical computing. It highlights enhancements to reading HDF5 and NetCDF files with both high-level and low-level interfaces. It also covers new capabilities for handling dates/times, big data, and accessing web services through RESTful APIs.
This document discusses incorporating ISO metadata standards into HDF files using the HDF Product Designer tool. It describes how the HDF Product Designer allows users to import pre-built ISO metadata components from a separate project into their HDF file designs. This allows essential contextual data or metadata to be stored in HDF5 files according to ISO 19115 standards.
This presentation discusses serving HDF5 data via OPeNDAP for efficient access and visualization. The HDF5 handler with CF and NcML modules translates HDF5 file layout and metadata to comply with CF conventions and provides the data via OPeNDAP to tools like Hyrax. Caching and optimizations improve performance. Future work includes supporting additional data types and file formats in the HDF5 handler.
This document discusses NEON's use of HDF5 file format for its ecological data. The goals are to implement a fast and efficient file format, develop a standardized data delivery structure, and provide metadata. It describes the HDF5 file structure, metadata inclusion, and an example workflow for processing eddy covariance data into HDF5 files. Future work includes integrating R code for HDF5 file generation and embedding ecological metadata.
This document discusses using MATLAB for working with big data and scientific data formats. It provides an overview of MATLAB's capabilities for scientific data, including interfaces for HDF5 and NetCDF formats. It also describes how MATLAB can be used to access, analyze, and visualize big data from sources like Hadoop, databases, and RESTful web services. As a demonstration, it shows how MATLAB can access HDF5 data stored on an HDF Server through RESTful web requests and analyze the data using in-memory data types and functions.
This document discusses MATLAB support for scientific data formats and analytics workflows. It provides an overview of MATLAB's capabilities for accessing, exploring, and preprocessing large scientific datasets. These include built-in support for HDF5, NetCDF, and other file formats. It also describes datastore objects that allow loading large datasets incrementally for analysis. The document concludes with an example that uses a FileDatastore to access and summarize HDF5 data from NASA ice sheet surveys in a MapReduce workflow.
This document provides information about HDF (Hierarchical Data Format) tools and resources for working with Earth observation data. It summarizes HDF's focus on helping users at different stages of working with data, from initial product design to long-term archiving. It also describes specific HDF tools for viewing, comparing, converting between formats and adding metadata to scientific data files.
The HDF Group provides updates on new features in HDF including faster compression, single writer/multiple reader file access, virtual datasets, and dynamically loaded filters. They also discuss tools like HDFView, nagg for data aggregation, and a new HDF5 ODBC driver. The work is supported by NASA.
This document discusses new features and capabilities in MATLAB for working with scientific data formats and performing technical computing. It highlights enhancements to reading HDF5 and NetCDF files with both high-level and low-level interfaces. It also covers new capabilities for handling dates/times, big data, and accessing web services through RESTful APIs.
This document discusses incorporating ISO metadata standards into HDF files using the HDF Product Designer tool. It describes how the HDF Product Designer allows users to import pre-built ISO metadata components from a separate project into their HDF file designs. This allows essential contextual data or metadata to be stored in HDF5 files according to ISO 19115 standards.
This presentation discusses serving HDF5 data via OPeNDAP for efficient access and visualization. The HDF5 handler with CF and NcML modules translates HDF5 file layout and metadata to comply with CF conventions and provides the data via OPeNDAP to tools like Hyrax. Caching and optimizations improve performance. Future work includes supporting additional data types and file formats in the HDF5 handler.
John Readey presented on HDF5 in the cloud using HDFCloud. HDF5 can provide a cost-effective cloud infrastructure by paying for what is used rather than what may be needed. HDFCloud uses an HDF5 server to enable accessing HDF5 data through a REST API, allowing users to access large datasets without downloading entire files. It maps HDF5 objects to cloud object storage for scalable performance and uses Docker containers for elastic scaling.
This presentation discusses moving data and applications from HDF4 to HDF5/netCDF-4. It covers the differences between HDF4 and HDF5 data models and capabilities, tools for converting HDF4 data to HDF5, advantages of HDF5 like unlimited dimensions and compression, and ways to ensure compatibility with netCDF-4 like avoiding HDF5-specific features. The work was supported by a NASA contract.
This document discusses changes made to the MOPITT instrument's HDF data products over multiple versions. MOPITT measures carbon monoxide on the Terra spacecraft. Version 7 products, which use HDF-EOS5 format without unlimited dimensions, were found to be read faster than Version 6 products despite being larger in size. Experiments comparing read times of Version 7 data with and without unlimited dimensions showed a small 2% difference initially, but up to 40% faster reads for Version 7 without unlimited dimensions at first, with the difference decreasing over multiple reads. Eliminating unlimited dimensions thus improved access performance for some HDF applications like MOPITT data.
The document discusses enhancing the Geospatial Data Abstraction Library (GDAL) to improve accessibility and interoperability of NASA data products with GIS tools. It developed plugins for three NASA data products to demonstrate reading multidimensional datasets into GIS applications like ArcGIS. Next steps include providing outreach, enhancing the framework to be more flexible and production-ready, and developing guides to help other data centers build GDAL plugins to address their issues with geospatial data. The overall goal is to improve analysis and visualization of NASA scientific data in GIS tools and web applications.
The document discusses the Product Designer Hub, which aims to take the HDF data format to the web by developing a system with the following key components:
1. A data store to manage user accounts, projects, file structures and metadata.
2. A RESTful server to enable exporting HDF files and data to various formats like JSON, CSV, and importing from formats like XML, MATLAB, and Python.
3. A web service that can ingest metadata from various XML dialects, transform it to HDF groups and attributes, and generate HDF files, JSON, and code templates from the metadata.
4. A client-server architecture where client software like web applications can access HDF files and data through the REST
ICESat-2 is a NASA satellite mission scheduled to launch in December 2017 that will use photon counting laser altimetry to measure ice sheet and sea ice elevations. It will carry an advanced laser altimeter that splits each laser pulse into 6 beams in a cross-track pattern to provide dense sampling. This will allow for improved elevation estimates over rough terrain. The document discusses ICESat-2's science objectives, measurement concept, data products, processing workflow, and approach to managing metadata across different levels of data products.
Kitware uses HDF as a widely adopted data format for scientific computing and visualization across several domains. HDF supports climate modeling, geospatial data, medical imaging, and more. Kitware is looking to improve HDF support for streaming big data, cloud computing, and web applications to enable more advanced analytics and sharing of scientific data. Future work may include pure JavaScript implementations of HDF tools and optimizing performance for cloud storage.
The document summarizes updates on Hierarchical Data Formats (HDF) software releases and tools. It discusses the latest releases of HDF5 1.8.19 and 1.10.1, compatibility issues when moving to newer versions, updates on tools like HDF-Java and HDFView 3.0, supported compilers and systems, and a new compression library for interoperability. It invites readers to provide feedback on their needs.
This document discusses two HDF5-based file formats for storing Earth observation data:
1. The Sorted Pulse Data (SPD) format stores laser scanning data including pulses and point data with attributes. It was created in 2008 and updated to version 4 to improve flexibility.
2. The KEA image file format implements the GDAL raster data model in HDF5, allowing large raster datasets and attribute tables to be stored together with compression. It was created in 2012 to address limitations of other formats.
Both formats take advantage of HDF5 features like compression but also discuss some limitations and lessons learned for effectively designing scientific data formats.
Aashish Chaudhary gave a presentation on Kitware's work with scientific computing and visualization using HDF. HDF is a widely used data format at Kitware for domains like climate modeling, geospatial visualization, and information visualization. Kitware is looking to improve HDF support for cloud and web environments to enable streaming analytics and web-based data analysis. The company also aims to further open source collaboration and scientific computing.
The document discusses MODIS Land products and their distribution format. MODIS Land products include radiation budget, ecosystem, and land cover variables. Products are distributed in HDF-EOS format with fine resolution grids in Integerized Sinusoidal or Lambert Azimuthal Equal-Area projections and coarse grids in a geographic Climate Modeling Grid. HDF-EOS allows for collaboration and standard geolocation representation but toolkit support is still limited.
Harris Corporation provides geospatial software and analytics tools to access and analyze scientific data from remote sensing platforms. Their ENVI and IDL software support common data formats like HDF and NetCDF and provide capabilities for calibration, bowtie correction, reprojection, and visualization of data from sensors including GOES-16, VIIRS, and ocean and weather satellites. The tools allow scientists and analysts to efficiently process large volumes of earth observation data and extract valuable information to support applications in weather forecasting, agriculture, infrastructure monitoring, and more.
This document discusses how HDF Product Designer (HPD) uses templates to achieve interoperability. HPD is an application for consistently developing interoperable data content in HDF5 files. It has a client-server architecture and desktop app. Templates allow users to copy design examples that incorporate best practices and are curated by the HPD development team. Available templates include NCEI collections and CF templates, with more to be added based on community review and suggestions. Templates allow users to initialize new designs by mixing and matching content from different template examples.
The document describes the HDF Product Designer software tool. It was created to facilitate the design of interoperable scientific data products in HDF5 format. The tool allows intuitive editing of HDF5 objects and supports conventions like CF and ACDD. It also provides validation services to test file compliance. The goal is to help scientists design data products that follow standards and are easy for others to use.
The HDF Group provides software for managing large, complex data and services to support users of this technology. It derives most of its revenue from projects related to earth science, including supporting HDF-EOS, JPSS, and other earth science projects. It maintains various tools for working with HDF files and conducts maintenance, support, and development activities to support new versions and capabilities of HDF libraries and software.
This document summarizes the work done to enhance the Geospatial Data Abstraction Library (GDAL) to better support NASA Earth Observing System (EOS) data products. It describes three phases of work: 1) a proof-of-concept ArcGIS plugin for product-specific HDF drivers, 2) generalized HDF drivers and an XML format, and 3) collaboration with GDAL developers utilizing HDF drivers and a Virtual Format (VRT) specification. The third phase highlights include enhanced generic functions, coordination with GDAL developers, testing across GIS clients, outreach to other data centers, and building tutorials. Future work areas are also outlined.
The document discusses Esri's tools and roadmap for working with multi-dimensional (MD) scientific data in ArcGIS. It outlines Esri's efforts to directly read HDF, GRIB, and netCDF files as raster layers or feature/table views in ArcGIS. MD mosaic datasets allow users to manage variables and dimensions across multiple files and perform on-the-fly computations and visualization of MD data. New functions have been added to improve MD data analysis and visualization, including a vector field renderer to depict raster data as vectors. Esri is also working to better support OPeNDAP data sources.
This document discusses efforts to standardize data products from three NASA laser altimeter missions - ICESat, ICESat-2, and MABEL. It describes designing similar data products for all three missions to promote interoperability. Products are being developed for ICESat data using lessons from MABEL. Code is also being generated to help create products from specifications to reduce development time. The goal is to make the multi-rate point data from all three missions easily accessible and usable.
This document discusses how ArcGIS supports scientific multidimensional data. It can directly ingest data in formats like netCDF, HDF, and GRIB, and represent the data as raster layers, feature layers, or tables. Users can visualize, analyze, and share the data through tools in ArcGIS Desktop and services. Python can also be used to extend analytical capabilities. ArcGIS is evolving to better support scientific data through capabilities like multidimensional raster and feature layers, on-the-fly processing, and disseminating content as web services.
This document discusses a pilot project to incorporate ISO 19115-2 metadata attributes at the granule level for the NASA SWOT mission. The metadata will be stored in HDF5 groups and generated in two ways - via XML serialization or XML style sheet conversions. The project aims to capture essential metadata attributes from the SWOT information architecture and ISO metadata model, and generate an HDF5 structure specification and example metadata snippets.
HDF Cloud Services aims to bring HDF5 to the cloud by defining a REST API for HDF5 and implementing related services. The HDF REST API allows HDF5 data to be accessed via HTTP requests and responses. H5serv is an open source reference implementation of the HDF REST API. The HDF Scalable Data Service (HSDS) is being developed to support large HDF5 repositories in a scalable, cost effective manner using object storage like AWS S3.
John Readey presented on HDF5 in the cloud using HDFCloud. HDF5 can provide a cost-effective cloud infrastructure by paying for what is used rather than what may be needed. HDFCloud uses an HDF5 server to enable accessing HDF5 data through a REST API, allowing users to access large datasets without downloading entire files. It maps HDF5 objects to cloud object storage for scalable performance and uses Docker containers for elastic scaling.
This presentation discusses moving data and applications from HDF4 to HDF5/netCDF-4. It covers the differences between HDF4 and HDF5 data models and capabilities, tools for converting HDF4 data to HDF5, advantages of HDF5 like unlimited dimensions and compression, and ways to ensure compatibility with netCDF-4 like avoiding HDF5-specific features. The work was supported by a NASA contract.
This document discusses changes made to the MOPITT instrument's HDF data products over multiple versions. MOPITT measures carbon monoxide on the Terra spacecraft. Version 7 products, which use HDF-EOS5 format without unlimited dimensions, were found to be read faster than Version 6 products despite being larger in size. Experiments comparing read times of Version 7 data with and without unlimited dimensions showed a small 2% difference initially, but up to 40% faster reads for Version 7 without unlimited dimensions at first, with the difference decreasing over multiple reads. Eliminating unlimited dimensions thus improved access performance for some HDF applications like MOPITT data.
The document discusses enhancing the Geospatial Data Abstraction Library (GDAL) to improve accessibility and interoperability of NASA data products with GIS tools. It developed plugins for three NASA data products to demonstrate reading multidimensional datasets into GIS applications like ArcGIS. Next steps include providing outreach, enhancing the framework to be more flexible and production-ready, and developing guides to help other data centers build GDAL plugins to address their issues with geospatial data. The overall goal is to improve analysis and visualization of NASA scientific data in GIS tools and web applications.
The document discusses the Product Designer Hub, which aims to take the HDF data format to the web by developing a system with the following key components:
1. A data store to manage user accounts, projects, file structures and metadata.
2. A RESTful server to enable exporting HDF files and data to various formats like JSON, CSV, and importing from formats like XML, MATLAB, and Python.
3. A web service that can ingest metadata from various XML dialects, transform it to HDF groups and attributes, and generate HDF files, JSON, and code templates from the metadata.
4. A client-server architecture where client software like web applications can access HDF files and data through the REST
ICESat-2 is a NASA satellite mission scheduled to launch in December 2017 that will use photon counting laser altimetry to measure ice sheet and sea ice elevations. It will carry an advanced laser altimeter that splits each laser pulse into 6 beams in a cross-track pattern to provide dense sampling. This will allow for improved elevation estimates over rough terrain. The document discusses ICESat-2's science objectives, measurement concept, data products, processing workflow, and approach to managing metadata across different levels of data products.
Kitware uses HDF as a widely adopted data format for scientific computing and visualization across several domains. HDF supports climate modeling, geospatial data, medical imaging, and more. Kitware is looking to improve HDF support for streaming big data, cloud computing, and web applications to enable more advanced analytics and sharing of scientific data. Future work may include pure JavaScript implementations of HDF tools and optimizing performance for cloud storage.
The document summarizes updates on Hierarchical Data Formats (HDF) software releases and tools. It discusses the latest releases of HDF5 1.8.19 and 1.10.1, compatibility issues when moving to newer versions, updates on tools like HDF-Java and HDFView 3.0, supported compilers and systems, and a new compression library for interoperability. It invites readers to provide feedback on their needs.
This document discusses two HDF5-based file formats for storing Earth observation data:
1. The Sorted Pulse Data (SPD) format stores laser scanning data including pulses and point data with attributes. It was created in 2008 and updated to version 4 to improve flexibility.
2. The KEA image file format implements the GDAL raster data model in HDF5, allowing large raster datasets and attribute tables to be stored together with compression. It was created in 2012 to address limitations of other formats.
Both formats take advantage of HDF5 features like compression but also discuss some limitations and lessons learned for effectively designing scientific data formats.
Aashish Chaudhary gave a presentation on Kitware's work with scientific computing and visualization using HDF. HDF is a widely used data format at Kitware for domains like climate modeling, geospatial visualization, and information visualization. Kitware is looking to improve HDF support for cloud and web environments to enable streaming analytics and web-based data analysis. The company also aims to further open source collaboration and scientific computing.
The document discusses MODIS Land products and their distribution format. MODIS Land products include radiation budget, ecosystem, and land cover variables. Products are distributed in HDF-EOS format with fine resolution grids in Integerized Sinusoidal or Lambert Azimuthal Equal-Area projections and coarse grids in a geographic Climate Modeling Grid. HDF-EOS allows for collaboration and standard geolocation representation but toolkit support is still limited.
Harris Corporation provides geospatial software and analytics tools to access and analyze scientific data from remote sensing platforms. Their ENVI and IDL software support common data formats like HDF and NetCDF and provide capabilities for calibration, bowtie correction, reprojection, and visualization of data from sensors including GOES-16, VIIRS, and ocean and weather satellites. The tools allow scientists and analysts to efficiently process large volumes of earth observation data and extract valuable information to support applications in weather forecasting, agriculture, infrastructure monitoring, and more.
This document discusses how HDF Product Designer (HPD) uses templates to achieve interoperability. HPD is an application for consistently developing interoperable data content in HDF5 files. It has a client-server architecture and desktop app. Templates allow users to copy design examples that incorporate best practices and are curated by the HPD development team. Available templates include NCEI collections and CF templates, with more to be added based on community review and suggestions. Templates allow users to initialize new designs by mixing and matching content from different template examples.
The document describes the HDF Product Designer software tool. It was created to facilitate the design of interoperable scientific data products in HDF5 format. The tool allows intuitive editing of HDF5 objects and supports conventions like CF and ACDD. It also provides validation services to test file compliance. The goal is to help scientists design data products that follow standards and are easy for others to use.
The HDF Group provides software for managing large, complex data and services to support users of this technology. It derives most of its revenue from projects related to earth science, including supporting HDF-EOS, JPSS, and other earth science projects. It maintains various tools for working with HDF files and conducts maintenance, support, and development activities to support new versions and capabilities of HDF libraries and software.
This document summarizes the work done to enhance the Geospatial Data Abstraction Library (GDAL) to better support NASA Earth Observing System (EOS) data products. It describes three phases of work: 1) a proof-of-concept ArcGIS plugin for product-specific HDF drivers, 2) generalized HDF drivers and an XML format, and 3) collaboration with GDAL developers utilizing HDF drivers and a Virtual Format (VRT) specification. The third phase highlights include enhanced generic functions, coordination with GDAL developers, testing across GIS clients, outreach to other data centers, and building tutorials. Future work areas are also outlined.
The document discusses Esri's tools and roadmap for working with multi-dimensional (MD) scientific data in ArcGIS. It outlines Esri's efforts to directly read HDF, GRIB, and netCDF files as raster layers or feature/table views in ArcGIS. MD mosaic datasets allow users to manage variables and dimensions across multiple files and perform on-the-fly computations and visualization of MD data. New functions have been added to improve MD data analysis and visualization, including a vector field renderer to depict raster data as vectors. Esri is also working to better support OPeNDAP data sources.
This document discusses efforts to standardize data products from three NASA laser altimeter missions - ICESat, ICESat-2, and MABEL. It describes designing similar data products for all three missions to promote interoperability. Products are being developed for ICESat data using lessons from MABEL. Code is also being generated to help create products from specifications to reduce development time. The goal is to make the multi-rate point data from all three missions easily accessible and usable.
This document discusses how ArcGIS supports scientific multidimensional data. It can directly ingest data in formats like netCDF, HDF, and GRIB, and represent the data as raster layers, feature layers, or tables. Users can visualize, analyze, and share the data through tools in ArcGIS Desktop and services. Python can also be used to extend analytical capabilities. ArcGIS is evolving to better support scientific data through capabilities like multidimensional raster and feature layers, on-the-fly processing, and disseminating content as web services.
This document discusses a pilot project to incorporate ISO 19115-2 metadata attributes at the granule level for the NASA SWOT mission. The metadata will be stored in HDF5 groups and generated in two ways - via XML serialization or XML style sheet conversions. The project aims to capture essential metadata attributes from the SWOT information architecture and ISO metadata model, and generate an HDF5 structure specification and example metadata snippets.
HDF Cloud Services aims to bring HDF5 to the cloud by defining a REST API for HDF5 and implementing related services. The HDF REST API allows HDF5 data to be accessed via HTTP requests and responses. H5serv is an open source reference implementation of the HDF REST API. The HDF Scalable Data Service (HSDS) is being developed to support large HDF5 repositories in a scalable, cost effective manner using object storage like AWS S3.
This slide will demonstrate how to use visualization and analysis tools such as IDV and GrADS to access HDF data via OPeNDAP.
To see animation in some slides, please visit:
http://hdfeos.org/workshops/ws13/presentations/day1/jxl_opendap_tutorial.ppt
This tutorial is designed for the HDF5 users with some HDF5 experience.
It will cover advanced features of the HDF5 library for achieving better I/O performance and efficient storage. The following HDF5 features will be discussed: partial I/O, chunked storage layout, compression and other filters including new n-bit and scale+offset filters. Significant time will be devoted to the discussion of complex HDF5 datatypes such as strings, variable-length datatypes, array and compound datatypes.
This tutorial is designed for new HDF5 users. We will go over a brief history of HDF and HDF5 software, and will cover basic HDF5 Data Model objects and their properties; we will give an overview of the HDF5 Libraries and APIs, and discuss the HDF5 programming model. Simple C and Fortran examples, and Java tool HDFView will be used to illustrate HDF5 concepts.
Unidata provides data services, tools, and cyberinfrastructure to advance Earth system science and broaden participation in the geosciences. It was created through a grassroots effort and is funded by the NSF and UCAR. Unidata's work is driven by science, education, technology, and social needs. It provides real-time data from various sources to over 260 sites worldwide and develops standards like netCDF and services like THREDDS to facilitate data sharing and access. Unidata is working to broaden its community through international collaborations and empowering users around the world.
This tutorial is designed for anyone who needs to work with data stored in HDF5 files. The tutorial will cover functionality and useful features of the HDF5 utilities h5dump, h5diff, h5repack, h5stat, h5copy, h5check and h5repart. We will also introduce a prototype of the new h52jpeg conversion tool and recently released h5perf_serial tool used for performance studies. We will briefly introduce HDFView. Details of the HDFView and HDF-Java will be discussed in a separate tutorial.
The document discusses how the HDF team is enabling collaboration around data in the cloud while protecting data producers and users. It provides examples of how the US Geological Survey migrated Landsat data to AWS, decreasing processing times. It also outlines HDF's approach to flexible data structures, migration of data to local files, private and public clouds, and client/server architectures to access data across different locations and applications.
In this talk we will discuss what happens to data when it is written from the HDF5 application to an HDF5 file. This knowledge will help developers to write more efficient applications and to avoid performance bottlenecks.
HDF Cloud provides scalable HDF5 data access in the cloud. It uses a RESTful interface to store HDF5 files and metadata on object storage like AWS S3. This allows datasets to be accessed elastically from anywhere, avoiding hardware costs while gaining redundancy, scalability and other cloud benefits. The architecture maps HDF5 objects to individual storage objects, caching frequently used data in memory for high performance. Client libraries provide transparent access whether files are local or remote.
Fast partial access to objects from very large files in the SDSC Storage Resource Broker (SRB5) can be extremely challenging, even when those objects are small. The HDF-SRB project integrates the SRB and NCSA Hierarchical Data Format (HDF5), to create an access mechanism within the SRB that is can be orders of magnitude more efficient than current methods for accessing object-based file formats.
The project provides interactive and efficient access to datasets or subsets of datasets in large files without bringing entire files into local machines. A new set of data structures and APIs have been implemented to the SRB support such object-level data access. A working prototype of the HDF5-SRB data system has been developed and tested. The SRB support is implemented in HDFView as a client application.
This document discusses how to optimize HDF5 files for efficient access in cloud object stores. Key optimizations include using large dataset chunk sizes of 1-4 MiB, consolidating internal file metadata, and minimizing variable-length datatypes. The document recommends creating files with paged aggregation and storing file content information in the user block to enable fast discovery of file contents when stored in object stores.
HDFS (Hadoop Distributed File System) is designed to store very large files across commodity hardware in a Hadoop cluster. It partitions files into blocks and replicates blocks across multiple nodes for fault tolerance. The document discusses HDFS design, concepts like data replication, interfaces for interacting with HDFS like command line and Java APIs, and challenges related to small files and arbitrary modifications.
A preponderance of data from NASA's Earth Observing System (EOS) is archived in the HDF Version 4 (HDF4) format. The long-term preservation of these data is critical for climate and other scientific studies going many decades into the future. HDF4 is very effective for working with the large and complex collection of EOS data products. Unfortunately, because of the complex internal byte layout of HDF4 files, future readability of HDF4 data depends on preserving a complex software library that can interpret that layout. Having a way to access HDF4 data independent of a library could improve its viability as an archive format, and consequently give confidence that HDF4 data will be readily accessible forever, even if the HDF4 library is gone.
To address the need to simplify long-term access to EOS data stored in HDF4, a collaborative project between The HDF Group and NASA Earth Science Data Centers is implementing an approach to accessing data in HDF4 files based on the use of independent maps that describe the data in HDF4 files and tools that can use these maps to recover data from those files. With this approach, relatively simple programs will be able to extract the data from an HDF4 file, bypassing the need for the HDF4 library.
A demonstration project has shown that this approach is feasible. This involved an assessment of NASA�s HDF4 data holdings, and development of a prototype XML-based layout mapping language and tools to read layout maps and read HDF4 files using layout maps. Future plans call for a second phase of the project, in which the mapping tools and XML schema are made production quality, the mapping schema are integrated with existing XML metadata files in several data centers, and outreach activities are carried out to encourage and facilitate acceptance of the technology.
This document provides an overview of HSDS (Highly Scalable Data Service), which is a REST-based service that allows accessing HDF5 data stored in the cloud. It discusses how HSDS maps HDF5 objects like datasets and groups to individual cloud storage objects to optimize performance. The document also describes how HSDS was used to improve access performance for NASA ICESat-2 HDF5 data on AWS S3 by hyper-chunking datasets into larger chunks spanning multiple original HDF5 chunks. Benchmark results showed that accessing the data through HSDS provided over 2x faster performance than other methods like ROS3 or S3FS that directly access the cloud storage.
This presentation discusses putting HDF5 files into Apache Spark for analysis. It describes the differences between traditional file systems and Hadoop's HDFS, and how Spark provides a more accessible way to exploit data parallelism without using MapReduce. The presentation outlines experiments loading HDF5 climate data files into Spark to calculate statistics. It suggests variations like providing a file list instead of traversing directories. The conclusion is that Spark can effectively analyze HDF5 files under the right circumstances but current methods are imperfect, and future work with The HDF Group could build better integrations.
1) Uber uses Spark and Hadoop to process large amounts of transportation data in real-time and batch. This includes building pipelines to ingest trip data from databases into a data warehouse within 1-2 hours.
2) Paricon is Uber's first Spark application which infers schemas from raw JSON data, converts it to Parquet format for faster querying, and validates the results. It processes over 15TB of data daily.
3) Future work includes building a SQL-based ETL platform on Spark, open sourcing SQL-on-Hadoop, and creating a machine learning platform with Spark and a real-time analytics system called Apollo using Spark Streaming.
The document discusses Hadoop, its components, and how they work together. It covers HDFS, which stores and manages large files across commodity servers; MapReduce, which processes large datasets in parallel; and other tools like Pig and Hive that provide interfaces for Hadoop. Key points are that Hadoop is designed for large datasets and hardware failures, HDFS replicates data for reliability, and MapReduce moves computation instead of data for efficiency.
This document discusses strategies for storing and accessing HDF5 data files in cloud object storage like Amazon S3. It describes an HDF5 Virtual File Driver (VFD) developed by The HDF Group that allows reading HDF5 files directly from S3 without downloading. For better performance, the document recommends optimizing HDF5 files stored in S3 by using chunking, compression, and aggregating smaller files. It also introduces the HDF Cloud Schema which maps HDF5 objects to individual object storage objects for parallel access.
クラウディアン、IOT/M2M本格普及にむけ、ビッグデータを「スマートデータ」として活用できるCLOUDIAN HyperStore 5.1をリリース ~ CLOUDIAN HyperStore 5.1ソフトウェアとアプライアンスをHadoopとHortonworks Data Platformが公式認定、ペタバイト規模の分析を可能に ~
http://cloudian.jp/news/pressrelease_detail/press-release-34.html
Cloudian HyperStore Ushers in Era of Smart Data With Efficient, Scalable Storage for Internet of Things ~ With Hadoop and Hortonworks Data Platform Qualified on HyperStore 5.1 Software and Appliances, Customers Can Perform In-Place Data Analysis at Petabyte-Scale; Cloudian Becomes Hortonworks Certified Technology Partner ~
http://www.cloudian.com/news/press-releases/cloudian-hyperstore-5.1-ushers-in-era-of-smart-data.php
http://hortonworks.com/partner/cloudian/
http://hortonworks.com/wp-content/uploads/2014/08/Cloudian-Hortonworks-Solutions-Brief.pdf
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataSafe Software
Once in a while, there really is something new under the sun. The rise of cloud-hosted data has fueled innovation in spatial data storage, enabling a brand new serverless architectural approach to spatial data sharing. Join us in our upcoming webinar to learn all about these new ways to organize your data, and leverage data shared by others. Explore the potential of Cloud Native Geospatial Formats in your workflows with FME, as we introduce five new formats: COGs, COPC, FlatGeoBuf, GeoParquet, STAC and ZARR.
Learn from industry experts Michelle Roby from Radiant Earth and Chris Holmes from Planet about these cloud-native geospatial data formats and how they can make data easier to manage, share, and analyze. To get us started, they’ll explain the goals of the Cloud-Native Geospatial Foundation and provide overviews of cloud-native technologies including the Cloud-Optimized GeoTIFF (COG), SpatioTemporal Asset Catalogs (STAC), and GeoParquet.
Following this, our seasoned FME team will guide you through practical demonstrations, showcasing how to leverage each format to its fullest potential. Learn strategic approaches for seamless integration and transition, along with valuable tips to enhance performance using these formats in FME.
Discover how these formats are reshaping geospatial data handling and how you can seamlessly integrate them into your FME workflows and harness the explosion of cloud-hosted data.
In this talk we will examine how to tune HDF5 performance to improve I/O speed. The talk will focus on chunk and metadata caches, how they affect performance, and which HDF5 APIs that can be used for performance tuning.
Examples of different chunking strategies will be given. We will also discuss how to reduce file overhead by using special properties of the HDF5 groups, datasets and datatypes.
This document discusses accelerating Spark workloads on Amazon S3 using Alluxio. It describes the challenges of running Spark interactively on S3 due to its eventual consistency and expensive metadata operations. Alluxio provides a data caching layer that offers strong consistency, faster performance, and API compatibility with HDFS and S3. It also allows data outside of S3 to be analyzed. The document demonstrates how to bootstrap Alluxio on an AWS EMR cluster to accelerate Spark workloads running on S3.
A brief introduction to Hadoop distributed file system. How a file is broken into blocks, written and replicated on HDFS. How missing replicas are taken care of. How a job is launched and its status is checked. Some advantages and disadvantages of HDFS-1.x
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataSafe Software
Once in a while, there really is something new under the sun. The rise of cloud-hosted data has fueled innovation in spatial data storage, enabling a brand new serverless architectural approach to spatial data sharing. Join us in our upcoming webinar to learn all about these new ways to organize your data, and leverage data shared by others. Explore the potential of Cloud Native Geospatial Formats in your workflows with FME, as we introduce five new formats: COGs, COPC, FlatGeoBuf, GeoParquet, STAC and ZARR.
Learn from industry experts Michelle Roby from Radiant Earth and Chris Holmes from Planet about these cloud-native geospatial data formats and how they can make data easier to manage, share, and analyze. To get us started, they’ll explain the goals of the Cloud-Native Geospatial Foundation and provide overviews of cloud-native technologies including the Cloud-Optimized GeoTIFF (COG), SpatioTemporal Asset Catalogs (STAC), and GeoParquet.
Following this, our seasoned FME team will guide you through practical demonstrations, showcasing how to leverage each format to its fullest potential. Learn strategic approaches for seamless integration and transition, along with valuable tips to enhance performance using these formats in FME.
Discover how these formats are reshaping geospatial data handling and how you can seamlessly integrate them into your FME workflows and harness the explosion of cloud-hosted data.
The document discusses Parallel Computing with HDF Server. The key points are:
1. HDF Server (HSDS) allows efficient access to HDF5 data stored in AWS S3. It runs as containers on Kubernetes and supports parallel access across containers.
2. HSDS uses S3 as the data store for HDF5 files. Individual HDF5 objects like datasets and chunks are stored as separate S3 objects. This allows parallel read/write and only modifying what changes.
3. HDF Kita Lab is a hosted Jupyter environment on AWS that provides access to HSDS for reading and writing HDF5 data on S3. It allows scaling the server and provides tools for HDF5 on S3.
HDFS tiered storage: mounting object stores in HDFSDataWorks Summit
Most users know HDFS as the reliable store of record for big data analytics. HDFS is also used to store transient and operational data when working with cloud object stores, such as Microsoft Azure, and on-premises object stores, such as Western Digital’s ActiveScale. In these settings, applications often manage data stored in multiple storage systems or clusters, requiring a complex workflow for synchronizing data between filesystems for business continuity planning (BCP) and/or supporting hybrid cloud architectures to achieve the required business goals for durability, performance, and coordination.
To resolve this complexity, HDFS-9806 has added a PROVIDED storage tier to HDFS allowing mounting external namespaces, both object stores and other HDFS clusters. Building on this functionality, we can now allow remote namespaces to be synchronized with HDFS, enabling asynchronous writes to the remote storage and the possibility to synchronously and transparently read data back to a local application wanting to access file data which is stored remotely. This talk, which corresponds to the work in progress under HDFS-12090, will present how the Hadoop admin can manage storage tiering between clusters and how that is then handled inside HDFS through the snapshotting mechanism and asynchronously satisfying the storage policy.
Speaker
Thomas Demoor, Object Storage Architect, Western Digital
Ewan Higgs, Software Engineer, Western Digital
Similaire à Utilizing HDF4 File Content Maps for the Cloud Computing (20)
This document summarizes the current status and focus of the HDF Group. It discusses that the HDF Group is located in Champaign, IL and is a non-profit organization focused on developing and maintaining HDF software and data formats. It provides an overview of recent HDF5, HDF4 and HDFView releases and notes areas of focus for software quality improvements, increased transparency, strengthening the community, and modernizing HDF products. It invites support and participation in upcoming user group meetings.
This document provides an overview of HSDS (HDF Server and Data Service), which allows HDF5 files to be stored and accessed from the cloud. Key points include:
- HSDS maps HDF5 objects like datasets and groups to individual cloud storage objects for scalability and parallelism.
- Features include streaming support, fancy indexing for complex queries, and caching for improved performance.
- HSDS can be deployed on Docker, Kubernetes, or AWS Lambda depending on needs.
- Case studies show HSDS is used by organizations like NREL and NSF to make petabytes of scientific data publicly accessible in the cloud.
This document discusses creating cloud-optimized HDF5 files by rearranging internal structures for more efficient data access in cloud object stores. It describes cloud-native and cloud-optimized storage formats, with the latter involving storing the entire HDF5 file as a single object. The benefits of cloud-optimized HDF5 include fast scanning and using the HDF5 library. Key aspects covered include using optimal chunk sizes, compression, and minimizing variable-length datatypes.
This document discusses updates and performance improvements to the HDF5 OPeNDAP data handler. It provides a history of the handler since 2001 and describes recent updates including supporting DAP4, new data types, and NetCDF data models. A performance study showed that passing compressed HDF5 data through the handler without decompressing/recompressing led to speedups of around 17-30x by leveraging HDF5 direct I/O APIs. This allows outputting HDF5 files as NetCDF files much faster through the handler.
This document provides instructions for using the Hyrax software to serve scientific data files stored on Amazon S3 using the OPeNDAP data access protocol. It describes how to generate ancillary metadata files called DMR++ files using the get_dmrpp tool that provide information about the data file structure and locations. The document explains how to run get_dmrpp inside a Docker container to process data files on S3 and generate customized DMR++ files that the Hyrax server can use to serve the files to clients.
This document provides an overview and examples of accessing cloud data and services using the Earthdata Login (EDL), Pydap, and MATLAB. It discusses some common problems users encounter, such as being unable to access HDF5 data on AWS S3 using MATLAB or read data from OPeNDAP servers using Pydap. Solutions presented include using EDL to get temporary AWS tokens for S3 access in MATLAB and providing code examples on the HDFEOS website to help users access S3 data and OPeNDAP services. The document also notes some limitations, such as tokens being valid for only 1 hour, and workarounds like requesting new tokens or using the MATLAB HDF5 API instead of the netCDF API.
The HDF5 Roadmap and New Features document outlines upcoming changes and improvements to the HDF5 library. Key points include:
- HDF5 1.13.x releases will include new features like selection I/O, the Onion VFD for versioned files, improved VFD SWMR for single-writer multiple-reader access, and subfiling for parallel I/O.
- The Virtual Object Layer allows customizing HDF5 object storage and introduces terminal and pass-through connectors.
- The Onion VFD stores versions of HDF5 files in a separate onion file for versioned access.
- VFD SWMR improves on legacy SWMR by implementing single-writer multiple-reader capabilities
This document discusses user analysis of the HDFEOS.org website and plans for future improvements. It finds that the majority of the site's 100 daily users are "quiet", not posting on forums or other interactive elements. The main user types are locators, who search for examples or data; mergers, who combine or mosaic datasets; and converters, who change file formats. The document outlines recent updates focused on these user types, like adding Python examples for subsetting and calculating latitude and longitude. It proposes future work on artificial intelligence/machine learning uses of HDF files and examples for processing HDF data in the cloud.
This document summarizes a presentation about the current status and future directions of the Hierarchical Data Format (HDF) software. It provides updates on recent HDF5 releases, development efforts including new compression methods and ways to access HDF5 data, and outreach resources. It concludes by inviting the audience to share wishes for future HDF development.
The document describes H5Coro, a new C++ library for reading HDF5 files from cloud storage. H5Coro was created to optimize HDF5 reading for cloud environments by minimizing I/O operations through caching and efficient HTTP requests. Performance tests showed H5Coro was 77-132x faster than the previous HDF5 library at reading HDF5 data from Amazon S3 for NASA's SlideRule project. H5Coro supports common HDF5 elements but does not support writing or some complex HDF5 data types and messages to focus on optimized read-only performance for time series data stored sequentially in memory.
This document summarizes MathWorks' work to modernize MATLAB's support for HDF5. Key points include:
1) MATLAB now supports HDF5 1.10.7 features like single-writer/multiple-reader access and virtual datasets through new and updated low-level functions.
2) Performance benchmarks show some improvements but also regressions compared to the previous HDF5 version, and work continues to optimize code and support future versions.
3) There are compatibility considerations for Linux filter plugins, but interim solutions are provided until MathWorks can ship a single HDF5 version.
HSDS provides HDF as a service through a REST API that can scale across nodes. New releases will enable serverless operation using AWS Lambda or direct client access without a server. This allows HDF data to be accessed remotely without managing servers. HSDS stores each HDF object separately, making it compatible with cloud object storage. Performance on AWS Lambda is slower than a dedicated server but has no management overhead. Direct client access has better performance but limits collaboration between clients.
HDF5 and Zarr are data formats that can be used to store and access scientific data. This presentation discusses approaches to translating between the two formats. It describes how HDF5 files were translated to the Zarr format by creating a separate Zarr store to hold HDF5 file chunks, and storing chunk location metadata. It also discusses an implementation that translates Zarr data to the HDF5 format by using a special chunking layout and storing chunk information in an HDF5 compound dataset. Limitations of the translations include lack of support for some HDF5 dataset properties in Zarr, and lack of support for some Zarr compression methods in the HDF5 implementation.
The document discusses HDF for the cloud, including new features of the HDF Server and what's next. Key points:
- HDF Server uses a "sharded schema" that maps HDF5 objects to individual storage objects, allowing parallel access and updates without transferring entire files.
- Implementations include HSDS software that uses the sharded schema with an API and SDKs for different languages like h5pyd for Python.
- New features of HSDS 0.6 include support for POSIX, Azure, AWS Lambda, and role-based access control.
- Future work includes direct access to storage without a server intermediary for some use cases.
This document compares different methods for accessing HDF and netCDF files stored on Amazon S3, including Apache Drill, THREDDS Data Server (TDS), and HDF5 Virtual File Driver (VFD). A benchmark test of accessing a 24GB HDF5/netCDF-4 file on S3 from Amazon EC2 found that TDS performed the best, responding within 2 minutes, while Apache Drill failed after 7 minutes. The document concludes that TDS 5.0 is the clear winner based on performance and support for role-based access control and HDF4 files, but the best solution depends on use case and software.
This document discusses STARE-PODS, a proposal to NASA/ACCESS-19 to develop a scalable data store for earth science data using the SpatioTemporal Adaptive Resolution Encoding (STARE) indexing scheme. STARE allows diverse earth science data to be unified and indexed, enabling the data to be partitioned and stored in a Parallel Optimized Data Store (PODS) for efficient analysis. The HDF Virtual Object Layer and Virtual Data Set technologies can then provide interfaces to access the data in STARE-PODS in a familiar way. The goal is for STARE-PODS to organize diverse data for alignment and parallel/distributed storage and processing to enable integrative analysis at scale.
This document provides an overview and update on HDF5 and its ecosystem. Key points include:
- HDF5 1.12.0 was recently released with new features like the Virtual Object Layer and external references.
- The HDF5 library now supports accessing data in the cloud using connectors like S3 VFD and REST VOL without needing to modify applications.
- Projects like HDFql and H5CPP provide additional interfaces for querying and working with HDF5 files from languages like SQL, C++, and Python.
- The HDF5 community is moving development to GitHub and improving documentation resources on the HDF wiki site.
This document summarizes new features in HDF5 1.12.0, including support for storing references to objects and attributes across files, new storage backends using a virtual object layer (VOL), and virtual file drivers (VFDs) for Amazon S3 and HDFS. It outlines the HDF5 roadmap for 2019-2022, which includes continued support for HDF5 1.8 and 1.10, and new features in future 1.12.x releases like querying, indexing, and provenance tracking.
The document discusses leveraging cloud resources like Amazon Web Services to improve software testing for the HDF group. Currently HDF software is tested on various in-house systems, but moving more testing to the cloud could provide better coverage of operating systems and distributions at a lower cost. AWS spot instances are being used to run HDF5 build and regression tests across different Linux distributions in around 30 minutes for approximately $0.02 per hour.
Google Colaboratory allows users to write and execute Python code in the cloud using Jupyter notebooks. It provides a free GPU and TPU for accelerating code. The document discusses how HDF-EOS is a standard format for satellite data and provides many examples for converting and processing HDF-EOS data. It then demonstrates how to install necessary packages and run an example zoo code in Colab to plot HDF-EOS data in the cloud without installing anything locally.
Plus de The HDF-EOS Tools and Information Center (20)
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Webinar: Designing a schema for a Data WarehouseFederico Razzoli
Are you new to data warehouses (DWH)? Do you need to check whether your data warehouse follows the best practices for a good design? In both cases, this webinar is for you.
A data warehouse is a central relational database that contains all measurements about a business or an organisation. This data comes from a variety of heterogeneous data sources, which includes databases of any type that back the applications used by the company, data files exported by some applications, or APIs provided by internal or external services.
But designing a data warehouse correctly is a hard task, which requires gathering information about the business processes that need to be analysed in the first place. These processes must be translated into so-called star schemas, which means, denormalised databases where each table represents a dimension or facts.
We will discuss these topics:
- How to gather information about a business;
- Understanding dictionaries and how to identify business entities;
- Dimensions and facts;
- Setting a table granularity;
- Types of facts;
- Types of dimensions;
- Snowflakes and how to avoid them;
- Expanding existing dimensions and facts.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
Utilizing HDF4 File Content Maps for the Cloud Computing
1. DM_PPT_NP_v02
Utilizing HDF4 File Content Maps
for the Cloud Computing
Hyokyung Joe Lee
The HDF Group
This work was supported by NASA/GSFC under
Raytheon Co. contract number NNG15HZ39C
2. DM_PPT_NP_v02
2
HDF File Format is for Data.
• PDF for Document, HDF for Data
• Why PDF over MS Word DOC?
– Free, Portable, Sharing & Archiving
• Why HDF over MS Excel XLS(X)?
– Free, Portable, Sharing & Archiving
• HDF: HDF4 & HDF5
3. DM_PPT_NP_v02
3
HDF4 is “old” format.
• Old = Large volume over long time
• Old = Limitation (32-bit)
• Old = More difficult to sustain
17. DM_PPT_NP_v02
17
Reduce storage cost (e.g., S3) by avoiding redundancy.
Make each chunk searchable through search engine.
Run cloud computing on chunks of interest.
Store chunks as cloud objects
18. DM_PPT_NP_v02
18
NASA Earthdata search is too shallow.
Index HDF4 data using maps and make deep web.
Provide search interface for the deep web.
Frequently searched data can be cached as cloud objects.
Users can run cloud computing on cached objects in RT.
Verify results with HDF4 archives from NASA data centers.
Shallow Web is not Enough
19. DM_PPT_NP_v02
19
(BACC= Bigdata Analytics in Cloud Computing)
1. Use HDF archive as is. Create maps for HDF.
2. Maps can be indexed and searched.
3. ELT (Extract Load Transform) only relevant data into
cloud from HDF.
4. Offset/length based file IO is universal - all existing
BACC solutions will work. No dependency on HDF APIs.
HDF: Antifragile Solution for BACC
20. DM_PPT_NP_v02
20
Future Work
1. HDF5 Mapping Project?
2. Use HDF Product Designer for archiving cloud objects
and analytics results in HDF5.
3. Re-map: To metadata is human, to data is divine: For
the same binary object, user can easily re-define
meaning of data, re-index it, search, and analyze it.
(e.g., serve the same binary data in Chinese, Spanish,
Russian, etc.)
Good morning, everyone! My name is Joe Lee and I’m a software engineer at The HDF Group. Although I attended the past ESIP meetings regularly, I could not travel this summer. ESIP meeting is a great place to learn and share new ideas and technologies through face-to-face conversation so I apologize for presenting my new idea over telecon.
Although you may have heard about Hierarchical Data Format, let me start my presentation by giving a very short introduction to HDF.
HDF is similar to PDF in many ways as a free & portable binary format although the brand power of HDF is much weaker than the brand power of PDF.
Everybody knows that PDF is for publishing document. HDF is for publishing data of any size – big or small.
For example, NASA has used HDF for several decades to archive big data as large as Earth observation because it is good for sharing and archiving.
HDF has two incompatible formats called HDF4 and HDF5.
As the number indicates, HDF4 is old format and HDF5 is relatively new format.
The idea that I am going to present today is mainly about HDF4 because HDF4 is old.
I cannot tell exactly how old HDF4 is because I don’t want to discriminate any file format based on its age.
Old can mean many different things – both good and bad.
For example, old means a large volume of earth data has been archived in HDF4.
Old also means that HDF4 has some limits that are already overcome by today’s technology.
As the technology advances very fast, you’ll see fewer tools that support HDF4.
I put an image of CD-player because HDF4 reminds me the CD player in my 20 years old car.
In 1995, I had to pay extra money for it as a premium car audio option.
Last November, my 20 year old car was finally broken after racking up 250 thousand miles so I went shopping for a new car.
I was surprised to know that new cars do not have CD players any more.
Instead, they have USB or SD memory card slots and they accept MP3 formats.
I’m telling this story because the modernization of HDF4 data is necessary before it gets too old to sustain.
Since HDF4 is not backward-compatible with HDF5, HDF5 users need to convert HDF4 to HDF5 if their tools do not support HDF4.
The HDF Group already provides h4toh5 conversion tool.
This is a good solution as long as you are willing to convert millions of HDF4 files into HDF5 files.
Thinking about the future alternative, like Tesla car that can stream music from cloud, I think Earth data streaming from cloud is the way to go.
So, converting HDF4 to HDF5 is an OK solution but I think there should be an alternative if we’d like to modernize old HDF4 data in cloud age.
I found the word “Cloudification” and I like it a lot. Wiktionary defines it as “The conversion ….”
Why does cloud computing matter? I think I don’t have to explain it any more thanks to IBM Watson and Google AlphaGo. When combined with AI and big data, cloud computing can do amazing things like beating human experts.
For another example, last winter, I was involved in a project called data container study. I ran a machine learning experiment with 20 years of NASA sea surface temperature data near Peru from 1987 and 2008 using Open Science Data Cloud and I could detect an anomaly quickly in a few seconds. The result matched nicely with El Nino in 1998. Open Science Data Cloud was very convenient and fast.
What I also learned from data container study is that efficient I/O is the key.
OSDC provides 200 terra bytes of public data in HDF4 format.
However, they are not directly usable for me because OSDC does not provide any search interface to the collection that is similar to NASA Earthdata search.
OSDC only provides a list of HDF4 file names available and all I can do is to transfer a collection of HDF4 files “as is” from cloud storage to the computing nodes.
This is horribly inefficient because I need a way to search & filter the only relevant data to speed up my data analytics at collection level.
Thus, I came up with an idea to use HDF4 file content map to maximize the utilization of cloud computing. A single binary HDF4 file can have multiple data objects represented as array, group, table, attributes and so on. Each object can be precisely located using the offset from the beginning of file and the number of bytes to read using HDF4 file content map. The rationale is that if the only relevant object can be searched and loaded into data analytics engine, you can reduce the amount of I/O and thus get the result much faster. Without shredding thousands of HDF4 files into objects with HDF4 maps, you must load 200 TB of data into computing nodes, process them, and throw away. You must repeat this for different analytics jobs. You need to wait days for I/O operation while the actual data analytics takes only a few seconds.
So what is HDF4 file content map that I’m talking about? It is an XML file that maps the content of the HDF4 binary file.
Unless you’re a hacker working for NSA, it’s hard to know what’s inside the HDF4 binary file as shown in the slide.
HDF4 binary file is a long stream of bytes and HDF4 map file can tell you how to decrypt the stream correctly.
Interpreting binary data is possible because the file content map has full of addresses.
In HDF, a dataset can be organized into chunks for efficient I/O and the HDF4 map can tell where you can find a chunk of data.
The chunk position in array is a good indication of where data is located on on Earth if dataset is grid.
By fully disclosing offset and number of bytes to read from the binary file, you don’t need HDF4 libraries to access a chunk of data.
If you read the file content map carefully, you can find some interesting patterns from byte size of each chunk.
The fillValues XML tag indicates that there’s nothing to be analyzed in the chunk.
Small size of chunk indicates that the chunk contains a lot of repeated information so it can be compressed.
Big size of chunk indicates that the chunk has more information than other compressed chunks.
To find useful data from a huge collection of HDF4 in OSDC, I ran an elastic search on chunks and visually inspected the frequency distribution of checksums with Kibana after computing MD5 checksum of each chunk.
MD5 checksum on individual chunk is not provided by h4mapwriter yet, so I created a separate script in Python.
Running some analytics on HDF4 file using the HDF4 map was very fun.
It revealed that the same chunk of data is repeated within a HDF4 file.
At collection level, it scales up nicely.
Hundreds of HDF4 files have the 16K chunk of same data.
This makes sense because some observations from the Earth will be same for a long period of time.
Once index is built with Elastic Search, I can easily run query to find a dataset that I’m interested in using the byte size information.
For example, I could sort the dataset size from the smallest to the largest over hundreds of HDF4 files.
As expected, the smallest byte size dataset showed almost nothing when it is visualized with HDFView.
The largest byte size dataset returned a colorful image.
Based on the HDF4 map information, I learned that it is possible to re-organize the entire collection of HDF4 data to optimize the use of cloud storage.
If you optimize data organization, ETL time for cloud computing will be shortened and the cost of storage will be also reduced.
If you can build search engine on top of those objects, advanced HDF4 users can run cloud computing directly on HDF chunks of their interest after filtering out irrelevant data based on the search result.
Users can always transform HDF chunks into other format such as Apache Parquet or JSON to meet their cloud computing needs.
From the Elastic Search experiment with HDF4 map, now I have a new wish list for NASA Earthdata search.
Although I like the new and improved NASA Earthdata search, I still think it’s too shallow because it does not index what’s inside from granules.
If Earthdata search can index HDF4 maps and provide search interface, collection at the chunk level can be returned to a user’s query.
I’d like to call such search service as deep web search.
For a chunk collection that deep web search returns, user can stream chunks to user’s cloud storage.
Here, the key is to deliver chunk collection to user’s cloud service provider.
Downloading the entire HDF4 data does not make sense in this workflow.
Then, user can run his analytics job using cloud computing on the streamed chunks.
If necessary, users can go back to the original HDF4 archives and run the same analytics if necessary using the traditional off-cloud method.
In summary, data archived in HDF is ready for big data analytics for certain access patterns that data producers prescribed.
The prescribed pattern may not match exactly what users need. For such use case scenario, HDF maps can be indexed and searched to identify relevant pieces from HDF.
I call it anti-fragile solution because any big data analytics solution in any computer language in any cloud computing environment will work. For example, I could read data over network in PHP language using Apache web server that supports byte-range and it worked pretty nicely. I picked PHP because PHP binding doesn’t exist for HDF. Relying on a single monolithic library to access data is too fragile.
You may wonder if the same solution can be applied to HDF5 or netCDF4.
Unfortunately, there is no HDF5 mapper tool yet.
How can a user save a collection of chunks in HDF5 easily for future use? I think HDF Product Designer is a good candidate for creating a new HDF5 from chunk objects in cloud. It can play a role of h4toh5 conversion tool with on-demand collection level subset/aggregation capability.
Finally, the HDF4 map idea has a great potential as flexible metadata solution. While binary is forever, metadata doesn’t have to be. If you re-map the same binary data with a different dialect, you can serve wider community that understands the dialect. One example is rewriting HDF4 file content map in different languages. Then, international users can discover and access Earthdata more easily.
Thank you for listening and I hope that you can use HDF4 map wisely in your next cloud computing project.
Do you have any question?