SlideShare une entreprise Scribd logo
1  sur  51
Caching and Buffering in
HDF5
The HDF Group

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

1
Software stack and the “magic box”
• Life cycle: What happens to data when it is transferred from
application buffer to HDF5 file?

Application

Data buffer

Object API

H5Dwrite

Library internals

Magic box

Virtual file I/O

Unbuffered I/O

File or other “storage”
Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

Data in a file

2
Inside the magic box
• Understanding of what is happening to data inside the
magic box will help to write efficient applications
• HDF5 library has mechanisms to control behavior inside
the magic box
• Goals of this talk:
 Describe some basic operations and data structures and
explain how they affect performance and storage sizes
 Give some “recipes” for how to improve performance

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

3
Topics
• Dataset metadata and array data storage layouts
• Types of dataset storage layouts
• Factors affecting I/O performance
•
•
•
•

I/O with compact datasets
I/O with contiguous datasets
I/O with chunked datasets
Variable length data and I/O

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

4
HDF5 dataset metadata and
array data storage layouts

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

5
HDF5 Dataset
• Data array
• Ordered collection of identically typed data items
distinguished by their indices

• Metadata
•
•
•
•

Dataspace: Rank, dimensions of dataset array
Datatype: Information on how to interpret data
Storage Properties: How array is organized on disk
Attributes: User-defined metadata (optional)

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

6
Separate Components of a Dataset

Header

Data array

Dataspace

Rank

Dimensions

3

Dim_1 = 4
Dim_2 = 5
Dim_3 = 7

Datatype
IEEE 32-bit float

Storage info

Attributes
Time = 32.4

Chunked

Pressure = 987

Compressed

Temp = 56

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

7
Metadata cache and array data
• Dataset array data typically kept in application memory
• Dataset header in separate space – metadata cache
Metadata cache
Dataset header
………….
Datatype
Dataspace
………….
Attributes
…

Dataset array data

Application memory
HDF5 metadata

Dataset array data

File

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

8
Metadata and metadata cache
• HDF5 metadata
• Information about HDF5 objects used by the library
• Examples: object headers, B-tree nodes for group, B-Tree
nodes for chunks, heaps, super-block, etc.
• Usually small compared to raw data sizes (KB vs. MB-GB)

• Metadata cache
• Space allocated to handle pieces of the HDF5 metadata
• Allocated by the HDF5 library in application’s memory
space
• Cache behavior affects overall performance
• Metadata cache implementation prior to HDF5 1.6.5
could cause performance degradation for some
applications
Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

9
Types of data storage layouts

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

10
HDF5 datasets storage layouts
• Contiguous
• Chunked
• Compact

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

11
Contiguous storage layout
• Metadata header separate from raw data
• Raw data stored in one contiguous block on disk
Metadata cache

Dataset array data

Dataset header
………….
Datatype
Dataspace
………….
Attributes
…

Application memory

File

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

12
Chunked storage
• Chunking – storage layout where a dataset is partitioned
in fixed-size multi-dimensional tiles or chunks
• Used for extendible datasets and datasets with filters
applied (checksum, compression)
• HDF5 library treats each chunk as atomic object
• Greatly affects performance and file sizes

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

13
Chunked storage layout
• Raw data divided into equal sized blocks (chunks).
• Each chunk stored separately as a contiguous block on disk
Metadata cache

Dataset array data

Dataset header

A

………….
Datatype
Dataspace
………….
Attributes
…

File

B

C

D

Chunk
index

Application memory

header

Nov. 6, 2007

Chunk
index

A

C

HDF-EOS Workshop XI Tutorial

D
14

B
Compact storage layout
• Data array and metadata stored together in the header
Dataset header
………….
Datatype
Dataspace
………….
Attributes
…

Array data
Data

Metadata cache
Array data

Application memory

File*

* “File” may in fact be a collection of files, memory, or other storage destination.

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

15
Factors affecting I/O
performance

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

16
What goes on inside the magic box?
• Operations on data inside the magic box
• Copying to/from internal buffers
• Datatype conversion
• Scattering - gathering
• Data transformation (filters, compression)
• Data structures used
• B-trees (groups, dataset chunks)
• Hash tables
• Local and Global heaps (variable length data: link names,
strings, etc.)
• Other concepts
• HDF5 metadata, metadata cache
• Chunking, chunk cache
Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

17
Operations on data inside the magic box
• Copying to/from internal buffers
• Datatype conversion, such as
• float  integer
• LE  BE
• 64-bit integer to 16-bit integer

• Scattering - gathering
• Data is scattered/gathered from/to application buffers into
internal buffers for datatype conversion and partial I/O
• Data transformation (filters, compression)
• Checksum on raw data and metadata (in 1.8.0)
• Algebraic transform
• GZIP and SZIP compressions
• User-defined filters
Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

18
I/O performance depends on
•
•
•
•
•
•
•

Storage layouts
Dataset storage properties
Chunking strategy
Metadata cache performance
Datatype conversion performance
Other filters, such as compression
Access patterns

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

19
I/O with different storage
layouts

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

20
Writing compact dataset

Dataset header

Metadata cache

………….
Datatype
Dataspace
………….
Attributes
…

Array data
Data

Application memory
One write to store header and data array

File

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

21
Writing contiguous dataset – no conversion

Metadata cache
Dataset header
………….
Datatype
Dataspace
………….
Attributes
…

Dataset array data

Application memory

File

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

22
Writing a contiguous dataset with datatype conversion

Dataset header
………….
Datatype
Dataspace
………….
Attribute 1
Attribute 2
…………

Metadata cache

Dataset array data

Conversion buffer 1MB
Application memory

File

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

23
Partial I/O with contiguous
datasets

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

24
Writing whole dataset – contiguous rows
N

M
One I/O operation
Application data in memory
M rows

File

Nov. 6, 2007

Data is contiguous in a file

HDF-EOS Workshop XI Tutorial

25
Sub-setting of contiguous dataset
Series of adjacent rows
Application data in memory
N
M
One I/O operation

M rows
Subset – contiguous in file
File
Entire dataset – contiguous in file

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

26
Sub-setting of contiguous dataset
Adjacent, partial rows
Application data in memory
N
Several small I/O operation

M

N elements
File

Nov. 6, 2007

…

Data is scattered in a file in M contiguous blocks

HDF-EOS Workshop XI Tutorial

27
Sub-setting of contiguous dataset
Extreme case: writing a column
Application data in memory
N
Several small I/O operation

M

1 element

…

Subset data is scattered in a file in M different locations

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

28
Sub-setting of contiguous dataset
Data sieve buffer
Application data in memory
N

Data is gathered in a sieve buffer in memory 64K
memcopy

M

1 element
File

Nov. 6, 2007

…

Data is scattered in a file in M contiguous blocks

HDF-EOS Workshop XI Tutorial

29
Performance tuning for contiguous dataset
• Datatype conversion
• Avoid for better performance
• Use H5Pset_buffer function to customize
conversion buffer size

• Partial I/O
• Write/read in big contiguous blocks
• Use H5Pset_sieve_buf_size to improve
performance for complex subsetting

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

30
I/O with Chunking

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

31
Reminder – chunked storage layout

Metadata cache

Dataset array data

Dataset header

A

………….
Datatype
Dataspace
………….
Attributes
…

File

B

C

D

Chunk
index

Application memory

header

Nov. 6, 2007

Chunk
index

A

C

HDF-EOS Workshop XI Tutorial

D

32

B
Information about chunking
• HDF5 library treats each chunk as atomic object
• Compression is applied to each chunk
• Datatype conversion, other filters applied per chunk

• Chunk size greatly affects performance
• Chunk overhead adds to file size
• Chunk processing involves many steps

• Chunk cache
•
•
•
•
•
•

Caches chunks for better performance
Created for each chunked dataset
Size of chunk cache is set for file (default size 1MB)
Each chunked dataset has its own chunk cache
Chunk may be too big to fit into cache
Memory may grow if application keeps opening datasets

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

33
Chunk cache

Dataset_1 header

Metadata cache

…………
………

Dataset_N header Chunking B-tree nodes
…………

Chunk cache
Default size is 1MB

Application memory

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

34
Writing chunked dataset
Chunked dataset
A
C

Chunk cache
C

B

Filter pipeline

File

B

A

…………..

C

• Compression performed when chunk evicted from the chunk cache
• Other filters applied as data goes through filter pipeline

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

35
Partial I/O with Chunking

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

36
Partial I/O for chunked dataset
1

2

3

4

• Example: write the green subset from the dataset , converting
the data
• Dataset is stored as six chunks in the file.
• The subset spans four chunks, numbered 1-4 in the figure.
• Hence four chunks must be written to the file.
• But first, the four chunks must be read from the file, to preserve
those parts of each chunk that are not to be overwritten.
Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 37
Partial I/O for chunked dataset
• For each of the four chunks:
1

2

3

4

• Read chunk from file into chunk cache,
unless it’s already there.
• Determine which part of the chunk will be
replaced by the selection.
• Replace that part of the chunk in the cache
with the corresponding elements from the
application’s array.
• Move those elements to conversion buffer
and perform conversion
• Move those elements back from conversion
buffer to chunk cache.
• Apply filters (compression) when chunk is
flushed from chunk cache

• For each element 3 memcopy
performed
Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

38
Partial I/O for chunked dataset

Application buffer

Chunk cache

3

3

Chunk
memcopy
Elements participating in I/O are gathered into corresponding chunk
Application memory

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

39
Partial I/O for chunked dataset

Chunk cache
Memcopy
Conversion buffer
3

Memcopy

Application memory

Compress and write to file

File

Nov. 6, 2007

Chunk

HDF-EOS Workshop XI Tutorial

40
Variable length data and I/O

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

41
Examples of variable length data
• String
A[0] “the first string we want to write”
…………………………………
A[N-1] “the N-th string we want to write”

• Each element is a record of variable-length
A[0] (1,1,0,0,0,5,6,7,8,9) [length = 10]
A[1] (0,0,110,2005)
[length = 4]
………………………..
A[N] (1,2,3,4,5,6,7,8,9,10,11,12,….,M) [length = M]

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

42
Variable length data in HDF5
• Variable length description in HDF5 application
typedef struct {
size_t length;
void
*p;
}hvl_t;

• Base type can be any HDF5 type
H5Tvlen_create(base_type)

• ~ 20 bytes overhead for each element
• Data cannot be compressed

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

43
How variable length data is stored in HDF5

Actual variable
length data
Global
heap

File

Dataset header

Nov. 6, 2007

Dataset with
variable length
elements

Pointer into
global heap

HDF-EOS Workshop XI Tutorial

44
Variable length datasets and I/O
• When writing variable length data, elements in application
buffer point to global heaps in the metadata cache where
actual data is stored.
Raw data

Application buffer

Global
heap

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

45
There may be more than one global heap

Raw data

Application buffer

Global
heap
Global
heap

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

46
Variable length datasets and I/O
Raw data
Global
heap
Global
heap

File

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

47
VL chunked dataset in a file

Chunk B-tree

File

Dataset header
Heaps with
VL data

Nov. 6, 2007

Dataset chunks

HDF-EOS Workshop XI Tutorial

48
Writing chunked VL datasets
Metadata cache

B-tree nodes

Chunk cache

Dataset header
…………

Application memory

Global heap

………
Raw data

Chunk cache
Conversion buffer
Filter pipeline

VL chunked dataset with selected region

File

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

49
Hints for variable length data I/O
• Avoid closing/opening a file while writing VL datasets
• Global heap information is lost
• Global heaps may have unused space

• Avoid alternately writing different VL datasets
• Data from different datasets will go into to the same heap

• If maximum length of the record is known, consider
using fixed-length records and compression

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

50
Thank you!

Questions ?

Nov. 6, 2007

HDF-EOS Workshop XI Tutorial

51

Contenu connexe

Tendances

Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseSnowflake Computing
 
MySQL Atchitecture and Concepts
MySQL Atchitecture and ConceptsMySQL Atchitecture and Concepts
MySQL Atchitecture and ConceptsTuyen Vuong
 
Introduction to Database
Introduction to DatabaseIntroduction to Database
Introduction to DatabaseSiti Ismail
 
Data Wrangling with Open Refine
Data Wrangling with Open RefineData Wrangling with Open Refine
Data Wrangling with Open RefineLOUIS Libraries
 
Analyzing Log Data With Apache Spark
Analyzing Log Data With Apache SparkAnalyzing Log Data With Apache Spark
Analyzing Log Data With Apache SparkSpark Summit
 
SQL Server Versions & Migration Paths
SQL Server Versions & Migration PathsSQL Server Versions & Migration Paths
SQL Server Versions & Migration PathsJeannette Browning
 
10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning SystemsXavier Amatriain
 
Data Warehouse Concepts | Data Warehouse Tutorial | Data Warehousing | Edureka
Data Warehouse Concepts | Data Warehouse Tutorial | Data Warehousing | EdurekaData Warehouse Concepts | Data Warehouse Tutorial | Data Warehousing | Edureka
Data Warehouse Concepts | Data Warehouse Tutorial | Data Warehousing | EdurekaEdureka!
 
Optimize the performance, cost, and value of databases.pptx
Optimize the performance, cost, and value of databases.pptxOptimize the performance, cost, and value of databases.pptx
Optimize the performance, cost, and value of databases.pptxIDERA Software
 
BITS: Introduction to MySQL - Introduction and Installation
BITS: Introduction to MySQL - Introduction and InstallationBITS: Introduction to MySQL - Introduction and Installation
BITS: Introduction to MySQL - Introduction and InstallationBITS
 
MySQL Shell - The Best MySQL DBA Tool
MySQL Shell - The Best MySQL DBA ToolMySQL Shell - The Best MySQL DBA Tool
MySQL Shell - The Best MySQL DBA ToolMiguel Araújo
 
Best Practices for implementing Database Security Comprehensive Database Secu...
Best Practices for implementing Database Security Comprehensive Database Secu...Best Practices for implementing Database Security Comprehensive Database Secu...
Best Practices for implementing Database Security Comprehensive Database Secu...Kal BO
 
2 08 client-server architecture
2 08 client-server architecture2 08 client-server architecture
2 08 client-server architecturejit_123
 
Generative AI con Amazon Bedrock.pdf
Generative AI con Amazon Bedrock.pdfGenerative AI con Amazon Bedrock.pdf
Generative AI con Amazon Bedrock.pdfGuido Maria Nebiolo
 
Time series database, InfluxDB & PHP
Time series database, InfluxDB & PHPTime series database, InfluxDB & PHP
Time series database, InfluxDB & PHPCorley S.r.l.
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowKaxil Naik
 

Tendances (20)

Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data Warehouse
 
MySQL Atchitecture and Concepts
MySQL Atchitecture and ConceptsMySQL Atchitecture and Concepts
MySQL Atchitecture and Concepts
 
Introduction to Database
Introduction to DatabaseIntroduction to Database
Introduction to Database
 
Snowflake Overview
Snowflake OverviewSnowflake Overview
Snowflake Overview
 
Data Wrangling with Open Refine
Data Wrangling with Open RefineData Wrangling with Open Refine
Data Wrangling with Open Refine
 
Analyzing Log Data With Apache Spark
Analyzing Log Data With Apache SparkAnalyzing Log Data With Apache Spark
Analyzing Log Data With Apache Spark
 
SQL Server Versions & Migration Paths
SQL Server Versions & Migration PathsSQL Server Versions & Migration Paths
SQL Server Versions & Migration Paths
 
10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems
 
Data Warehouse Concepts | Data Warehouse Tutorial | Data Warehousing | Edureka
Data Warehouse Concepts | Data Warehouse Tutorial | Data Warehousing | EdurekaData Warehouse Concepts | Data Warehouse Tutorial | Data Warehousing | Edureka
Data Warehouse Concepts | Data Warehouse Tutorial | Data Warehousing | Edureka
 
Chapter 4 Structured Query Language
Chapter 4 Structured Query LanguageChapter 4 Structured Query Language
Chapter 4 Structured Query Language
 
Optimize the performance, cost, and value of databases.pptx
Optimize the performance, cost, and value of databases.pptxOptimize the performance, cost, and value of databases.pptx
Optimize the performance, cost, and value of databases.pptx
 
BITS: Introduction to MySQL - Introduction and Installation
BITS: Introduction to MySQL - Introduction and InstallationBITS: Introduction to MySQL - Introduction and Installation
BITS: Introduction to MySQL - Introduction and Installation
 
Introduction to ETL and Data Integration
Introduction to ETL and Data IntegrationIntroduction to ETL and Data Integration
Introduction to ETL and Data Integration
 
MySQL Shell - The Best MySQL DBA Tool
MySQL Shell - The Best MySQL DBA ToolMySQL Shell - The Best MySQL DBA Tool
MySQL Shell - The Best MySQL DBA Tool
 
Best Practices for implementing Database Security Comprehensive Database Secu...
Best Practices for implementing Database Security Comprehensive Database Secu...Best Practices for implementing Database Security Comprehensive Database Secu...
Best Practices for implementing Database Security Comprehensive Database Secu...
 
2 08 client-server architecture
2 08 client-server architecture2 08 client-server architecture
2 08 client-server architecture
 
DBMS Notes: DDL DML DCL
DBMS Notes: DDL DML DCLDBMS Notes: DDL DML DCL
DBMS Notes: DDL DML DCL
 
Generative AI con Amazon Bedrock.pdf
Generative AI con Amazon Bedrock.pdfGenerative AI con Amazon Bedrock.pdf
Generative AI con Amazon Bedrock.pdf
 
Time series database, InfluxDB & PHP
Time series database, InfluxDB & PHPTime series database, InfluxDB & PHP
Time series database, InfluxDB & PHP
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache Airflow
 

Similaire à Caching and Buffering in HDF5

Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.Yousef Fadila
 
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFGestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFSUSE Italy
 
A quick start guide to using HDF5 files in GLOBE Claritas
A quick start guide to using HDF5 files in GLOBE ClaritasA quick start guide to using HDF5 files in GLOBE Claritas
A quick start guide to using HDF5 files in GLOBE ClaritasGuy Maslen
 
Analytics with unified file and object
Analytics with unified file and object Analytics with unified file and object
Analytics with unified file and object Sandeep Patil
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it bettergvernik
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?gvernik
 

Similaire à Caching and Buffering in HDF5 (20)

HDF5 Life cycle of data
HDF5 Life cycle of dataHDF5 Life cycle of data
HDF5 Life cycle of data
 
Performance Tuning in HDF5
Performance Tuning in HDF5 Performance Tuning in HDF5
Performance Tuning in HDF5
 
Cloud-Optimized HDF5 Files
Cloud-Optimized HDF5 FilesCloud-Optimized HDF5 Files
Cloud-Optimized HDF5 Files
 
Migrating from HDF5 1.6 to 1.8
Migrating from HDF5 1.6 to 1.8Migrating from HDF5 1.6 to 1.8
Migrating from HDF5 1.6 to 1.8
 
Introduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIsIntroduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIs
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.
 
Introduction to HDF5
Introduction to HDF5Introduction to HDF5
Introduction to HDF5
 
Introduction to HDF5
Introduction to HDF5Introduction to HDF5
Introduction to HDF5
 
HDF5 Advanced Topics - Chunking
HDF5 Advanced Topics - ChunkingHDF5 Advanced Topics - Chunking
HDF5 Advanced Topics - Chunking
 
Unit-3.pptx
Unit-3.pptxUnit-3.pptx
Unit-3.pptx
 
Hdf5 intro
Hdf5 introHdf5 intro
Hdf5 intro
 
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFGestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
 
Update on HDF5 1.8
Update on HDF5 1.8Update on HDF5 1.8
Update on HDF5 1.8
 
A quick start guide to using HDF5 files in GLOBE Claritas
A quick start guide to using HDF5 files in GLOBE ClaritasA quick start guide to using HDF5 files in GLOBE Claritas
A quick start guide to using HDF5 files in GLOBE Claritas
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Analytics with unified file and object
Analytics with unified file and object Analytics with unified file and object
Analytics with unified file and object
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it better
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?
 
Metadata Requirements for EOSDIS Data Providers
Metadata Requirements for EOSDIS Data ProvidersMetadata Requirements for EOSDIS Data Providers
Metadata Requirements for EOSDIS Data Providers
 

Plus de The HDF-EOS Tools and Information Center

STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...The HDF-EOS Tools and Information Center
 

Plus de The HDF-EOS Tools and Information Center (20)

Accessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDSAccessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDS
 
The State of HDF
The State of HDFThe State of HDF
The State of HDF
 
Highly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance FeaturesHighly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance Features
 
Creating Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 FilesCreating Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 Files
 
HDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance DiscussionHDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance Discussion
 
Hyrax: Serving Data from S3
Hyrax: Serving Data from S3Hyrax: Serving Data from S3
Hyrax: Serving Data from S3
 
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLABAccessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
 
HDF - Current status and Future Directions
HDF - Current status and Future DirectionsHDF - Current status and Future Directions
HDF - Current status and Future Directions
 
HDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and FutureHDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and Future
 
HDF - Current status and Future Directions
HDF - Current status and Future Directions HDF - Current status and Future Directions
HDF - Current status and Future Directions
 
H5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only LibraryH5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only Library
 
MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10
 
HDF for the Cloud - Serverless HDF
HDF for the Cloud - Serverless HDFHDF for the Cloud - Serverless HDF
HDF for the Cloud - Serverless HDF
 
HDF5 <-> Zarr
HDF5 <-> ZarrHDF5 <-> Zarr
HDF5 <-> Zarr
 
HDF for the Cloud - New HDF Server Features
HDF for the Cloud - New HDF Server FeaturesHDF for the Cloud - New HDF Server Features
HDF for the Cloud - New HDF Server Features
 
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
 
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
 
HDF5 and Ecosystem: What Is New?
HDF5 and Ecosystem: What Is New?HDF5 and Ecosystem: What Is New?
HDF5 and Ecosystem: What Is New?
 
HDF5 Roadmap 2019-2020
HDF5 Roadmap 2019-2020HDF5 Roadmap 2019-2020
HDF5 Roadmap 2019-2020
 
Leveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software TestingLeveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software Testing
 

Dernier

Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 

Dernier (20)

Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 

Caching and Buffering in HDF5

  • 1. Caching and Buffering in HDF5 The HDF Group Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 1
  • 2. Software stack and the “magic box” • Life cycle: What happens to data when it is transferred from application buffer to HDF5 file? Application Data buffer Object API H5Dwrite Library internals Magic box Virtual file I/O Unbuffered I/O File or other “storage” Nov. 6, 2007 HDF-EOS Workshop XI Tutorial Data in a file 2
  • 3. Inside the magic box • Understanding of what is happening to data inside the magic box will help to write efficient applications • HDF5 library has mechanisms to control behavior inside the magic box • Goals of this talk:  Describe some basic operations and data structures and explain how they affect performance and storage sizes  Give some “recipes” for how to improve performance Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 3
  • 4. Topics • Dataset metadata and array data storage layouts • Types of dataset storage layouts • Factors affecting I/O performance • • • • I/O with compact datasets I/O with contiguous datasets I/O with chunked datasets Variable length data and I/O Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 4
  • 5. HDF5 dataset metadata and array data storage layouts Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 5
  • 6. HDF5 Dataset • Data array • Ordered collection of identically typed data items distinguished by their indices • Metadata • • • • Dataspace: Rank, dimensions of dataset array Datatype: Information on how to interpret data Storage Properties: How array is organized on disk Attributes: User-defined metadata (optional) Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 6
  • 7. Separate Components of a Dataset Header Data array Dataspace Rank Dimensions 3 Dim_1 = 4 Dim_2 = 5 Dim_3 = 7 Datatype IEEE 32-bit float Storage info Attributes Time = 32.4 Chunked Pressure = 987 Compressed Temp = 56 Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 7
  • 8. Metadata cache and array data • Dataset array data typically kept in application memory • Dataset header in separate space – metadata cache Metadata cache Dataset header …………. Datatype Dataspace …………. Attributes … Dataset array data Application memory HDF5 metadata Dataset array data File Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 8
  • 9. Metadata and metadata cache • HDF5 metadata • Information about HDF5 objects used by the library • Examples: object headers, B-tree nodes for group, B-Tree nodes for chunks, heaps, super-block, etc. • Usually small compared to raw data sizes (KB vs. MB-GB) • Metadata cache • Space allocated to handle pieces of the HDF5 metadata • Allocated by the HDF5 library in application’s memory space • Cache behavior affects overall performance • Metadata cache implementation prior to HDF5 1.6.5 could cause performance degradation for some applications Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 9
  • 10. Types of data storage layouts Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 10
  • 11. HDF5 datasets storage layouts • Contiguous • Chunked • Compact Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 11
  • 12. Contiguous storage layout • Metadata header separate from raw data • Raw data stored in one contiguous block on disk Metadata cache Dataset array data Dataset header …………. Datatype Dataspace …………. Attributes … Application memory File Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 12
  • 13. Chunked storage • Chunking – storage layout where a dataset is partitioned in fixed-size multi-dimensional tiles or chunks • Used for extendible datasets and datasets with filters applied (checksum, compression) • HDF5 library treats each chunk as atomic object • Greatly affects performance and file sizes Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 13
  • 14. Chunked storage layout • Raw data divided into equal sized blocks (chunks). • Each chunk stored separately as a contiguous block on disk Metadata cache Dataset array data Dataset header A …………. Datatype Dataspace …………. Attributes … File B C D Chunk index Application memory header Nov. 6, 2007 Chunk index A C HDF-EOS Workshop XI Tutorial D 14 B
  • 15. Compact storage layout • Data array and metadata stored together in the header Dataset header …………. Datatype Dataspace …………. Attributes … Array data Data Metadata cache Array data Application memory File* * “File” may in fact be a collection of files, memory, or other storage destination. Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 15
  • 16. Factors affecting I/O performance Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 16
  • 17. What goes on inside the magic box? • Operations on data inside the magic box • Copying to/from internal buffers • Datatype conversion • Scattering - gathering • Data transformation (filters, compression) • Data structures used • B-trees (groups, dataset chunks) • Hash tables • Local and Global heaps (variable length data: link names, strings, etc.) • Other concepts • HDF5 metadata, metadata cache • Chunking, chunk cache Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 17
  • 18. Operations on data inside the magic box • Copying to/from internal buffers • Datatype conversion, such as • float  integer • LE  BE • 64-bit integer to 16-bit integer • Scattering - gathering • Data is scattered/gathered from/to application buffers into internal buffers for datatype conversion and partial I/O • Data transformation (filters, compression) • Checksum on raw data and metadata (in 1.8.0) • Algebraic transform • GZIP and SZIP compressions • User-defined filters Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 18
  • 19. I/O performance depends on • • • • • • • Storage layouts Dataset storage properties Chunking strategy Metadata cache performance Datatype conversion performance Other filters, such as compression Access patterns Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 19
  • 20. I/O with different storage layouts Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 20
  • 21. Writing compact dataset Dataset header Metadata cache …………. Datatype Dataspace …………. Attributes … Array data Data Application memory One write to store header and data array File Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 21
  • 22. Writing contiguous dataset – no conversion Metadata cache Dataset header …………. Datatype Dataspace …………. Attributes … Dataset array data Application memory File Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 22
  • 23. Writing a contiguous dataset with datatype conversion Dataset header …………. Datatype Dataspace …………. Attribute 1 Attribute 2 ………… Metadata cache Dataset array data Conversion buffer 1MB Application memory File Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 23
  • 24. Partial I/O with contiguous datasets Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 24
  • 25. Writing whole dataset – contiguous rows N M One I/O operation Application data in memory M rows File Nov. 6, 2007 Data is contiguous in a file HDF-EOS Workshop XI Tutorial 25
  • 26. Sub-setting of contiguous dataset Series of adjacent rows Application data in memory N M One I/O operation M rows Subset – contiguous in file File Entire dataset – contiguous in file Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 26
  • 27. Sub-setting of contiguous dataset Adjacent, partial rows Application data in memory N Several small I/O operation M N elements File Nov. 6, 2007 … Data is scattered in a file in M contiguous blocks HDF-EOS Workshop XI Tutorial 27
  • 28. Sub-setting of contiguous dataset Extreme case: writing a column Application data in memory N Several small I/O operation M 1 element … Subset data is scattered in a file in M different locations Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 28
  • 29. Sub-setting of contiguous dataset Data sieve buffer Application data in memory N Data is gathered in a sieve buffer in memory 64K memcopy M 1 element File Nov. 6, 2007 … Data is scattered in a file in M contiguous blocks HDF-EOS Workshop XI Tutorial 29
  • 30. Performance tuning for contiguous dataset • Datatype conversion • Avoid for better performance • Use H5Pset_buffer function to customize conversion buffer size • Partial I/O • Write/read in big contiguous blocks • Use H5Pset_sieve_buf_size to improve performance for complex subsetting Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 30
  • 31. I/O with Chunking Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 31
  • 32. Reminder – chunked storage layout Metadata cache Dataset array data Dataset header A …………. Datatype Dataspace …………. Attributes … File B C D Chunk index Application memory header Nov. 6, 2007 Chunk index A C HDF-EOS Workshop XI Tutorial D 32 B
  • 33. Information about chunking • HDF5 library treats each chunk as atomic object • Compression is applied to each chunk • Datatype conversion, other filters applied per chunk • Chunk size greatly affects performance • Chunk overhead adds to file size • Chunk processing involves many steps • Chunk cache • • • • • • Caches chunks for better performance Created for each chunked dataset Size of chunk cache is set for file (default size 1MB) Each chunked dataset has its own chunk cache Chunk may be too big to fit into cache Memory may grow if application keeps opening datasets Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 33
  • 34. Chunk cache Dataset_1 header Metadata cache ………… ……… Dataset_N header Chunking B-tree nodes ………… Chunk cache Default size is 1MB Application memory Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 34
  • 35. Writing chunked dataset Chunked dataset A C Chunk cache C B Filter pipeline File B A ………….. C • Compression performed when chunk evicted from the chunk cache • Other filters applied as data goes through filter pipeline Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 35
  • 36. Partial I/O with Chunking Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 36
  • 37. Partial I/O for chunked dataset 1 2 3 4 • Example: write the green subset from the dataset , converting the data • Dataset is stored as six chunks in the file. • The subset spans four chunks, numbered 1-4 in the figure. • Hence four chunks must be written to the file. • But first, the four chunks must be read from the file, to preserve those parts of each chunk that are not to be overwritten. Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 37
  • 38. Partial I/O for chunked dataset • For each of the four chunks: 1 2 3 4 • Read chunk from file into chunk cache, unless it’s already there. • Determine which part of the chunk will be replaced by the selection. • Replace that part of the chunk in the cache with the corresponding elements from the application’s array. • Move those elements to conversion buffer and perform conversion • Move those elements back from conversion buffer to chunk cache. • Apply filters (compression) when chunk is flushed from chunk cache • For each element 3 memcopy performed Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 38
  • 39. Partial I/O for chunked dataset Application buffer Chunk cache 3 3 Chunk memcopy Elements participating in I/O are gathered into corresponding chunk Application memory Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 39
  • 40. Partial I/O for chunked dataset Chunk cache Memcopy Conversion buffer 3 Memcopy Application memory Compress and write to file File Nov. 6, 2007 Chunk HDF-EOS Workshop XI Tutorial 40
  • 41. Variable length data and I/O Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 41
  • 42. Examples of variable length data • String A[0] “the first string we want to write” ………………………………… A[N-1] “the N-th string we want to write” • Each element is a record of variable-length A[0] (1,1,0,0,0,5,6,7,8,9) [length = 10] A[1] (0,0,110,2005) [length = 4] ……………………….. A[N] (1,2,3,4,5,6,7,8,9,10,11,12,….,M) [length = M] Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 42
  • 43. Variable length data in HDF5 • Variable length description in HDF5 application typedef struct { size_t length; void *p; }hvl_t; • Base type can be any HDF5 type H5Tvlen_create(base_type) • ~ 20 bytes overhead for each element • Data cannot be compressed Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 43
  • 44. How variable length data is stored in HDF5 Actual variable length data Global heap File Dataset header Nov. 6, 2007 Dataset with variable length elements Pointer into global heap HDF-EOS Workshop XI Tutorial 44
  • 45. Variable length datasets and I/O • When writing variable length data, elements in application buffer point to global heaps in the metadata cache where actual data is stored. Raw data Application buffer Global heap Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 45
  • 46. There may be more than one global heap Raw data Application buffer Global heap Global heap Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 46
  • 47. Variable length datasets and I/O Raw data Global heap Global heap File Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 47
  • 48. VL chunked dataset in a file Chunk B-tree File Dataset header Heaps with VL data Nov. 6, 2007 Dataset chunks HDF-EOS Workshop XI Tutorial 48
  • 49. Writing chunked VL datasets Metadata cache B-tree nodes Chunk cache Dataset header ………… Application memory Global heap ……… Raw data Chunk cache Conversion buffer Filter pipeline VL chunked dataset with selected region File Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 49
  • 50. Hints for variable length data I/O • Avoid closing/opening a file while writing VL datasets • Global heap information is lost • Global heaps may have unused space • Avoid alternately writing different VL datasets • Data from different datasets will go into to the same heap • If maximum length of the record is known, consider using fixed-length records and compression Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 50
  • 51. Thank you! Questions ? Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 51