Advanced HDF5 Features

HDF5 Advanced Topics

October 15, 2008

HDF and HDF-EOS Workshop XII

1

Outline
• Part I
• Overview of HDF5 datatypes

• Part II
• Partial I/O in HDF5
• Hyperslab selection
• Dataset region references

• Chunking and compression

• Part III
• Performance issues (how to do it right)

October 15, 2008


2

Part I
HDF5 Datatypes
Quick overview of the most
difficult topics

October 15, 2008


3

HDF5 Datatypes
• HDF5 has a rich set of pre-defined datatypes and
supports the creation of an unlimited variety of
complex user-defined datatypes.
• Datatype definitions are stored in the HDF5 file
with the data.
• Datatype definitions include information such as
byte order (endianess), size, and floating point
representation to fully describe how the data is
stored and to insure portability across platforms.
• Datatype definitions can be shared among objects
in an HDF file, providing a powerful and efficient
mechanism for describing data.
October 15, 2008


4

Example
Array of of integers on Linux platform
Native integer is little-endian, 4 bytes

Array of of integers on Solaris platform
Native integer is big-endian,
Fortran compiler uses -i8 flag to set
integer to 8 bytes

H5T_NATIVE_INT
H5T_NATIVE_INT

Little-endian 4 bytes integer

H5Dwrite

H5Dread
H5Dwrite
H5T_SDT_I32LE

October 15, 2008


VAX G-floating

5

Storing Variable Length
Data in HDF5

October 15, 2008


6

HDF5 Fixed and Variable Length Array Storage

•Data
•Data

Time
•Data
•Data

•Data
•Data

Time
•Data
•Data
•Data

October 15, 2008


7

Storing Strings in HDF5
• Array of characters
• Access to each character
• Extra work to access and interpret each string
• Fixed length
string_id = H5Tcopy(H5T_C_S1);
H5Tset_size(string_id, size);
• Overhead for short strings
• Can be compressed
• Variable length
string_id = H5Tcopy(H5T_C_S1);
H5Tset_size(string_id, H5T_VARIABLE);
• Overhead as for all VL datatypes
• Compression will not be applied to actual data

October 15, 2008


8

Storing Variable Length Data in HDF5
• Each element is represented by C structure
typedef struct {
size_t length;
void
*p;
} hvl_t;

• Base type can be any HDF5 type
H5Tvlen_create(base_type)

October 15, 2008


9

Example
hvl_t

data[LENGTH];

for(i=0; i<LENGTH; i++) {
data[i].p=HDmalloc(
(i+1)*sizeof(unsigned int));
data[i].len=i+1;
}
tvl = H5Tvlen_create (H5T_NATIVE_UINT);
data[0].p

•Data
•Data
•Data
•Data

data[4].len
October 15, 2008

•Data


10

Reading HDF5 Variable Length Array
On read HDF5 Library allocates memory to read data in,
application only needs to allocate array of hvl_t elements
(pointers and lengths).
hvl_t

rdata[LENGTH];

/* Discover the type in the file */
tvl = H5Tvlen_create (H5T_NATIVE_UINT);
ret = H5Dread(dataset,tvl,H5S_ALL,H5S_ALL,
H5P_DEFAULT, rdata);
/* Reclaim the read VL data */
H5Dvlen_reclaim(tvl,H5S_ALL,H5P_DEFAULT,rdata
);
October 15, 2008


11

Storing Tables in HDF5 file

October 15, 2008


12

Example
a_name
(integer)

b_name
(float)

c_name
(double)

0

0.

1.0000

1

1.

0.5000

2

4.

0.3333

3

9.

0.2500

4

16.

0.2000

5

25.

0.1667

6

36.

0.1429

7

49.

0.1250

8

64.

0.1111

9

81.

0.1000

October 15,
2008

Multiple ways to store a table
Dataset for each field
Dataset with compound datatype
If all fields have the same type:
2-dim array
1-dim array of array datatype
continued…..

Choose to achieve your goal!
How much overhead each type
of storage will create?
Do I always read all fields?
Do I need to read some fields more
often?
Do I want to use compression?
Do I want to access some records?


13

HDF5 Compound Datatypes
• Compound types
• Comparable to C structs
• Members can be atomic or compound
types
• Members can be multidimensional
• Can be written/read by a field or set of
fields
• Not all data filters can be applied (shuffling,
SZIP)

October 15, 2008


14

HDF5 Compound Datatypes
•

Which APIs to use?
• H5TB APIs
•
•
•
•

Create, read, get info and merge tables
Add, delete, and append records
Insert and delete fields
Limited control over table’s properties (i.e. only GZIP
compression, level 6, default allocation time for table, extendible,
etc.)

• PyTables http://www.pytables.org
• Based on H5TB
• Python interface
• Indexing capabilities

• HDF5 APIs
• H5Tcreate(H5T_COMPOUND), H5Tinsert calls to create a
compound datatype
• H5Dcreate, etc.
• See H5Tget_member* functions for discovering properties of the
HDF5 compound datatype
October 15, 2008


15

Creating and Writing Compound Dataset
h5_compound.c example
typedef struct s1_t {
int a;
float b;
double c;
} s1_t;
s1_t

October 15, 2008

s1[LENGTH];


16

/* Create datatype in memory. */
s1_tid = H5Tcreate (H5T_COMPOUND, sizeof(s1_t));
H5Tinsert(s1_tid, "a_name", HOFFSET(s1_t, a),
H5T_NATIVE_INT);
H5Tinsert(s1_tid, "c_name", HOFFSET(s1_t, c),
H5T_NATIVE_DOUBLE);
H5Tinsert(s1_tid, "b_name", HOFFSET(s1_t, b),
H5T_NATIVE_FLOAT);

Note:
• Use HOFFSET macro instead of calculating offset by hand.
• Order of H5Tinsert calls is not important if HOFFSET is used.

October 15, 2008


17

/* Create dataset and write data */
dataset = H5Dcreate(file, DATASETNAME, s1_tid, space,
H5P_DEFAULT);
status = H5Dwrite(dataset, s1_tid, H5S_ALL, H5S_ALL,
H5P_DEFAULT, s1);
Note:
• In this example memory and file datatypes are the same.
• Type is not packed.
• Use H5Tpack to save space in the file.
s2_tid = H5Tpack(s1_tid);
status = H5Dcreate(file, DATASETNAME, s2_tid, space,
H5P_DEFAULT);
October 15, 2008


18

File Content with h5dump
HDF5 "SDScompound.h5" {
GROUP "/" {
DATASET "ArrayOfStructures" {
DATATYPE {
H5T_STD_I32BE "a_name";
H5T_IEEE_F32BE "b_name";
H5T_IEEE_F64BE "c_name"; }
DATASPACE { SIMPLE ( 10 ) / ( 10 ) }
DATA {
{
[ 0 ],
[ 0 ],
[ 1 ]
},
{
[ 1 ],
…
October 15, 2008


19

Reading Compound Dataset
/* Create datatype in memory and read data. */
dataset
= H5Dopen(file, DATSETNAME);
s2_tid
= H5Dget_type(dataset);
mem_tid
= H5Tget_native_type (s2_tid);
s1 = malloc((sizeof(mem_tid)*number_of_elements)
status
= H5Dread(dataset, mem_tid, H5S_ALL,
H5S_ALL, H5P_DEFAULT, s1);
Note:
• We could construct memory type as we did in writing example.
• For general applications we need to discover the type in the
file, find out corresponding memory type, allocate space and do
read.

October 15, 2008


20

Reading Compound Dataset by Fields
typedef struct s2_t {
double c;
int
a;
} s2_t;
s2_t s2[LENGTH];
…
s2_tid = H5Tcreate (H5T_COMPOUND, sizeof(s2_t));
H5Tinsert(s2_tid, "c_name", HOFFSET(s2_t, c),
H5T_NATIVE_DOUBLE);
H5Tinsert(s2_tid, “a_name", HOFFSET(s2_t, a),
H5T_NATIVE_INT);
…
status = H5Dread(dataset, s2_tid, H5S_ALL,
H5S_ALL, H5P_DEFAULT, s2);

October 15, 2008


21

New Way of Creating Datatypes
Another way to create a compound datatype
#include H5LTpublic.h
…..
s2_tid = H5LTtext_to_dtype(
"H5T_COMPOUND
{H5T_NATIVE_DOUBLE "c_name";
H5T_NATIVE_INT "a_name";
}",
H5LT_DDL);

October 15, 2008


22

Need Help with Datatypes?
Check our support web pages

http://www.hdfgroup.uiuc.edu/UserSupport/example

http://www.hdfgroup.uiuc.edu/UserSupport/example

October 15, 2008


23

Part II
Working with subsets

October 15, 2008


24

Collect data one way ….
Array of images (3D)

October 15, 2008


25

Display data another way …

Stitched image (2D array)

October 15, 2008


26

Data is too big to read….

October 15, 2008


27

Refer to a region…
Need to select and access the same
elements of a dataset

October 15, 2008


28

HDF5 Library Features
• HDF5 Library provides capabilities to
• Describe subsets of data and perform write/read
operations on subsets
• Hyperslab selections and partial I/O

• Store descriptions of the data subsets in a file
• Object references
• Region references

• Use efficient storage mechanism to achieve good
performance while writing/reading subsets of data
• Chunking, compression

October 15, 2008


29

Partial I/O in HDF5

October 15, 2008


30

How to Describe a Subset in HDF5?
• Before writing and reading a subset of data
one has to describe it to the HDF5 Library.
• HDF5 APIs and documentation refer to a
subset as a “selection” or “hyperslab
selection”.
• If specified, HDF5 Library will perform I/O on a
selection only and not on all elements of a
dataset.

October 15, 2008


31

Types of Selections in HDF5
• Two types of selections
• Hyperslab selection
• Regular hyperslab
• Simple hyperslab
• Result of set operations on hyperslabs (union,
difference, …)

• Point selection

• Hyperslab selection is especially important for
doing parallel I/O in HDF5 (See Parallel HDF5
Tutorial)

October 15, 2008


32

Regular Hyperslab

Collection of regularly spaced equal size blocks
October 15, 2008


33

Simple Hyperslab

Contiguous subset or sub-array
October 15, 2008


34

Hyperslab Selection

Result of union operation on three simple hyperslabs
October 15, 2008


35

Hyperslab Description
• Offset - starting location of a hyperslab (1,1)
• Stride - number of elements that separate each
block (3,2)
• Count - number of blocks (2,6)
• Block - block size (2,1)
• Everything is “measured” in number of elements

October 15, 2008


36

Simple Hyperslab Description
• Two ways to describe a simple hyperslab
• As several blocks
• Stride – (1,1)
• Count – (2,6)
• Block – (2,1)

• As one block
• Stride – (1,1)
• Count – (1,1)
• Block – (4,6)

No performance penalty for
one way or another
October 15, 2008


37

H5Sselect_hyperslab Function

space_id Identifier of dataspace
op
Selection operator
H5S_SELECT_SET or H5S_SELECT_OR
offset
Array with starting coordinates of hyperslab
stride
Array specifying which positions along a dimension
to select
count
Array specifying how many blocks to select from the
dataspace, in each dimension
block
Array specifying size of element block
(NULL indicates a block size of a single element in
a dimension)

October 15, 2008


38

Reading/Writing Selections
Programming model for reading from a dataset in
a file
1. Open a dataset.
2. Get file dataspace handle of the dataset and specify
subset to read from.
a. H5Dget_space returns file dataspace handle
a.

File dataspace describes array stored in a file (number of
dimensions and their sizes).

b. H5Sselect_hyperslab selects elements of the array
that participate in I/O operation.

3. Allocate data buffer of an appropriate shape and size

October 15, 2008


39

Reading/Writing Selections
Programming model (continued)
4. Create a memory dataspace and specify subset to write
to.
1.
2.
3.

Memory dataspace describes data buffer (its rank and
dimension sizes).
Use H5Screate_simple function to create memory
dataspace.
Use H5Sselect_hyperslab to select elements of the data
buffer that participate in I/O operation.

4. Issue H5Dread or H5Dwrite to move the data between
file and memory buffer.
5. Close file dataspace and memory dataspace when
done.
October 15, 2008


40

Example : Reading Two Rows
1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

-1

-1

-1

Data in a file
4x6 matrix

Buffer in memory
1-dim array of length 14
-1

-1

October 15, 2008

-1

-1

-1

-1

-1


-1

-1
41

-1

-1

Example: Reading Two Rows
1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

offset
count
block
stride

24

filespace = H5Dget_space (dataset);
H5Sselect_hyperslab (filespace, H5S_SELECT_SET,
offset, NULL, count, NULL)

October 15, 2008


42

=
=
=
=

{1,0}
{2,6}
{1,1}
{1,1}

offset = {1}
count = {12}

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

memspace = H5Screate_simple(1, 14, NULL);
H5Sselect_hyperslab (memspace, H5S_SELECT_SET,
offset, NULL, count, NULL)

October 15, 2008


43

-1

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

H5Dread (…, …, memspace, filespace, …, …);

-1

7

October 15, 2008

8

9

10 11 12 13 14 15 16 17 18 -1

44

Things to Remember
• Number of elements selected in a file and in a
memory buffer should be the same
• H5Sget_select_npoints returns number of
selected elements in a hyperslab selection

• HDF5 partial I/O is tuned to move data between
selections that have the same dimensionality;
avoid choosing subsets that have different ranks
(as in example above)
• Allocate a buffer of an appropriate size when
reading data; use H5Tget_native_type and
H5Tget_size to get the correct size of the data
element in memory.
October 15, 2008


45

Things to Remember
• When calling H5Sselect_hyperslab in a loop
close the obtained dataspace handle in a loop to
avoid application memory growth.
Only offset parameter is changing;
block and stride parameters
stay the same.

offset
October 15, 2008


46

Example
offset[0] = 0;
offset[1] = 0;
fspace_id = H5Dget_space(...);
for (k=0; k < DIM3; k++) { /* Start for loop */
offset[2] = k;
…
tmp_id = H5Sselect_hyperslab(fspace_id, …, offset, …);
H5Dwrite(dset_id, type_id, H5S_ALL, tmp_id, ..);
H5Sclose(tmp_id);
…
} /* End for loop */
October 15, 2008


47

HDF5 Region References
and Selections

October 15, 2008


48

Saving Selected Region in a File
Need to select and access the same
elements of a dataset

October 15, 2008


49

Reference Datatype
• Reference to an HDF5 object
• Pointer to a group or a dataset in a file
• Predefined datatype H5T_STD_REG_OBJ
describe object references

• Reference to a dataset region (or to
selection)
• Pointer to the dataspace selection
• Predefined datatype
H5T_STD_REF_DSETREG to describe regions

October 15, 2008


50

Reference to Dataset Region
REF_REG.h5
Root

Matrix

Object References

1 1 2 3 3 4 5 5 6
1 2 2 3 4 4 5 6 6

October 15, 2008


51

Example
dsetr_id = H5Dcreate(file_id,
“REGION REFERENCES”, H5T_STD_REF_DSETREG,
…);
H5Sselect_hyperslab(space_id,
H5S_SELECT_SET, start, NULL, …);
H5Rcreate(&ref[0], file_id, “MATRIX”,
H5R_DATASET_REGION, space_id);
H5Dwrite(dsetr_id, H5T_STD_REF_DSETREG,
H5S_ALL, H5S_ALL, H5P_DEFAULT,ref);
October 15, 2008


52

HDF5 "REF_REG.h5" {
GROUP "/" {
DATASET "MATRIX" {
……
}
DATASET "REGION_REFERENCES" {
DATATYPE H5T_REFERENCE
DATASPACE SIMPLE { ( 2 ) / ( 2 ) }
DATA {
(0): DATASET /MATRIX {(0,3)-(1,5)},
(1): DATASET /MATRIX {(0,0), (1,6), (0,8)}
}
}
}
}

October 15, 2008


53

Chunking in HDF5

October 15, 2008


54

HDF5 Chunking
• Dataset data is divided into equally sized blocks (chunks).
• Each chunk is stored separately as a contiguous block in
HDF5 file.
Metadata cache

Dataset data

Dataset header
………….
Datatype
Dataspace
………….
Attributes
…

File
October 15, 2008

A

B

C

D

Chunk
index

Application memory

header

Chunk
index

A

C


D

B
55

HDF5 Chunking
• Chunking is needed for
• Enabling compression and other filters
• Extendible datasets

October 15, 2008


56

HDF5 Chunking
• If used appropriately chunking improves partial
I/O for big datasets

Only two chunks are involved in I/O

October 15, 2008


57

HDF5 Chunking
• Chunk has the same rank as a dataset
• Chunk’s dimensions do not need to be factors of
dataset’s dimensions

October 15, 2008


58

Creating Chunked Dataset
1.
2.
3.

Create a dataset creation property list.
Set property list to use chunked storage layout.
Create dataset with the above property list.

crp_id = H5Pcreate(H5P_DATASET_CREATE);
rank = 2;
ch_dims[0] = 100;
ch_dims[1] = 100;
H5Pset_chunk(crp_id, rank, ch_dims);
dset_id = H5Dcreate (…, crp_id);
H5Pclose(crp_id);

October 15, 2008


59

Writing or Reading Chunked Dataset
1.
2.

Chunking mechanism is transparent to application.
Use the same set of operation as for contiguous
dataset, for example,
H5Dopen(…);
H5Sselect_hyperslab (…);
H5Dread(…);

3.

Selections do not need to coincide precisely with the
chunks boundaries.

October 15, 2008


60

HDF5 Filters
•
•

HDF5 filters modify data during I/O operations
Available filters:
1.
2.
3.
4.

Checksum (H5Pset_fletcher32)
Shuffling filter (H5Pset_shuffle)
Data transformation (in 1.8.*)
Compression
•
•
•
•

October 15, 2008

Scale + offset (in 1.8.*)
N-bit (in 1.8.*)
GZIP (deflate), SZIP (H5Pset_deflate, H5Pset_szip)
User-defined filters (BZIP2)
• Example of a user-defined compression filter can be
found
http://www.hdfgroup.uiuc.edu/papers/papers/bzip2/


61

Creating Compressed Dataset
1.
2.
3.
4.

Create a dataset creation property list
Set property list to use chunked storage layout
Set property list to use filters
Create dataset with the above property list

crp_id = H5Pcreate(H5P_DATASET_CREATE);
rank = 2;
ch_dims[0] = 100;
ch_dims[1] = 100;
H5Pset_chunk(crp_id, rank, ch_dims);
H5Pset_deflate(crp_id, 9);
dset_id = H5Dcreate (…, crp_id);
H5Pclose(crp_id);

October 15, 2008


62

Writing Compressed Dataset
Chunked dataset
A
C

Chunk cache (per dataset)
C

B

Filter pipeline

File

B

A

…………..

C

Default chunk cache size is 1MB.
Filters including compression are applied when chunk is evicted from
cache.
Chunks in the file may have different sizes
October 15, 2008


63

Chunking Basics to Remember
•
•

•

Chunking creates storage overhead in the file.
Performance is affected by
• Chunking and compression parameters
• Chunking cache size (H5Pset_cache call)
Some hints for getting better performance
• Use chunk size not smaller than block size (4k) on
a file system.
• Use compression method appropriate for your
data.
• Avoid using selections that do not coincide with
the chunking boundaries.

October 15, 2008


64

Example
Creates a compressed 1000x20 integer dataset in a file
%h5dump –p –H zip.h5
HDF5 "zip.h5" {
GROUP "/" {
GROUP "Data" {
DATASET "Compressed_Data" {
DATATYPE H5T_STD_I32BE
DATASPACE SIMPLE { ( 1000, 20 )………
STORAGE_LAYOUT {
CHUNKED ( 20, 20 )
SIZE 5316
}

October 15, 2008


65

Example (continued)
FILTERS {
COMPRESSION DEFLATE { LEVEL 6 }
}
FILLVALUE {
FILL_TIME H5D_FILL_TIME_IFSET
VALUE 0
}
ALLOCATION_TIME {
H5D_ALLOC_TIME_INCR
}
}
}
}
}

October 15, 2008


66

Example (bigger chunk)
Creates a compressed integer dataset 1000x20 in a
file; better compression ratio is achieved.
h5dump –p –H zip.h5
HDF5 "zip.h5" {
GROUP "/" {
GROUP "Data" {
DATASET "Compressed_Data" {
DATATYPE H5T_STD_I32BE
DATASPACE SIMPLE { ( 1000, 20 )………
STORAGE_LAYOUT {
CHUNKED ( 200, 20 )
SIZE 2936
}
October 15, 2008


67

Part III
Performance Issues
(How to Do it Right)

October 15, 2008


68

Performance of Serial I/O Operations
• Next slides show the performance effects of using
different access patterns and storage layouts.
• We use three test cases which consist of writing a
selection to an array of characters.
• Data is stored in a row-major order.
• Tests were executed on THG Linux x86_64 box
using h5perf_serial and HDF5 version 1.8.0

October 15, 2008


69

Serial Benchmarking Tool
• Benchmarking tool, h5perf_serial, introduced in
1.8.1 release.
• Features inlcude:
• Support for POSIX and HDF5 I/O calls.
• Support for datasets and buffers with multiple
dimensions.
• Entire dataset access using a single or several I/O
operations.
• Selection of contiguous and chunked storage for HDF5
operations.

October 15, 2008


70

Contiguous Storage (Case 1)
• Rectangular dataset of size 48K x
48K, with write selections of 512 x
48K.
• HDF5 storage layout is contiguous.
• Good I/O pattern for POSIX and
HDF5 because each selection is
contiguous.
• POSIX: 5.19 MB/s
• HDF5: 5.36 MB/s

October 15, 2008


1
2
3
4

1

2

71

3

4

Contiguous Storage (Case 2)
• Rectangular dataset of 48K x
48K, with write selections of 48K
x 512.
• HDF5 storage layout is
contiguous.
• Bad I/O pattern for POSIX and
HDF5 because each selection is
noncontiguous.
• HDF5: 0.05 MB/s

October 15, 2008


1

1

2

3

4

2

1

3

2

72

4

3

4

…….

Chunked Storage
• Rectangular dataset of 48K x
48K, with write selections of 48K
x 512.
• HDF5 storage layout is chunked.
Chunks and selections sizes are
equal.
• Bad I/O case for POSIX because
selections are noncontiguous.
• Good I/O case for HDF5 since
selections are contiguous due to
chunking layout settings.
• HDF5: 5.58 MB/s

1


3

4

POSIX
1

2

3

4

1

2

3

4

…….

HDF5
1

October 15, 2008

2

2

3

73

4

Conclusions
• Access patterns with small I/O operations incur
high latency and overhead costs many times.
• Chunked storage may improve I/O performance
by affecting the contiguity of the data selection.

October 15, 2008


74

Writing Chunked Dataset
• 1000x100x100 dataset
• 4 byte integers
• Random values 0-99

• 50x100x100 chunks (20 total)
• Chunk size: 2 MB

• Write the entire dataset using 1x100x100 slices
• Slices are written sequentially

October 15, 2008


75

Test Setup
• 20 Chunks
• 1000 slices
• Chunk size is 2MB

October 15, 2008


76

Test Setup (continued)
• Tests performed with 1 MB and 5MB chunk cache
size
• Cache size set with H5Pset_cache function
H5Pget_cache (fapl, NULL, &rdcc_nelmts,
&rdcc_nbytes, &rdcc_w0);
H5Pset_cache (fapl, 0, rdcc_nelmts,
5*1024*1024, rdcc_w0);

• Tests performed with no compression and with
gzip (deflate) compression

October 15, 2008


77

Effect of Chunk Cache Size on Write
No compression
Cache size

I/O operations

Total data
written

File size

1 MB (default)

1002

75.54 MB

38.15 MB

5 MB

22

38.16 MB

38.15 MB

Gzip compression
Cache size

I/O operations

Total data
written

File size

1 MB (default)

1982

335.42 MB
(322.34 MB read)

13.08 MB

5 MB

22

13.08 MB

13.08 MB

October 15, 2008


78

• With the 1 MB cache size, a chunk will not fit into
the cache
• All writes to the dataset must be immediately
written to disk
• With compression, the entire chunk must be read
and rewritten every time a part of the chunk is
written to
• Data must also be decompressed and
recompressed each time
• Non sequential writes could result in a larger file

• Without compression, the entire chunk must be
written when it is first written to the file
• If the selection were not contiguous on disk, it could
require as much as 1 I/O operation for each element
October 15, 2008


79

• With the 5 MB cache size, the chunk is written
only after it is full
• Drastically reduces the number of I/O operations
• Reduces the amount of data that must be written
(and read)
• Reduces processing time, especially with the
compression filter

October 15, 2008


80

Conclusion
• It is important to make sure that a chunk will fit
into the raw data chunk cache
• If you will be writing to multiple chunks at once,
you should increase the cache size even more
• Try to design chunk dimensions to minimize the
number you will be writing to at once

October 15, 2008


81

Reading Chunked Dataset
• Read the same dataset, again by slices, but the
slices cross through all the chunks
• 2 orientations for read plane
• Plane includes fastest changing dimension
• Plane does not include fastest changing dimension

• Measure total read operations, and total size read
• Chunk sizes of 50x100x100, and 10x100x100
• 1 MB cache

October 15, 2008


82

Test Setup
• Chunks
• Read slices
• Vertical and horizontal

October 15, 2008


83

Results
• Read slice includes fastest changing dimension

Chunk size

Compression

I/O operations

Total data read

50

Yes

2010

1307 MB

10

Yes

10012

1308 MB

50

No

100010

38 MB

10

No

10012

3814 MB

October 15, 2008


84

Results (continued)
• Read slice does not include fastest changing
dimension
Chunk size

Compression

I/O operations

Total data read

50

Yes

2010

1307 MB

10

Yes

10012

1308 MB

50

No

10000010

38 MB

10

No

10012

3814 MB

October 15, 2008


85

Effect of Cache Size on Read
• When compression is enabled, the library must
always read each entire chunk once for each call
to H5Dread.
• When compression is disabled, the library’s
behavior depends on the cache size relative to
the chunk size.
• If the chunk fits in cache, the library reads each
entire chunk once for each call to H5Dread
• If the chunk does not fit in cache, the library reads
only the data that is selected
• More read operations, especially if the read plane
does not include the fastest changing dimension
• Less total data read
October 15, 2008


86

Conclusion
• In this case cache size does not matter when
reading if compression is enabled.
• Without compression, a larger cache may not be
beneficial, unless the cache is large enough to
hold all of the chunks.
• The optimum cache size depends on the exact
shape of the data, as well as the hardware.

October 15, 2008


87

Questions?

October 15, 2008


88

Acknowledgement
• This Tutorial is based upon work supported in part
by a Cooperative Agreement with the National
Aeronautics and Space Administration (NASA)
under NASA Awards NNX06AC83A and
NNX08AO77A. Any opinions, findings, and
conclusions or recommendations expressed in
this material are those of the author(s) and do not
necessarily reflect the views of the National
Aeronautics and Space Administration.

October 15, 2008


89

Advanced HDF5 Features

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Advanced HDF5 Features

Similaire à Advanced HDF5 Features (20)

Plus de The HDF-EOS Tools and Information Center

Plus de The HDF-EOS Tools and Information Center (20)

Dernier

Dernier (20)

Advanced HDF5 Features

Notes de l'éditeur