SlideShare une entreprise Scribd logo
1  sur  89
HDF5 Advanced Topics

October 15, 2008

HDF and HDF-EOS Workshop XII

1
Outline
• Part I
• Overview of HDF5 datatypes

• Part II
• Partial I/O in HDF5
• Hyperslab selection
• Dataset region references

• Chunking and compression

• Part III
• Performance issues (how to do it right)

October 15, 2008

HDF and HDF-EOS Workshop XII

2
Part I
HDF5 Datatypes
Quick overview of the most
difficult topics

October 15, 2008

HDF and HDF-EOS Workshop XII

3
HDF5 Datatypes
• HDF5 has a rich set of pre-defined datatypes and
supports the creation of an unlimited variety of
complex user-defined datatypes.
• Datatype definitions are stored in the HDF5 file
with the data.
• Datatype definitions include information such as
byte order (endianess), size, and floating point
representation to fully describe how the data is
stored and to insure portability across platforms.
• Datatype definitions can be shared among objects
in an HDF file, providing a powerful and efficient
mechanism for describing data.
October 15, 2008

HDF and HDF-EOS Workshop XII

4
Example
Array of of integers on Linux platform
Native integer is little-endian, 4 bytes

Array of of integers on Solaris platform
Native integer is big-endian,
Fortran compiler uses -i8 flag to set
integer to 8 bytes

H5T_NATIVE_INT
H5T_NATIVE_INT

Little-endian 4 bytes integer

H5Dwrite

H5Dread
H5Dwrite
H5T_SDT_I32LE

October 15, 2008

HDF and HDF-EOS Workshop XII

VAX G-floating

5
Storing Variable Length
Data in HDF5

October 15, 2008

HDF and HDF-EOS Workshop XII

6
HDF5 Fixed and Variable Length Array Storage

•Data
•Data

Time
•Data
•Data

•Data
•Data

Time
•Data
•Data
•Data

October 15, 2008

HDF and HDF-EOS Workshop XII

7
Storing Strings in HDF5
• Array of characters
• Access to each character
• Extra work to access and interpret each string
• Fixed length
string_id = H5Tcopy(H5T_C_S1);
H5Tset_size(string_id, size);
• Overhead for short strings
• Can be compressed
• Variable length
string_id = H5Tcopy(H5T_C_S1);
H5Tset_size(string_id, H5T_VARIABLE);
• Overhead as for all VL datatypes
• Compression will not be applied to actual data

October 15, 2008

HDF and HDF-EOS Workshop XII

8
Storing Variable Length Data in HDF5
• Each element is represented by C structure
typedef struct {
size_t length;
void
*p;
} hvl_t;

• Base type can be any HDF5 type
H5Tvlen_create(base_type)

October 15, 2008

HDF and HDF-EOS Workshop XII

9
Example
hvl_t

data[LENGTH];

for(i=0; i<LENGTH; i++) {
data[i].p=HDmalloc(
(i+1)*sizeof(unsigned int));
data[i].len=i+1;
}
tvl = H5Tvlen_create (H5T_NATIVE_UINT);
data[0].p

•Data
•Data
•Data
•Data

data[4].len
October 15, 2008

•Data

HDF and HDF-EOS Workshop XII

10
Reading HDF5 Variable Length Array
On read HDF5 Library allocates memory to read data in,
application only needs to allocate array of hvl_t elements
(pointers and lengths).
hvl_t

rdata[LENGTH];

/* Discover the type in the file */
tvl = H5Tvlen_create (H5T_NATIVE_UINT);
ret = H5Dread(dataset,tvl,H5S_ALL,H5S_ALL,
H5P_DEFAULT, rdata);
/* Reclaim the read VL data */
H5Dvlen_reclaim(tvl,H5S_ALL,H5P_DEFAULT,rdata
);
October 15, 2008

HDF and HDF-EOS Workshop XII

11
Storing Tables in HDF5 file

October 15, 2008

HDF and HDF-EOS Workshop XII

12
Example
a_name
(integer)

b_name
(float)

c_name
(double)

0

0.

1.0000

1

1.

0.5000

2

4.

0.3333

3

9.

0.2500

4

16.

0.2000

5

25.

0.1667

6

36.

0.1429

7

49.

0.1250

8

64.

0.1111

9

81.

0.1000

October 15,
2008

Multiple ways to store a table
Dataset for each field
Dataset with compound datatype
If all fields have the same type:
2-dim array
1-dim array of array datatype
continued…..

Choose to achieve your goal!
How much overhead each type
of storage will create?
Do I always read all fields?
Do I need to read some fields more
often?
Do I want to use compression?
Do I want to access some records?

HDF and HDF-EOS Workshop XII

13
HDF5 Compound Datatypes
• Compound types
• Comparable to C structs
• Members can be atomic or compound
types
• Members can be multidimensional
• Can be written/read by a field or set of
fields
• Not all data filters can be applied (shuffling,
SZIP)

October 15, 2008

HDF and HDF-EOS Workshop XII

14
HDF5 Compound Datatypes
•

Which APIs to use?
• H5TB APIs
•
•
•
•

Create, read, get info and merge tables
Add, delete, and append records
Insert and delete fields
Limited control over table’s properties (i.e. only GZIP
compression, level 6, default allocation time for table, extendible,
etc.)

• PyTables http://www.pytables.org
• Based on H5TB
• Python interface
• Indexing capabilities

• HDF5 APIs
• H5Tcreate(H5T_COMPOUND), H5Tinsert calls to create a
compound datatype
• H5Dcreate, etc.
• See H5Tget_member* functions for discovering properties of the
HDF5 compound datatype
October 15, 2008

HDF and HDF-EOS Workshop XII

15
Creating and Writing Compound Dataset
h5_compound.c example
typedef struct s1_t {
int a;
float b;
double c;
} s1_t;
s1_t

October 15, 2008

s1[LENGTH];

HDF and HDF-EOS Workshop XII

16
Creating and Writing Compound Dataset
/* Create datatype in memory. */
s1_tid = H5Tcreate (H5T_COMPOUND, sizeof(s1_t));
H5Tinsert(s1_tid, "a_name", HOFFSET(s1_t, a),
H5T_NATIVE_INT);
H5Tinsert(s1_tid, "c_name", HOFFSET(s1_t, c),
H5T_NATIVE_DOUBLE);
H5Tinsert(s1_tid, "b_name", HOFFSET(s1_t, b),
H5T_NATIVE_FLOAT);

Note:
• Use HOFFSET macro instead of calculating offset by hand.
• Order of H5Tinsert calls is not important if HOFFSET is used.

October 15, 2008

HDF and HDF-EOS Workshop XII

17
Creating and Writing Compound Dataset
/* Create dataset and write data */
dataset = H5Dcreate(file, DATASETNAME, s1_tid, space,
H5P_DEFAULT);
status = H5Dwrite(dataset, s1_tid, H5S_ALL, H5S_ALL,
H5P_DEFAULT, s1);
Note:
• In this example memory and file datatypes are the same.
• Type is not packed.
• Use H5Tpack to save space in the file.
s2_tid = H5Tpack(s1_tid);
status = H5Dcreate(file, DATASETNAME, s2_tid, space,
H5P_DEFAULT);
October 15, 2008

HDF and HDF-EOS Workshop XII

18
File Content with h5dump
HDF5 "SDScompound.h5" {
GROUP "/" {
DATASET "ArrayOfStructures" {
DATATYPE {
H5T_STD_I32BE "a_name";
H5T_IEEE_F32BE "b_name";
H5T_IEEE_F64BE "c_name"; }
DATASPACE { SIMPLE ( 10 ) / ( 10 ) }
DATA {
{
[ 0 ],
[ 0 ],
[ 1 ]
},
{
[ 1 ],
…
October 15, 2008

HDF and HDF-EOS Workshop XII

19
Reading Compound Dataset
/* Create datatype in memory and read data. */
dataset
= H5Dopen(file, DATSETNAME);
s2_tid
= H5Dget_type(dataset);
mem_tid
= H5Tget_native_type (s2_tid);
s1 = malloc((sizeof(mem_tid)*number_of_elements)
status
= H5Dread(dataset, mem_tid, H5S_ALL,
H5S_ALL, H5P_DEFAULT, s1);
Note:
• We could construct memory type as we did in writing example.
• For general applications we need to discover the type in the
file, find out corresponding memory type, allocate space and do
read.

October 15, 2008

HDF and HDF-EOS Workshop XII

20
Reading Compound Dataset by Fields
typedef struct s2_t {
double c;
int
a;
} s2_t;
s2_t s2[LENGTH];
…
s2_tid = H5Tcreate (H5T_COMPOUND, sizeof(s2_t));
H5Tinsert(s2_tid, "c_name", HOFFSET(s2_t, c),
H5T_NATIVE_DOUBLE);
H5Tinsert(s2_tid, “a_name", HOFFSET(s2_t, a),
H5T_NATIVE_INT);
…
status = H5Dread(dataset, s2_tid, H5S_ALL,
H5S_ALL, H5P_DEFAULT, s2);

October 15, 2008

HDF and HDF-EOS Workshop XII

21
New Way of Creating Datatypes
Another way to create a compound datatype
#include H5LTpublic.h
…..
s2_tid = H5LTtext_to_dtype(
"H5T_COMPOUND
{H5T_NATIVE_DOUBLE "c_name";
H5T_NATIVE_INT "a_name";
}",
H5LT_DDL);

October 15, 2008

HDF and HDF-EOS Workshop XII

22
Need Help with Datatypes?
Check our support web pages

http://www.hdfgroup.uiuc.edu/UserSupport/example

http://www.hdfgroup.uiuc.edu/UserSupport/example

October 15, 2008

HDF and HDF-EOS Workshop XII

23
Part II
Working with subsets

October 15, 2008

HDF and HDF-EOS Workshop XII

24
Collect data one way ….
Array of images (3D)

October 15, 2008

HDF and HDF-EOS Workshop XII

25
Display data another way …

Stitched image (2D array)

October 15, 2008

HDF and HDF-EOS Workshop XII

26
Data is too big to read….

October 15, 2008

HDF and HDF-EOS Workshop XII

27
Refer to a region…
Need to select and access the same
elements of a dataset

October 15, 2008

HDF and HDF-EOS Workshop XII

28
HDF5 Library Features
• HDF5 Library provides capabilities to
• Describe subsets of data and perform write/read
operations on subsets
• Hyperslab selections and partial I/O

• Store descriptions of the data subsets in a file
• Object references
• Region references

• Use efficient storage mechanism to achieve good
performance while writing/reading subsets of data
• Chunking, compression

October 15, 2008

HDF and HDF-EOS Workshop XII

29
Partial I/O in HDF5

October 15, 2008

HDF and HDF-EOS Workshop XII

30
How to Describe a Subset in HDF5?
• Before writing and reading a subset of data
one has to describe it to the HDF5 Library.
• HDF5 APIs and documentation refer to a
subset as a “selection” or “hyperslab
selection”.
• If specified, HDF5 Library will perform I/O on a
selection only and not on all elements of a
dataset.

October 15, 2008

HDF and HDF-EOS Workshop XII

31
Types of Selections in HDF5
• Two types of selections
• Hyperslab selection
• Regular hyperslab
• Simple hyperslab
• Result of set operations on hyperslabs (union,
difference, …)

• Point selection

• Hyperslab selection is especially important for
doing parallel I/O in HDF5 (See Parallel HDF5
Tutorial)

October 15, 2008

HDF and HDF-EOS Workshop XII

32
Regular Hyperslab

Collection of regularly spaced equal size blocks
October 15, 2008

HDF and HDF-EOS Workshop XII

33
Simple Hyperslab

Contiguous subset or sub-array
October 15, 2008

HDF and HDF-EOS Workshop XII

34
Hyperslab Selection

Result of union operation on three simple hyperslabs
October 15, 2008

HDF and HDF-EOS Workshop XII

35
Hyperslab Description
• Offset - starting location of a hyperslab (1,1)
• Stride - number of elements that separate each
block (3,2)
• Count - number of blocks (2,6)
• Block - block size (2,1)
• Everything is “measured” in number of elements

October 15, 2008

HDF and HDF-EOS Workshop XII

36
Simple Hyperslab Description
• Two ways to describe a simple hyperslab
• As several blocks
• Stride – (1,1)
• Count – (2,6)
• Block – (2,1)

• As one block
• Stride – (1,1)
• Count – (1,1)
• Block – (4,6)

No performance penalty for
one way or another
October 15, 2008

HDF and HDF-EOS Workshop XII

37
H5Sselect_hyperslab Function

space_id Identifier of dataspace
op
Selection operator
H5S_SELECT_SET or H5S_SELECT_OR
offset
Array with starting coordinates of hyperslab
stride
Array specifying which positions along a dimension
to select
count
Array specifying how many blocks to select from the
dataspace, in each dimension
block
Array specifying size of element block
(NULL indicates a block size of a single element in
a dimension)

October 15, 2008

HDF and HDF-EOS Workshop XII

38
Reading/Writing Selections
Programming model for reading from a dataset in
a file
1. Open a dataset.
2. Get file dataspace handle of the dataset and specify
subset to read from.
a. H5Dget_space returns file dataspace handle
a.

File dataspace describes array stored in a file (number of
dimensions and their sizes).

b. H5Sselect_hyperslab selects elements of the array
that participate in I/O operation.

3. Allocate data buffer of an appropriate shape and size

October 15, 2008

HDF and HDF-EOS Workshop XII

39
Reading/Writing Selections
Programming model (continued)
4. Create a memory dataspace and specify subset to write
to.
1.
2.
3.

Memory dataspace describes data buffer (its rank and
dimension sizes).
Use H5Screate_simple function to create memory
dataspace.
Use H5Sselect_hyperslab to select elements of the data
buffer that participate in I/O operation.

4. Issue H5Dread or H5Dwrite to move the data between
file and memory buffer.
5. Close file dataspace and memory dataspace when
done.
October 15, 2008

HDF and HDF-EOS Workshop XII

40
Example : Reading Two Rows
1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

-1

-1

-1

Data in a file
4x6 matrix

Buffer in memory
1-dim array of length 14
-1

-1

October 15, 2008

-1

-1

-1

-1

-1

HDF and HDF-EOS Workshop XII

-1

-1
41

-1

-1
Example: Reading Two Rows
1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

offset
count
block
stride

24

filespace = H5Dget_space (dataset);
H5Sselect_hyperslab (filespace, H5S_SELECT_SET,
offset, NULL, count, NULL)

October 15, 2008

HDF and HDF-EOS Workshop XII

42

=
=
=
=

{1,0}
{2,6}
{1,1}
{1,1}
Example: Reading Two Rows
offset = {1}
count = {12}

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

-1

memspace = H5Screate_simple(1, 14, NULL);
H5Sselect_hyperslab (memspace, H5S_SELECT_SET,
offset, NULL, count, NULL)

October 15, 2008

HDF and HDF-EOS Workshop XII

43

-1
Example: Reading Two Rows
1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

H5Dread (…, …, memspace, filespace, …, …);

-1

7

October 15, 2008

8

9

10 11 12 13 14 15 16 17 18 -1
HDF and HDF-EOS Workshop XII

44
Things to Remember
• Number of elements selected in a file and in a
memory buffer should be the same
• H5Sget_select_npoints returns number of
selected elements in a hyperslab selection

• HDF5 partial I/O is tuned to move data between
selections that have the same dimensionality;
avoid choosing subsets that have different ranks
(as in example above)
• Allocate a buffer of an appropriate size when
reading data; use H5Tget_native_type and
H5Tget_size to get the correct size of the data
element in memory.
October 15, 2008

HDF and HDF-EOS Workshop XII

45
Things to Remember
• When calling H5Sselect_hyperslab in a loop
close the obtained dataspace handle in a loop to
avoid application memory growth.
Only offset parameter is changing;
block and stride parameters
stay the same.

offset
October 15, 2008

HDF and HDF-EOS Workshop XII

46
Example
offset[0] = 0;
offset[1] = 0;
fspace_id = H5Dget_space(...);
for (k=0; k < DIM3; k++) { /* Start for loop */
offset[2] = k;
…
tmp_id = H5Sselect_hyperslab(fspace_id, …, offset, …);
H5Dwrite(dset_id, type_id, H5S_ALL, tmp_id, ..);
H5Sclose(tmp_id);
…
} /* End for loop */
October 15, 2008

HDF and HDF-EOS Workshop XII

47
HDF5 Region References
and Selections

October 15, 2008

HDF and HDF-EOS Workshop XII

48
Saving Selected Region in a File
Need to select and access the same
elements of a dataset

October 15, 2008

HDF and HDF-EOS Workshop XII

49
Reference Datatype
• Reference to an HDF5 object
• Pointer to a group or a dataset in a file
• Predefined datatype H5T_STD_REG_OBJ
describe object references

• Reference to a dataset region (or to
selection)
• Pointer to the dataspace selection
• Predefined datatype
H5T_STD_REF_DSETREG to describe regions

October 15, 2008

HDF and HDF-EOS Workshop XII

50
Reference to Dataset Region
REF_REG.h5
Root

Matrix

Object References

1 1 2 3 3 4 5 5 6
1 2 2 3 4 4 5 6 6

October 15, 2008

HDF and HDF-EOS Workshop XII

51
Reference to Dataset Region
Example
dsetr_id = H5Dcreate(file_id,
“REGION REFERENCES”, H5T_STD_REF_DSETREG,
…);
H5Sselect_hyperslab(space_id,
H5S_SELECT_SET, start, NULL, …);
H5Rcreate(&ref[0], file_id, “MATRIX”,
H5R_DATASET_REGION, space_id);
H5Dwrite(dsetr_id, H5T_STD_REF_DSETREG,
H5S_ALL, H5S_ALL, H5P_DEFAULT,ref);
October 15, 2008

HDF and HDF-EOS Workshop XII

52
Reference to Dataset Region
HDF5 "REF_REG.h5" {
GROUP "/" {
DATASET "MATRIX" {
……
}
DATASET "REGION_REFERENCES" {
DATATYPE H5T_REFERENCE
DATASPACE SIMPLE { ( 2 ) / ( 2 ) }
DATA {
(0): DATASET /MATRIX {(0,3)-(1,5)},
(1): DATASET /MATRIX {(0,0), (1,6), (0,8)}
}
}
}
}

October 15, 2008

HDF and HDF-EOS Workshop XII

53
Chunking in HDF5

October 15, 2008

HDF and HDF-EOS Workshop XII

54
HDF5 Chunking
• Dataset data is divided into equally sized blocks (chunks).
• Each chunk is stored separately as a contiguous block in
HDF5 file.
Metadata cache

Dataset data

Dataset header
………….
Datatype
Dataspace
………….
Attributes
…

File
October 15, 2008

A

B

C

D

Chunk
index

Application memory

header

Chunk
index

A

C

HDF and HDF-EOS Workshop XII

D

B
55
HDF5 Chunking
• Chunking is needed for
• Enabling compression and other filters
• Extendible datasets

October 15, 2008

HDF and HDF-EOS Workshop XII

56
HDF5 Chunking
• If used appropriately chunking improves partial
I/O for big datasets

Only two chunks are involved in I/O

October 15, 2008

HDF and HDF-EOS Workshop XII

57
HDF5 Chunking
• Chunk has the same rank as a dataset
• Chunk’s dimensions do not need to be factors of
dataset’s dimensions

October 15, 2008

HDF and HDF-EOS Workshop XII

58
Creating Chunked Dataset
1.
2.
3.

Create a dataset creation property list.
Set property list to use chunked storage layout.
Create dataset with the above property list.

crp_id = H5Pcreate(H5P_DATASET_CREATE);
rank = 2;
ch_dims[0] = 100;
ch_dims[1] = 100;
H5Pset_chunk(crp_id, rank, ch_dims);
dset_id = H5Dcreate (…, crp_id);
H5Pclose(crp_id);

October 15, 2008

HDF and HDF-EOS Workshop XII

59
Writing or Reading Chunked Dataset
1.
2.

Chunking mechanism is transparent to application.
Use the same set of operation as for contiguous
dataset, for example,
H5Dopen(…);
H5Sselect_hyperslab (…);
H5Dread(…);

3.

Selections do not need to coincide precisely with the
chunks boundaries.

October 15, 2008

HDF and HDF-EOS Workshop XII

60
HDF5 Filters
•
•

HDF5 filters modify data during I/O operations
Available filters:
1.
2.
3.
4.

Checksum (H5Pset_fletcher32)
Shuffling filter (H5Pset_shuffle)
Data transformation (in 1.8.*)
Compression
•
•
•
•

October 15, 2008

Scale + offset (in 1.8.*)
N-bit (in 1.8.*)
GZIP (deflate), SZIP (H5Pset_deflate, H5Pset_szip)
User-defined filters (BZIP2)
• Example of a user-defined compression filter can be
found
http://www.hdfgroup.uiuc.edu/papers/papers/bzip2/

HDF and HDF-EOS Workshop XII

61
Creating Compressed Dataset
1.
2.
3.
4.

Create a dataset creation property list
Set property list to use chunked storage layout
Set property list to use filters
Create dataset with the above property list

crp_id = H5Pcreate(H5P_DATASET_CREATE);
rank = 2;
ch_dims[0] = 100;
ch_dims[1] = 100;
H5Pset_chunk(crp_id, rank, ch_dims);
H5Pset_deflate(crp_id, 9);
dset_id = H5Dcreate (…, crp_id);
H5Pclose(crp_id);

October 15, 2008

HDF and HDF-EOS Workshop XII

62
Writing Compressed Dataset
Chunked dataset
A
C

Chunk cache (per dataset)
C

B

Filter pipeline

File

B

A

…………..

C

Default chunk cache size is 1MB.
Filters including compression are applied when chunk is evicted from
cache.
Chunks in the file may have different sizes
October 15, 2008

HDF and HDF-EOS Workshop XII

63
Chunking Basics to Remember
•
•

•

Chunking creates storage overhead in the file.
Performance is affected by
• Chunking and compression parameters
• Chunking cache size (H5Pset_cache call)
Some hints for getting better performance
• Use chunk size not smaller than block size (4k) on
a file system.
• Use compression method appropriate for your
data.
• Avoid using selections that do not coincide with
the chunking boundaries.

October 15, 2008

HDF and HDF-EOS Workshop XII

64
Example
Creates a compressed 1000x20 integer dataset in a file
%h5dump –p –H zip.h5
HDF5 "zip.h5" {
GROUP "/" {
GROUP "Data" {
DATASET "Compressed_Data" {
DATATYPE H5T_STD_I32BE
DATASPACE SIMPLE { ( 1000, 20 )………
STORAGE_LAYOUT {
CHUNKED ( 20, 20 )
SIZE 5316
}

October 15, 2008

HDF and HDF-EOS Workshop XII

65
Example (continued)
FILTERS {
COMPRESSION DEFLATE { LEVEL 6 }
}
FILLVALUE {
FILL_TIME H5D_FILL_TIME_IFSET
VALUE 0
}
ALLOCATION_TIME {
H5D_ALLOC_TIME_INCR
}
}
}
}
}

October 15, 2008

HDF and HDF-EOS Workshop XII

66
Example (bigger chunk)
Creates a compressed integer dataset 1000x20 in a
file; better compression ratio is achieved.
h5dump –p –H zip.h5
HDF5 "zip.h5" {
GROUP "/" {
GROUP "Data" {
DATASET "Compressed_Data" {
DATATYPE H5T_STD_I32BE
DATASPACE SIMPLE { ( 1000, 20 )………
STORAGE_LAYOUT {
CHUNKED ( 200, 20 )
SIZE 2936
}
October 15, 2008

HDF and HDF-EOS Workshop XII

67
Part III
Performance Issues
(How to Do it Right)

October 15, 2008

HDF and HDF-EOS Workshop XII

68
Performance of Serial I/O Operations
• Next slides show the performance effects of using
different access patterns and storage layouts.
• We use three test cases which consist of writing a
selection to an array of characters.
• Data is stored in a row-major order.
• Tests were executed on THG Linux x86_64 box
using h5perf_serial and HDF5 version 1.8.0

October 15, 2008

HDF and HDF-EOS Workshop XII

69
Serial Benchmarking Tool
• Benchmarking tool, h5perf_serial, introduced in
1.8.1 release.
• Features inlcude:
• Support for POSIX and HDF5 I/O calls.
• Support for datasets and buffers with multiple
dimensions.
• Entire dataset access using a single or several I/O
operations.
• Selection of contiguous and chunked storage for HDF5
operations.

October 15, 2008

HDF and HDF-EOS Workshop XII

70
Contiguous Storage (Case 1)
• Rectangular dataset of size 48K x
48K, with write selections of 512 x
48K.
• HDF5 storage layout is contiguous.
• Good I/O pattern for POSIX and
HDF5 because each selection is
contiguous.
• POSIX: 5.19 MB/s
• HDF5: 5.36 MB/s

October 15, 2008

HDF and HDF-EOS Workshop XII

1
2
3
4

1

2

71

3

4
Contiguous Storage (Case 2)
• Rectangular dataset of 48K x
48K, with write selections of 48K
x 512.
• HDF5 storage layout is
contiguous.
• Bad I/O pattern for POSIX and
HDF5 because each selection is
noncontiguous.
• POSIX: 1.24 MB/s
• HDF5: 0.05 MB/s

October 15, 2008

HDF and HDF-EOS Workshop XII

1

1

2

3

4

2

1

3

2

72

4

3

4

…….
Chunked Storage
• Rectangular dataset of 48K x
48K, with write selections of 48K
x 512.
• HDF5 storage layout is chunked.
Chunks and selections sizes are
equal.
• Bad I/O case for POSIX because
selections are noncontiguous.
• Good I/O case for HDF5 since
selections are contiguous due to
chunking layout settings.
• POSIX: 1.51 MB/s
• HDF5: 5.58 MB/s

1

HDF and HDF-EOS Workshop XII

3

4

POSIX
1

2

3

4

1

2

3

4

…….

HDF5
1

October 15, 2008

2

2

3

73

4
Conclusions
• Access patterns with small I/O operations incur
high latency and overhead costs many times.
• Chunked storage may improve I/O performance
by affecting the contiguity of the data selection.

October 15, 2008

HDF and HDF-EOS Workshop XII

74
Writing Chunked Dataset
• 1000x100x100 dataset
• 4 byte integers
• Random values 0-99

• 50x100x100 chunks (20 total)
• Chunk size: 2 MB

• Write the entire dataset using 1x100x100 slices
• Slices are written sequentially

October 15, 2008

HDF and HDF-EOS Workshop XII

75
Test Setup
• 20 Chunks
• 1000 slices
• Chunk size is 2MB

October 15, 2008

HDF and HDF-EOS Workshop XII

76
Test Setup (continued)
• Tests performed with 1 MB and 5MB chunk cache
size
• Cache size set with H5Pset_cache function
H5Pget_cache (fapl, NULL, &rdcc_nelmts,
&rdcc_nbytes, &rdcc_w0);
H5Pset_cache (fapl, 0, rdcc_nelmts,
5*1024*1024, rdcc_w0);

• Tests performed with no compression and with
gzip (deflate) compression

October 15, 2008

HDF and HDF-EOS Workshop XII

77
Effect of Chunk Cache Size on Write
No compression
Cache size

I/O operations

Total data
written

File size

1 MB (default)

1002

75.54 MB

38.15 MB

5 MB

22

38.16 MB

38.15 MB

Gzip compression
Cache size

I/O operations

Total data
written

File size

1 MB (default)

1982

335.42 MB
(322.34 MB read)

13.08 MB

5 MB

22

13.08 MB

13.08 MB

October 15, 2008

HDF and HDF-EOS Workshop XII

78
Effect of Chunk Cache Size on Write
• With the 1 MB cache size, a chunk will not fit into
the cache
• All writes to the dataset must be immediately
written to disk
• With compression, the entire chunk must be read
and rewritten every time a part of the chunk is
written to
• Data must also be decompressed and
recompressed each time
• Non sequential writes could result in a larger file

• Without compression, the entire chunk must be
written when it is first written to the file
• If the selection were not contiguous on disk, it could
require as much as 1 I/O operation for each element
October 15, 2008

HDF and HDF-EOS Workshop XII

79
Effect of Chunk Cache Size on Write
• With the 5 MB cache size, the chunk is written
only after it is full
• Drastically reduces the number of I/O operations
• Reduces the amount of data that must be written
(and read)
• Reduces processing time, especially with the
compression filter

October 15, 2008

HDF and HDF-EOS Workshop XII

80
Conclusion
• It is important to make sure that a chunk will fit
into the raw data chunk cache
• If you will be writing to multiple chunks at once,
you should increase the cache size even more
• Try to design chunk dimensions to minimize the
number you will be writing to at once

October 15, 2008

HDF and HDF-EOS Workshop XII

81
Reading Chunked Dataset
• Read the same dataset, again by slices, but the
slices cross through all the chunks
• 2 orientations for read plane
• Plane includes fastest changing dimension
• Plane does not include fastest changing dimension

• Measure total read operations, and total size read
• Chunk sizes of 50x100x100, and 10x100x100
• 1 MB cache

October 15, 2008

HDF and HDF-EOS Workshop XII

82
Test Setup
• Chunks
• Read slices
• Vertical and horizontal

October 15, 2008

HDF and HDF-EOS Workshop XII

83
Results
• Read slice includes fastest changing dimension

Chunk size

Compression

I/O operations

Total data read

50

Yes

2010

1307 MB

10

Yes

10012

1308 MB

50

No

100010

38 MB

10

No

10012

3814 MB

October 15, 2008

HDF and HDF-EOS Workshop XII

84
Results (continued)
• Read slice does not include fastest changing
dimension
Chunk size

Compression

I/O operations

Total data read

50

Yes

2010

1307 MB

10

Yes

10012

1308 MB

50

No

10000010

38 MB

10

No

10012

3814 MB

October 15, 2008

HDF and HDF-EOS Workshop XII

85
Effect of Cache Size on Read
• When compression is enabled, the library must
always read each entire chunk once for each call
to H5Dread.
• When compression is disabled, the library’s
behavior depends on the cache size relative to
the chunk size.
• If the chunk fits in cache, the library reads each
entire chunk once for each call to H5Dread
• If the chunk does not fit in cache, the library reads
only the data that is selected
• More read operations, especially if the read plane
does not include the fastest changing dimension
• Less total data read
October 15, 2008

HDF and HDF-EOS Workshop XII

86
Conclusion
• In this case cache size does not matter when
reading if compression is enabled.
• Without compression, a larger cache may not be
beneficial, unless the cache is large enough to
hold all of the chunks.
• The optimum cache size depends on the exact
shape of the data, as well as the hardware.

October 15, 2008

HDF and HDF-EOS Workshop XII

87
Questions?

October 15, 2008

HDF and HDF-EOS Workshop XII

88
Acknowledgement
• This Tutorial is based upon work supported in part
by a Cooperative Agreement with the National
Aeronautics and Space Administration (NASA)
under NASA Awards NNX06AC83A and
NNX08AO77A. Any opinions, findings, and
conclusions or recommendations expressed in
this material are those of the author(s) and do not
necessarily reflect the views of the National
Aeronautics and Space Administration.

October 15, 2008

HDF and HDF-EOS Workshop XII

89

Contenu connexe

Tendances

Python and HDF5: Overview
Python and HDF5: OverviewPython and HDF5: Overview
Python and HDF5: Overviewandrewcollette
 
Hdf5 is for Lovers (PyData SV 2013)
Hdf5 is for Lovers (PyData SV 2013)Hdf5 is for Lovers (PyData SV 2013)
Hdf5 is for Lovers (PyData SV 2013)PyData
 
Interoperability with netCDF-4 - Experience with NPP and HDF-EOS5 products
Interoperability with netCDF-4 - Experience with NPP and HDF-EOS5 productsInteroperability with netCDF-4 - Experience with NPP and HDF-EOS5 products
Interoperability with netCDF-4 - Experience with NPP and HDF-EOS5 productsThe HDF-EOS Tools and Information Center
 

Tendances (20)

HDF5 FastQuery
HDF5 FastQueryHDF5 FastQuery
HDF5 FastQuery
 
Implementing HDF5 in MATLAB
Implementing HDF5 in MATLABImplementing HDF5 in MATLAB
Implementing HDF5 in MATLAB
 
Projection Indexes for HDF5 Datasets
Projection Indexes for HDF5 DatasetsProjection Indexes for HDF5 Datasets
Projection Indexes for HDF5 Datasets
 
Substituting HDF5 tools with Python/H5py scripts
Substituting HDF5 tools with Python/H5py scriptsSubstituting HDF5 tools with Python/H5py scripts
Substituting HDF5 tools with Python/H5py scripts
 
Python and HDF5: Overview
Python and HDF5: OverviewPython and HDF5: Overview
Python and HDF5: Overview
 
HDF5 Tools
HDF5 ToolsHDF5 Tools
HDF5 Tools
 
HDF Tools Tutorial
HDF Tools TutorialHDF Tools Tutorial
HDF Tools Tutorial
 
Introduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIsIntroduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIs
 
Images of HDF5
Images of HDF5Images of HDF5
Images of HDF5
 
Hdf5 is for Lovers (PyData SV 2013)
Hdf5 is for Lovers (PyData SV 2013)Hdf5 is for Lovers (PyData SV 2013)
Hdf5 is for Lovers (PyData SV 2013)
 
NASA HDF/HDF-EOS Data for Dummies (and Developers)
NASA HDF/HDF-EOS Data for Dummies (and Developers)NASA HDF/HDF-EOS Data for Dummies (and Developers)
NASA HDF/HDF-EOS Data for Dummies (and Developers)
 
Interoperability with netCDF-4 - Experience with NPP and HDF-EOS5 products
Interoperability with netCDF-4 - Experience with NPP and HDF-EOS5 productsInteroperability with netCDF-4 - Experience with NPP and HDF-EOS5 products
Interoperability with netCDF-4 - Experience with NPP and HDF-EOS5 products
 
The Python Programming Language and HDF5: H5Py
The Python Programming Language and HDF5: H5PyThe Python Programming Language and HDF5: H5Py
The Python Programming Language and HDF5: H5Py
 
Hdf5 intro
Hdf5 introHdf5 intro
Hdf5 intro
 
NetCDF and HDF5
NetCDF and HDF5NetCDF and HDF5
NetCDF and HDF5
 
Introduction to HDF5
Introduction to HDF5Introduction to HDF5
Introduction to HDF5
 
Hdf5
Hdf5Hdf5
Hdf5
 
Caching and Buffering in HDF5
Caching and Buffering in HDF5Caching and Buffering in HDF5
Caching and Buffering in HDF5
 
Digital Object Identifiers for EOSDIS data
Digital Object Identifiers for EOSDIS dataDigital Object Identifiers for EOSDIS data
Digital Object Identifiers for EOSDIS data
 
Migrating from HDF5 1.6 to 1.8
Migrating from HDF5 1.6 to 1.8Migrating from HDF5 1.6 to 1.8
Migrating from HDF5 1.6 to 1.8
 

En vedette

En vedette (20)

HDFView and HDF Java Products
HDFView and HDF Java ProductsHDFView and HDF Java Products
HDFView and HDF Java Products
 
NEON HDF5
NEON HDF5NEON HDF5
NEON HDF5
 
Pilot Project for HDF5 Metadata Structures for SWOT
Pilot Project for HDF5 Metadata Structures for SWOTPilot Project for HDF5 Metadata Structures for SWOT
Pilot Project for HDF5 Metadata Structures for SWOT
 
Breakthrough Listen
Breakthrough ListenBreakthrough Listen
Breakthrough Listen
 
HDF Update 2016
HDF Update 2016HDF Update 2016
HDF Update 2016
 
ICESat-2 Metadata and Status
ICESat-2 Metadata and StatusICESat-2 Metadata and Status
ICESat-2 Metadata and Status
 
HDF Cloud Services
HDF Cloud ServicesHDF Cloud Services
HDF Cloud Services
 
Utilizing HDF4 File Content Maps for the Cloud Computing
Utilizing HDF4 File Content Maps for the Cloud ComputingUtilizing HDF4 File Content Maps for the Cloud Computing
Utilizing HDF4 File Content Maps for the Cloud Computing
 
Scientific Computing and Visualization using HDF
Scientific Computing and Visualization using HDFScientific Computing and Visualization using HDF
Scientific Computing and Visualization using HDF
 
Using visualization tools to access HDF data via OPeNDAP
Using visualization tools to access HDF data via OPeNDAP Using visualization tools to access HDF data via OPeNDAP
Using visualization tools to access HDF data via OPeNDAP
 
SPD and KEA: HDF5 based file formats for Earth Observation
SPD and KEA: HDF5 based file formats for Earth ObservationSPD and KEA: HDF5 based file formats for Earth Observation
SPD and KEA: HDF5 based file formats for Earth Observation
 
The CFD General Notation System transition to HDF5
The CFD General Notation System transition to HDF5The CFD General Notation System transition to HDF5
The CFD General Notation System transition to HDF5
 
Shifting the Burden from the User to the Data Provider
Shifting the Burden from the User to the Data ProviderShifting the Burden from the User to the Data Provider
Shifting the Burden from the User to the Data Provider
 
Status of HDF-EOS, Related Software, and Tools
Status of HDF-EOS, Related Software, and ToolsStatus of HDF-EOS, Related Software, and Tools
Status of HDF-EOS, Related Software, and Tools
 
Support for NPP/NPOESS by The HDF Group
Support for NPP/NPOESS by The HDF GroupSupport for NPP/NPOESS by The HDF Group
Support for NPP/NPOESS by The HDF Group
 
Workshop Discussion: HDF & HDF-EOS Future Direction
Workshop Discussion: HDF & HDF-EOS Future DirectionWorkshop Discussion: HDF & HDF-EOS Future Direction
Workshop Discussion: HDF & HDF-EOS Future Direction
 
Profile of NPOESS HDF5 Files
Profile of NPOESS HDF5 FilesProfile of NPOESS HDF5 Files
Profile of NPOESS HDF5 Files
 
HDF Update
HDF UpdateHDF Update
HDF Update
 
What will be new in HDF5?
What will be new in HDF5?What will be new in HDF5?
What will be new in HDF5?
 
ENVI/IDL for HDF
ENVI/IDL for HDFENVI/IDL for HDF
ENVI/IDL for HDF
 

Similaire à Advanced HDF5 Features

HDF5 Advanced Topics - Object's Properties, Storage Methods, Filters, Datatypes
HDF5 Advanced Topics - Object's Properties, Storage Methods, Filters, DatatypesHDF5 Advanced Topics - Object's Properties, Storage Methods, Filters, Datatypes
HDF5 Advanced Topics - Object's Properties, Storage Methods, Filters, DatatypesThe HDF-EOS Tools and Information Center
 
Hdf5 parallel
Hdf5 parallelHdf5 parallel
Hdf5 parallelmfolk
 

Similaire à Advanced HDF5 Features (20)

Advanced HDF5 Features
Advanced HDF5 FeaturesAdvanced HDF5 Features
Advanced HDF5 Features
 
HDF5 Advanced Topics - Object's Properties, Storage Methods, Filters, Datatypes
HDF5 Advanced Topics - Object's Properties, Storage Methods, Filters, DatatypesHDF5 Advanced Topics - Object's Properties, Storage Methods, Filters, Datatypes
HDF5 Advanced Topics - Object's Properties, Storage Methods, Filters, Datatypes
 
HDF5 Advanced Topics
HDF5 Advanced TopicsHDF5 Advanced Topics
HDF5 Advanced Topics
 
Hdf5 parallel
Hdf5 parallelHdf5 parallel
Hdf5 parallel
 
Parallel HDF5 Introductory Tutorial
Parallel HDF5 Introductory TutorialParallel HDF5 Introductory Tutorial
Parallel HDF5 Introductory Tutorial
 
HDF5 iRODS
HDF5 iRODSHDF5 iRODS
HDF5 iRODS
 
Introduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIsIntroduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIs
 
Introduction to HDF5
Introduction to HDF5Introduction to HDF5
Introduction to HDF5
 
UML Representation of NPOESS Data Products in HDF5
UML Representation of NPOESS Data Products in HDF5UML Representation of NPOESS Data Products in HDF5
UML Representation of NPOESS Data Products in HDF5
 
Using HDF5 tools for performance tuning and troubleshooting
Using HDF5 tools for performance tuning and troubleshootingUsing HDF5 tools for performance tuning and troubleshooting
Using HDF5 tools for performance tuning and troubleshooting
 
HDF Cloud: HDF5 at Scale
HDF Cloud: HDF5 at ScaleHDF Cloud: HDF5 at Scale
HDF Cloud: HDF5 at Scale
 
HDF5 Life cycle of data
HDF5 Life cycle of dataHDF5 Life cycle of data
HDF5 Life cycle of data
 
HDF Update for DAAC Managers (2017-02-27)
HDF Update for DAAC Managers (2017-02-27)HDF Update for DAAC Managers (2017-02-27)
HDF Update for DAAC Managers (2017-02-27)
 
HDF5 Tools Updates
HDF5 Tools UpdatesHDF5 Tools Updates
HDF5 Tools Updates
 
HDF Update
HDF UpdateHDF Update
HDF Update
 
Performance Tuning in HDF5
Performance Tuning in HDF5 Performance Tuning in HDF5
Performance Tuning in HDF5
 
H5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only LibraryH5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only Library
 
Dimension Scales in HDF-EOS2 and HDF-EOS5
Dimension Scales in HDF-EOS2 and HDF-EOS5 Dimension Scales in HDF-EOS2 and HDF-EOS5
Dimension Scales in HDF-EOS2 and HDF-EOS5
 
HDF for the Cloud - New HDF Server Features
HDF for the Cloud - New HDF Server FeaturesHDF for the Cloud - New HDF Server Features
HDF for the Cloud - New HDF Server Features
 
Cloud-Optimized HDF5 Files
Cloud-Optimized HDF5 FilesCloud-Optimized HDF5 Files
Cloud-Optimized HDF5 Files
 

Plus de The HDF-EOS Tools and Information Center

STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...The HDF-EOS Tools and Information Center
 

Plus de The HDF-EOS Tools and Information Center (20)

Accessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDSAccessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDS
 
The State of HDF
The State of HDFThe State of HDF
The State of HDF
 
Highly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance FeaturesHighly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance Features
 
Creating Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 FilesCreating Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 Files
 
HDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance DiscussionHDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance Discussion
 
Hyrax: Serving Data from S3
Hyrax: Serving Data from S3Hyrax: Serving Data from S3
Hyrax: Serving Data from S3
 
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLABAccessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
 
HDF - Current status and Future Directions
HDF - Current status and Future DirectionsHDF - Current status and Future Directions
HDF - Current status and Future Directions
 
HDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and FutureHDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and Future
 
HDF - Current status and Future Directions
HDF - Current status and Future Directions HDF - Current status and Future Directions
HDF - Current status and Future Directions
 
MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10
 
HDF for the Cloud - Serverless HDF
HDF for the Cloud - Serverless HDFHDF for the Cloud - Serverless HDF
HDF for the Cloud - Serverless HDF
 
HDF5 <-> Zarr
HDF5 <-> ZarrHDF5 <-> Zarr
HDF5 <-> Zarr
 
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
 
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
 
HDF5 and Ecosystem: What Is New?
HDF5 and Ecosystem: What Is New?HDF5 and Ecosystem: What Is New?
HDF5 and Ecosystem: What Is New?
 
HDF5 Roadmap 2019-2020
HDF5 Roadmap 2019-2020HDF5 Roadmap 2019-2020
HDF5 Roadmap 2019-2020
 
Leveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software TestingLeveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software Testing
 
Google Colaboratory for HDF-EOS
Google Colaboratory for HDF-EOSGoogle Colaboratory for HDF-EOS
Google Colaboratory for HDF-EOS
 
Parallel Computing with HDF Server
Parallel Computing with HDF ServerParallel Computing with HDF Server
Parallel Computing with HDF Server
 

Dernier

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 

Dernier (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Advanced HDF5 Features

  • 1. HDF5 Advanced Topics October 15, 2008 HDF and HDF-EOS Workshop XII 1
  • 2. Outline • Part I • Overview of HDF5 datatypes • Part II • Partial I/O in HDF5 • Hyperslab selection • Dataset region references • Chunking and compression • Part III • Performance issues (how to do it right) October 15, 2008 HDF and HDF-EOS Workshop XII 2
  • 3. Part I HDF5 Datatypes Quick overview of the most difficult topics October 15, 2008 HDF and HDF-EOS Workshop XII 3
  • 4. HDF5 Datatypes • HDF5 has a rich set of pre-defined datatypes and supports the creation of an unlimited variety of complex user-defined datatypes. • Datatype definitions are stored in the HDF5 file with the data. • Datatype definitions include information such as byte order (endianess), size, and floating point representation to fully describe how the data is stored and to insure portability across platforms. • Datatype definitions can be shared among objects in an HDF file, providing a powerful and efficient mechanism for describing data. October 15, 2008 HDF and HDF-EOS Workshop XII 4
  • 5. Example Array of of integers on Linux platform Native integer is little-endian, 4 bytes Array of of integers on Solaris platform Native integer is big-endian, Fortran compiler uses -i8 flag to set integer to 8 bytes H5T_NATIVE_INT H5T_NATIVE_INT Little-endian 4 bytes integer H5Dwrite H5Dread H5Dwrite H5T_SDT_I32LE October 15, 2008 HDF and HDF-EOS Workshop XII VAX G-floating 5
  • 6. Storing Variable Length Data in HDF5 October 15, 2008 HDF and HDF-EOS Workshop XII 6
  • 7. HDF5 Fixed and Variable Length Array Storage •Data •Data Time •Data •Data •Data •Data Time •Data •Data •Data October 15, 2008 HDF and HDF-EOS Workshop XII 7
  • 8. Storing Strings in HDF5 • Array of characters • Access to each character • Extra work to access and interpret each string • Fixed length string_id = H5Tcopy(H5T_C_S1); H5Tset_size(string_id, size); • Overhead for short strings • Can be compressed • Variable length string_id = H5Tcopy(H5T_C_S1); H5Tset_size(string_id, H5T_VARIABLE); • Overhead as for all VL datatypes • Compression will not be applied to actual data October 15, 2008 HDF and HDF-EOS Workshop XII 8
  • 9. Storing Variable Length Data in HDF5 • Each element is represented by C structure typedef struct { size_t length; void *p; } hvl_t; • Base type can be any HDF5 type H5Tvlen_create(base_type) October 15, 2008 HDF and HDF-EOS Workshop XII 9
  • 10. Example hvl_t data[LENGTH]; for(i=0; i<LENGTH; i++) { data[i].p=HDmalloc( (i+1)*sizeof(unsigned int)); data[i].len=i+1; } tvl = H5Tvlen_create (H5T_NATIVE_UINT); data[0].p •Data •Data •Data •Data data[4].len October 15, 2008 •Data HDF and HDF-EOS Workshop XII 10
  • 11. Reading HDF5 Variable Length Array On read HDF5 Library allocates memory to read data in, application only needs to allocate array of hvl_t elements (pointers and lengths). hvl_t rdata[LENGTH]; /* Discover the type in the file */ tvl = H5Tvlen_create (H5T_NATIVE_UINT); ret = H5Dread(dataset,tvl,H5S_ALL,H5S_ALL, H5P_DEFAULT, rdata); /* Reclaim the read VL data */ H5Dvlen_reclaim(tvl,H5S_ALL,H5P_DEFAULT,rdata ); October 15, 2008 HDF and HDF-EOS Workshop XII 11
  • 12. Storing Tables in HDF5 file October 15, 2008 HDF and HDF-EOS Workshop XII 12
  • 13. Example a_name (integer) b_name (float) c_name (double) 0 0. 1.0000 1 1. 0.5000 2 4. 0.3333 3 9. 0.2500 4 16. 0.2000 5 25. 0.1667 6 36. 0.1429 7 49. 0.1250 8 64. 0.1111 9 81. 0.1000 October 15, 2008 Multiple ways to store a table Dataset for each field Dataset with compound datatype If all fields have the same type: 2-dim array 1-dim array of array datatype continued….. Choose to achieve your goal! How much overhead each type of storage will create? Do I always read all fields? Do I need to read some fields more often? Do I want to use compression? Do I want to access some records? HDF and HDF-EOS Workshop XII 13
  • 14. HDF5 Compound Datatypes • Compound types • Comparable to C structs • Members can be atomic or compound types • Members can be multidimensional • Can be written/read by a field or set of fields • Not all data filters can be applied (shuffling, SZIP) October 15, 2008 HDF and HDF-EOS Workshop XII 14
  • 15. HDF5 Compound Datatypes • Which APIs to use? • H5TB APIs • • • • Create, read, get info and merge tables Add, delete, and append records Insert and delete fields Limited control over table’s properties (i.e. only GZIP compression, level 6, default allocation time for table, extendible, etc.) • PyTables http://www.pytables.org • Based on H5TB • Python interface • Indexing capabilities • HDF5 APIs • H5Tcreate(H5T_COMPOUND), H5Tinsert calls to create a compound datatype • H5Dcreate, etc. • See H5Tget_member* functions for discovering properties of the HDF5 compound datatype October 15, 2008 HDF and HDF-EOS Workshop XII 15
  • 16. Creating and Writing Compound Dataset h5_compound.c example typedef struct s1_t { int a; float b; double c; } s1_t; s1_t October 15, 2008 s1[LENGTH]; HDF and HDF-EOS Workshop XII 16
  • 17. Creating and Writing Compound Dataset /* Create datatype in memory. */ s1_tid = H5Tcreate (H5T_COMPOUND, sizeof(s1_t)); H5Tinsert(s1_tid, "a_name", HOFFSET(s1_t, a), H5T_NATIVE_INT); H5Tinsert(s1_tid, "c_name", HOFFSET(s1_t, c), H5T_NATIVE_DOUBLE); H5Tinsert(s1_tid, "b_name", HOFFSET(s1_t, b), H5T_NATIVE_FLOAT); Note: • Use HOFFSET macro instead of calculating offset by hand. • Order of H5Tinsert calls is not important if HOFFSET is used. October 15, 2008 HDF and HDF-EOS Workshop XII 17
  • 18. Creating and Writing Compound Dataset /* Create dataset and write data */ dataset = H5Dcreate(file, DATASETNAME, s1_tid, space, H5P_DEFAULT); status = H5Dwrite(dataset, s1_tid, H5S_ALL, H5S_ALL, H5P_DEFAULT, s1); Note: • In this example memory and file datatypes are the same. • Type is not packed. • Use H5Tpack to save space in the file. s2_tid = H5Tpack(s1_tid); status = H5Dcreate(file, DATASETNAME, s2_tid, space, H5P_DEFAULT); October 15, 2008 HDF and HDF-EOS Workshop XII 18
  • 19. File Content with h5dump HDF5 "SDScompound.h5" { GROUP "/" { DATASET "ArrayOfStructures" { DATATYPE { H5T_STD_I32BE "a_name"; H5T_IEEE_F32BE "b_name"; H5T_IEEE_F64BE "c_name"; } DATASPACE { SIMPLE ( 10 ) / ( 10 ) } DATA { { [ 0 ], [ 0 ], [ 1 ] }, { [ 1 ], … October 15, 2008 HDF and HDF-EOS Workshop XII 19
  • 20. Reading Compound Dataset /* Create datatype in memory and read data. */ dataset = H5Dopen(file, DATSETNAME); s2_tid = H5Dget_type(dataset); mem_tid = H5Tget_native_type (s2_tid); s1 = malloc((sizeof(mem_tid)*number_of_elements) status = H5Dread(dataset, mem_tid, H5S_ALL, H5S_ALL, H5P_DEFAULT, s1); Note: • We could construct memory type as we did in writing example. • For general applications we need to discover the type in the file, find out corresponding memory type, allocate space and do read. October 15, 2008 HDF and HDF-EOS Workshop XII 20
  • 21. Reading Compound Dataset by Fields typedef struct s2_t { double c; int a; } s2_t; s2_t s2[LENGTH]; … s2_tid = H5Tcreate (H5T_COMPOUND, sizeof(s2_t)); H5Tinsert(s2_tid, "c_name", HOFFSET(s2_t, c), H5T_NATIVE_DOUBLE); H5Tinsert(s2_tid, “a_name", HOFFSET(s2_t, a), H5T_NATIVE_INT); … status = H5Dread(dataset, s2_tid, H5S_ALL, H5S_ALL, H5P_DEFAULT, s2); October 15, 2008 HDF and HDF-EOS Workshop XII 21
  • 22. New Way of Creating Datatypes Another way to create a compound datatype #include H5LTpublic.h ….. s2_tid = H5LTtext_to_dtype( "H5T_COMPOUND {H5T_NATIVE_DOUBLE "c_name"; H5T_NATIVE_INT "a_name"; }", H5LT_DDL); October 15, 2008 HDF and HDF-EOS Workshop XII 22
  • 23. Need Help with Datatypes? Check our support web pages http://www.hdfgroup.uiuc.edu/UserSupport/example http://www.hdfgroup.uiuc.edu/UserSupport/example October 15, 2008 HDF and HDF-EOS Workshop XII 23
  • 24. Part II Working with subsets October 15, 2008 HDF and HDF-EOS Workshop XII 24
  • 25. Collect data one way …. Array of images (3D) October 15, 2008 HDF and HDF-EOS Workshop XII 25
  • 26. Display data another way … Stitched image (2D array) October 15, 2008 HDF and HDF-EOS Workshop XII 26
  • 27. Data is too big to read…. October 15, 2008 HDF and HDF-EOS Workshop XII 27
  • 28. Refer to a region… Need to select and access the same elements of a dataset October 15, 2008 HDF and HDF-EOS Workshop XII 28
  • 29. HDF5 Library Features • HDF5 Library provides capabilities to • Describe subsets of data and perform write/read operations on subsets • Hyperslab selections and partial I/O • Store descriptions of the data subsets in a file • Object references • Region references • Use efficient storage mechanism to achieve good performance while writing/reading subsets of data • Chunking, compression October 15, 2008 HDF and HDF-EOS Workshop XII 29
  • 30. Partial I/O in HDF5 October 15, 2008 HDF and HDF-EOS Workshop XII 30
  • 31. How to Describe a Subset in HDF5? • Before writing and reading a subset of data one has to describe it to the HDF5 Library. • HDF5 APIs and documentation refer to a subset as a “selection” or “hyperslab selection”. • If specified, HDF5 Library will perform I/O on a selection only and not on all elements of a dataset. October 15, 2008 HDF and HDF-EOS Workshop XII 31
  • 32. Types of Selections in HDF5 • Two types of selections • Hyperslab selection • Regular hyperslab • Simple hyperslab • Result of set operations on hyperslabs (union, difference, …) • Point selection • Hyperslab selection is especially important for doing parallel I/O in HDF5 (See Parallel HDF5 Tutorial) October 15, 2008 HDF and HDF-EOS Workshop XII 32
  • 33. Regular Hyperslab Collection of regularly spaced equal size blocks October 15, 2008 HDF and HDF-EOS Workshop XII 33
  • 34. Simple Hyperslab Contiguous subset or sub-array October 15, 2008 HDF and HDF-EOS Workshop XII 34
  • 35. Hyperslab Selection Result of union operation on three simple hyperslabs October 15, 2008 HDF and HDF-EOS Workshop XII 35
  • 36. Hyperslab Description • Offset - starting location of a hyperslab (1,1) • Stride - number of elements that separate each block (3,2) • Count - number of blocks (2,6) • Block - block size (2,1) • Everything is “measured” in number of elements October 15, 2008 HDF and HDF-EOS Workshop XII 36
  • 37. Simple Hyperslab Description • Two ways to describe a simple hyperslab • As several blocks • Stride – (1,1) • Count – (2,6) • Block – (2,1) • As one block • Stride – (1,1) • Count – (1,1) • Block – (4,6) No performance penalty for one way or another October 15, 2008 HDF and HDF-EOS Workshop XII 37
  • 38. H5Sselect_hyperslab Function space_id Identifier of dataspace op Selection operator H5S_SELECT_SET or H5S_SELECT_OR offset Array with starting coordinates of hyperslab stride Array specifying which positions along a dimension to select count Array specifying how many blocks to select from the dataspace, in each dimension block Array specifying size of element block (NULL indicates a block size of a single element in a dimension) October 15, 2008 HDF and HDF-EOS Workshop XII 38
  • 39. Reading/Writing Selections Programming model for reading from a dataset in a file 1. Open a dataset. 2. Get file dataspace handle of the dataset and specify subset to read from. a. H5Dget_space returns file dataspace handle a. File dataspace describes array stored in a file (number of dimensions and their sizes). b. H5Sselect_hyperslab selects elements of the array that participate in I/O operation. 3. Allocate data buffer of an appropriate shape and size October 15, 2008 HDF and HDF-EOS Workshop XII 39
  • 40. Reading/Writing Selections Programming model (continued) 4. Create a memory dataspace and specify subset to write to. 1. 2. 3. Memory dataspace describes data buffer (its rank and dimension sizes). Use H5Screate_simple function to create memory dataspace. Use H5Sselect_hyperslab to select elements of the data buffer that participate in I/O operation. 4. Issue H5Dread or H5Dwrite to move the data between file and memory buffer. 5. Close file dataspace and memory dataspace when done. October 15, 2008 HDF and HDF-EOS Workshop XII 40
  • 41. Example : Reading Two Rows 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 -1 -1 -1 Data in a file 4x6 matrix Buffer in memory 1-dim array of length 14 -1 -1 October 15, 2008 -1 -1 -1 -1 -1 HDF and HDF-EOS Workshop XII -1 -1 41 -1 -1
  • 42. Example: Reading Two Rows 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 offset count block stride 24 filespace = H5Dget_space (dataset); H5Sselect_hyperslab (filespace, H5S_SELECT_SET, offset, NULL, count, NULL) October 15, 2008 HDF and HDF-EOS Workshop XII 42 = = = = {1,0} {2,6} {1,1} {1,1}
  • 43. Example: Reading Two Rows offset = {1} count = {12} -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 memspace = H5Screate_simple(1, 14, NULL); H5Sselect_hyperslab (memspace, H5S_SELECT_SET, offset, NULL, count, NULL) October 15, 2008 HDF and HDF-EOS Workshop XII 43 -1
  • 44. Example: Reading Two Rows 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 H5Dread (…, …, memspace, filespace, …, …); -1 7 October 15, 2008 8 9 10 11 12 13 14 15 16 17 18 -1 HDF and HDF-EOS Workshop XII 44
  • 45. Things to Remember • Number of elements selected in a file and in a memory buffer should be the same • H5Sget_select_npoints returns number of selected elements in a hyperslab selection • HDF5 partial I/O is tuned to move data between selections that have the same dimensionality; avoid choosing subsets that have different ranks (as in example above) • Allocate a buffer of an appropriate size when reading data; use H5Tget_native_type and H5Tget_size to get the correct size of the data element in memory. October 15, 2008 HDF and HDF-EOS Workshop XII 45
  • 46. Things to Remember • When calling H5Sselect_hyperslab in a loop close the obtained dataspace handle in a loop to avoid application memory growth. Only offset parameter is changing; block and stride parameters stay the same. offset October 15, 2008 HDF and HDF-EOS Workshop XII 46
  • 47. Example offset[0] = 0; offset[1] = 0; fspace_id = H5Dget_space(...); for (k=0; k < DIM3; k++) { /* Start for loop */ offset[2] = k; … tmp_id = H5Sselect_hyperslab(fspace_id, …, offset, …); H5Dwrite(dset_id, type_id, H5S_ALL, tmp_id, ..); H5Sclose(tmp_id); … } /* End for loop */ October 15, 2008 HDF and HDF-EOS Workshop XII 47
  • 48. HDF5 Region References and Selections October 15, 2008 HDF and HDF-EOS Workshop XII 48
  • 49. Saving Selected Region in a File Need to select and access the same elements of a dataset October 15, 2008 HDF and HDF-EOS Workshop XII 49
  • 50. Reference Datatype • Reference to an HDF5 object • Pointer to a group or a dataset in a file • Predefined datatype H5T_STD_REG_OBJ describe object references • Reference to a dataset region (or to selection) • Pointer to the dataspace selection • Predefined datatype H5T_STD_REF_DSETREG to describe regions October 15, 2008 HDF and HDF-EOS Workshop XII 50
  • 51. Reference to Dataset Region REF_REG.h5 Root Matrix Object References 1 1 2 3 3 4 5 5 6 1 2 2 3 4 4 5 6 6 October 15, 2008 HDF and HDF-EOS Workshop XII 51
  • 52. Reference to Dataset Region Example dsetr_id = H5Dcreate(file_id, “REGION REFERENCES”, H5T_STD_REF_DSETREG, …); H5Sselect_hyperslab(space_id, H5S_SELECT_SET, start, NULL, …); H5Rcreate(&ref[0], file_id, “MATRIX”, H5R_DATASET_REGION, space_id); H5Dwrite(dsetr_id, H5T_STD_REF_DSETREG, H5S_ALL, H5S_ALL, H5P_DEFAULT,ref); October 15, 2008 HDF and HDF-EOS Workshop XII 52
  • 53. Reference to Dataset Region HDF5 "REF_REG.h5" { GROUP "/" { DATASET "MATRIX" { …… } DATASET "REGION_REFERENCES" { DATATYPE H5T_REFERENCE DATASPACE SIMPLE { ( 2 ) / ( 2 ) } DATA { (0): DATASET /MATRIX {(0,3)-(1,5)}, (1): DATASET /MATRIX {(0,0), (1,6), (0,8)} } } } } October 15, 2008 HDF and HDF-EOS Workshop XII 53
  • 54. Chunking in HDF5 October 15, 2008 HDF and HDF-EOS Workshop XII 54
  • 55. HDF5 Chunking • Dataset data is divided into equally sized blocks (chunks). • Each chunk is stored separately as a contiguous block in HDF5 file. Metadata cache Dataset data Dataset header …………. Datatype Dataspace …………. Attributes … File October 15, 2008 A B C D Chunk index Application memory header Chunk index A C HDF and HDF-EOS Workshop XII D B 55
  • 56. HDF5 Chunking • Chunking is needed for • Enabling compression and other filters • Extendible datasets October 15, 2008 HDF and HDF-EOS Workshop XII 56
  • 57. HDF5 Chunking • If used appropriately chunking improves partial I/O for big datasets Only two chunks are involved in I/O October 15, 2008 HDF and HDF-EOS Workshop XII 57
  • 58. HDF5 Chunking • Chunk has the same rank as a dataset • Chunk’s dimensions do not need to be factors of dataset’s dimensions October 15, 2008 HDF and HDF-EOS Workshop XII 58
  • 59. Creating Chunked Dataset 1. 2. 3. Create a dataset creation property list. Set property list to use chunked storage layout. Create dataset with the above property list. crp_id = H5Pcreate(H5P_DATASET_CREATE); rank = 2; ch_dims[0] = 100; ch_dims[1] = 100; H5Pset_chunk(crp_id, rank, ch_dims); dset_id = H5Dcreate (…, crp_id); H5Pclose(crp_id); October 15, 2008 HDF and HDF-EOS Workshop XII 59
  • 60. Writing or Reading Chunked Dataset 1. 2. Chunking mechanism is transparent to application. Use the same set of operation as for contiguous dataset, for example, H5Dopen(…); H5Sselect_hyperslab (…); H5Dread(…); 3. Selections do not need to coincide precisely with the chunks boundaries. October 15, 2008 HDF and HDF-EOS Workshop XII 60
  • 61. HDF5 Filters • • HDF5 filters modify data during I/O operations Available filters: 1. 2. 3. 4. Checksum (H5Pset_fletcher32) Shuffling filter (H5Pset_shuffle) Data transformation (in 1.8.*) Compression • • • • October 15, 2008 Scale + offset (in 1.8.*) N-bit (in 1.8.*) GZIP (deflate), SZIP (H5Pset_deflate, H5Pset_szip) User-defined filters (BZIP2) • Example of a user-defined compression filter can be found http://www.hdfgroup.uiuc.edu/papers/papers/bzip2/ HDF and HDF-EOS Workshop XII 61
  • 62. Creating Compressed Dataset 1. 2. 3. 4. Create a dataset creation property list Set property list to use chunked storage layout Set property list to use filters Create dataset with the above property list crp_id = H5Pcreate(H5P_DATASET_CREATE); rank = 2; ch_dims[0] = 100; ch_dims[1] = 100; H5Pset_chunk(crp_id, rank, ch_dims); H5Pset_deflate(crp_id, 9); dset_id = H5Dcreate (…, crp_id); H5Pclose(crp_id); October 15, 2008 HDF and HDF-EOS Workshop XII 62
  • 63. Writing Compressed Dataset Chunked dataset A C Chunk cache (per dataset) C B Filter pipeline File B A ………….. C Default chunk cache size is 1MB. Filters including compression are applied when chunk is evicted from cache. Chunks in the file may have different sizes October 15, 2008 HDF and HDF-EOS Workshop XII 63
  • 64. Chunking Basics to Remember • • • Chunking creates storage overhead in the file. Performance is affected by • Chunking and compression parameters • Chunking cache size (H5Pset_cache call) Some hints for getting better performance • Use chunk size not smaller than block size (4k) on a file system. • Use compression method appropriate for your data. • Avoid using selections that do not coincide with the chunking boundaries. October 15, 2008 HDF and HDF-EOS Workshop XII 64
  • 65. Example Creates a compressed 1000x20 integer dataset in a file %h5dump –p –H zip.h5 HDF5 "zip.h5" { GROUP "/" { GROUP "Data" { DATASET "Compressed_Data" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 1000, 20 )……… STORAGE_LAYOUT { CHUNKED ( 20, 20 ) SIZE 5316 } October 15, 2008 HDF and HDF-EOS Workshop XII 65
  • 66. Example (continued) FILTERS { COMPRESSION DEFLATE { LEVEL 6 } } FILLVALUE { FILL_TIME H5D_FILL_TIME_IFSET VALUE 0 } ALLOCATION_TIME { H5D_ALLOC_TIME_INCR } } } } } October 15, 2008 HDF and HDF-EOS Workshop XII 66
  • 67. Example (bigger chunk) Creates a compressed integer dataset 1000x20 in a file; better compression ratio is achieved. h5dump –p –H zip.h5 HDF5 "zip.h5" { GROUP "/" { GROUP "Data" { DATASET "Compressed_Data" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 1000, 20 )……… STORAGE_LAYOUT { CHUNKED ( 200, 20 ) SIZE 2936 } October 15, 2008 HDF and HDF-EOS Workshop XII 67
  • 68. Part III Performance Issues (How to Do it Right) October 15, 2008 HDF and HDF-EOS Workshop XII 68
  • 69. Performance of Serial I/O Operations • Next slides show the performance effects of using different access patterns and storage layouts. • We use three test cases which consist of writing a selection to an array of characters. • Data is stored in a row-major order. • Tests were executed on THG Linux x86_64 box using h5perf_serial and HDF5 version 1.8.0 October 15, 2008 HDF and HDF-EOS Workshop XII 69
  • 70. Serial Benchmarking Tool • Benchmarking tool, h5perf_serial, introduced in 1.8.1 release. • Features inlcude: • Support for POSIX and HDF5 I/O calls. • Support for datasets and buffers with multiple dimensions. • Entire dataset access using a single or several I/O operations. • Selection of contiguous and chunked storage for HDF5 operations. October 15, 2008 HDF and HDF-EOS Workshop XII 70
  • 71. Contiguous Storage (Case 1) • Rectangular dataset of size 48K x 48K, with write selections of 512 x 48K. • HDF5 storage layout is contiguous. • Good I/O pattern for POSIX and HDF5 because each selection is contiguous. • POSIX: 5.19 MB/s • HDF5: 5.36 MB/s October 15, 2008 HDF and HDF-EOS Workshop XII 1 2 3 4 1 2 71 3 4
  • 72. Contiguous Storage (Case 2) • Rectangular dataset of 48K x 48K, with write selections of 48K x 512. • HDF5 storage layout is contiguous. • Bad I/O pattern for POSIX and HDF5 because each selection is noncontiguous. • POSIX: 1.24 MB/s • HDF5: 0.05 MB/s October 15, 2008 HDF and HDF-EOS Workshop XII 1 1 2 3 4 2 1 3 2 72 4 3 4 …….
  • 73. Chunked Storage • Rectangular dataset of 48K x 48K, with write selections of 48K x 512. • HDF5 storage layout is chunked. Chunks and selections sizes are equal. • Bad I/O case for POSIX because selections are noncontiguous. • Good I/O case for HDF5 since selections are contiguous due to chunking layout settings. • POSIX: 1.51 MB/s • HDF5: 5.58 MB/s 1 HDF and HDF-EOS Workshop XII 3 4 POSIX 1 2 3 4 1 2 3 4 ……. HDF5 1 October 15, 2008 2 2 3 73 4
  • 74. Conclusions • Access patterns with small I/O operations incur high latency and overhead costs many times. • Chunked storage may improve I/O performance by affecting the contiguity of the data selection. October 15, 2008 HDF and HDF-EOS Workshop XII 74
  • 75. Writing Chunked Dataset • 1000x100x100 dataset • 4 byte integers • Random values 0-99 • 50x100x100 chunks (20 total) • Chunk size: 2 MB • Write the entire dataset using 1x100x100 slices • Slices are written sequentially October 15, 2008 HDF and HDF-EOS Workshop XII 75
  • 76. Test Setup • 20 Chunks • 1000 slices • Chunk size is 2MB October 15, 2008 HDF and HDF-EOS Workshop XII 76
  • 77. Test Setup (continued) • Tests performed with 1 MB and 5MB chunk cache size • Cache size set with H5Pset_cache function H5Pget_cache (fapl, NULL, &rdcc_nelmts, &rdcc_nbytes, &rdcc_w0); H5Pset_cache (fapl, 0, rdcc_nelmts, 5*1024*1024, rdcc_w0); • Tests performed with no compression and with gzip (deflate) compression October 15, 2008 HDF and HDF-EOS Workshop XII 77
  • 78. Effect of Chunk Cache Size on Write No compression Cache size I/O operations Total data written File size 1 MB (default) 1002 75.54 MB 38.15 MB 5 MB 22 38.16 MB 38.15 MB Gzip compression Cache size I/O operations Total data written File size 1 MB (default) 1982 335.42 MB (322.34 MB read) 13.08 MB 5 MB 22 13.08 MB 13.08 MB October 15, 2008 HDF and HDF-EOS Workshop XII 78
  • 79. Effect of Chunk Cache Size on Write • With the 1 MB cache size, a chunk will not fit into the cache • All writes to the dataset must be immediately written to disk • With compression, the entire chunk must be read and rewritten every time a part of the chunk is written to • Data must also be decompressed and recompressed each time • Non sequential writes could result in a larger file • Without compression, the entire chunk must be written when it is first written to the file • If the selection were not contiguous on disk, it could require as much as 1 I/O operation for each element October 15, 2008 HDF and HDF-EOS Workshop XII 79
  • 80. Effect of Chunk Cache Size on Write • With the 5 MB cache size, the chunk is written only after it is full • Drastically reduces the number of I/O operations • Reduces the amount of data that must be written (and read) • Reduces processing time, especially with the compression filter October 15, 2008 HDF and HDF-EOS Workshop XII 80
  • 81. Conclusion • It is important to make sure that a chunk will fit into the raw data chunk cache • If you will be writing to multiple chunks at once, you should increase the cache size even more • Try to design chunk dimensions to minimize the number you will be writing to at once October 15, 2008 HDF and HDF-EOS Workshop XII 81
  • 82. Reading Chunked Dataset • Read the same dataset, again by slices, but the slices cross through all the chunks • 2 orientations for read plane • Plane includes fastest changing dimension • Plane does not include fastest changing dimension • Measure total read operations, and total size read • Chunk sizes of 50x100x100, and 10x100x100 • 1 MB cache October 15, 2008 HDF and HDF-EOS Workshop XII 82
  • 83. Test Setup • Chunks • Read slices • Vertical and horizontal October 15, 2008 HDF and HDF-EOS Workshop XII 83
  • 84. Results • Read slice includes fastest changing dimension Chunk size Compression I/O operations Total data read 50 Yes 2010 1307 MB 10 Yes 10012 1308 MB 50 No 100010 38 MB 10 No 10012 3814 MB October 15, 2008 HDF and HDF-EOS Workshop XII 84
  • 85. Results (continued) • Read slice does not include fastest changing dimension Chunk size Compression I/O operations Total data read 50 Yes 2010 1307 MB 10 Yes 10012 1308 MB 50 No 10000010 38 MB 10 No 10012 3814 MB October 15, 2008 HDF and HDF-EOS Workshop XII 85
  • 86. Effect of Cache Size on Read • When compression is enabled, the library must always read each entire chunk once for each call to H5Dread. • When compression is disabled, the library’s behavior depends on the cache size relative to the chunk size. • If the chunk fits in cache, the library reads each entire chunk once for each call to H5Dread • If the chunk does not fit in cache, the library reads only the data that is selected • More read operations, especially if the read plane does not include the fastest changing dimension • Less total data read October 15, 2008 HDF and HDF-EOS Workshop XII 86
  • 87. Conclusion • In this case cache size does not matter when reading if compression is enabled. • Without compression, a larger cache may not be beneficial, unless the cache is large enough to hold all of the chunks. • The optimum cache size depends on the exact shape of the data, as well as the hardware. October 15, 2008 HDF and HDF-EOS Workshop XII 87
  • 88. Questions? October 15, 2008 HDF and HDF-EOS Workshop XII 88
  • 89. Acknowledgement • This Tutorial is based upon work supported in part by a Cooperative Agreement with the National Aeronautics and Space Administration (NASA) under NASA Awards NNX06AC83A and NNX08AO77A. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Aeronautics and Space Administration. October 15, 2008 HDF and HDF-EOS Workshop XII 89

Notes de l'éditeur

  1. H5Sselect_hyperslab: 1: dataspace identifier 2: selection operator to determine how the new selection is to be combined with the already existing selection for the dataspace. CUrrenlty on H5S_SELECT_SET operator is supported, which replaces the existing selection with the parameters from this call. Overlapping blocks are not supported with H5S_SELECT_SET. 3:start - starting coordinates. 0-beginning 4:stride: how many elements to move in each dimension. Stride of zero not supported: NULL==1 5:count: determines how many spaces to select in each dimension 6: block: determines size of the element block: if NULL then defaults to single emenet in each dimension. H5Sselect_elements: 1: dataspace identifier 2: H5S_SELECT_SET: (see above) 3:NUMP: number of elements selected 4:coord: 2-D array of size (dataspace rank0 by num_elements in szie. The order of the element coordinates in the coord array also specifies the order that the array elements are iterated when I/O is performed.
  2. Dumping DS with references may be slow since library has to dereference each element; on our to-do list
  3. More data will be written in this case Ghost zones are filled with fill value unless fill value is disabled
  4. Chunk remains in the cache until evicted Filters applied on a way to a file only Different chunks may have different sizes in a file
  5. 2MB chunk doesn’t fit to 1MB cache. Chunk is allocated in the file, and written once with the first frame in the chunk, then HDF5 writes one frame at a time, writing 20x49 + 20 times = 1000 plus small I/O for metatdata. Almost twice the size participates in I/O 2MB chunk does fit into 5MB cache, we write 50 frames at once, when we fill the chunk and do it 20 times only! In case of compression situation is even worse: we need to write chunk every time we write a frame, then read it back to write another frame, etc. For each plane we write chunk 1000; in order to modify plane 2-50 (49 of them) in one chunk , we have to read the chunk 49 times x 20 times (number of chunks), so we get 1000 (writes) + 980 (reads). We do 1000 writes and 980 reads for raw data. Bigger cache works nicely.
  6. First case: to read one plane, we need to read each chunk (20 I/Os) and do it 100 times (number of rows) (doesn’t fit into cache) Second case: to read one plane we need to read each chunk (100 I/O) and do it 100 times (for each row) Third case: cache is bypassed; reads from the file 1000 x 100 (for each row) = 100000 Fourth case: chunk fits into cache, so for one plane we do 100 reads to bring in all chunks, then do it 100 times for each row
  7. No difference for the first two cases and fourth one (for first two we always bring chunk into memory and uncompress), the third one fits into cache In the third case, chunk doesn’t fit into cache and library reads directly from the file getting one element at a time (1000x 100 (# of rows) x 100 (# columns) = 10000000)