In this talk we will discuss caching and buffering strategies in HDF5. The information presented will help developers write more efficient applications and avoid performance bottlenecks.
SQL Database Design For Developers at php[tek] 2024
Caching and Buffering in HDF5
1. Caching and Buffering in
HDF5
The HDF Group
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
1
2. Software stack and the “magic box”
• Life cycle: What happens to data when it is transferred from
application buffer to HDF5 file?
Application
Data buffer
Object API
H5Dwrite
Library internals
Magic box
Virtual file I/O
Unbuffered I/O
File or other “storage”
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
Data in a file
2
3. Inside the magic box
• Understanding of what is happening to data inside the
magic box will help to write efficient applications
• HDF5 library has mechanisms to control behavior inside
the magic box
• Goals of this talk:
Describe some basic operations and data structures and
explain how they affect performance and storage sizes
Give some “recipes” for how to improve performance
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
3
4. Topics
• Dataset metadata and array data storage layouts
• Types of dataset storage layouts
• Factors affecting I/O performance
•
•
•
•
I/O with compact datasets
I/O with contiguous datasets
I/O with chunked datasets
Variable length data and I/O
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
4
5. HDF5 dataset metadata and
array data storage layouts
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
5
6. HDF5 Dataset
• Data array
• Ordered collection of identically typed data items
distinguished by their indices
• Metadata
•
•
•
•
Dataspace: Rank, dimensions of dataset array
Datatype: Information on how to interpret data
Storage Properties: How array is organized on disk
Attributes: User-defined metadata (optional)
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
6
7. Separate Components of a Dataset
Header
Data array
Dataspace
Rank
Dimensions
3
Dim_1 = 4
Dim_2 = 5
Dim_3 = 7
Datatype
IEEE 32-bit float
Storage info
Attributes
Time = 32.4
Chunked
Pressure = 987
Compressed
Temp = 56
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
7
8. Metadata cache and array data
• Dataset array data typically kept in application memory
• Dataset header in separate space – metadata cache
Metadata cache
Dataset header
………….
Datatype
Dataspace
………….
Attributes
…
Dataset array data
Application memory
HDF5 metadata
Dataset array data
File
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
8
9. Metadata and metadata cache
• HDF5 metadata
• Information about HDF5 objects used by the library
• Examples: object headers, B-tree nodes for group, B-Tree
nodes for chunks, heaps, super-block, etc.
• Usually small compared to raw data sizes (KB vs. MB-GB)
• Metadata cache
• Space allocated to handle pieces of the HDF5 metadata
• Allocated by the HDF5 library in application’s memory
space
• Cache behavior affects overall performance
• Metadata cache implementation prior to HDF5 1.6.5
could cause performance degradation for some
applications
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
9
10. Types of data storage layouts
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
10
12. Contiguous storage layout
• Metadata header separate from raw data
• Raw data stored in one contiguous block on disk
Metadata cache
Dataset array data
Dataset header
………….
Datatype
Dataspace
………….
Attributes
…
Application memory
File
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
12
13. Chunked storage
• Chunking – storage layout where a dataset is partitioned
in fixed-size multi-dimensional tiles or chunks
• Used for extendible datasets and datasets with filters
applied (checksum, compression)
• HDF5 library treats each chunk as atomic object
• Greatly affects performance and file sizes
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
13
14. Chunked storage layout
• Raw data divided into equal sized blocks (chunks).
• Each chunk stored separately as a contiguous block on disk
Metadata cache
Dataset array data
Dataset header
A
………….
Datatype
Dataspace
………….
Attributes
…
File
B
C
D
Chunk
index
Application memory
header
Nov. 6, 2007
Chunk
index
A
C
HDF-EOS Workshop XI Tutorial
D
14
B
15. Compact storage layout
• Data array and metadata stored together in the header
Dataset header
………….
Datatype
Dataspace
………….
Attributes
…
Array data
Data
Metadata cache
Array data
Application memory
File*
* “File” may in fact be a collection of files, memory, or other storage destination.
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
15
17. What goes on inside the magic box?
• Operations on data inside the magic box
• Copying to/from internal buffers
• Datatype conversion
• Scattering - gathering
• Data transformation (filters, compression)
• Data structures used
• B-trees (groups, dataset chunks)
• Hash tables
• Local and Global heaps (variable length data: link names,
strings, etc.)
• Other concepts
• HDF5 metadata, metadata cache
• Chunking, chunk cache
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
17
18. Operations on data inside the magic box
• Copying to/from internal buffers
• Datatype conversion, such as
• float integer
• LE BE
• 64-bit integer to 16-bit integer
• Scattering - gathering
• Data is scattered/gathered from/to application buffers into
internal buffers for datatype conversion and partial I/O
• Data transformation (filters, compression)
• Checksum on raw data and metadata (in 1.8.0)
• Algebraic transform
• GZIP and SZIP compressions
• User-defined filters
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
18
19. I/O performance depends on
•
•
•
•
•
•
•
Storage layouts
Dataset storage properties
Chunking strategy
Metadata cache performance
Datatype conversion performance
Other filters, such as compression
Access patterns
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
19
20. I/O with different storage
layouts
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
20
21. Writing compact dataset
Dataset header
Metadata cache
………….
Datatype
Dataspace
………….
Attributes
…
Array data
Data
Application memory
One write to store header and data array
File
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
21
23. Writing a contiguous dataset with datatype conversion
Dataset header
………….
Datatype
Dataspace
………….
Attribute 1
Attribute 2
…………
Metadata cache
Dataset array data
Conversion buffer 1MB
Application memory
File
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
23
24. Partial I/O with contiguous
datasets
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
24
25. Writing whole dataset – contiguous rows
N
M
One I/O operation
Application data in memory
M rows
File
Nov. 6, 2007
Data is contiguous in a file
HDF-EOS Workshop XI Tutorial
25
26. Sub-setting of contiguous dataset
Series of adjacent rows
Application data in memory
N
M
One I/O operation
M rows
Subset – contiguous in file
File
Entire dataset – contiguous in file
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
26
27. Sub-setting of contiguous dataset
Adjacent, partial rows
Application data in memory
N
Several small I/O operation
M
N elements
File
Nov. 6, 2007
…
Data is scattered in a file in M contiguous blocks
HDF-EOS Workshop XI Tutorial
27
28. Sub-setting of contiguous dataset
Extreme case: writing a column
Application data in memory
N
Several small I/O operation
M
1 element
…
Subset data is scattered in a file in M different locations
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
28
29. Sub-setting of contiguous dataset
Data sieve buffer
Application data in memory
N
Data is gathered in a sieve buffer in memory 64K
memcopy
M
1 element
File
Nov. 6, 2007
…
Data is scattered in a file in M contiguous blocks
HDF-EOS Workshop XI Tutorial
29
30. Performance tuning for contiguous dataset
• Datatype conversion
• Avoid for better performance
• Use H5Pset_buffer function to customize
conversion buffer size
• Partial I/O
• Write/read in big contiguous blocks
• Use H5Pset_sieve_buf_size to improve
performance for complex subsetting
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
30
32. Reminder – chunked storage layout
Metadata cache
Dataset array data
Dataset header
A
………….
Datatype
Dataspace
………….
Attributes
…
File
B
C
D
Chunk
index
Application memory
header
Nov. 6, 2007
Chunk
index
A
C
HDF-EOS Workshop XI Tutorial
D
32
B
33. Information about chunking
• HDF5 library treats each chunk as atomic object
• Compression is applied to each chunk
• Datatype conversion, other filters applied per chunk
• Chunk size greatly affects performance
• Chunk overhead adds to file size
• Chunk processing involves many steps
• Chunk cache
•
•
•
•
•
•
Caches chunks for better performance
Created for each chunked dataset
Size of chunk cache is set for file (default size 1MB)
Each chunked dataset has its own chunk cache
Chunk may be too big to fit into cache
Memory may grow if application keeps opening datasets
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
33
35. Writing chunked dataset
Chunked dataset
A
C
Chunk cache
C
B
Filter pipeline
File
B
A
…………..
C
• Compression performed when chunk evicted from the chunk cache
• Other filters applied as data goes through filter pipeline
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
35
36. Partial I/O with Chunking
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
36
37. Partial I/O for chunked dataset
1
2
3
4
• Example: write the green subset from the dataset , converting
the data
• Dataset is stored as six chunks in the file.
• The subset spans four chunks, numbered 1-4 in the figure.
• Hence four chunks must be written to the file.
• But first, the four chunks must be read from the file, to preserve
those parts of each chunk that are not to be overwritten.
Nov. 6, 2007 HDF-EOS Workshop XI Tutorial 37
38. Partial I/O for chunked dataset
• For each of the four chunks:
1
2
3
4
• Read chunk from file into chunk cache,
unless it’s already there.
• Determine which part of the chunk will be
replaced by the selection.
• Replace that part of the chunk in the cache
with the corresponding elements from the
application’s array.
• Move those elements to conversion buffer
and perform conversion
• Move those elements back from conversion
buffer to chunk cache.
• Apply filters (compression) when chunk is
flushed from chunk cache
• For each element 3 memcopy
performed
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
38
39. Partial I/O for chunked dataset
Application buffer
Chunk cache
3
3
Chunk
memcopy
Elements participating in I/O are gathered into corresponding chunk
Application memory
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
39
40. Partial I/O for chunked dataset
Chunk cache
Memcopy
Conversion buffer
3
Memcopy
Application memory
Compress and write to file
File
Nov. 6, 2007
Chunk
HDF-EOS Workshop XI Tutorial
40
42. Examples of variable length data
• String
A[0] “the first string we want to write”
…………………………………
A[N-1] “the N-th string we want to write”
• Each element is a record of variable-length
A[0] (1,1,0,0,0,5,6,7,8,9) [length = 10]
A[1] (0,0,110,2005)
[length = 4]
………………………..
A[N] (1,2,3,4,5,6,7,8,9,10,11,12,….,M) [length = M]
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
42
43. Variable length data in HDF5
• Variable length description in HDF5 application
typedef struct {
size_t length;
void
*p;
}hvl_t;
• Base type can be any HDF5 type
H5Tvlen_create(base_type)
• ~ 20 bytes overhead for each element
• Data cannot be compressed
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
43
44. How variable length data is stored in HDF5
Actual variable
length data
Global
heap
File
Dataset header
Nov. 6, 2007
Dataset with
variable length
elements
Pointer into
global heap
HDF-EOS Workshop XI Tutorial
44
45. Variable length datasets and I/O
• When writing variable length data, elements in application
buffer point to global heaps in the metadata cache where
actual data is stored.
Raw data
Application buffer
Global
heap
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
45
46. There may be more than one global heap
Raw data
Application buffer
Global
heap
Global
heap
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
46
47. Variable length datasets and I/O
Raw data
Global
heap
Global
heap
File
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
47
48. VL chunked dataset in a file
Chunk B-tree
File
Dataset header
Heaps with
VL data
Nov. 6, 2007
Dataset chunks
HDF-EOS Workshop XI Tutorial
48
49. Writing chunked VL datasets
Metadata cache
B-tree nodes
Chunk cache
Dataset header
…………
Application memory
Global heap
………
Raw data
Chunk cache
Conversion buffer
Filter pipeline
VL chunked dataset with selected region
File
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
49
50. Hints for variable length data I/O
• Avoid closing/opening a file while writing VL datasets
• Global heap information is lost
• Global heaps may have unused space
• Avoid alternately writing different VL datasets
• Data from different datasets will go into to the same heap
• If maximum length of the record is known, consider
using fixed-length records and compression
Nov. 6, 2007
HDF-EOS Workshop XI Tutorial
50