SlideShare une entreprise Scribd logo
1  sur  55
The HDF Group

HDF5 Advanced Topics
Elena Pourmal
The HDF Group
The 13th HDF and HDF-EOS Workshop
November 3-5, 2009

November 3-5, 2009

HDF/HDF-EOS Workshop XIII

1

www.hdfgroup.org
The HDF Group

Chunking in HDF5

November 3-5, 2009

HDF/HDF-EOS Workshop XIII

2

www.hdfgroup.org
Goal
• To help you with understanding of how HDF5
chunking works, so you can efficiently store and
retrieve data from HDF5

November 3-5, 2009

HDF/HDF-EOS Workshop XIII

3

www.hdfgroup.org
Recall from Intro: HDF5 Dataset

Metadata

Dataset data

Dataspace
Rank Dimensions
3

Dim_1 = 4
Dim_2 = 5
Dim_3 = 7

Datatype
IEEE 32-bit float

Storage info

Attributes
Time = 32.4

Chunked

Pressure = 987

Compressed

Temp = 56

November 3-5, 2009

HDF/HDF-EOS Workshop XIII

4

www.hdfgroup.org
Contiguous storage layout
• Metadata header separate from dataset data
• Data stored in one contiguous block in HDF5
file
Metadata cache
Dataset header
………….
Datatype
Dataspace
………….
Attributes
…

Dataset data

Application memory

File

November 3-5, 2009

Dataset data

HDF/HDF-EOS Workshop XIII

5

www.hdfgroup.org
What is HDF5 Chunking?
• Data is stored in chunks of predefined size
• Two-dimensional instance may be referred to
as data tiling
• HDF5 library always writes/reads the whole
chunk
Contiguous

November 3-5, 2009

HDF/HDF-EOS Workshop XIII

Chunked

6

www.hdfgroup.org
What is HDF5 Chunking?
• Dataset data is divided into equally sized blocks (chunks).
• Each chunk is stored separately as a contiguous block in
HDF5 file.
Metadata cache

Dataset data

Dataset header
………….
Datatype
Dataspace
………….
Attributes
…

File
November 3-5, 2009

A

B

C

D

Chunk
index

Application memory

header

Chunk
index

A

HDF/HDF-EOS Workshop XIII

C

D
7

B
www.hdfgroup.org
Why HDF5 Chunking?
• Chunking is required for several HDF5
features
• Enabling compression and other filters like
checksum
• Extendible datasets

November 3-5, 2009

HDF/HDF-EOS Workshop XIII

8

www.hdfgroup.org
Why HDF5 Chunking?
• If used appropriately chunking improves
partial I/O for big datasets

Only two chunks are involved in I/O

November 3-5, 2009

HDF/HDF-EOS Workshop XIII

9

www.hdfgroup.org
Creating Chunked Dataset

1.
2.
3.

Create a dataset creation property list.
Set property list to use chunked storage layout.
Create dataset with the above property list.
dcpl_id = H5Pcreate(H5P_DATASET_CREATE);
rank = 2;
ch_dims[0] = 100;
ch_dims[1] = 200;
H5Pset_chunk(dcpl_id, rank, ch_dims);
dset_id = H5Dcreate (…, dcpl_id);
H5Pclose(dcpl_id);

November 3-5, 2009

HDF/HDF-EOS Workshop XIII

10

www.hdfgroup.org
Creating Chunked Dataset
• Things to remember:
• Chunk always has the same rank as a dataset
• Chunk’s dimensions do not need to be factors
of dataset’s dimensions
• Caution: May cause more I/O than desired
(see white portions of the chunks below)

November 3-5, 2009

HDF/HDF-EOS Workshop XIII

11

www.hdfgroup.org
Quiz time
• Why shouldn’t I make a chunk with dimension
sizes equal to one?
• Can I change chunk size after dataset was
created?

November 3-5, 2009

HDF/HDF-EOS Workshop XIII

12

www.hdfgroup.org
Writing or Reading Chunked Dataset
1.
2.

Chunking mechanism is transparent to application.
Use the same set of operation as for contiguous
dataset, for example,
H5Dopen(…);
H5Sselect_hyperslab (…);
H5Dread(…);

3.

Selections do not need to coincide precisely with the
chunks boundaries.

November 3-5, 2009

HDF/HDF-EOS Workshop XIII

13

www.hdfgroup.org
HDF5 Chunking and compression
Chunking is required for compression and
other filters
HDF5 filters modify data during I/O operations
Filters provided by HDF5:

•
•
•
•
•
•

•

Checksum (H5Pset_fletcher32)
Data transformation (in 1.8.*)
Shuffling filter (H5Pset_shuffle)

Compression (also called filters) in HDF5
•
•
•
•

November 3-5, 2009

Scale + offset (in 1.8.*) (H5Pset_scaleoffset)
N-bit (in 1.8.*) (H5Pset_nbit)
GZIP (deflate) (H5Pset_deflate)
SZIP (H5Pset_szip)
HDF/HDF-EOS Workshop XIII

14

www.hdfgroup.org
HDF5 Third-Party Filters
• Compression methods supported by
HDF5 User’s community
http://wiki.hdfgroup.org/Community-Support-for-HDF5

•
•
•
•

November 3-5, 2009

LZO lossless compression (PyTables)
BZIP2 lossless compression (PyTables)
BLOSC lossless compression (PyTables)
LZF lossless compression H5Py

HDF/HDF-EOS Workshop XIII

15

www.hdfgroup.org
Creating Compressed Dataset
1.
2.
3.
4.

Create a dataset creation property list
Set property list to use chunked storage layout
Set property list to use filters
Create dataset with the above property list

crp_id = H5Pcreate(H5P_DATASET_CREATE);
rank = 2;
ch_dims[0] = 100;
ch_dims[1] = 100;
H5Pset_chunk(crp_id, rank, ch_dims);
H5Pset_deflate(crp_id, 9);
dset_id = H5Dcreate (…, crp_id);
H5Pclose(crp_id);

November 3-5, 2009

HDF/HDF-EOS Workshop XIII

16

www.hdfgroup.org
The HDF Group

Performance Issues
or
What everyone needs to know
about chunking, compression and
chunk cache

November 3-5, 2009

HDF/HDF-EOS Workshop XIII

17

www.hdfgroup.org
Accessing a row in contiguous dataset

One seek is needed to find the starting location of row of data.
Data is read/written using one disk access.

November 3-5, 2009

HDF/HDF-EOS Workshop XIII

18

www.hdfgroup.org
Accessing a row in chunked dataset

Five seeks is needed to find each chunk. Data is read/written
using five disk accesses. Chunking storage is less efficient
than contiguous storage.
November 3-5, 2009

HDF/HDF-EOS Workshop XIII

19

www.hdfgroup.org
Quiz time
• How might I improve this situation, if it is
common to access my data in this way?

November 3-5, 2009

HDF/HDF-EOS Workshop XIII

20

www.hdfgroup.org
Accessing data in contiguous dataset

M rows

M seeks are needed to find the starting location of the element.
Data is read/written using M disk accesses. Performance may be
very bad.

November 3-5, 2009

HDF/HDF-EOS Workshop XIII

21

www.hdfgroup.org
Motivation for chunking storage

M rows

Two seeks are needed to find two chunks. Data is
read/written using two disk accesses. For this pattern
chunking helps with I/O performance.
November 3-5, 2009

HDF/HDF-EOS Workshop XIII

22

www.hdfgroup.org
Quiz time
• If I know I shall always access a column at a time,
what size and shape should I make my chunks?

November 3-5, 2009

HDF/HDF-EOS Workshop XIII

23

www.hdfgroup.org
Motivation for chunk cache
A

B

H5Dwrite
H5Dwrite

Selection shown is written by two H5Dwrite calls (one for
each row).
Chunks A and B are accessed twice (one time for each
row). If both chunks fit into cache, only two I/O accesses
needed to write the shown selections.
November 3-5, 2009

HDF/HDF-EOS Workshop XIII

24

www.hdfgroup.org
Motivation for chunk cache
A

B

H5Dwrite
H5Dwrite

Question: What happens if there is a space for only one
chunk at a time?

November 3-5, 2009

HDF/HDF-EOS Workshop XIII

25

www.hdfgroup.org
HDF5 raw data chunk cache
• Improves performance whenever the same chunks are
read or written multiple times.
• Current implementation doesn’t adjust parameters
automatically (cache size, size of hash table).
• Chunks are indexed with a simple hash table.
• Hash function = (cindex mod nslots), where cindex is the
linear index into a hypothetical array of chunks and nslots
is the size of hash table.
• Only one of several chunks with the same hash value
stays in cache.
• Nslots should be a prime number to minimize the number
of hash value collisions.

November 3-5, 2009

HDF/HDF-EOS Workshop XIII

26

www.hdfgroup.org
HDF5 Chunk Cache APIs
• H5Pset_chunk_cache sets raw data chunk
cache parameters for a dataset
H5Pset_chunk_cache (dapl, rdcc_nslots,
rdcc_nbytes, rdcc_w0);

• H5Pset_cache sets raw data chunk cache
parameters for all datasets in a file
H5Pset_cache (fapl, 0, nslots,
5*1024*1024, rdcc_w0);

November 3-5, 2009

HDF/HDF-EOS Workshop XIII

27

www.hdfgroup.org
Hints for Chunk Settings
• Chunk dimension sizes should align as closely as
possible with hyperslab dimensions for read/write
• Chunk cache size (rdcc_nbytes) should be large
enough to hold all the chunks in a selection
• If this is not possible, it may be best to disable chunk
caching altogether (set rdcc_nbytes to 0)

• rdcc_slots should be a prime number that is at
least 10 to 100 times the number of chunks that can
fit into rdcc_nbytes
• rdcc_w0 should be set to 1 if chunks that have been
fully read/written will never be read/written again
November 3-5,
2009

HDF/HDF-EOS Workshop XIII

28

www.hdfgroup.org
The Good and The Ugly: Reading a row
A

B

M rows
Each row is read by a
separate call to H5Dread

The Good: If both chunks fit into cache, 2 disks accesses
are needed to read the data.
The Ugly: If one chunk fits into cache, 2M disks accesses
are needed to read the data (compare with M accesses for
contiguous storage).
November 3-5, 2009

HDF/HDF-EOS Workshop XIII

29

www.hdfgroup.org
Case study: Writing Chunked Dataset
• 1000x100x100 dataset
• 4 byte integers
• Random values 0-99

• 50x100x100 chunks (20 total)
• Chunk size: 2 MB

• Write the entire dataset using 1x100x100
slices
• Slices are written sequentially

• Chunk cache size 1MB (default) compared
with chunk cache size is 5MB
November 3-5, 2009

HDF/HDF-EOS Workshop XIII

30

www.hdfgroup.org
Test Setup

• 20 Chunks
• 1000 slices
• Chunk size ~ 2MB
• Total size ~ 40MB
• Each plane ~ 40KB

November 3-5, 2009

HDF/HDF-EOS Workshop XIII

31

www.hdfgroup.org
Aside: Writing dataset with contiguous storage

• 1000 disk accesses to write 1000 planes
• Total size written 40MB

November 3-5, 2009

HDF/HDF-EOS Workshop XIII

32

www.hdfgroup.org
Writing chunked dataset
• Example: Chunk fits into cache
• Chunk is filled in cache and then written to
disk
• 20 disk accesses are needed
• Total size written 40MB

1000 disk access for contiguous
November 3-5, 2009

HDF/HDF-EOS Workshop XIII

33

www.hdfgroup.org
Writing chunked dataset
• Example: Chunk doesn’t fit into cache
• For each chunk (20 total)
1. Fill chunk in memory with the first plane and write it
to the file
2. Write 49 new planes to file directly

• End For
• Total disk accesses 20 x(1 + 49)= 1000
• Total data written ~80MB (vs. 40MB)
B
November 3-5,
2009

B

Chunk cache
HDF/HDF-EOS Workshop XIII

34

A
A
November 3-5, 2009

A
HDF/HDF-EOS Workshop XIII

34

www.hdfgroup.org
Writing compressed chunked dataset
• Example: Chunk fits into cache
• For each chunk (20 total)
1. Fill chunk in memory, compress it and write it to file

• End For
• Total disk accesses 20
• Total data written less than 40MB
Chunk cache
B
November 3-5,
2009

Chunk in a file

B

B
35

A

A

A
November 3-5, 2009

HDF/HDF-EOS Workshop XIII

35

www.hdfgroup.org
Writing compressed chunked dataset
• Example: Chunk doesn’t fit into cache
• For each chunk (20 total)
•
•

•

•
•
•
•

Fill chunk with the first plane, compress, write to a file
For each new plane (49 planes)
• Read chunk back
• Fill chunk with the plane
• Compress
• Write chunk to a file
End For

End For
Total disk accesses 20 x(1+2x49)= 1980
Total data written and read ? (see next slide)
Note: HDF5 can probably detect such behavior and increase cache
size

November 3-5, 2009

HDF/HDF-EOS Workshop XIII

36

www.hdfgroup.org
Effect of Chunk Cache Size on Write
No compression, chunk size is 2MB
Cache size

I/O operations

Total data
written

File size

1 MB (default)

1002

75.54 MB

38.15 MB

5 MB

22

38.16 MB

38.15 MB

Gzip compression
Cache size

I/O operations

Total data
written

File size

1 MB (default)

1982

335.42 MB
(322.34 MB read)

13.08 MB

5 MB

22

13.08 MB

13.08 MB

November 3-5,
2009

HDF/HDF-EOS Workshop XIII

37

www.hdfgroup.org
Effect of Chunk Cache Size on Write
• With the 1 MB cache size, a chunk will not fit into the
cache
• All writes to the dataset must be immediately written to
disk
• With compression, the entire chunk must be read and
rewritten every time a part of the chunk is written to
• Data must also be decompressed and recompressed
each time
• Non sequential writes could result in a larger file

• Without compression, the entire chunk must be
written when it is first written to the file
• If the selection were not contiguous on disk, it could
require as much as 1 I/O disk access for each
element
November 3-5,
2009

HDF/HDF-EOS Workshop XIII

38

www.hdfgroup.org
Effect of Chunk Cache Size on Write
• With the 5 MB cache size, the chunk is written
only after it is full
• Drastically reduces the number of I/O
operations
• Reduces the amount of data that must be
written (and read)
• Reduces processing time, especially with the
compression filter

November 3-5,
2009

HDF/HDF-EOS Workshop XIII

39

www.hdfgroup.org
Conclusion
• It is important to make sure that a chunk will fit
into the raw data chunk cache
• If you will be writing to multiple chunks at once,
you should increase the cache size even more
• Try to design chunk dimensions to minimize
the number you will be writing to at once

November 3-5,
2009

HDF/HDF-EOS Workshop XIII

40

www.hdfgroup.org
Reading Chunked Dataset
• Read the same dataset, again by slices, but
the slices cross through all the chunks
• 2 orientations for read plane
• Plane includes fastest changing dimension
• Plane does not include fastest changing
dimension

• Measure total read operations, and total size
read
• Chunk sizes of 50x100x100, and 10x100x100
• 1 MB cache
November 3-5,
2009

HDF/HDF-EOS Workshop XIII

41

www.hdfgroup.org
Test Setup
• Chunks
• Read slices

100

• Vertical and horizontal

00
10

100

November 3-5,
2009

HDF/HDF-EOS Workshop XIII

42

www.hdfgroup.org
Aside: Reading from contiguous dataset

• Repeat 100 times for each plane
• Repeat 1000 times
• Read a row
• Seek to the beginning of the next read

• Total 105 disk accesses

00
1

0

seek

November 3-5,
2009

100

read

100
HDF/HDF-EOS Workshop XIII

43

www.hdfgroup.org
Reading chunked dataset
• No compression; chunk fits into cache
• For each plane (100 total)
• For each chunk (20 total)
• Read chunk
• Extract 50 rows

00
10

• End For

• End For
100

• Total 2000 disk accesses
• Chunk doesn’t fit into cache
• Data is read directly
from the file
• 105 disk accesses
November 3-5, 2009

HDF/HDF-EOS Workshop XIII

100
44

www.hdfgroup.org
Reading chunked dataset
• Compression
• Cache size doesn’t matter in this case
• For each plane (100 total)
• For each chunk (20 total)
• Read chunk, uncompress
• Extract 50 rows

00
10

• End
• Total 2000 disk accesses

November 3-5,
2009

100

• End

100
HDF/HDF-EOS Workshop XIII

45

www.hdfgroup.org
Results
• Read slice includes fastest changing
dimension
Chunk size

Compression

I/O operations

Total data read

50

Yes

2010

1307 MB

10

Yes

10012

1308 MB

50

No

100010

38 MB

10

No

10012

3814 MB

November 3-5,
2009

HDF/HDF-EOS Workshop XIII

46

www.hdfgroup.org
Aside: Reading from contiguous dataset

• Repeat for each plane (100 total)
• Repeat for each column (1000 total)
• Repeat for each element (100 total)
• Read element
• Seek to the next one

00
10

seek

November 3-5,
2009

100

• Total 107 disk accesses

100
HDF/HDF-EOS Workshop XIII

47

www.hdfgroup.org
Reading chunked dataset
• No compression; chunk fits into cache
• For each plane (100 total)
• For each chunk (20 total)
• Read chunk, uncompress
• Extract 50 columns

• End

• End

• Total 2000 disk accesses
• Chunk doesn’t fit into cache
• Data is read directly
from the file
• 7
November 3-5,10 disk operations
2009

HDF/HDF-EOS Workshop XIII

48

www.hdfgroup.org
Reading chunked dataset
• Compression; cache size doesn’t matter
• For each plane (100 total)
• For each chunk (20 total)
• Read chunk, uncompress
• Extract 50 columns

• End

• End

• Total 2000 disk accesses

November 3-5, 2009

HDF/HDF-EOS Workshop XIII

49

www.hdfgroup.org
Results (continued)
• Read slice does not include fastest changing
dimension
Chunk size

Compression

I/O operations

Total data read

50

Yes

2010

1307 MB

10

Yes

10012

1308 MB

50

No

10000010

38 MB

10

No

10012

3814 MB

November 3-5,
2009

HDF/HDF-EOS Workshop XIII

50

www.hdfgroup.org
Effect of Cache Size on Read
• When compression is enabled, the library must
always read entire chunk once for each call to
H5Dread (unless it is in cache)
• When compression is disabled, the library’s
behavior depends on the cache size relative to
the chunk size.
• If the chunk fits in cache, the library reads entire
chunk once for each call to H5Dread
• If the chunk does not fit in cache, the library reads
only the data that is selected
• More read operations, especially if the read plane
does not include the fastest changing dimension
• Less total data read
November 3-5,
2009

HDF/HDF-EOS Workshop XIII

51

www.hdfgroup.org
Conclusion
• On read cache size does not matter when
compression is enabled.
• Without compression, the cache must be large
enough to hold all of the chunks to get good
preformance.
• The optimum cache size depends on the exact
shape of the data, as well as the hardware, as
well as access pattern.

November 3-5, 2009

HDF/HDF-EOS Workshop XIII

52

www.hdfgroup.org
The HDF Group

Thank You!

November 3-5, 2009

HDF/HDF-EOS Workshop XIII

53

www.hdfgroup.org
Acknowledgements
This work was supported by cooperative agreement
number NNX08AO77A from the National
Aeronautics and Space Administration (NASA).
Any opinions, findings, conclusions, or
recommendations expressed in this material are
those of the author[s] and do not necessarily reflect
the views of the National Aeronautics and Space
Administration.

November 3-5, 2009

HDF/HDF-EOS Workshop XIII

54

www.hdfgroup.org
The HDF Group

Questions/comments?

November 3-5, 2009

HDF/HDF-EOS Workshop XIII

55

www.hdfgroup.org

Contenu connexe

Tendances

Dremel: interactive analysis of web-scale datasets
Dremel: interactive analysis of web-scale datasetsDremel: interactive analysis of web-scale datasets
Dremel: interactive analysis of web-scale datasets
Hung-yu Lin
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
Cloudera, Inc.
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basic
Hafizur Rahman
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 

Tendances (20)

Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Architecture of Hadoop
Architecture of HadoopArchitecture of Hadoop
Architecture of Hadoop
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
 
SQLRally Amsterdam 2013 - Hadoop
SQLRally Amsterdam 2013 - HadoopSQLRally Amsterdam 2013 - Hadoop
SQLRally Amsterdam 2013 - Hadoop
 
Hadoop Architecture in Depth
Hadoop Architecture in DepthHadoop Architecture in Depth
Hadoop Architecture in Depth
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
Interactive big data analytics
Interactive big data analyticsInteractive big data analytics
Interactive big data analytics
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open house
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
 
Dremel: interactive analysis of web-scale datasets
Dremel: interactive analysis of web-scale datasetsDremel: interactive analysis of web-scale datasets
Dremel: interactive analysis of web-scale datasets
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
 
Lecture 25
Lecture 25Lecture 25
Lecture 25
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basic
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 

En vedette (8)

Support for NPP/NPOESS/JPSS by The HDF Group
 Support for NPP/NPOESS/JPSS by The HDF Group Support for NPP/NPOESS/JPSS by The HDF Group
Support for NPP/NPOESS/JPSS by The HDF Group
 
HDF Tools Updates and Discussions
HDF Tools Updates and DiscussionsHDF Tools Updates and Discussions
HDF Tools Updates and Discussions
 
Advanced HDF5 Features
Advanced HDF5 FeaturesAdvanced HDF5 Features
Advanced HDF5 Features
 
Design of Experiments Group Presentation- Spring 2013
Design of Experiments Group Presentation- Spring 2013Design of Experiments Group Presentation- Spring 2013
Design of Experiments Group Presentation- Spring 2013
 
Chunking slides
Chunking slidesChunking slides
Chunking slides
 
Basics of Content Chunking
Basics of Content ChunkingBasics of Content Chunking
Basics of Content Chunking
 
Learning
LearningLearning
Learning
 
Learning (Psychology) Lecture notes by Imran Ahmad Sajid
Learning (Psychology) Lecture notes by Imran Ahmad SajidLearning (Psychology) Lecture notes by Imran Ahmad Sajid
Learning (Psychology) Lecture notes by Imran Ahmad Sajid
 

Similaire à HDF5 Advanced Topics - Chunking

Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Simplilearn
 

Similaire à HDF5 Advanced Topics - Chunking (20)

Performance Tuning in HDF5
Performance Tuning in HDF5 Performance Tuning in HDF5
Performance Tuning in HDF5
 
HDF5 Advanced Topics
HDF5 Advanced TopicsHDF5 Advanced Topics
HDF5 Advanced Topics
 
HDF5 Advanced Topics
HDF5 Advanced TopicsHDF5 Advanced Topics
HDF5 Advanced Topics
 
Hdf5 intro
Hdf5 introHdf5 intro
Hdf5 intro
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
 
HDF5 Advanced Topics - Datatypes and Partial I/O
HDF5 Advanced Topics - Datatypes and Partial I/OHDF5 Advanced Topics - Datatypes and Partial I/O
HDF5 Advanced Topics - Datatypes and Partial I/O
 
Cloud-Optimized HDF5 Files
Cloud-Optimized HDF5 FilesCloud-Optimized HDF5 Files
Cloud-Optimized HDF5 Files
 
HDF5 Life cycle of data
HDF5 Life cycle of dataHDF5 Life cycle of data
HDF5 Life cycle of data
 
Caching and Buffering in HDF5
Caching and Buffering in HDF5Caching and Buffering in HDF5
Caching and Buffering in HDF5
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
HDF5 I/O Performance
HDF5 I/O PerformanceHDF5 I/O Performance
HDF5 I/O Performance
 
Unit-3.pptx
Unit-3.pptxUnit-3.pptx
Unit-3.pptx
 
co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.co-Hadoop: Data co-location on Hadoop.
co-Hadoop: Data co-location on Hadoop.
 
HDF5 Advanced Topics - Object's Properties, Storage Methods, Filters, Datatypes
HDF5 Advanced Topics - Object's Properties, Storage Methods, Filters, DatatypesHDF5 Advanced Topics - Object's Properties, Storage Methods, Filters, Datatypes
HDF5 Advanced Topics - Object's Properties, Storage Methods, Filters, Datatypes
 
Introduction to HDF5
Introduction to HDF5Introduction to HDF5
Introduction to HDF5
 
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure CodingLess is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
 
Building modern data lakes
Building modern data lakes Building modern data lakes
Building modern data lakes
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 

Plus de The HDF-EOS Tools and Information Center

Plus de The HDF-EOS Tools and Information Center (20)

Accessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDSAccessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDS
 
The State of HDF
The State of HDFThe State of HDF
The State of HDF
 
Highly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance FeaturesHighly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance Features
 
Creating Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 FilesCreating Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 Files
 
HDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance DiscussionHDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance Discussion
 
Hyrax: Serving Data from S3
Hyrax: Serving Data from S3Hyrax: Serving Data from S3
Hyrax: Serving Data from S3
 
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLABAccessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
 
HDF - Current status and Future Directions
HDF - Current status and Future DirectionsHDF - Current status and Future Directions
HDF - Current status and Future Directions
 
HDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and FutureHDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and Future
 
HDF - Current status and Future Directions
HDF - Current status and Future Directions HDF - Current status and Future Directions
HDF - Current status and Future Directions
 
H5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only LibraryH5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only Library
 
MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10
 
HDF for the Cloud - Serverless HDF
HDF for the Cloud - Serverless HDFHDF for the Cloud - Serverless HDF
HDF for the Cloud - Serverless HDF
 
HDF5 <-> Zarr
HDF5 <-> ZarrHDF5 <-> Zarr
HDF5 <-> Zarr
 
HDF for the Cloud - New HDF Server Features
HDF for the Cloud - New HDF Server FeaturesHDF for the Cloud - New HDF Server Features
HDF for the Cloud - New HDF Server Features
 
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
 
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
 
HDF5 and Ecosystem: What Is New?
HDF5 and Ecosystem: What Is New?HDF5 and Ecosystem: What Is New?
HDF5 and Ecosystem: What Is New?
 
HDF5 Roadmap 2019-2020
HDF5 Roadmap 2019-2020HDF5 Roadmap 2019-2020
HDF5 Roadmap 2019-2020
 
Leveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software TestingLeveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software Testing
 

Dernier

Dernier (20)

Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

HDF5 Advanced Topics - Chunking

  • 1. The HDF Group HDF5 Advanced Topics Elena Pourmal The HDF Group The 13th HDF and HDF-EOS Workshop November 3-5, 2009 November 3-5, 2009 HDF/HDF-EOS Workshop XIII 1 www.hdfgroup.org
  • 2. The HDF Group Chunking in HDF5 November 3-5, 2009 HDF/HDF-EOS Workshop XIII 2 www.hdfgroup.org
  • 3. Goal • To help you with understanding of how HDF5 chunking works, so you can efficiently store and retrieve data from HDF5 November 3-5, 2009 HDF/HDF-EOS Workshop XIII 3 www.hdfgroup.org
  • 4. Recall from Intro: HDF5 Dataset Metadata Dataset data Dataspace Rank Dimensions 3 Dim_1 = 4 Dim_2 = 5 Dim_3 = 7 Datatype IEEE 32-bit float Storage info Attributes Time = 32.4 Chunked Pressure = 987 Compressed Temp = 56 November 3-5, 2009 HDF/HDF-EOS Workshop XIII 4 www.hdfgroup.org
  • 5. Contiguous storage layout • Metadata header separate from dataset data • Data stored in one contiguous block in HDF5 file Metadata cache Dataset header …………. Datatype Dataspace …………. Attributes … Dataset data Application memory File November 3-5, 2009 Dataset data HDF/HDF-EOS Workshop XIII 5 www.hdfgroup.org
  • 6. What is HDF5 Chunking? • Data is stored in chunks of predefined size • Two-dimensional instance may be referred to as data tiling • HDF5 library always writes/reads the whole chunk Contiguous November 3-5, 2009 HDF/HDF-EOS Workshop XIII Chunked 6 www.hdfgroup.org
  • 7. What is HDF5 Chunking? • Dataset data is divided into equally sized blocks (chunks). • Each chunk is stored separately as a contiguous block in HDF5 file. Metadata cache Dataset data Dataset header …………. Datatype Dataspace …………. Attributes … File November 3-5, 2009 A B C D Chunk index Application memory header Chunk index A HDF/HDF-EOS Workshop XIII C D 7 B www.hdfgroup.org
  • 8. Why HDF5 Chunking? • Chunking is required for several HDF5 features • Enabling compression and other filters like checksum • Extendible datasets November 3-5, 2009 HDF/HDF-EOS Workshop XIII 8 www.hdfgroup.org
  • 9. Why HDF5 Chunking? • If used appropriately chunking improves partial I/O for big datasets Only two chunks are involved in I/O November 3-5, 2009 HDF/HDF-EOS Workshop XIII 9 www.hdfgroup.org
  • 10. Creating Chunked Dataset 1. 2. 3. Create a dataset creation property list. Set property list to use chunked storage layout. Create dataset with the above property list. dcpl_id = H5Pcreate(H5P_DATASET_CREATE); rank = 2; ch_dims[0] = 100; ch_dims[1] = 200; H5Pset_chunk(dcpl_id, rank, ch_dims); dset_id = H5Dcreate (…, dcpl_id); H5Pclose(dcpl_id); November 3-5, 2009 HDF/HDF-EOS Workshop XIII 10 www.hdfgroup.org
  • 11. Creating Chunked Dataset • Things to remember: • Chunk always has the same rank as a dataset • Chunk’s dimensions do not need to be factors of dataset’s dimensions • Caution: May cause more I/O than desired (see white portions of the chunks below) November 3-5, 2009 HDF/HDF-EOS Workshop XIII 11 www.hdfgroup.org
  • 12. Quiz time • Why shouldn’t I make a chunk with dimension sizes equal to one? • Can I change chunk size after dataset was created? November 3-5, 2009 HDF/HDF-EOS Workshop XIII 12 www.hdfgroup.org
  • 13. Writing or Reading Chunked Dataset 1. 2. Chunking mechanism is transparent to application. Use the same set of operation as for contiguous dataset, for example, H5Dopen(…); H5Sselect_hyperslab (…); H5Dread(…); 3. Selections do not need to coincide precisely with the chunks boundaries. November 3-5, 2009 HDF/HDF-EOS Workshop XIII 13 www.hdfgroup.org
  • 14. HDF5 Chunking and compression Chunking is required for compression and other filters HDF5 filters modify data during I/O operations Filters provided by HDF5: • • • • • • • Checksum (H5Pset_fletcher32) Data transformation (in 1.8.*) Shuffling filter (H5Pset_shuffle) Compression (also called filters) in HDF5 • • • • November 3-5, 2009 Scale + offset (in 1.8.*) (H5Pset_scaleoffset) N-bit (in 1.8.*) (H5Pset_nbit) GZIP (deflate) (H5Pset_deflate) SZIP (H5Pset_szip) HDF/HDF-EOS Workshop XIII 14 www.hdfgroup.org
  • 15. HDF5 Third-Party Filters • Compression methods supported by HDF5 User’s community http://wiki.hdfgroup.org/Community-Support-for-HDF5 • • • • November 3-5, 2009 LZO lossless compression (PyTables) BZIP2 lossless compression (PyTables) BLOSC lossless compression (PyTables) LZF lossless compression H5Py HDF/HDF-EOS Workshop XIII 15 www.hdfgroup.org
  • 16. Creating Compressed Dataset 1. 2. 3. 4. Create a dataset creation property list Set property list to use chunked storage layout Set property list to use filters Create dataset with the above property list crp_id = H5Pcreate(H5P_DATASET_CREATE); rank = 2; ch_dims[0] = 100; ch_dims[1] = 100; H5Pset_chunk(crp_id, rank, ch_dims); H5Pset_deflate(crp_id, 9); dset_id = H5Dcreate (…, crp_id); H5Pclose(crp_id); November 3-5, 2009 HDF/HDF-EOS Workshop XIII 16 www.hdfgroup.org
  • 17. The HDF Group Performance Issues or What everyone needs to know about chunking, compression and chunk cache November 3-5, 2009 HDF/HDF-EOS Workshop XIII 17 www.hdfgroup.org
  • 18. Accessing a row in contiguous dataset One seek is needed to find the starting location of row of data. Data is read/written using one disk access. November 3-5, 2009 HDF/HDF-EOS Workshop XIII 18 www.hdfgroup.org
  • 19. Accessing a row in chunked dataset Five seeks is needed to find each chunk. Data is read/written using five disk accesses. Chunking storage is less efficient than contiguous storage. November 3-5, 2009 HDF/HDF-EOS Workshop XIII 19 www.hdfgroup.org
  • 20. Quiz time • How might I improve this situation, if it is common to access my data in this way? November 3-5, 2009 HDF/HDF-EOS Workshop XIII 20 www.hdfgroup.org
  • 21. Accessing data in contiguous dataset M rows M seeks are needed to find the starting location of the element. Data is read/written using M disk accesses. Performance may be very bad. November 3-5, 2009 HDF/HDF-EOS Workshop XIII 21 www.hdfgroup.org
  • 22. Motivation for chunking storage M rows Two seeks are needed to find two chunks. Data is read/written using two disk accesses. For this pattern chunking helps with I/O performance. November 3-5, 2009 HDF/HDF-EOS Workshop XIII 22 www.hdfgroup.org
  • 23. Quiz time • If I know I shall always access a column at a time, what size and shape should I make my chunks? November 3-5, 2009 HDF/HDF-EOS Workshop XIII 23 www.hdfgroup.org
  • 24. Motivation for chunk cache A B H5Dwrite H5Dwrite Selection shown is written by two H5Dwrite calls (one for each row). Chunks A and B are accessed twice (one time for each row). If both chunks fit into cache, only two I/O accesses needed to write the shown selections. November 3-5, 2009 HDF/HDF-EOS Workshop XIII 24 www.hdfgroup.org
  • 25. Motivation for chunk cache A B H5Dwrite H5Dwrite Question: What happens if there is a space for only one chunk at a time? November 3-5, 2009 HDF/HDF-EOS Workshop XIII 25 www.hdfgroup.org
  • 26. HDF5 raw data chunk cache • Improves performance whenever the same chunks are read or written multiple times. • Current implementation doesn’t adjust parameters automatically (cache size, size of hash table). • Chunks are indexed with a simple hash table. • Hash function = (cindex mod nslots), where cindex is the linear index into a hypothetical array of chunks and nslots is the size of hash table. • Only one of several chunks with the same hash value stays in cache. • Nslots should be a prime number to minimize the number of hash value collisions. November 3-5, 2009 HDF/HDF-EOS Workshop XIII 26 www.hdfgroup.org
  • 27. HDF5 Chunk Cache APIs • H5Pset_chunk_cache sets raw data chunk cache parameters for a dataset H5Pset_chunk_cache (dapl, rdcc_nslots, rdcc_nbytes, rdcc_w0); • H5Pset_cache sets raw data chunk cache parameters for all datasets in a file H5Pset_cache (fapl, 0, nslots, 5*1024*1024, rdcc_w0); November 3-5, 2009 HDF/HDF-EOS Workshop XIII 27 www.hdfgroup.org
  • 28. Hints for Chunk Settings • Chunk dimension sizes should align as closely as possible with hyperslab dimensions for read/write • Chunk cache size (rdcc_nbytes) should be large enough to hold all the chunks in a selection • If this is not possible, it may be best to disable chunk caching altogether (set rdcc_nbytes to 0) • rdcc_slots should be a prime number that is at least 10 to 100 times the number of chunks that can fit into rdcc_nbytes • rdcc_w0 should be set to 1 if chunks that have been fully read/written will never be read/written again November 3-5, 2009 HDF/HDF-EOS Workshop XIII 28 www.hdfgroup.org
  • 29. The Good and The Ugly: Reading a row A B M rows Each row is read by a separate call to H5Dread The Good: If both chunks fit into cache, 2 disks accesses are needed to read the data. The Ugly: If one chunk fits into cache, 2M disks accesses are needed to read the data (compare with M accesses for contiguous storage). November 3-5, 2009 HDF/HDF-EOS Workshop XIII 29 www.hdfgroup.org
  • 30. Case study: Writing Chunked Dataset • 1000x100x100 dataset • 4 byte integers • Random values 0-99 • 50x100x100 chunks (20 total) • Chunk size: 2 MB • Write the entire dataset using 1x100x100 slices • Slices are written sequentially • Chunk cache size 1MB (default) compared with chunk cache size is 5MB November 3-5, 2009 HDF/HDF-EOS Workshop XIII 30 www.hdfgroup.org
  • 31. Test Setup • 20 Chunks • 1000 slices • Chunk size ~ 2MB • Total size ~ 40MB • Each plane ~ 40KB November 3-5, 2009 HDF/HDF-EOS Workshop XIII 31 www.hdfgroup.org
  • 32. Aside: Writing dataset with contiguous storage • 1000 disk accesses to write 1000 planes • Total size written 40MB November 3-5, 2009 HDF/HDF-EOS Workshop XIII 32 www.hdfgroup.org
  • 33. Writing chunked dataset • Example: Chunk fits into cache • Chunk is filled in cache and then written to disk • 20 disk accesses are needed • Total size written 40MB 1000 disk access for contiguous November 3-5, 2009 HDF/HDF-EOS Workshop XIII 33 www.hdfgroup.org
  • 34. Writing chunked dataset • Example: Chunk doesn’t fit into cache • For each chunk (20 total) 1. Fill chunk in memory with the first plane and write it to the file 2. Write 49 new planes to file directly • End For • Total disk accesses 20 x(1 + 49)= 1000 • Total data written ~80MB (vs. 40MB) B November 3-5, 2009 B Chunk cache HDF/HDF-EOS Workshop XIII 34 A A November 3-5, 2009 A HDF/HDF-EOS Workshop XIII 34 www.hdfgroup.org
  • 35. Writing compressed chunked dataset • Example: Chunk fits into cache • For each chunk (20 total) 1. Fill chunk in memory, compress it and write it to file • End For • Total disk accesses 20 • Total data written less than 40MB Chunk cache B November 3-5, 2009 Chunk in a file B B 35 A A A November 3-5, 2009 HDF/HDF-EOS Workshop XIII 35 www.hdfgroup.org
  • 36. Writing compressed chunked dataset • Example: Chunk doesn’t fit into cache • For each chunk (20 total) • • • • • • • Fill chunk with the first plane, compress, write to a file For each new plane (49 planes) • Read chunk back • Fill chunk with the plane • Compress • Write chunk to a file End For End For Total disk accesses 20 x(1+2x49)= 1980 Total data written and read ? (see next slide) Note: HDF5 can probably detect such behavior and increase cache size November 3-5, 2009 HDF/HDF-EOS Workshop XIII 36 www.hdfgroup.org
  • 37. Effect of Chunk Cache Size on Write No compression, chunk size is 2MB Cache size I/O operations Total data written File size 1 MB (default) 1002 75.54 MB 38.15 MB 5 MB 22 38.16 MB 38.15 MB Gzip compression Cache size I/O operations Total data written File size 1 MB (default) 1982 335.42 MB (322.34 MB read) 13.08 MB 5 MB 22 13.08 MB 13.08 MB November 3-5, 2009 HDF/HDF-EOS Workshop XIII 37 www.hdfgroup.org
  • 38. Effect of Chunk Cache Size on Write • With the 1 MB cache size, a chunk will not fit into the cache • All writes to the dataset must be immediately written to disk • With compression, the entire chunk must be read and rewritten every time a part of the chunk is written to • Data must also be decompressed and recompressed each time • Non sequential writes could result in a larger file • Without compression, the entire chunk must be written when it is first written to the file • If the selection were not contiguous on disk, it could require as much as 1 I/O disk access for each element November 3-5, 2009 HDF/HDF-EOS Workshop XIII 38 www.hdfgroup.org
  • 39. Effect of Chunk Cache Size on Write • With the 5 MB cache size, the chunk is written only after it is full • Drastically reduces the number of I/O operations • Reduces the amount of data that must be written (and read) • Reduces processing time, especially with the compression filter November 3-5, 2009 HDF/HDF-EOS Workshop XIII 39 www.hdfgroup.org
  • 40. Conclusion • It is important to make sure that a chunk will fit into the raw data chunk cache • If you will be writing to multiple chunks at once, you should increase the cache size even more • Try to design chunk dimensions to minimize the number you will be writing to at once November 3-5, 2009 HDF/HDF-EOS Workshop XIII 40 www.hdfgroup.org
  • 41. Reading Chunked Dataset • Read the same dataset, again by slices, but the slices cross through all the chunks • 2 orientations for read plane • Plane includes fastest changing dimension • Plane does not include fastest changing dimension • Measure total read operations, and total size read • Chunk sizes of 50x100x100, and 10x100x100 • 1 MB cache November 3-5, 2009 HDF/HDF-EOS Workshop XIII 41 www.hdfgroup.org
  • 42. Test Setup • Chunks • Read slices 100 • Vertical and horizontal 00 10 100 November 3-5, 2009 HDF/HDF-EOS Workshop XIII 42 www.hdfgroup.org
  • 43. Aside: Reading from contiguous dataset • Repeat 100 times for each plane • Repeat 1000 times • Read a row • Seek to the beginning of the next read • Total 105 disk accesses 00 1 0 seek November 3-5, 2009 100 read 100 HDF/HDF-EOS Workshop XIII 43 www.hdfgroup.org
  • 44. Reading chunked dataset • No compression; chunk fits into cache • For each plane (100 total) • For each chunk (20 total) • Read chunk • Extract 50 rows 00 10 • End For • End For 100 • Total 2000 disk accesses • Chunk doesn’t fit into cache • Data is read directly from the file • 105 disk accesses November 3-5, 2009 HDF/HDF-EOS Workshop XIII 100 44 www.hdfgroup.org
  • 45. Reading chunked dataset • Compression • Cache size doesn’t matter in this case • For each plane (100 total) • For each chunk (20 total) • Read chunk, uncompress • Extract 50 rows 00 10 • End • Total 2000 disk accesses November 3-5, 2009 100 • End 100 HDF/HDF-EOS Workshop XIII 45 www.hdfgroup.org
  • 46. Results • Read slice includes fastest changing dimension Chunk size Compression I/O operations Total data read 50 Yes 2010 1307 MB 10 Yes 10012 1308 MB 50 No 100010 38 MB 10 No 10012 3814 MB November 3-5, 2009 HDF/HDF-EOS Workshop XIII 46 www.hdfgroup.org
  • 47. Aside: Reading from contiguous dataset • Repeat for each plane (100 total) • Repeat for each column (1000 total) • Repeat for each element (100 total) • Read element • Seek to the next one 00 10 seek November 3-5, 2009 100 • Total 107 disk accesses 100 HDF/HDF-EOS Workshop XIII 47 www.hdfgroup.org
  • 48. Reading chunked dataset • No compression; chunk fits into cache • For each plane (100 total) • For each chunk (20 total) • Read chunk, uncompress • Extract 50 columns • End • End • Total 2000 disk accesses • Chunk doesn’t fit into cache • Data is read directly from the file • 7 November 3-5,10 disk operations 2009 HDF/HDF-EOS Workshop XIII 48 www.hdfgroup.org
  • 49. Reading chunked dataset • Compression; cache size doesn’t matter • For each plane (100 total) • For each chunk (20 total) • Read chunk, uncompress • Extract 50 columns • End • End • Total 2000 disk accesses November 3-5, 2009 HDF/HDF-EOS Workshop XIII 49 www.hdfgroup.org
  • 50. Results (continued) • Read slice does not include fastest changing dimension Chunk size Compression I/O operations Total data read 50 Yes 2010 1307 MB 10 Yes 10012 1308 MB 50 No 10000010 38 MB 10 No 10012 3814 MB November 3-5, 2009 HDF/HDF-EOS Workshop XIII 50 www.hdfgroup.org
  • 51. Effect of Cache Size on Read • When compression is enabled, the library must always read entire chunk once for each call to H5Dread (unless it is in cache) • When compression is disabled, the library’s behavior depends on the cache size relative to the chunk size. • If the chunk fits in cache, the library reads entire chunk once for each call to H5Dread • If the chunk does not fit in cache, the library reads only the data that is selected • More read operations, especially if the read plane does not include the fastest changing dimension • Less total data read November 3-5, 2009 HDF/HDF-EOS Workshop XIII 51 www.hdfgroup.org
  • 52. Conclusion • On read cache size does not matter when compression is enabled. • Without compression, the cache must be large enough to hold all of the chunks to get good preformance. • The optimum cache size depends on the exact shape of the data, as well as the hardware, as well as access pattern. November 3-5, 2009 HDF/HDF-EOS Workshop XIII 52 www.hdfgroup.org
  • 53. The HDF Group Thank You! November 3-5, 2009 HDF/HDF-EOS Workshop XIII 53 www.hdfgroup.org
  • 54. Acknowledgements This work was supported by cooperative agreement number NNX08AO77A from the National Aeronautics and Space Administration (NASA). Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author[s] and do not necessarily reflect the views of the National Aeronautics and Space Administration. November 3-5, 2009 HDF/HDF-EOS Workshop XIII 54 www.hdfgroup.org
  • 55. The HDF Group Questions/comments? November 3-5, 2009 HDF/HDF-EOS Workshop XIII 55 www.hdfgroup.org

Notes de l'éditeur

  1. More data will be written in this case Ghost zones are filled with fill value unless fill value is disabled
  2. 2MB chunk doesn’t fit to 1MB cache. Chunk is allocated in the file, and written once with the first frame in the chunk, then HDF5 writes one frame at a time, writing 20x49 + 20 times = 1000 plus small I/O for metatdata. Almost twice the size participates in I/O 2MB chunk does fit into 5MB cache, we write 50 frames at once, when we fill the chunk and do it 20 times only! In case of compression situation is even worse: we need to write chunk every time we write a frame, then read it back to write another frame, etc. For each plane we write chunk 1000; in order to modify plane 2-50 (49 of them) in one chunk , we have to read the chunk 49 times x 20 times (number of chunks), so we get 1000 (writes) + 980 (reads). We do 1000 writes and 980 reads for raw data. Bigger cache works nicely.
  3. First case: to read one plane, we need to read each chunk (20 I/Os) and do it 100 times (number of rows) (doesn’t fit into cache) Second case: to read one plane we need to read each chunk (100 I/O) and do it 100 times (for each row) Third case: cache is bypassed; reads from the file 1000 x 100 (for each row) = 100000 Fourth case: chunk fits into cache, so for one plane we do 100 reads to bring in all chunks, then do it 100 times for each row
  4. No difference for the first two cases and fourth one (for first two we always bring chunk into memory and uncompress), the third one fits into cache In the third case, chunk doesn’t fit into cache and library reads directly from the file getting one element at a time (1000x 100 (# of rows) x 100 (# columns) = 10000000)