1. Using HDF5 and Python: The H5py module
Daniel Kahn
Science Systems and Applications, Inc.
Acknowledgement: Thanks to Ed Masuoka, NASA Contract NNG06HX18C
HDF & HDF-EOS Workshop XV 17 April 2012
2. Python has lists:
>>> for elem in ['FirstItem','SecondItem','ThirdItem']:
...
print elem
...
FirstItem
SecondItem
ThirdItem
>>>
We can assign the list to a variable.
>>> MyList = ['FirstItem','SecondItem','ThirdItem']
>>> for elem in MyList:
...
print elem
...
FirstItem
SecondItem
ThirdItem
HDF & HDF-EOS Workshop XV 17 April 2012
>>>
3. Lists can contain a mix of objects:
>>> MixedList = ['MyString',5,[72, 99.44]]
>>> for elem in MixedList:
...
print elem
...
MyString
A list inside a list
5
[72, 99.44]
Lists can be addressed by index:
>>> MixedList[0]
'MyString'
>>> MixedList[2]
[72, 99.44]
HDF & HDF-EOS Workshop XV 17 April 2012
4. A note about Python lists:
Python lists are one dimensional.
Arithmetic operations don’t work on them.
Don’t be tempted to use them for scientific array
based data sets. More the ‘right way’ later...
HDF & HDF-EOS Workshop XV 17 April 2012
5. Python has dictionaries.
Dictionaries are key,value pairs
>>> Dictionary =
{'FirstKey':'FirstValue',
'SecondKey':'SecondValue',
'ThirdKey':'ThirdValue'}
>>> Dictionary
{'SecondKey': 'SecondValue', 'ThirdKey': 'ThirdValue',
'FirstKey': 'FirstValue'}
>>>
Notice that Python prints the key,value pairs in a different
order than I typed them.
The Key,Value pairs in a dictionary are unordered.
HDF & HDF-EOS Workshop XV 17 April 2012
6. Dictionaries are not lists, however we can easily create a list
of the dictionary keys:
>>> list(Dictionary)
['SecondKey', 'ThirdKey', 'FirstKey']
>>>
We can use a dictionary in a loop without additional
elaboration:
>>> for Key in Dictionary:
...
print Key,"---->",Dictionary[Key]
...
SecondKey ----> SecondValue
ThirdKey ----> ThirdValue
FirstKey ----> FirstValue
>>>
HDF & HDF-EOS Workshop XV 17 April 2012
7. HDF5 is made of
“Dictionaries” a dataset
name is the key, and the
array is the value.
Keys
Value
HDFView is a tool which
shows use the keys
(TreeView) and the values
(TableView) of an HDF5 file.
HDF & HDF-EOS Workshop XV 17 April 2012
8. Andrew Collette’s H5py module allows us to use Python and
HDF5 together.
We can use H5py to manipulate HDF5 files as if they were
Python Dictionaries
>>> import h5py
>>> in_fid = h5py.File('DansExample1.h5','r')
>>> for DS in in_fid:
...
print DS,"------->",in_fid[DS]
...
FirstDataset -------> <HDF5 dataset "FirstDataset": shape (25,), type "<i4">
SecondDataset -------> <HDF5 dataset "SecondDataset": shape (3, 3), type "<i4">
ThirdDataset -------> <HDF5 dataset "ThirdDataset": shape (5, 5), type "<i4">
>>>
Keys
Values
HDF & HDF-EOS Workshop XV 17 April 2012
9. So What? We need to be able to manipulate the arrays, not
just the file.
The Numpy module by Travis Oliphant allows the manipulation
of arrays in Python.
We will see examples of writing arrays later, but to get arrays
from the H5py object we have the ellipses.
>>> import h5py
>>> fid = h5py.File('DansExample1.h5','r')
>>> fid['FirstDataset']
<HDF5 dataset "FirstDataset": shape (25,), type "<i4">
>>> fid['FirstDataset'][...]
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
17, 18, 19, 20, 21, 22, 23, 24])
>>> type(fid['FirstDataset'][...])
<type 'numpy.ndarray'>
>>>
HDF & HDF-EOS Workshop XV 17 April 2012
16,
10. Reasons to use Python and HDF5 instead of C or Fortran
The basic Python Dictionary object has a close similarity to
the HDF5 Group. The object oriented and dynamic nature of
Python allows the existing Dictionary syntax to be repurposed
for HDF5 manipulation.
In short, working with HDF5 in Python requires much less
code than C or Fortran which means faster development and
fewer errors.
HDF & HDF-EOS Workshop XV 17 April 2012
11. Comparison to C, h5_gzip:
C
# Lines of code
106
Python from
THG site
37
Fewer lines of code means fewer places to make mistakes
The 37 line h5_gzip.py example is a “direct” translation of the
C version. Some more advanced techniques offer insight into
advantages of Python/H5py programming. Text in next
slides is color coded to help match code with same functionality.
First writing a file…
HDF & HDF-EOS Workshop XV 17 April 2012
12. Original h5_gzip.py
Pythonic h5_gzip.py
# This example creates and writes GZIP compressed
dataset.
import h5py
import numpy as np
# Create gzip.h5 file.
#
file = h5py.File('gzip.h5','w')
#
# Create /DS1 dataset; in order to use compression,
dataset has to be chunked.
#
dataset = file.create_dataset('DS1',
(32,64),'i',chunks=(4,8),compression='gzip',compressi
on_opts=9)
#
# Initialize data.
#
data = np.zeros((32,64))
for i in range(32):
for j in range(64):
data[i][j]= i*j-j
# Write data.
print "Writing data..."
dataset[...] = datafile.close()
#!/usr/bin/env python
# It's a UNIX thing.....
from __future__ import print_function # Code will work
with python 3 as well....
# This example creates and writes GZIP compressed
dataset.
import h5py # load the HDF5 interface module
import numpy as np # Load the array processing
module
# Initialize data. Note the numbers 32 and 64 only
appear ONCE in the code!
LeftVector = np.arange(-1,32-1,dtype='int32')
RightVector = np.arange(64,dtype='int32')
DataArray = np.outer(LeftVector,RightVector) # create
32x64 array of i*j-j
# The _with_ construct will automatically create and
close the HDF5 file
with h5py.File('gzip-pythonic.h5','w') as h5_fid:
# Create and write /DS1 dataset; in order to use
compression, dataset has to be chunked.
h5_fid.create_dataset('DS1',data=DataArray,chunks=(4
,8),compression='gzip',compression_opts=9)
dataset[...] = data
file.close()
file.close()
HDF & HDF-EOS Workshop XV 17 April 2012
13. Reading data….
# Read data back; display compression properties and
dataset max value.
#
file = h5py.File('gzip.h5','r')
dataset = file['DS1']
print "Compression method is", dataset.compression
print "Compression parameter is",
dataset.compression_opts
data = dataset[...]
print "Maximum value in", dataset.name, "is:",
max(data.ravel())
file.close()
# Read data back; display compression properties and
dataset max value.
#
with h5py.File('gzip-pythonic.h5','r') as h5_fid:
dataset = h5_fid['DS1']
print("Compression method is", dataset.compression)
print("Compression parameter is",
dataset.compression_opts)
print("Maximum value in", dataset.name, "is:",
dataset.value.max())
HDF & HDF-EOS Workshop XV 17 April 2012
14. And finally, just to see what the file looks like…
HDF & HDF-EOS Workshop XV 17 April 2012
15. Real world example: Table Comparison
Background:
For the OMPS Instruments we need to design binary
arrays to be uploaded to the satellite to sub-sample the
CCD to reduced data rate.
For ground processing use we store these arrays in
HDF5.
As part of the design process we want to be able to
compare arrays in two different files.
HDF & HDF-EOS Workshop XV 17 April 2012
16. Here is an example of a Sample Table
HDF & HDF-EOS Workshop XV 17 April 2012
17. Here is another example:
HDF & HDF-EOS Workshop XV 17 April 2012
18. Here is the “difference” of the arrays. Red pixels are
unique to the first array.
HDF & HDF-EOS Workshop XV 17 April 2012
19. The code: CompareST.py
#!/usr/bin/env python
""" Documentation """
from __future__ import print_function,division
import h5py
import numpy
import ViewFrame
def CompareST(ST1,ST2,IntTime):
with h5py.File(ST1,'r') as st1_fid,h5py.File(ST2,'r') as st2_fid:
ST1 = st1_fid['/DATA/'+IntTime+'/SampleTable'].value
ST2 = st2_fid['/DATA/'+IntTime+'/SampleTable'].value
ST1[ST1!=0] = 1
ST2[ST2!=0] = 1
Diff = (ST1 - ST2)
ST1[Diff == 1] = 2
ViewFrame.ViewFrame(ST1)
HDF & HDF-EOS Workshop XV 17 April 2012
20. ..and the command line argument parsing.
if __name__ == "__main__":
import argparse
OptParser = argparse.ArgumentParser(description = __doc__)
OptParser.add_argument("--ST1",help="SampleTableFile1")
OptParser.add_argument("--ST2",help="SampleTableFile2")
OptParser.add_argument("--IntTime",help="Integration Time",
default='Long')
options = OptParser.parse_args()
CompareST(options.ST1,options.ST2,options.IntTime)
HDF & HDF-EOS Workshop XV 17 April 2012
21. Recursive descent into HDF5 file
Print group names, number of children and dataset names.
#!/usr/bin/env python
from __future__ import print_function
import h5py
def print_num_children(obj):
if isinstance(obj,h5py.highlevel.Group):
print(obj.name,"Number of Children:",len(obj))
for ObjName in obj: # ObjName will a string
print_num_children(obj[ObjName])
else:
print(obj.name,"Not a group")
with h5py.File('OMPS-NPP-NPP-LP_STB', 'r+') as f:
print_num_children(f)
HDF & HDF-EOS Workshop XV 17 April 2012
22. The Result….
ssai-s01033@dkahn: ~/python % ./print_num_children.py
/ Number of Children: 1
/DATA Number of Children: 10
/DATA/AutoSplitLong Not a group
/DATA/AutoSplitShort Not a group
/DATA/AuxiliaryData Number of Children: 6
/DATA/AuxiliaryData/FeatureNames Not a group
/DATA/AuxiliaryData/InputSpecification Not a group
/DATA/AuxiliaryData/LongLowEndSaturationEstimate Not a group
/DATA/AuxiliaryData/ShortLowEndSaturationEstimate Not a group
/DATA/AuxiliaryData/Timings Number of Children: 2
/DATA/AuxiliaryData/Timings/Long Not a group
/DATA/AuxiliaryData/Timings/Short Not a group
/DATA/AuxiliaryData/dummy Not a group
/DATA/Long Number of Children: 14
/DATA/Long/BadPixelTable Not a group
/DATA/Long/BinTransitionTable Not a group
/DATA/Long/FeatureNamesIndexes Not a group
/DATA/Long/Gain Not a group
/DATA/Long/InverseOMPSColumns Not a group
HDF & HDF-EOS Workshop XV 17 April 2012
23. Summary
Python with H5py and Numpy modules make developing
Programs to manipulate HDF5 files and perform calculations
With HDF5 arrays simpler which increase development
speed and reduces errors.
HDF & HDF-EOS Workshop XV 17 April 2012