Update on HDF, including recent changes to the software, new releases, THG collaborations, and future plans. Session will include an overview of the HDF4.2r2, HDF5 1.6.6, and 1.8.0 releases, as well as updates on completed and on-going THG projects including crash-proofing HDF5, efficient append to HDF5 datasets, and indexing in HDF5.
4. THG, the Company
•
•
•
•
Spun-off from University of Illinois July 2006
Non-profit
20+ scientific, technology, professional staff
Intellectual property:
− THG owns HDF4 and HDF5
− HDF formats and libraries to remain open
− Libraries have BSD-type license
• Continue ties to U of I and NCSA
02/18/14
The HDF Group
4
5. The mission of The HDF Group
is to ensure long-term
accessibility of HDF data through
sustainable development and
support of HDF technologies.
02/18/14
The HDF Group
5
6. Goals
• Maintain, evolve HDF for sponsors and
communities that depend on it
• Do consulting, training, tuning, development,
research
• Sustain The HDF Group for long term to assure
data access over time
02/18/14
The HDF Group
6
7. THG Services
•
•
•
•
•
•
Helpdesk and Mailing Lists
− Available to all users as a first level of support
Standard Support
− Rapid issue resolution support
Consulting
− Needs assessment, troubleshooting, design reviews, etc.
Enterprise Support
− Coordinating HDF activities across divisions
Special Projects
− Adapting customer applications to HDF
− New features and tools, with changes normally incorporated into
open source product
− Research and Development
Training
− Tutorials and hands-on practical experience
02/18/14
The HDF Group
7
11. New features and changes
• New APIs added to the SD and GR interfaces:
− SDreset_maxopenfiles, SDget_maxopenfiles, Modifies, reports
maximum allowable number of files
− SDget_numopenfiles:Gets number of open files
− SDgetcompinfo, GRgetcompinfo: Gets compression info
− SDgetfilename: Retrieves name of file, given its ID
− SDgetnamelen: Retrieves length of object name, given its ID
• SZIP compression
− Now can be invoked by Fortran API
− Now available for raster images via GR interface
• SDS, Vgroup names no longer limited to 64 characters
02/18/14
The HDF Group
11
12. New features and changes
• HDF configuration changes
− --enable-netcdf flag introduced
− Autotools versions updated
• Many bug fixes made to hrepack and hdiff
• See RELEASE.txt for a full list of changes
02/18/14
The HDF Group
12
13. Platforms to drop/add next release
• Drop
− Windows XP with MSVC+
+ 6.0
− Linux 2.4
− IRIX64 6.5
− SunOS 5.8, 5.9
02/18/14
The HDF Group
• Add
− Windows 64-bit (32 and
64-bit binaries)
13
14. Platforms tested
•
• Compilers
Systems
−
−
−
−
−
−
−
−
AIX 5.3 (32-bit, 64-bit)
Free BSD 6.2 (32-bit, 64-bit)*
HP-UX B.11.23 (32-bit, 64-bit)*
IRIX 64 v6.5 (32-bit, 64-bit)
Linux 2.4, 2.6*
Linux ia64
Linux x86_64
Sun OS 5.8, 5.10* (32-bit, 64bit)
− SunOS 5.10 on Intel
− Windows XP, Vista
− Mac OS X Intel*
−
−
−
−
−
−
−
−
−
IBM C and Fortran compilers
GNU gcc 3.4* and GNU Fortran
HPUX C and Fortran compilers
GNU gcc 3.4 and 4.*
Intel C and Fortran versions 9.1 and
10.00
SUN WorkShop C and Fortran
Visual Studio .NET and 2005 and
Intel Fortran
Visual Studio 2005 (no fortran)
GNU gcc 4.0.1 with gfortran and
g95
* New platforms
For detailed info, see RELEASE.txt
02/18/14
The HDF Group
14
17. HDF5 1.6.6 release
• Primarily a bug-fix release
• Some tool changes (see later slide)
• http://hdfgroup.org/HDF5/release/obtain5.html
02/18/14
The HDF Group
17
18. Platforms dropped
• Operating systems
−
−
−
−
• Compilers
− PGI 6.5-*
AIX 5.3
Solaris 2.8 and 2.9
OSF1
Windows XP with MSVC++ 6.0
http://www.hdfgroup.org/HDF5/release/alpha/obtain518.html
02/18/14
The HDF Group
18
19. Platforms added
•
Systems
− Alpha Open VMS
− MAC OSX 10.4 (Intel)
− Solaris 2.* on Intel
− Cray XT3
− Windows 64-bit (32 and 64bit)
− BG/L
02/18/14
The HDF Group
• Compilers
−
−
−
−
PGI V. 7.*
Intel 10.*
MPICH 1.2.7
MPICH2
19
21. HDF5 1.8 new library features
• Datatype and dataspace features
−
−
−
−
−
−
Create datatype from text description
Integer to float conversions during I/O
Compact storage for N-bit datatypes
Offset+size storage filter, saving space
“Null” dataspace – datasets with no elements
Data transformation filter
02/18/14
The HDF Group
21
22. HDF5 1.8 – new library features
• Group improvements
−
−
−
−
Creation order access
Compact groups – small groups take less space
Large group storage improvements
Intermediate group creation
• Link improvements
− Unicode names allowed
− External links – to objects in another file
− User defined links – create own kinds of links
02/18/14
The HDF Group
22
23. HDF5 1.8 – new library features
• Attribute improvements
− Improved storage for large number of attributes
− Iterate or look up by creation order
− Unicode names allowed
• Support for Unicode UTF-8 character set
• Shared header information, possibly saving space
• Metadata cache improvements – faster I/O on
files with many objects
• Better UNIX/Linux portability
02/18/14
The HDF Group
23
24. HDF5 1.8 – new APIs
•
•
•
•
New extendible error-handling API
New APIs to copy objects between files quickly
Dimension scale model and API
“HDFpacket” API, to read/write packets efficiently
02/18/14
The HDF Group
24
25. HDF5 1.8 – Backward and
Forward Compatibility
02/18/14
The HDF Group
25
26. HDF5 1.8 and 1.6
• Differences between 1.8 and 1.6.x
− Some file format changes
− Several new routines added
− Old APIs deprecated – may be removed in later
release
• Consequences
− Applications requiring 1.8 format changes will
generate objects that cannot be read by 1.6 library
− To exploit 1.8 changes, applications need to be
rewritten
02/18/14
The HDF Group
26
27. “The art of progress is to
preserve order amid change, and
to preserve change amid order.”
Alfred North Whitehead
02/18/14
The HDF Group
27
28. Principle of
Maximum File Format Compatibility
Unless instructed otherwise, the HDF5 library will write objects
using the earliest version of the format possible for describing
the information.
information
Assures older library versions are forward compatible whenever
possible:
− Objects in new files can be read with old versions of the library,
if the objects are “known” to the old libraries.
− New versions of the library can always read objects in files
written with older versions.
02/18/14
02/18/14
The HDF GroupGroup
The HDF
28
28
30. New features for existing tools
• -V option for all tools
− Prints HDF5 library version number used by tool
• h5repack: -L option
− Use latest version of file format to create objects
• h5dump: dumps groups/attributes in creation or
name order
− -q Q, --sort_by=Q Sort groups and attributes by index Q
− -z Z, --sort_order=Z Sort groups and attributes by order Z
02/18/14
02/18/14
The HDF GroupGroup
The HDF
33
33
31. New command line tools
• h5mkgrp
− Creates new groups and group hierarchies in an HDF5 file
• h5stat
− Provides statistics regarding the file, such as number of
objects per group, sizes of datasets, amount of free space in
file
• h5copy
− Copy object within a file or cross files
• h5check
− Verifies an HDF5 file against the defined HDF5 File Format
Specification
− Completed for 1.6.
− In progress for 1.8
02/18/14
02/18/14
The HDF GroupGroup
The HDF
34
34
32. Tool work in the pipeline
• Export numeric data formatted in several different
ways (such as MS excel, XML, etc)
• Import ASCII data that conforms to certain format
• Use a common text format for h5import and
h5dump
• Support NaN in tools such as h5diff.
Challenges:
− NaN is platform specific
− NaN can have different values for the same
machine
− Checking NaN can be a performance hit
02/18/14
02/18/14
The HDF GroupGroup
The HDF
35
35
34. HDF5 Java is Growing UP
02/18/14
The HDF Group
37
35. HDFView changes
• HDFView 2.4 released
• Many new features, such as
−
−
−
−
−
Support for compound datatypes of 2D+ arrays
Support for "filtering fill value" in Image Viewer
Effective handling of large 3D images
Support large fonts in GUI components
New autogain algorithm for image Brightness/Contrast
• New platforms
− Mac intel
− Linux 64-bit AMD
− Solaris 64-bit
02/18/14
02/18/14
The HDF GroupGroup
The HDF
38
38
36. Other Java products
• 36 new enhancements and 44 bugs fixed
• Test suite (using junit testing framework)
− Tests all public methods in the object package
− Added “make check” to run the test suite
• Enhanced documentation
− All public methods in the object package are fully
documented
02/18/14
02/18/14
The HDF GroupGroup
The HDF
39
39
37. Future work for Java
• Update HDF5 JNI APIs for HDF5 1.8 release
• Release HDFView with bug fixes/new features
with HDF5 1.8 release
• Port HDF5-SRB model to HDF5-iRODS model
• Writing capability for HDF5-iRODS model
02/18/14
02/18/14
The HDF GroupGroup
The HDF
40
40
42. Goals
• A framework for performance regression testing
• A tool for
−
−
−
−
Testing on multiple platforms
Testing different versions
Long term regression testing
Assistance in debugging
02/18/14
The HDF Group
45
43. Solution
HDF5 1.6
HDF5 1.8
cron
A User’s
Benchmark
Database
Performance
Library
www
PHP
Web Server
Graph/Text
02/18/14
The HDF Group
46
44. Sample Usage
H5Perf_startTimer(&time);
for(i=0;i<1000 ;i++) {
H5Gcreate(fileid,group_name,(size_t)0));
// Add groups
}
H5Perf_endTimer(&time);
H5Perf_addInstance(db_host, date, time);
00 21 * * * /home/local/hyoklee/src/chicago/test-perf-hdfdap-3.sh
|
178820 | 2007-08-17 21:51:14 | 10000 groups
Timestamp
02/18/14
| creating 10000 empty groups
Instance Name
The HDF Group
| 1.8.0
| hdfdap |
Version Platform
47
0.670198 |
Time
4384 |
46. Crash Survivability in HDF5
• Problem:
− Data in HDF5 files susceptible to corruption in the
event of an application or system crash.
− Corruption possible if structural metadata is being
written when the crash occurs.
• Initial Objective:
− Guarantee an HDF5 file with consistent metadata
can be reconstructed in the event of a crash.
− No guarantee on state of raw data – contains
whatever made it to disk prior to crash.
02/18/14
02/18/14
The HDF GroupGroup
The HDF
49
49
47. Crash Survivability in HDF5
• Approach: Metadata Journaling
− When a piece of metadata is modified and in a
consistent state, make a journal note.
− If the application crashes, a recovery program can
replay the journal by applying in order all metadata
writes until the end of the last completed
transaction written to the journal file.
02/18/14
02/18/14
The HDF GroupGroup
The HDF
50
50
49. Fast Data Appends
• Problem: Metadata operations limit the rate at
which HDF5 can append data to datasets.
• Solution: new data structure for indexing chunks:
− Allows constant time extend, shrink and lookup of
chunks in datasets with single unlimited dimension
− # of metadata I/O operations to append to dataset
is independent of # of chunks
− Allows single-writer/multiple-reader access
• Details at:
http://www.hdfgroup.uiuc.edu/RFC/HDF5/SkipList
ChunkIndex/SkipListChunkIndex.html
02/18/14
02/18/14
The HDF GroupGroup
The HDF
52
52
51. netCDF-4 Project
• Enhanced NetCDF-4 Interface to HDF5
− Combine features of netCDF and HDF5
− Take advantage of their separate strengths
• Collaboration between NCSA, THG, Unidata
• Currently in beta release
• Will be released after HDF5 1.8
02/18/14
The HDF Group
54
54. Project description
• Investigate integrated DAP-aware HDF5 library
that can provide seamless access to both
local and remote data
• A NASA ROSES NRA project
• See Kent Yang’s talk and poster
02/18/14
02/18/14
The HDF GroupGroup
The HDF
57
57
55. NOAA – Science Data
Stewardship
02/18/14
The HDF Group
58
56. NOAA – Science Data Stewardship
• Use HDF5 Archival Information Package (AIP) to
archive HDF EOS2 data
• A collaboration between NSIDC and THG
• See Ruth Duerr and Kent Yang’s poster
02/18/14
02/18/14
The HDF GroupGroup
The HDF
59
59
58. Why .NET?
• The Microsoft .NET framework is used by most
new applications created for Windows.
− Makes it easier to develop applications
− Reduces application vulnerability to security threats
• Supports development in multiple programming
languages, in particular C#.
• Increased level of interest in .NET from users of
HDF5.
02/18/14
02/18/14
The HDF GroupGroup
The HDF
61
61
59. HDF and .NET Status
• Received funding to implement prototype .NET
wrapper API for Windows XP
− Based on HDF5 C API
− Focus on C# binding
− Functionality limited to subset of API routines
• If funded, we would like to move beyond the
prototype to
− Create .NET wrappers for all HDF C functions
− Offer full support for .NET wrappers with HDF5 1.8
02/18/14
02/18/14
The HDF GroupGroup
The HDF
62
62
62. Sequencing
•
Next Gen Sequencing platforms produce ~1500 X more data than
CE (Sanger)
•
A single Next Gen instrument can produce 20 times more data a
single run than a day’s operation of a genome center with 100 CE
instruments
02/18/14
The HDF Group
65
63. An email on Sept 21…
“… A little background, we're doing genetic
association studies, these result in large 2-d matrices
(40K x 1M before applying threshholds). Each of
the cells in this matrix has ~10 numerical
statistics (e.g. some sort of pvalue)… ”
40K x 1M x 10 x 4 = 1,600,000,000,000 (1.6 TB)
02/18/14
The HDF Group
66
65. Product data
• HDF5 proposed to ISO as binary representation
for product data representation and exchange
• Would be a binary option to the STEP format
• ISO/NWI-CD 10303-026, STEP Part 26
02/18/14
The HDF Group
68
67. SQL Server and HDF5
• THG discussing possible project with Microsoft
• Microsoft envisions a dream environment for
scientists that would encompass both computing
and data management
• Possible SQL Server solution
− Combine RDBMS and scientific analysis tools in a
single integrated system
− Use HDF5 to manage scientific objects not handled
well by traditional database
02/18/14
02/18/14
The HDF GroupGroup
The HDF
70
70
68. HDF5 in SQL server
Visualization
Libraries
(MATLAB,…)
Web Services
(XML, REST, RSS)
OLAP and
Data Mining
Reporting
.NET Languages with Language Integrated Query
Entity Framework (EDM, eSQL, O-R mapping)
HDF5 EDM model
SQL Server
HDF5
HDF5
TVFs
Index
HDF5
type
02/18/14
HDF5
files
HDF5 FS
blob
The HDF Group
71
70. Acknowledgement
This report is based upon work supported in part by a
Cooperative Agreement with NASA under NASA
NNG05GC60A. Any opinions, findings, and conclusions
or recommendations expressed in this material are
those of the author(s) and do not necessarily reflect the
views of the National Aeronautics and Space
Administration.
02/18/14
The HDF Group
73
72. Information Sources
• HDF website
http://hdfgroup.org/
• HDF5 Information Center
http://hdfgroup.org/HDF5/
• HDF Helpdesk
hdfhelp@hdfgroup.org
• HDF users mailing list
hdfnews@ncsa.uiuc.edu
coming soon: news@hdfgroup.org
02/18/14
The HDF Group
75
Notes de l'éditeur
Why
Increasing need for support, services, quick response
Not a good model for a University R&D project
Who
11 software engineers and several students: develop, maintain HDF software, work on special projects, manage projects
3 tech support staff: helpdesk, doc, sysadmin.
Management team
President
Director of Technical Services and Operations
Director of Software Development
Director of Business Operations
Managers responsible for tools, applications
Other THG staff include seven full-time software engineers who develop and maintain the HDF software, as well as working on special projects, and three technical support staff who provide helpdesk support, documentation, and system administration. The HDF group also generally employs students from the University Computer Science and Engineering departments.
The R&D mission
Maintain and evolve HDF for high end science apps
Maintain HDF4 and HDF5 and tools at supercomputing centers, TeraGrid
Support academic science
Cutting edge data management research
Adapt to leading edge, experimental architectures
Integrate with new middleware technologies, parallel file systems
The “Support and Sustain” mission
Maintain, evolve for communities, sponsors
Provide proprietary consulting, tuning, development
Sustain for long term, maintain data access over time
<number>
I get all mixed up with the terms backward & forward compatibility. I did a little investigation on the definitions and use in talking with Frank about his compatibility matrix awhile back and still don’t have a good grasp of what is meant… my conclusion was there is no consistent use. It seems most, like MathWorks use “compatibility” without the forward/backward words. I made a change here… is this what you meant in the original?.
And, I don’t know if its’ worth saying but – New Versions can always read object in files written with older versions (unless there’s a bug in the writer!) Then we’ll offer the best solution we can.
Maybe Objective bullets do belong on later slide… not sure.
Is it only limited for unlimited / chunked datasets? Or is it that way for all but we’re just fixing it for limited / unchunked cases?
Contrasts with B-tree index:
- B-tree has O(log n) extend, shrink and lookup of chunks
- B-tree has ~logarithmic # of metadata I/O operations as chunks appended
Will be optimizing chunked dataset indexing for datasets with no unlimited dimensions (with array index) and multiple unlimited dimensions (with v2 B-tree) as part of project in the next year also.
<number>
I’ve changed this considerably. I don’t think its necessary to say who has funded work to date, exactly what that entails, or that the prototype is available. The important message (to me) is we have experience & interest in this area. And, willing to do more if it’s funded. If not, then that’s the end of the story.
First bullet – let them know it may or may not happen… not a done deal
Not sure I got the “translation” from first version of text to this one right…
Dropped “& other formats” (let them give those presentatations)