hdf images hdf images

Parallel HDF5

Why must attributes be written collectively?

Attributes in general are meant to be very small. Attributes (both the attribute information and the data it holds) are considered to be metadata on an object. Because of this, they are held in the metadata cache. The HDF library has a requirement that all metadata updates be done collectively so all processes see the same stream of metadata updates. This is how HDF5 was designed. Breaking the collective requirements for metadata updates has been discussed previously and we know that it is worth having for certain scenarios, but it is just not possible at the moment without a lot of research and funding.

Attribute data is treated as metadata because it is perceived as something that is present on all processes and not really generated by one process and sent to other processes. An example would be a label indicating that a given dataset is stored as of timestep 1 or at a given setting.

If you want to avoid sending the attribute's data to all processes, you may wish to consider using a dataset instead. Datasets can be created with any dimensions like attributes, and can be even created with a scalar dataspace to hold one element (see H5Screate).

How to improve performance with Parallel HDF5

Tuning parallel HDF5 for a specific application on a specific system requires playing with a lot of tunable parameters many of which are specific to certain platforms. Not all hints are applicable to all platforms, and some hints may be ignored even if they can be applied. The best practice here is to look at each system's webpage on how to tune I/O parameters. For example, Hopper, a Cray XE6 supercomputer at NERSC, has a webpage specifically on how to tune parallel I/O parameters for specific file systems:


Here are some general parameters that users should consider tuning when they see slow I/O performance from HDF5:

HDF5 parameters:

  1. Chunk size and dimensions: If the application is using chunked dataset storage, performance usually varies depending on the chunk size and how the chunks are aligned with block boundaries of the underlying parallel filesystem. Extra care must be taken on how the application is accessing the data to be able to set the chunk dimensions.

  2. Metadata cache: it is usually a good idea to increase the metadata cache size if possible to avoid small writes to the file system. See:


  3. Alignment properties: For MPI IO and other parallel systems, choose an alignment which is a multiple of the disk block size. See:


MPI-IO parameters:

There are several MPI-I/O parameters to tune. Usually it is done by setting info keys in the info object passed to HDF5. Some implementations might allow other ways to pass those hints to the MPI library. The MPI standard reserves some key values. An implementation is not required to interpret these key values, but if it does interpret the key value, it must provide the functionality described. The best thing to do again here is to consult with the specific MPI implementation and system used documentation to see what parameters are available to tune. For example, ROMIO in MPICH provides a user-guide with a section describing the hints that are available to tune:


Here are some general parameters that are usually tunable:

  1. cb_block_size (integer): This hint specifies the block size to be used for collective buffering file access. Target nodes access data in chunks of this size. The chunks are distributed among target nodes in a round-robin (CYCLIC) pattern.

  2. cb_buffer_size (integer): This hint specifies the total buffer space that can be used for collective buffering on each target node, usually a multiple of cb_block_size.

  3. cb_nodes (integer): This hint specifies the number of target nodes to be used for collective buffering.

MPI implementations other than ROMIO might provide a way to tune those parameters, but not necessarily through info hints. OMPIO (an Open MPI native MPI-IO implementation) for example uses OMPI MCA parameters to tune those hints.

Parallel File System parameters:

Depending on the parallel file system and what version it is, there are several ways to tune performance. It is very hard to come up with a general list of tunable parameters for all file systems, since there are not many common ones. Users should individually check the documentation for the particular file system they are using.

For most parallel file systems the two parameters that are usually tunable and very important to consider are:

  1. Stripe size: Controls the striping unit (in bytes).
  2. Stripe Count: Controls the number of I/O devices to stripe across.

For Blue Gene /P and /Q, one can set the environment variable BGLOCKLESSMPIO_F_TYPE to 0x47504653 (the GPFS file system magic number). ROMIO will then pretend GPFS is like PVFS and not issue any fcntl() lock commands.

Some IBM specific hints:


Some Cray specific hints:


GPFS Optimizations

If encountering issues with performance on GPFS, there are two parameters that you can tune:

  1. The "bglockless" prefix: https://press3.mcs.anl.gov/romio/2013/08/05/bglockless/

    With MPICH-3.1.1 and beyond this is no longer a problem: http://press3.mcs.anl.gov/romio/2014/06/05/new-romio-optimizations-for-blue-gene-q/

  2. "bg_nodes_pset" controls the number of aggregators. (See How to pass hints to MPI from HDF5.)

How to pass hints to MPI from HDF5

To set hints for MPI using HDF5, see: https://support.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetFaplMpio

You use the 'info' parameter to pass these kinds of low-level MPI-IO tuning tweaks. In C, the calls are like this:

   MPI_Info info;

   /* strange thing about MPI hints: the key and value are strings */ 
   MPI_Info_set(info, "bg_nodes_pset", "1");

   H5Pset_fapl_mpio(plist_id, MPI_COMM_WORLD, info);

   /* and now pass plist_id to H5Fopen or H5Fcreate */
   file_id = H5Fcreate(H5FILE_NAME, H5F_ACC_TRUNC, H5P_DEFAULT, plist_id);

How do you set up HDF5 so only one MPI rank 0 process does I/O?

Several scientific HDF5 applications use this approach, and we know it works very well. You should use the sequential HDF5 library.

Pros: one HDF5 file
Cons: Probably a lot of communications will be going on.

How would you create separate files for each compute node in a cluster using HDF5?

  1. If you use the Parallel HDF5 library, open a communicator for each node and add just one process to that communicator.

  2. You may also try to use the sequential library, but then you have to make sure that only a particular process creates/modifies/closes the file that is written to. This approach has not been tested.

    Pros: should be faster than 1)
    Cons: multiple files to handle, not tested

    Also there can be other approaches if you decide to go with multiple files. For example, each process writes a flat binary file and one can use the HDF5 external datasets storage feature to create a wrapper HDF5 file to combine all data. See the section on Dataset Creation Property Lists under Property Lists in the HDF5 Tutorial.

Performance: Parallel I/O with Chunking Storage

HDF5 has to map the coordinate of data from the file space to the memory space in chunked storage. If the shape of the memory and file space is the same, HDF5 can optimize the mapping process significantly. Otherwise, a general mapping routine will map the coordinate one by one. We recommend that applications use the same shape for both the memory space and file space.

For example, the following case may cause bad performance:

If you change it as follows:

Then, the performance may be much improved.

What if Parallel HDF5 tests fail with a ROMIO error: File locking failed in ADIOI_Set_lock ... ?

This means that ROMIO, the MPI-I/O implementation used in mpich and openmpi and many other implementations, is attempting to use file locking when it is not supported by your file system. To resolve that, first you should attempt to rebuild your MPI library to disable file locking. This is the best way to resolve this error for your HDF5 application or any MPI-I/O application on your file system.

If that is not possible, there is a manual way to do this within your program. Unfortunately, this will require updating your programs and also updating the internal parallel HDF5 tests if you want them to succeed. The following steps are required:

Testing ph5diff ... Expected result differs from actual result

When I run make check, it fails with errors similar to this:

  Testing ph5diff -v -p 0.05 --use-system-epsilon h5diff_basic1.h5 h5dif*FAILED*
  ====Expected result (expect_sorted) differs from actual result (actual_sorted)

What can I do to resolve these errors?

These are not valid errors. The test is comparing saved output in HDF5 with the output from running the test, and the two do not match.

When running the tests, ignore the errors by either specifying "make -i" or setting the HDF5_Make_Ignore environment variable. Also, redirect the output to a file. For example:

  env HDF5_Make_Ignore=yes gmake check >& check.out

Then edit the resulting check.out file and search for:

  *** Error ignored

If the only tests that fail are those that compare saved output with the test output, then your installation should be okay. You can run ph5diff manually from the command line, to be certain it is working properly.

How do you write to a single file in parallel in which different processes write to separate datasets?

All processes have to call H5Dcreate() to create a dataset, even if the dataset will be accessed by one process.

If you want to create a dataset for every process you could do something like this:

for(i=0 ; i < mpi_size; i++) {
    char dataSetName[256];
    sprintf(dataSetName, "a%d", i + 1);
    printf("Creating dataset %s ... \n", dataSetName);

    dset_id = H5Dcreate2(file_id, dataSetName, H5T_NATIVE_INT, filespace,
                        H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);

This will create n datasets, where n is the number of processes in the communicator.

Error: "The MPI_Comm_free() function was called after MPI_FINALIZE was invoked"

I obtain the following error at the end of my program:

   *** The MPI_Comm_free() function was called after MPI_FINALIZE was invoked.
   *** This is disallowed by the MPI standard. 
   *** Your MPI job will now abort. 

What should I look for to resolve this issue?

Make sure that all open objects are closed before calling MPI_FINALIZE. This includes not just the file, groups, and datasets, but also property lists, dataspaces, etc. If you are sure that all objects are closed and you still get this error, then please send an example program that reproduces the issue to the The HDF Helpdesk.

Closing my HDF5 file, I get a segfault with an error "MPI_FILE_SET_SIZE(76): Inconsistent arguments to collective routine"

This indicates that you have created datasets or groups or attributes in the file "uncollectively", meaning either not all processes called the create, or the creation was done with different parameters.

For example, a common mistake is to create a dataset with chunk dimensions (using H5Pset_chunk) that are not the same on all processes. Mistakes like that result in a different size of the file on all the processes and hence the MPI_File_set_size fails with different arguments between all the ranks.

- - Last modified: 14 October 2016