Variable-Length Datatypes in HDF5

Introduction

Variable-length (VL) datatypes have a great deal of flexibility, but can be over- or mis-used. VL datatypes are ideal at capturing the notion that elements in an HDF5 dataset (or attribute) can have different amounts of information (VL strings are the canonical example), but they have some drawbacks that this document attempts to address.

Background

Because fast random access to dataset elements requires that each element be a fixed size, the information stored for VL datatype elements is actually information to locate the VL information, not the information itself.

When to use VL datatypes

VL datatypes are designed allow the amount of data stored in each element of a dataset to vary. This change could be over time as new values, with different lengths, were written to the element. Or, the change can be over "space" - the dataset's space, with each element in the dataset having the same fundamental type, but different lengths. "Ragged arrays" are the classic example of elements that change over the "space" of the dataset. If the elements of a dataset are not going to change over "space" or time, a VL datatype should probably not be used.

Access Time Penalty

Accessing VL information requires reading the element in the file, then using that element's location information to retrieve the VL information itself. In the worst case, this obviously doubles the number of disk accesses required to access the VL information.

However, in order to avoid this extra disk access overhead, the HDF5 library groups VL information together into larger blocks on disk and performs I/O only on those larger blocks. Additionally, these blocks of information are cached in memory as long as possible. For most access patterns, this amortizes the extra disk accesses over enough pieces of VL information to hide the extra overhead involved.

Storage Space Penalty

Because VL information must be located and retrieved from another location in the file, extra information must be stored in the file to locate each item of VL information (i.e. each element in a dataset or each VL field in a compound datatype, etc.). Currently, that extra information amounts to 32 bytes per VL item.

With some judicious re-architecting of the library and file format, this could be reduced to 18 bytes per VL item with no loss in functionality or additional time penalties. With some additional effort, the space could perhaps could be pushed down as low as 8-10 bytes per VL item with no loss in functionality, but potentially a small time penalty.

Chunking and Filters

Storing data as VL information has some effects on chunked storage and the filters that can be applied to chunked data. Because the data that is stored in each chunk is the location to access the VL information, the actual VL information is not broken up into chunks in the same way as other data stored in chunks. Additionally, because the actual VL information is not stored in the chunk, any filters which operate on a chunk will operate on the information to locate the VL information, not the VL information itself.

File Drivers

Because the parallel I/O file drivers (MPI-I/O and MPI-posix) don't allow objects with varying sizes to be created in the file, attemping to create a dataset or attribute with a VL datatype in a file managed by those drivers will cause the creation call to fail.

Additionally, using VL datatypes and the 'multi' and 'split' file drivers may not operate in the manner desired. The HDF5 library currently categorizes the "blocks of VL information" stored in the file as a type of metadata, which means that they may not be stored with the other raw data for the file.

Rewriting

When VL information in the file is re-written, the old VL information must be released, space for the new VL information allocated and the new VL information must be written to the file. This may cause additional I/O accesses.

Last modified: 9 September 2003