HDF Performance Issues

[Top] [Prev] [Next]

14.4 Block Size Tuning Issues

A key to I/O performance in HDF is the number of disk accesses that must be made during any
I/O operation. If you can decrease significantly the number of disk accesses required, you may be able to improve performance correspondingly. In this section we examine two such strategies for improving HDF I/O performance.

14.4.1 Tuning Data Descriptor Block Size to Enhance Performance

HDF objects are identified in HDF files by 12-byte headers called data descriptors (DDs). Most composite HDF objects, such as SDSs, are made up of many small HDF objects, so it is not unusual to have a large number of DDs in an HDF file. DDs are stored in blocks called data descriptor blocks (DD blocks).

When an HDF file is created, the file's DD block size is specified. The default size is 16 DDs per DD block. When you start putting objects into an HDF file, their DDs are inserted into the first DD block. When the DD block gets filled up, a new DD block is created, stored at some other location in the file, and linked with the previous DD block. If a large number of objects are stored in an HDF file whose DD block size is small, a large number of DD blocks will be needed, and each DD block is likely to be stored on a different disk page.

Consider, for example, an HDF file with 1,000 SDSs and a DD block size of 16. Each SDS could easily require 10 DDs to describe all the objects comprising the SDS, so the entire file might contain 10,000 DDs. This would require 625 (10,000/16) DD blocks, each stored on a different disk page.

Whenever an HDF file is opened, all of the DDs are read into memory. Hence, in our example, 625 disk accesses might be required just to open the file.

Fortunately, there is a way we can use this kind of information to improve performance. When we create an HDF file, we can specify the DD block size. If we know that the file will have many objects stored in it, we should choose a large DD block size so that each disk access will read in a large number of DDs, and hence there will be fewer disk accesses. In our example, we might have chosen the DD block size to be 10,000, resulting in only one disk access. (Of course, this example goes deliberately to a logical extreme. For a variety of reasons, a more common approach would be to set the DD block size to something between 1,000 and 5,000 DDs.)

From this discussion we can derive the following rules of thumb for achieving good performance by altering the DD block size.

Increasing the size of the data descriptor block may improve performance when opening a file, especially when working with large HDF files with lots of objects. It will reduce the number of times that HDF has to go out and read another DD block. This will be particularly valuable in code that does large numbers of HDF file opens.
The same principle applies when closing an HDF file that has been written to. Since all DDs are flushed to an HDF file when it is written to and then closed, the DD block size can similarly impact performance.
Notice that these actions only affect the opening and closing of a file. Once a file is opened, DDs are accessed in memory; no further disk accesses are required.
Large DD blocks can negatively affect storage efficiency, particularly if very large DD blocks are used. Since the last DD block may only be partially filled up, you probably should not use large DD blocks for very small HDF files.

14.4.2 Tuning Linked Block Size to Enhance Performance

Linked blocks get created whenever compression, chunking, external files, or appendable datasets are used. They provide a means of linking new data blocks to a pre-existing data element. If you have ever looked at an HDF file and seen Special Scientific Data or Linked Block Indicator tags with strange tag values, these are used in specifying linked blocks. As with DD blocks, linked block size can affect both storage efficiency and I/O performance.

You can change the linked block size for SDSs by use of the function SDsetblocksize. To change the linked block size for Vdatas, you must edit the hlimits.h file, change the value of HDF_APPENDABLE_BLOCK_LEN, and re-build the HDF library. Changing the linked block size only affects the size of the linked blocks used after the change is made; it does not affect the size of blocks that have already been written.

There is a certain amount of overhead when creating linked blocks. For every linked block that is added there will be a specified number of block accesses, disk space used, and reference numbers added to the file. If you increase the size of the linked block, it will decrease the number of block accesses, disk space used, and reference numbers added to the file. Making the linked block size larger will decrease the number of reference numbers required; this is sometimes necessary because there are a limited number of available reference numbers.

Linked block size can also affect I/O performance, depending on how the data is accessed. If the data will typically be accessed in large chunks, then making the linked block size large could improve performance. If the data is accessed in small chunks, then making the linked block size small could improve performance.

If data will be randomly accessed in small amounts, then it is better to have small linked blocks.

Ideally one might say that making the linked block size equal to the size of the dataset that will typically be accessed, is the best solution. However, there are other things that will affect performance, such as the operating system being used, the sector size on the disk being accessed, the amount of memory available, and access patterns.

Here are some rules of thumb for specifying linked block size:

Linked block size should be at least as large as the smallest number of bytes accessed in a single disk access. This amount varies from one system to another, but 4K bytes is probably a safe minimum.
Linked block size should be a power of 2.
Linked blocks should be approximately equal to the number of bytes accessed in a typical access. This rule should be mitigated by the amount of locality from one disk access to another, however, as the next rule indicates.
If memory is large, it may be possible to take advantage of caching that your operating system does by using a large block size. If successive accesses are close to one another, blocks may be cached by the OS, so that actual physical disk accesses are not always required. If successive accesses are not close to one another, this strategy could backfire, however.
Although very large blocks can result in efficient access, they can also result in inefficient storage. For instance if the block size is 100K bytes, and 101K bytes of data are stored per SDS in an HDF file, the file will be twice as large as necessary.

Unfortunately, there are so many factors affected by block size that there is no simple formula that you can follow for deciding what the linked block size should be. A little experimentation on the target platform can help a great deal in determining the ideal block size for your situation.

[Top] [Prev] [Next]

hdfhelp@ncsa.uiuc.edu

HDF User's Guide - 05/19/99, NCSA HDF Development Group.