If correctly applied, data chunking may reduce the number of seeks through the SDS data array to find the data to be read or written, thereby improving I/O performance. However, it should be remembered that data chunking, if incorrectly applied, can significantly reduce the performance of reading and/or writing to an SDS. Knowledge of how chunked SDSs are created and accessed and application-specific knowledge of how data is to be read from the chunked SDSs are necessary in avoiding situations where data chunking works against the goal of I/O performance optimization.
The following figure illustrates the difference between a non-chunked SDS and a chunked SDS.
Specifically, the issues that affect the process of reading from chunked SDSs are
· Compression
· Subsetting
· Chunk sizing
· Chunk cache sizing
The issues that affect the process of writing to chunked SDSs are
· Compression
· Chunk cache sizing
The main consideration to keep in mind when subsetting from chunked and non-chunked SDSs is that if the subset can be accessed in the same order as it was stored, subsetting will be efficient. If not, subsetting may result in less-than-optimal performance considering the number of elements to be accessed.
To illustrate this, the instance of subsetting in non-chunked SDSs will first be described. Consider the example of a non-chunked, two-dimensional, 2,000 x 1,600 SDS array of integer data. The following figure shows how this array is filled with data in a row-wise fashion. (Each square in the array shown represents 100 x 100 integers.)
FIGURE 13k Number of Seeks Needed to Access a Row of Data in a Non-Chunked SDS
If the subset of data to be read from this array is one 2,000 integer column, then 2,000 seeks will be required to complete the operation. This is the most inefficient method of reading this subset as nearly all of the array locations will be accessed in the process of seeking to a relatively small number of target locations.FIGURE 13l Number of Seeks Needed to Access a Column of Data in a Non-Chunked SDS
Now suppose this SDS is chunked, and the chunk size is 400 x 400 integers. A read of the aforementioned row is performed. In this case, four seeks are needed to read all of the chunks that contain the target locations. This is less efficient than the one seek needed in the non-chunked SDS.FIGURE 13m Number of Seeks Needed to Access a Row of Data in a Chunked SDS
To read the aforementioned column of data, five chunks must be read into memory in order to access the 2,000 locations of the subset. Therefore, five seeks to the starting location of each of these chunks are necessary to complete the read operation - far fewer than the 2,000 needed in the non-chunked SDS.FIGURE 13n Number of Seeks Needed to Access a Column of Data in a Chunked SDS
These examples show that, in many cases, chunking can be used to reduce the I/O overhead of subsetting, but in certain cases, chunking can impair I/O performance. 13.3.4 Chunking with Compression
Chunking can be particularly effective when used in conjunction with compression. It allows subsets to be read (or written) without having to uncompress (or compress) the entire array.FIGURE 13o Compressing and Writing Chunks of Data to a Compressed and Tiled SDS
When it becomes necessary to read a subset of the image data, the application passes in the location of a tile, reads the entire tile into a buffer, and extracts the data-of-interest from that buffer.FIGURE 13p Extracting a Subset from a Compressed and Tiled SDS
In a compressed and non-tiled SDS, retrieving a subset of the compressed image data necessitates reading the entire contents of the SDS array into a memory buffer and uncompressing it in-core. (See Figure 13p.) The subset is then extracted from this buffer. (Keep in mind that, even though the illustrations show two-dimensional data tiles for clarity, this process can be extended to data chunks of any number of dimensions.)
FIGURE 13q Extracting a Subset from a Compressed Non-Tiled SDS
As compressed image files can be as large as hundreds of megabytes in size, and a gigabyte or more uncompressed, it is clear that the I/O requirements of reading to and writing from non-tiled, compressed SDSs can be immense, if not prohibitive. Add to this the additional I/O burden inherent in situations where portions of several image files must be read at the same time for comparison, and the benefits of tiling become even more apparent. 13.3.5 Effect of Chunk Size on Performance
The main concern in modelling data for chunking is that the chunk size be approximately equal to the average expected size of the data block needed by the application. 13.3.6 How Insufficient Chunk Cache Space can Impair Chunking Performance
The HDF library provides for the caching of chunks. This can substantially improve I/O performance when a particular chunk must be accessed more than once.FIGURE 13r Example 4 x 12 Element Scientific Data Set
Suppose this dataset is untiled, and the subset shown in the following figure must be read.FIGURE 13s 2 x 8 Element Subset of the 4 x 12 Scientific Data Set
As this dataset is untiled, the numbers are stored in linear order. SDreaddata finds the longest contiguous stream of numbers, and requests the lower level of the library code to read it into memory. First, the first row of numbers will be read:
3 4 5 6 7 8 9 10
23 24 25 26 27 28 29 30
This involves two reads, two disk accesses and sixteen numbers.
FIGURE 13t 4 x 12 Element Data Set with 2 x 2 Element Tiles
Also, the chunk cache size is set to 2.
If the dataset is untiled the numbers are read into memory row-by-row. This involves 300 disk accesses for 300 rows, with each disk access reading in 1,000 numbers. The total number of numbers that will be read is 300,000.
Each square in the following figure represents one 100 x 100 element region of the dataset. Five tiles span the 300 x 1,000 target subset. For the purposes of this example, they will be labelled A, B, C, D and E.
FIGURE 13u 5 200 x 300 Element Tiles Labelled A, B, C, D and E
First, the higher-level code instructs the lower-level code to read in the first row of subset numbers. The lower-level code must read all five tiles (A through E) into memory, as they all contain numbers in the first row. Tiles A and B are read into the cache without problem, then the following set of cache overwrites occurs.
The second row is then read. The higher-level code first requests tile A, however the cache is full, so it must overwrite tile D to read tile A. Then the following set of cache overwrites occur.
Essentially, five times more disk accesses are being performed and 900 times more data is being read than with the untiled 3,000 x 8,400 dataset. The severity of the performance degradation increases in a non-linear fashion as the size of the dataset increases.
From this example it should be apparent that, to prevent this kind of chunk cache "thrashing" from occurring, the size of the chunk cache should be made equal to, or greater than, the number of chunks along the fastest-varying dimension of the dataset. In this case, the chunk cache size should be set to 4.
When a chunked SDS is opened for reading or writing, the default cache size is set to the number of chunks along the fastest-varying dimension of the SDS. This will prevent cache thrashing from occurring in situations where the user doesn't set the size of the the chunk cache. Caution should be exercised by the user when altering this default chunk cache size.