Arithmetic Data Transforms
Leon Arber, Albert Cheng, William Wendling<![if !supportFootnotes]><![endif]>
Data can be stored and represented in many different ways. In most fields of science, for example, the metric system is used for storing all data. However, many fields of engineering still use the English system. In such scenarios, there needs to be a way to easily perform arbitrary scaling of data. The data transforms provide just such functionality. They allow arbitrary arithmetic expressions to be applied to a dataset during read and write operations. This means that data can be stored in Celsius in a data file, but read in and automatically converted to Fahrenheit. Alternatively, data that is obtained in Fahrenheit can be written out to the data file in Celsius.
Although a user can always manually modify the data they read and write, having the data transform as a property means that the user doesnt have to worry about forgetting to call the conversion function or even writing it in the first place.
The data transform functionality is implemented as a property that is set on a dataset transfer property list. There are two functions available: one for setting the transform and another for finding out what transform, if any, is currently set.
The function for setting the transform is:
herr_t H5Pset_data_transform(hid_t plist_id, const char* expression)
plist_id is the identifier of the dataset transfer property list on which the data transform property should be set.
expression is a pointer to a string of the form (5/9.0)*(x-32) which describes the transform.
The function for getting the transform is:
ssize_t H5Pget_data_transform(hid_t plist_id, char* expression, size_t size)
plist_id is the identifier of the dataset transfer property list which will be queried for its data transform property.
expression is either NULL or a pointer to memory where the data transform string, if present, will be copied.
size is the number of bytes to copy from the transform string into expression. H5Pget_data_transform will never copy more than the length of the transform expression.
Data transforms are set by passing a pointer to a string, which is the data transform expression. This string describes what sort of arithmetic transform should be done during data transfer of read or write. The string is a standard mathematical expression, as would be entered into a something like MATLAB.
Expressions are defined by the following context-free grammar:
expr:= term | term + term | term - term
term := factor | factor * factor | factor / factor
factor := number | symbol | - factor | + factor | ( expr )
symbol := [a-zA-Z][a-zA-Z0-9]*
number := INT | FLOAT
where INT is interpreted as a C long int and FLOAT is interpreted as a C double
This grammar allows for order of operations (multiplication and dividision take precedence over addition and subtraction), floating and integer constants, and grouping of terms by way of parentheses. Although the grammar allows symbols to be arbitrary strings, this documentation will always use x for symbols.
Within a transform expression, the symbol represents a variable which contains the data to be manipulated. For this reason, the terms symbol and variable will be used interchangeably. Furthermore, in the current implementation of data transforms, all symbols appearing in an expression are interpreted as referring to the same dataset. So, an expression such as alpha + 5 is equivalent to x+5 and an expression such as alpha + 3*beta + 5 is equivalent to alpha + 3*alpha + 5 which is equivalent to 4*x + 5.
When the data transform property of a dataset transfer property list is set, a parse tree of the expression is immediately generated and its root is saved in the property list. The generation of the parse involves several steps.
First, the expression is reduced, so as to simply the final parse and speed up the transform operations. Expressions such as (5/9.0) * (x-32) will be reduced to .555555*(x-32). While further simplification is algebraically possible, the data transform code will only reduce simple trivial arithmetic operations.
Then, this reduced expression is parsed into a set of tokens, from which the parse tree is generated. From the expression (5/9.0)*(x-32), for example, the following parse tree would be created:
When a read is performed with a dataset transfer property list that has the data transform property set, the following sequence of events occurs:
Step 2 works like this:
If the transform expression is (5/9.0)*(x-32), with the parse tree shown above and the buffer contains [-10 0 10 50 100], then the intermediate steps involved in the transform are:
Note that the original data in the file was not modified.
The process of a write works much the same way, but in the reverse order. When a file is written out with a dataset transfer property list that has the data transform property set:
Step 2 works exactly as in the read example. Note that the users data is not modified. Also, since the transform property is not saved with the dataset, in order to recover the original data, a user must know the inverse of the transform that was applied in order to recover it. In the case of (5/9.0)*(x-32) this inverse would be (9/5.0)*x + 32. Reading from a data file that had previously been written out with a transform string of (5/9.0)*(x-32) with a transform string of (9/5.0)*x + 32 would effectively recover the original data the author of the file had been using.<![if !supportFootnotes]><![endif]>
Because the data transform sits and modifies data between the file space and the memory space, various effects can occur that are the result of the typecasting that may be involved in the operations. In addition, because constants in the data transform expression can be either INT or FLOAT, the data transform itself can be a source of truncation.
In the example above, the reason that the transform expression is always written as (5/9.0)*(x-32) is because, if it were written without a floating point constant, it would always evaluate to 0. The expression (5/9)*(x-32) would, when set, get reduced to 0*(x-32) because both 5 and 9 would get read as C long ints and, when divided, the result would get truncated to 0. This resulting expression, 0*(x-32), would cause any data read or written to be saved as an array of all 0s.
Another source of unpredictability caused by truncation occurs when intermediate data is of a type that is more precise than the destination memory type. For example, if the transform expression (1/2.0)*x is applied to data read from a file that is being read into an integer memory buffer, the results can be unpredictable. If the source array is [1 2 3 4], then the resulting array could be either [0 1 1 2] or [0 0 1 1], depending on the floating point unit of the processors. Note that this result is independent of the source data type. It doesnt matter if the source data is integer or floating point because the 2.0 in the data transform expression will cause everything to be evaluated in a floating-point context.
When setting transform expressions, care must be taken to ensure that the truncation does not adversely affect the data. A workaround for the possible effects of a transform such as (1/2.0) * x would be to used the transform expression (1/2.0)*x + 0.5 instead of the original. This will ensure that all truncation rounds up, with the possible exception of a boundary condition.
The following code snippet shows an example using data transform, where the data transform property is set and a write is performed. Then, a read is performed with no data transform property set. It is assumed that dataset is a dataset that has been opened and windchillF and windchillC are both arrays that hold floating point data. The result of this snippet is to fill windchillC with the data in windchillF, converted to Celcius.
const char* c_to_f = (9/5.0)*x + 32;
/* Create the dataset transfer property list */
dxpl_id_c_to_f = H5Pcreate(H5P_DATASET_XFER);
/* Set the data transform to be used on the read*/
* Write the data to the dataset using the f_to_c transform
status = H5Dwrite(dataset, H5T_NATIVE_FLOAT, H5S_ALL, H5S_ALL, dxpl_id_f_to_c, windchillF);
/* Read the data with the c_to_f data transform */
H5Dread(dataset, H5T_NATIVE_FLOAT, H5S_ALL, H5S_ALL, H5P_DEFAULT, windchillC);
Querying the data transform string of a dataset transfer property list requires the use of the H5Pget_data_transform function. This function provides the ability to both query the size of the string stored and retrieve part or all of it. Note that H5Pget_data_transform will return the expression that was set by H5Pset_data_transform. The reduced transform string, computed when H5Pset_data_transform is called, is not stored in string form and is not available to the user.
In order to ascertain the size of the string, a NULL expression should be passed to the function. This will make the function return the length of the transform string (not including the terminated \0 character).
To actually retrieve the string, a pointer to a valid memory location should be passed in for expression and the number of bytes from the string that should be copied to that memory location should be passed in as size.
Some additional functionality can still be added to the data transform. Currently the most important feature lacking is the addition of operators, such as exponentiation and the trigonometric functions. Although exponentiation can be explicitly carried with a transform expression such as x*x*x it may be easier to support expression like x^3. Also lacking are the commonly used trigonometric functions, such as sin, cos, and tan.
Popular constants could also be added, such as π or e.
More advanced functionality, such as the ability to perform a transform on multiple datasets is also a possibility, but is a feature is more a completely new addition than an extension to data transforms.
<![if !supportFootnotes]><![endif]> Mr. Wendling, who involved in the initial design and implemented the expression parser, has left NCSA.
<![if !supportFootnotes]><![endif]> See the h5_dtransform.c example in the examples directory of the hdf5 library for just such an illustration.