Encode and Decode HDF5 Objects

                                    Raymond Lu & Quincey Koziol

                                    July 27, 2004

 

I.                   Document’s Audience

 

Current HDF5 library designers and knowledgeable external developers.

 

II.                Functionality

 

The functions described in this document encode and decode the description of HDF5 objects given by their ID into a binary buffer.  At this stage, these functions deal only with datatype and dataspace objects.  In the future, we may expand the coverage of the objects.

 

III.             Motivations and Use Cases

 

The motivation of these functions is to provide a way to transmit object description between two tasks.  The second task can reconstruct the object and return its ID.  This can happen among different processes in an MPI program.  Another useful case is to do checksum on the object. 

 

It is possible to allow user’s program to decode the binary description of the object if there is need to do so.  See below for a description of the encoding of the objects.

 

Below are some use cases (borrowed from Robb Matzke),

 

1.  In a parallel program (plain old PHDF5 and not FPHDF5) datasets must be created collectively over the file communicator, but sometimes conceptually only one MPI task (call it the "root" task) wants to create the object.  Therefore the non-root tasks need to jump into the H5Dcreate() call too, and they need to all supply various parameters including a datatype.  HDF5 datatypes have such a rich expressive power that it would be easiest if HDF5 provided some way to help the root task broadcast the datatype to the other tasks rather than the programmer dissect the datatype, broadcast its description, then construct it on the other tasks.  That's where the encode/decode functions come in: the encode takes a datatype and converts it into some task-independent representation

that can be broadcast to the other tasks, which then decode that representation back into an hid_t handle to a datatype.

 

2.  This second use case is similar in that one MPI task has a datatype that it would like to save in a file. Normally H5Tcommit() is a file-collective operation. However, writing to an existing dataset is independent, so a programmer could encode a datatype into the

task-independent representation and independently write that into an existing dataset. Obviously in this situation a reader program would have to know that the sequence of bytes is an encoded datatype in order to do anything useful with it.

 

3.  In order to compare whether two datatypes on different tasks are equivalent you could byte-compare their task-independent encodings.

 

4.  If you have lots of datatypes to compare (e.g., sorting an array of datatypes) it would probably be faster to encode them all first, then sort based on the encoding since memcmp() is almost certainly faster than H5T_cmp().

 

5.  If you have lots of datatypes to save in a file and those types are distributed among MPI tasks then its probably fastest to have each task encode its datatypes, MPI_Gather() them to a single I/O task, then that task independently writes the encoded types to a

pre-existing dataset.

 

IV.              Release

 

There functions are for release 1.8 of the library.

 

V.                 Algorithms & Format of Data

 

For the internal library design, we are going to borrow the algorithm of encoding and decoding object header messages.  It will be helpful to take a look at the data object part (Part IV) of the HDF5 File Format document.  The data objects we are interested in here are datatype and dataspace.

 

Format of the datatype information:

            The encoded datatype is composed of the following information:

·        A single byte indicating the buffer is a datatype which is the same value (3) as the datatype message ID in object headers in the file format.

·        A single byte indicating the version of the datatype information, currently set to 0.

·        A sequence of bytes that have the same format as the datatype message in the object header.  See the HDF5 file format document for the exact format.

 

Format of the dataspace information:

            The encoded dataspace is composed of the following information:

·        A single byte indicating the buffer is a dataspace which is the same value (1) as the dataspace message ID in object headers in the file format.

·        A single byte indicating the version of the dataspace information, currently set to 0.

·        A single indicating the size in bytes of the “size of lengths” information for the rest of the information encoded in the buffer.

·        A 16-bit value indicating the size in bytes of the dataspace extent information.

·        A sequence of bytes that store the extent of the dataspace and that have the same format as the dataspace message in the object header.  See the HDF5 file format document for the exact format.

·        A sequence of bytes that store the selection within the dataspace and that have the same format as the region portion of a region reference.  See the HDF5 file format document for the exact format.

 

VI.              Examples

 

Here is an example for H5Tencode() and H5Tdecode().  In this piece of code, it creates a compound datatype and encodes it.  Then it decodes the buffer and returns a new object ID. 

 

    struct s1 {

        int    a;

        float  b;

};

hid tid1, decoded_tid1;

        :

 

    /* Create a compound datatype */

    if((tid1=H5Tcreate(H5T_COMPOUND, sizeof(struct s1)))<0)

        goto error;

 

    if(H5Tinsert(tid1, "a", HOFFSET(struct s1, a), H5T_NATIVE_INT)<0)

        goto error;

   

    if(H5Tinsert(tid1, "b", HOFFSET(struct s1, b), H5T_NATIVE_FLOAT)<0)

        goto error;

 

    /* Encode compound type in a buffer */

    if(H5Tencode(tid1, NULL, &cmpd_buf_size)<0)

        goto error;

 

    if(cmpd_buf_size>0)

        cmpd_buf = (unsigned char*)calloc(1, cmpd_buf_size);

 

    if(H5Tencode(tid1, cmpd_buf, &cmpd_buf_size)<0)

        goto error;

 

    /* Decode from the compound buffer and return an object handle */

    if((decoded_tid1=H5Tdecode(cmpd_buf))<0)

        goto error;

 

Below is an example for encoding and decoding a simple dataspace using H5Sencode() and H5Sdecode().  Notice that dataspace selection(hyperslab in this case) can also be encoded and decoded.

 

hid_t   sid1, decoded_sid1;

    hsize_t             dims1[] = {3, 15, 13};

    hssize_t            start[] = {0, 0, 0};

    hsize_t             stride[] = {2, 5, 3};

    hsize_t             count[] = {2, 2, 2};

    hsize_t             block[] = {1, 3, 1};

 

    /* Create a simple data space of 3x15x13 */

    sid1 = H5Screate_simple(SPACE1_RANK, dims1, NULL);

    CHECK(sid1, FAIL, "H5Screate_simple");

     

    /* Make a hyperslab selection */

ret = H5Sselect_hyperslab(sid1, H5S_SELECT_SET, start, stride,

        count, block);

    CHECK(ret, FAIL, "H5Sselect_hyperslab");

 

    /* Find out the size of buffer needed for encoding */

    ret = H5Sencode(sid1, NULL, &sbuf_size);

    CHECK(ret, FAIL, "H5Sencode");

 

    if(sbuf_size>0)

        sbuf = (unsigned char*)calloc(1, sbuf_size);

 

    /* Encode simple data space in a buffer */

    ret = H5Sencode(sid1, sbuf, &sbuf_size);

    CHECK(ret, FAIL, "H5Sencode");

 

    /* Decode from the dataspace buffer and return an object handle */

    decoded_sid1=H5Sdecode(sbuf);

    CHECK(decoded_sid1, FAIL, "H5Sdecode");

 

VII.           API Functions

 

Name: H5Tencode

Signature:

herr_t  H5Tencode(hid_t obj_id, unsigned char* buf, size_t* nalloc)

Purpose:

Encode a data type object description into a binary buffer.

Description:

Given data type ID, H5Tencode converts a data type description into binary form in a buffer.  Using this binary form in the buffer, a data type object can be reconstructed using H5Tdecode to return a new object handle(hid_t) for this data type.

 

A preliminary H5Tencode call can be made to find out the size of the buffer needed. This value is returned as nalloc.  That value can then be assigned to nalloc for a second H5Tencode call, which will retrieve the actual encoded object. 

 

If the library finds out nalloc is not big enough for the object, it simply returns the size of the buffer needed through nalloc without encoding the provided buffer.

Parameters:

hid_t obj_id

IN: Identifier of the object to be encoded.

unsigned char* buf

IN/OUT: Buffer for the object to be encoded into.  If the provided buffer is NULL,

only the size of buffer needed is returned through nalloc.

size_t* nalloc

IN: The size of the allocated buffer.

OUT: The size of the buffer needed.

Returns:

Returns a non-negative value if successful; otherwise returns a negative value.

 

Name: H5Tdecode

Signature:

hid_t  H5Tdecode(unsigned char* buf)

Purpose:

Decode a binary object description of data type and return a new object handle.

Description:

Given an object description of data type in binary in a buffer, H5Tdecode reconstructs the HDF5 data type object and returns a new object handle for it.  The binary description of the object is encoded by H5Tencode.  User is responsible for passing in the right buffer.

Parameters:

unsigned char* buf

IN: Buffer for the data type object to be decoded.

Returns:

Returns an object ID(non-negative) if successful; otherwise returns a negative value.

 

Name: H5Sencode

Signature:

herr_t  H5Sencode(hid_t obj_id, unsigned char* buf, size_t* nalloc)

Purpose:

Encode a data space object description into a binary buffer.

Description:

Given the data space ID, H5Sencode converts a data space description into binary form in a buffer.  Using this binary form in the buffer, a data space object can be reconstructed using H5Sdecode to return a new object handle(hid_t) for this data space.

 

A preliminary H5Sencode call can be made to find out the size of the buffer needed. This value is returned as nalloc.  That value can then be assigned to nalloc for a second H5Sencode call, which will retrieve the actual encoded object. 

 

If the library finds out nalloc is not big enough for the object, it simply returns the size of the buffer needed through nalloc without encoding the provided buffer.

 

The types of data space we address in this function are null, scalar, and simple space.  For simple data space, the information of selection, for example, hyperslab selection, is also encoded and decoded.  Complex data space has not been implemented in the library.

Parameters:

hid_t obj_id

IN: Identifier of the object to be encoded.

unsigned char* buf

IN/OUT: Buffer for the object to be encoded into.  If the provided buffer is NULL,

only the size of buffer needed is returned through nalloc.

size_t* nalloc

IN: The size of the allocated buffer.

OUT: The size of the buffer needed.

Returns:

Returns a non-negative value if successful; otherwise returns a negative value.

 

Name: H5Sdecode

Signature:

hid_t  H5Sdecode(unsigned char* buf)

Purpose:

Decode a binary object description of data space and return a new object handle.

Description:

Given an object description of data space in binary in a buffer, H5Sdecode reconstructs the HDF5 data type object and returns a new object handle for it.  The binary description of the object is encoded by H5Sencode.  User is responsible for passing in the right buffer.

 

The types of data space we address in this function are null, scalar, and simple space.  For simple data space, the information of selection, for example, hyperslab selection, is also encoded and decoded.  Complex data space has not been implemented in the library.

Parameters:

unsigned char* buf

IN: Buffer for the data space object to be decoded.

Returns:

Returns an object ID(non-negative) if successful; otherwise returns a negative value.