Conversion Between Text and Datatype

                                    Quincey Koziol & Raymond Lu

                             

                                                Aug 11, 2004

                                        Revised on Sep 21, 2004

 

I.                   Document’s Audience

 

Current HDF5 library designers and knowledgeable external developers.

 

 

II.                Requirements and Use Cases

 

1.  There have been some user requests to create HDF5 data type in a single step.  Although it does not save much for atomic data type, it can reduce multiple steps of data type creation into one step for some more complex data types, like compound and array types.  The ASCI people have been trying to use a macro to encapsulate a data type text description in order to define a data type during run time in a single place. 

 

2.  On the other side, some user requests the text description of a HDF5 data type for debugging purpose.  This description can be in different language format.  This is quite similar to what the h5dump does.

 

3.  Another request is to make a HDF5 private function H5T_cmp public.  What it does is to compare data types in the library’s predefined way.  This is useful to sort a table of data types.

 

 

III.             Functionality

 

There are two new major functions we are going to add to the library to satisfy the requirements, one is H5Ttext_to_type, another is H5Ttype_to_text.  Depending on the format of text description for a data type, H5Ttext_to_type converts a text description into an HDF5 datatype object and return its ID handle.  For H5Ttype_to_text, it returns a text description of an HDF5 datatype given by its ID.  The format of the text should comply with the grammar of the language we have predefined.  More details of these two functions can be found in Section VIII of this document.  

 

There will be another function called H5Tcmp.  It compares two data types in the arbitrary way of the library’s predefinition.

 

 

IV.              Library Design

 

1.  Conversion from text to data type.

 

For the internal library design, there are several steps involved to convert from text description of a data type to an HDF5 data type object.  These steps can be illustrated in the following diagram.

 

 

Double Bracket: H5Tcreate

                                    Figure 1.  Conversion from text to data type.

 

From the diagram, we can see there are three major steps, text input, text analysis, and parsing.  Input text has to match the format we defined for different languages.  We currently support three languages, C, DDL, and Fortran.  Full description of these three formats is in Section VI of this document.  The input text is passed in through the new function H5Ttext_to_type.  A very simple example is like “unsigned int.  We defined for C language, the text format should be like a C program.  Therefore the text “unsigned int should be passed in to create a data type of unsigned native integer(H5T_NATIVE_UINT) in HDF5 library.     

 

The text analyzer’s job is to recognize the expressions in the input text according to the predefined rules for each language.  It then creates tokens and symbols representing these expressions and passes them on to the parser.  For the same example above, the text “unsigned int is read by the text analyzer and match two expressions “unsigned” and int.  Two tokens are created for each word and passed to the parser.

 

The parser will take the tokens and symbols generated by the text analyzer and check if they are valid based on the language grammar we have predefined.  Once the parser sees valid expression, it will take actions.  A token for “unsigned” followed by the token for int is considered as good C language.  An action in HDF5(H5Tcreate) occurs then.  A data type object ID is created and returned.  It will be illegal to have a token for “float” follow the token for “unsigned”.

 

2. Conversion from HDF5 data type to text

 

This part is relatively simple.  Function H5Ttype_to_text prints out the text description of an HDF5 data type according to the language format.  It is very similar to what the h5dump does.  h5dump only prints out the text in DDL format.  H5Ttype_to_text supports C, DDL, and Fortran.

 

                                                                                                                                                                                                                                                                                                   

V.                 Design Issues to be Considered

 

1.  Modules for languages

 

The library currently supports C, DLL, and Fortran formats.  In the future, we may want to support other languages, like C++.  The library design should be moduled to make future addition simple. 

 

It is possible to provide a tool built with Lex and Yacc.  Library designers and users can simply provide a text file consisted of syntax and grammar for a certain language.  By running this tool, rules for Lex and grammar for Yacc can be generated automatically.  Then they can be easily plugged into the HDF5 library.  This could be the second stage of this project if there are enough needs and requirements.

 

2.  Availability of Lex library on systems

 

There is no need for the Yacc library.  The Lex library is optional in order to compile this part of HDF5 library.  As long as we define minimally one function to overwrite the default Lex function, we do not need to link to any Lex library.  The way we use Lex and Yacc tools is similar to the GNU Autoconf.  Once we run Lex and Yacc to generate the desired .c and .h files, the code is supposed be portable.

 

3.  Different kinds of Lex and Yacc

 

For Lex, there are versions of AT&T Lex, GNU Flex, POSIX Lex, etc.  For Yacc, there are versions of AT&T Yacc, Berkeley Yacc, GNU Bison, and so on.  Each one of them may be somehow different from the others.  We do not have to address the differences of syntax based on what Lex and Yacc we use.  We can simply use GNU Flex and Bison.

 

4.  Error report

 

A good error report is needed if there are errors in the input text.  Both Lex and Yacc have some mechanisms to report errors.  We need to combine them with the library error report.

 

5.  Supported data types

 

For C and DDL, we will support all atomic data types, compound, enumerate, and array types, including their nested cases.  We do not support variable-length and opaque data types because C language does not have equivalent data types for them.  It will be difficult for the text analyzer and parser to distinguish them from other data types. 

 

For Fortran, we will only support four atomic types.  But Fortran user can still use DDL description to create data types.  

 

 

VI.              Language Formats

 

1.  C

 

For text description for C should be the same as C language itself with some minor differences.  A text “unsigned long long will create the data type of H5T_NATIVE_ULLONG.

 

For A complete list of data type definition in C language, please refer to the Appendix B, Syntax of the C language in the book C -  A Reference Manual.  The differences we have here is array type.  We added array as a data type.    

 

The data types we support are defined in BNF as follows,

 

type-specifier:

      enumeration-type-specifier

      floating-type-specifier

      integer-type-specifier

      structure-type-specifier

      typedef-name

 

The enumeration types are defined as the following,

 

enumeration-type-specifier:

      enumeration-type-definition

      enumeration-type-reference

 

enumeration-type-definition:

      enum enumeration-tagopt { enumeration-definition-list }

 

enumeration-type-reference:

      enum enumeration-tag

 

enumeration-tag:

      identifier

 

enumeration-definition-list:

      enumeration-constant-definition

      enumeration-definition-list,enumeration-constant-definition

 

enumeration-constant-definition:

      enumeration-constant

      enumeration-constant = integer-constant

 

enumeration-constant:

      identifier

 

The floating-point data types are defined as,

 

            floating-type-specifier:

            float

            double

            long double

 

The integer data types are defined as,

 

integer-type-specifier:

      signed-type-specifier

      unsigned-type-specifier

      character-type-specifier

 

signed-type-specifier:

      short or short int or signed short or signed short int

      int or signed int or signed

      long or long int or signed long or signed long int

long long or long long int or signed long long or signed long long int

 

unsigned-type-specifier:

      unsigned short intopt

      unsigned intopt

      unsigned long intopt

      unsigned long long intopt

 

character-type-specifier:

      char

      signed char

      unsigned char

 

The structure data types are defined as,

 

structure-type-specifier:

      structure-type-definition

      structure-type-reference

 

structure-type-definition:

      struct structure-tagopt { field-list }

 

structure-type-reference:

      struct structure-tag   

 

structure-tag:

      identifier

 

field-list:

      component-declaration

      field-list component-declaration

 

component-declaration:

      type-specifier component-declaration-list ;

 

component-declaration-list:

      component-declarator

      component-declaration-list , component-declarator

 

component-declarator:

      simple-component

      bit-field

 

simple-component:

      declarator

 

bit-field:

      declaratoropt : width

 

width:

      expression

 

The typedef is defined as,

 

            typedef-name:

            identifier

 

We also support array as a data type in HDF5, which is different from C.  Below is the definition of array,

 

            array-declarator:

            type-specifier simple-declaratoropt [ constant-expression ]

 

      simple-declarator:

            identifier

 

 

Below is a list of examples of C data types, 

 

Integer types

             char           “unsigned char”

            short           “unsigned short”

            int              “unsigned”

            long;”           “unsigned long”

            long long    “unsigned long long

 

Floating-point types

            float                        “double”                     “long double”

 

Structures

            struct s {int a; float b;};”     typedef struct s {int a; float b;} s_t;”

 

Arrays

            int [16];”                 

typedef struct s {int a; float b;} s_t; s_t [16][32];”

 

Enumerates

            enum {Bob=0, Elena, Quincey, Frank};”

 

2.  DDL

 

This format is basically the DDL definition for HDF5.  Please look at the last chapter of the User’s Guide for HDF5, DDL for HDF5.  The part of data type definition for this project’s concern is as follows,   

 

<datatype> ::= <atomic_type> | <compound_type> | <array_type> |
               <variable_length_type>
 
<atomic_type> ::= <integer>  | <float>  | <time>      | <string> |
                  <bitfield> | <opaque> | <reference> | <enum>
 
<integer> ::=  H5T_STD_I8BE     | H5T_STD_I8LE      |
               H5T_STD_I16BE    | H5T_STD_I16LE     |
               H5T_STD_I32BE    | H5T_STD_I32LE     |
               H5T_STD_I64BE    | H5T_STD_I64LE     |
               H5T_STD_U8BE     | H5T_STD_U8LE      |
               H5T_STD_U16BE    | H5T_STD_U16LE     |
               H5T_STD_U32BE    | H5T_STD_U32LE     |
               H5T_STD_U64BE    | H5T_STD_U64LE     |
               H5T_NATIVE_CHAR  | H5T_NATIVE_UCHAR  |
               H5T_NATIVE_SHORT | H5T_NATIVE_USHORT |
               H5T_NATIVE_INT   | H5T_NATIVE_UINT   |
               H5T_NATIVE_LONG  | H5T_NATIVE_ULONG  |
               H5T_NATIVE_LLONG | H5T_NATIVE_ULLONG
 
<float> ::= H5T_IEEE_F32BE   | H5T_IEEE_F32LE     |
            H5T_IEEE_F64BE   | H5T_IEEE_F64LE     |
            H5T_NATIVE_FLOAT | H5T_NATIVE_DOUBLE  |
            H5T_NATIVE_LDOUBLE
 
<time> ::= TBD
 
<string> ::= H5T_STRING { STRSIZE <strsize> ;
               STRPAD <strpad> ;
               CSET <cset> ;
               CTYPE <ctype> ; }  
<strsize> ::= <int_value>
<strpad> ::= H5T_STR_NULLTERM | H5T_STR_NULLPAD | H5T_STR_SPACEPAD
<cset> ::= H5T_CSET_ASCII
<ctype> ::= H5T_C_S1 | H5T_FORTRAN_S1
 
<bitfield> ::= TBD
 
<opaque> ::= H5T_OPAQUE { <identifier> }
 
<reference> ::= H5T_REFERENCE { <ref_type> }
<ref_type> ::= H5T_STD_REF_OBJECT | H5T_STD_REF_DSETREG
 
<compound_type> ::= H5T_COMPOUND { <member_type_def>+ }
<member_type_def> ::= <datatype> <field_name> <offset>opt ;
<field_name> ::= <identifier>
<offset> ::= : <int_value>
 
<variable_length_type> ::= H5T_VLEN { <datatype> }

 

<array_type> ::= H5T_ARRAY { <dim_sizes> <datatype> }
<dim_sizes> ::= `['<dimsize>`]' | `['<dimsize>`]'<dim_sizes>
<dimsize> ::= <int_value>

 

<enum> ::= H5T_ENUM { <enum_base_type> <enum_def>+  }
<enum_base_type> ::= <integer>
// Currently enums can only hold integer type data, but they may be //expanded in the future to hold any datatype
<enum_def> ::= <enum_symbol> <enum_val>;
<enum_symbol> ::= <identifier>
<enum_val> ::= <int_value>

 

A few examples of datatypes in DDL are as follows,

 

“H5T_ENUM { H5T_NATIVE_INT;

“Bob”       0;

“Elena”     1;

Quincey   2;

“Frank”     3;    }”

 

“H5T_COMPOUND {

      H5T_ARRAY { [4] H5T_STD_I32BE } “int_array”;

      H5T_ARRAY { [5][6] H5T_IEEE_F32BE } “float_array”; }”

 

“H5T_COMPOUND {

      H5T_STD_I16LE     “16_bit  :  0;

      H5T_IEEE_U32BE    “32_bit  :  16; }”

 

3.  Fortran

 

To be decided.

 

 

VII.           Examples

 

A simple example below shows how to create an array datatype of compound type in C format.  This compound data type has two fields, one is integer, another is float.  The program then converts the HDF5 data type just created into a text description. 

 

hid_t    dtype;

size_t   tsize;

unsigned char* text_buf;

         :

 

/* Create the data type by C text */

if((dtype = H5Ttext_to_type(“typedef struct foo{

                                 int a;

                                 float b;

                           } foo_t;

                         foo_t [12];”))<0)

goto error;

 

   /* Convert the data type back to text */

   If(H5Ttype_to_text(dtype, NULL, H5T_C, &tsize)<0)

            goto error;

  

   If(tsize>0)

            text_buf = (unsigned char*)calloc(1, tsize);

 

   If(H5Ttype_to_text(dtype, text_buf, H5T_C, &tsize)<0)

            goto error;  

 

                       

VIII.        API Functions

 

Name: H5Ttext_to_type

Signature:

hid_t  H5Ttext_to_type(const char* str)

Purpose:

Create a HDF5 datatype given a description of data type.

Description:

            Given a text description of data type, this function creates an HDF5 datatype.  The

            text description of the data type has to comply with certain language formats.  The  

currently supported languages are C, DDL, and Fortran.  An example of C text description is like,

 

                        typedef struct foo {

                                    int a;

                                    float b;

                         } foo_t;

                         foo_t [12];”

 

When this C definition of data type is passed in as the str, this function will create an HDF5 datatype of 12-element array of a compound datatype.  This compound datatype has a field of integer

            and a field of float.

Parameters:

const char* str

IN: a character string describing the data type to be created.

Returns:

Returns the datatype ID(non-negative) if successful; otherwise returns a negative value.

 

Name: H5Ttype_to_text

Signature:

herr_t  H5Ttype_to_text(hid_t datatype, char* str, H5T_lang_t lang_type, size_t* len)

Purpose:

Creates a text description of a datatype.

Description:

Given a datatype ID, this functions creates a text description of this datatype in different format according to the language type.  If the lang_type is H5T_C, the text description will be in C format.  If it is H5T_DDL, the description will be in

HDF5 DDL format.  An example in C format will be like,

                        typedef struct foo {

                                    int a;

                                    float b;

                         } foo_t;

                         foo_t [12];”

which is a datatype of 12-element array of a compound datatype.  This compound datatype has a field of integer and a field of float.

 

A preliminary H5Ttype_to_text call can be made to find out the size of the buffer needed. This value is returned as len.  That value can then be assigned to len for a second H5Ttype_to_text call, which will retrieve the actual text description for the data type.  

 

If the library finds out len is not big enough for the description, it simply returns the size of the buffer needed through len without encoding the provided buffer.

Parameters:

hid_t datatype

            IN: ID of the datatype to be converted.

char* str

OUT: Buffer for the text description of the data type.

H5T_lang_t lang_type

IN: the language used to describe the data type.  Currently supported languages are H5T_C, H5T_DDL, H5T_FORTRAN.  Other languages might be added later.

size_t* len

            OUT: the size of buffer needed to store the text description.

Returns:

Returns non-negative if successful; otherwise returns a negative value.

 

Name: H5Tcmp

Signature:

int  H5Tcmp(hid_t dtype1, hid_t dtype2)

Purpose:

Compare two data types.

Description:

Given the data type IDs, this function compares two data types.  The library

compares data types in the following way,

 

           

Data types

Comparison

Different types

H5T_ARRAY > H5T_VLEN > H5T_ENUM > H5T_REFERENCE > H5T_COMPOUND > H5T_OPAQUE > H5T_BITFIELD > H5T_STRING > H5T_TIME > H5T_FLOAT > H5T_INTEGER

Integers

Big endian > little endian, higher precision > lower precision, greater offset > lesser offset(bit position of the least significant bit), greater least significant padding > lesser least significant padding, greater most significant padding > lesser most significant padding, unsigned > signed.

Floats

Big endian > little endian, higher precision > lower precision, greater offset > lesser offset(bit position of the least significant bit), greater least significant padding > lesser least significant padding, greater most significant padding > lesser most significant padding, greater bit position of sign bit > lesser bit position of sign bit, greater position of least significant bit of exponent > lesser position of least significant bit of exponent, greater exponent size > lesser exponent size, greater exponent bias > less exponent bias, greater mantissa size > lesser mantissa size, most significant bit of mantissa is 1 always > most significant bit of mantissa is implied(normalization), padding set to background value > padding set to 1 > padding set to 0.

Times

Big endian > little endian, higher precision > lower precision, greater offset > lesser offset(bit position of the least significant bit), greater least significant padding > lesser least significant padding, greater most significant padding > lesser most significant padding.

Strings

Big endian > little endian, higher precision > lower precision, greater offset > lesser offset(bit position of the least significant bit), greater least significant padding > lesser least significant padding, greater most significant padding > lesser most significant padding, padding with space for extra bytes > padding with nulls > null-terminated.

Bit fields

Big endian > little endian, higher precision > lower precision, greater offset > lesser offset(bit position of the least significant bit), greater least significant padding > lesser least significant padding, greater most significant padding > lesser most significant padding.

Compounds

More members > less members, greater member name > lesser member name(similar to string comparison), greater member offset > lesser member offset, greater member size > lesser member size.

References

Big endian > little endian, higher precision > lower precision, greater offset > lesser offset(bit position of the least significant bit), greater least significant padding > lesser least significant padding, greater most significant padding > lesser most significant padding, internal reference > dataset region reference > object reference, data located on disk > data located in memory(for object reference),

Enumerates

Greater parent > lesser parent, more members > less members, greater member name > lesser member name.

Variable-lengths

String > sequence, data located on disk > data located in memory, greater file object address > lesser file object address(if they are in different files).

Opaques

H5T_ARRAY > H5T_VLEN > H5T_ENUM > H5T_REFERENCE > H5T_COMPOUND > H5T_OPAQUE > H5T_BITFIELD > H5T_STRING > H5T_TIME > H5T_FLOAT > H5T_INTEGER, greater tag > lesser tag(similar to string comparison).

Arrays

More dimensions > less dimensions, bigger dimensions > smaller dimensions, greater parent > lesser parent.

 

This function provides the convenience of sorting data types although some of the comparisons are arbitrary.  

Parameters:

hid_t dtype1

            IN: ID of the first  datatype to be compared.

hid_t dtype2

IN: ID of the second data type to be compared.

Returns:

Returns positive value if first data type is greater; negative value if second data type is greater; 0(zero) if they are equal.

 

 


Last modified: 23 January 2012 (Links only)