UTF-8 Character Encoding in HDF5

James Laird
Robert E. McGrath
Revised 4 May, 2005

Motivation

The NetCDF team would like HDF5 to support strings with UTF-8 Unicode character encoding.

Currently HDF5 officially supports only strings encoded in standard US ASCII.  However, the HDF5 File Format Specification and other documentation is ambiguous, and the library does not check the encoding of strings.

What is UTF-8?

Joel Spolsky has written an introduction to character sets and Unicode entitled “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.”

Briefly: standard ASCII defines characters for byte values between 0 and 127. Values 128 through 255 are not defined by standard ASCII. UTF-8 byte values between 0 and 127 represent the same characters as ASCII--standard ASCII is a subset of UTF-8. ASCII values 128-255 represent different characters depending on which "code page" is currently loaded. UTF-8 also uses these values to represent characters outside of unaccented American English.

The convenient side-effect of this is that any UTF-8 string is also a valid ASCII string, although not necessarily one with the same meaning. Any string consisting of only standard ASCII characters (unaccented American English characters) is identical in ASCII and UTF-8 encodings.

UTF-8 does store some characters as multiple bytes (up to four bytes); all multibyte characters use byte values of 128-255. NULL-termination, space padding, and C string routines all operate identically on UTF-8 as they do on ASCII characters and strings. The fact that UTF-8 has mutibyte characters means that the number of bytes in a string is not necessarily the number of characters in that string, but the number of bytes is usually the important factor for storing and manipulating the string.

The ASCII and UTF-8 must be displayed differently for characters outside of the standard ASCII set, but this is the responsibility of the displaying software, not of HDF5. ASCII and UTF-8 strings can be stored and manipulated identically by HDF5.

Proposed action to investigate further:

Write tests to ensure that UTF-8 characters can be used in the library. The tests would ensure that strings including non-ASCII characters don't break any functionality, and can be returned to the user unaltered. This would need to be tested everywhere the library uses character strings,. Table 1 gives a list of where strings are used in the HDF5 API.

Table 1. HDF5 Libary APIs String Usage
Object
Uses
Comments
Data (i.e., contents of attribute or dataset that are an array of strings.
define datatype,
get datatype
User data should conform to the encodeing specified by the datatype, but th elibrary doesn't check this.
Object names and paths (links to Group, Dataset, Named Datatype)
create object
open object
interate group,
get name from ID

Reference to object or region
create reference
Should be same as path names.
Soft link linkval
set/get value
Should be same as path names.
Attribute names
create,
open
get name

Compound datatype field names
define datatype,
get datatype,
select fields

ENUM type names
define datatype,
retrieved from nameof, getmember

opaque data type tag
define datatype,
retrieved from gettag

error strings
define error strings,
push messages,
retrieve stack
Details changed in 1.8, waiting for documentation.
property class name, filter name,
register,
get name
get by name
Predefined do not need UTF-8 option.  Probably do not need to support UTF-8 for these.
property name
create property,
set/get property value

file names
create, open,
 get name
is_hdf5,
mount/unmount
Depends on file system?
file names: external file, multi file, split file
set, get file names
Depends on file system?
comment on filters
set/get
No need for UTF-8
comment on groups
set/get
Non need for UDF-8

What would UTF-8 support mean?

This document assumes that the tests will confirm that using UTF-8 will not break the HDF5 API or library. If so, these tests can be checked in to CVS if appropriate.

Adding support for UTF-8 has several aspects.

  1. Adding means for the user to set and retrieve a character encoding for all the cases in Table 1 that we cover.
  2. Adding a new encoding to the character encodings for a String datatype (first row)
  3. Checking the validity of UTF-8 encoded strings, when UDF-8 is selected (optional?)
  4. Updating the Format Specification to correctly specify the use of US ASCII, extended ASCII, and UTF-8.
  5. Documenting the correct use of UTF-8, and backward compatibility issues with existing files and applications.
An important design decision is which objects in Table 1 will support alternative encodings, and what granualarity will they be specified.

Data values (i.e., data of type String) are easy to handle via  a  simple extension of the current data types. As currently, the HDF5 library would not check the encoding of user data values.

A simple option is to have a single file creation property that selects ASCII or UTF-8 for the whole file (except for data values).  Whichever is selected applies to the whole file.
Finer grained control is more difficult because only a few operations have property lists that can be used to specifcy the encoding, at lease without changing the existing APIs. How this could be done is an open question.


Last modified: 23 January 2012 (Formatting, links, and citations only)