UTF-8 Character Encoding in HDF5

James Laird
Robert E. McGrath
Revised 4 May, 2005

Motivation

The NetCDF team would like HDF5 to support strings with UTF-8 Unicode character encoding.

Currently HDF5 officially supports only strings encoded in standard US ASCII. However, the HDF5 File Format Specification and other documentation is ambiguous, and the library does not check the encoding of strings.

What is UTF-8?

Joel Spolsky has written an introduction to character sets and Unicode entitled “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.”

Briefly: standard ASCII defines characters for byte values between 0 and 127. Values 128 through 255 are not defined by standard ASCII. UTF-8 byte values between 0 and 127 represent the same characters as ASCII--standard ASCII is a subset of UTF-8. ASCII values 128-255 represent different characters depending on which "code page" is currently loaded. UTF-8 also uses these values to represent characters outside of unaccented American English.

The convenient side-effect of this is that any UTF-8 string is also a valid ASCII string, although not necessarily one with the same meaning. Any string consisting of only standard ASCII characters (unaccented American English characters) is identical in ASCII and UTF-8 encodings.

UTF-8 does store some characters as multiple bytes (up to four bytes); all multibyte characters use byte values of 128-255. NULL-termination, space padding, and C string routines all operate identically on UTF-8 as they do on ASCII characters and strings. The fact that UTF-8 has mutibyte characters means that the number of bytes in a string is not necessarily the number of characters in that string, but the number of bytes is usually the important factor for storing and manipulating the string.

The ASCII and UTF-8 must be displayed differently for characters outside of the standard ASCII set, but this is the responsibility of the displaying software, not of HDF5. ASCII and UTF-8 strings can be stored and manipulated identically by HDF5.

Proposed action to investigate further:

Write tests to ensure that UTF-8 characters can be used in the library. The tests would ensure that strings including non-ASCII characters don't break any functionality, and can be returned to the user unaltered. This would need to be tested everywhere the library uses character strings,. Table 1 gives a list of where strings are used in the HDF5 API.

Table 1. HDF5 Libary APIs String Usage

Object	Uses	Comments
Data (i.e., contents of attribute or dataset that are an array of strings.	define datatype, get datatype	User data should conform to the encodeing specified by the datatype, but th elibrary doesn't check this.
Object names and paths (links to Group, Dataset, Named Datatype)	create object open object interate group, get name from ID
Reference to object or region	create reference	Should be same as path names.
Soft link linkval	set/get value	Should be same as path names.
Attribute names	create, open get name
Compound datatype field names	define datatype, get datatype, select fields
ENUM type names	define datatype, retrieved from nameof, getmember
opaque data type tag	define datatype, retrieved from gettag
error strings	define error strings, push messages, retrieve stack	Details changed in 1.8, waiting for documentation.
property class name, filter name,	register, get name get by name	Predefined do not need UTF-8 option. Probably do not need to support UTF-8 for these.
property name	create property, set/get property value
file names	create, open, get name is_hdf5, mount/unmount	Depends on file system?
file names: external file, multi file, split file	set, get file names	Depends on file system?
comment on filters	set/get	No need for UTF-8
comment on groups	set/get	Non need for UDF-8

What would UTF-8 support mean?

This document assumes that the tests will confirm that using UTF-8 will not break the HDF5 API or library. If so, these tests can be checked in to CVS if appropriate.

Adding support for UTF-8 has several aspects.

Adding means for the user to set and retrieve a character encoding for all the cases in Table 1 that we cover.
Adding a new encoding to the character encodings for a String datatype (first row)
Checking the validity of UTF-8 encoded strings, when UDF-8 is selected (optional?)
Updating the Format Specification to correctly specify the use of US ASCII, extended ASCII, and UTF-8.
Documenting the correct use of UTF-8, and backward compatibility issues with existing files and applications.

An important design decision is which objects in Table 1 will support alternative encodings, and what granualarity will they be specified.

Data values (i.e., data of type String) are easy to handle via a simple extension of the current data types. As currently, the HDF5 library would not check the encoding of user data values.

A simple option is to have a single file creation property that selects ASCII or UTF-8 for the whole file (except for data values). Whichever is selected applies to the whole file.

Advantages:

simple
feasible

Disadvantages:

inflexible, especially for future if additional encodings are added

Finer grained control is more difficult because only a few operations have property lists that can be used to specifcy the encoding, at lease without changing the existing APIs. How this could be done is an open question.

Last modified: 23 January 2012 (Formatting, links, and citations only)