When and why should I consider switching from a directory/file structure in a file system to using an HDF5 smart data container?
Never change a winning team. If you are happy with your current solution, don't bother! (If you aren't happy with your solution, we'd like to hear more about that...)
If, with a simplistic definition such as this,
"A file system is something that stores named byte streams (files) which can be logically organized into (named) directories"
we make three substitutions:
file system -> HDF5 container,
files -> datasets,
directories -> groups,
However, the difference is more than syntactical.
The question can be addressed at any level of technical detail. At the lowest level of detail, the distinction can be summarized as follows:
- An HDF5 file is a container. A file system is not a container. (The definition of 'container' implies portability.)
- A file in a file system is an uninterpreted stream of bytes. An HDF5 dataset in an HDF5 container is well defined HDF5 object.
- File systems do not have a portable layer for user metadata, HDF5 containers do.
Let's elaborate on that:
An HDF5 data container is a standardized, highly-customizable data receptacle designed for portability. Unless your definition of 'container' is extremely broad, file systems are not commonly considered containers.
File systems aren't portable: For example, you might be able to mount an NTFS file system on an AIX machine, but the integers or floating point numbers written on an Intel processor will turn out to be garbage when read on a IBM Power processor.
HDF5 achieves portability by separating its "cargo" (data) from its environment (file system, processor architecture, etc.) and by encoding it in a self-describing file format. The HDF5 library serves the dual purpose of being a parser/encoder of this format and an API for user-level objects (datasets, groups, attributes, etc.).
Although it is possible to emulate a file system in an HDF5 file, i.e., store raw byte streams (files) in HDF5 datasets and organize them in HDF5 groups (directories), this is not what people commonly do with HDF5 containers.
The data stored in HDF5 datasets is shaped and it is typed. Datasets have (logically) the shape of multi-dimensional rectilinear arrays. All elements in a given dataset are of the same type, and HDF5 has one of the most extensive type systems and one that is user-extendible.
The (portable) metadata layer of an HDF5 container consists mainly of HDF5 groups and HDF5 attributes. Yes, groups are akin to directories in files systems, but in HDF5 they can be decorated with arbitrary user-defined attributes. (In a files system, all there is in terms if metadata are date/time stamps, access control lists, etc. Have you ever thought about decorating a directory with a calibration parameter or other typed invariants that apply to all the files in that directory?) The same goes for HDF5 datasets. They can be decorated with arbitrary user-defined attributes. File systems store file and directory metadata, not user metadata.
Returning to the three main points and the original question, here's some food for thought:
If you are looking for portability, a file system's got nothing to do with it. (It's the thumbdrive that's portable, not the files system on it.) If a TAR archive works for you as a container, no need to change it!
If you are looking to type and shape the data in a file, a file format is a first step. There's no other vehicle in a file system to achieve that. (Other than non-portable extensions, such as NTFS streams, extended attributes, etc.)
Creating your own file format is easy. Maintaining it is another story. You can still have your own file format, by using HDF5 as a "file format toolkit."
If you need a portable user metadata layer, the file system story is pretty short and there isn't much of a story. HDF5 has the facilities to tell your user metadata story, and keep data and metadata together.
- - Last modified: 11 March 2014