NCSA HDF Specification and DeveloperÕs Guide Introduction i National Center for Supercomputing Applications November 8, 1993 i November 8, 1993 i I Introduction Overview The Hierarchical Data Format (HDF) was designed to be an easy, straight-forward, and self-describing means of sharing scientific data among people, projects, and types of computers. An extensible header and carefully crafted internal layers provide a system that can grow as scientific data-handling needs evolve. This document, the NCSA HDF Specification and DeveloperÕs Guide, fully defines HDF and its interfaces, discusses criteria employed in its development, and provides guidelines for developers working on HDF itself or building applications that employ HDF. This introduction provides a brief overview of HDF capabilities and design. Why HDF? A fundamental requirement of scientific data management is the ability to access as much information in as many ways, as quickly and easily as possible. A data storage and retrieval system that facilitates these capabilities must provide the following features: Support for scientific data and metadata Scientific data is characterized by a variety of data types and representations, data sets (including images) that can be extremely large and complex, and the need to attach accompanying attributes, parameters, notebooks, and other metadata. Metadata, supplementary data that describes the basic data, includes information such as the dimensions of an array, the number type of the elements of a record, or a color lookup table (LUT). Support for a range of hardware platforms Data can originate on one machine only to be used later on many different machines. Scientists must be able to access data and metadata on as many hardware platforms as possible Support for a range of software tools Scientists need a variety of software tools and utilities for easily searching, analyzing, archiving, and transporting the data and metadata. These tools range from a library of routines for reading and writing data and metadata, to small utilities that simply display an image on a console, to full-blown database retrieval systems that provide multiple views of thousands of sets of data and metadata. Rapid data transfer Both the size and the dispersion of scientific data sets require that mechanisms exist to get the data from place to place rapidly. Extendibility As new types of information are generated and new kinds of science are done, a means must be provided to support them. What is HDF? The HDF Structure HDF is a self-describing extensible file format using tagged objects that have standard meanings. The idea is to store both a known format description and the data in the same file. HDF tags describe the format of the data because each tag is assigned a specific meaning: the tag DFTAG_LUT stands for color palette, the tag DFTAG_RI stands for 8-bit raster image, and so on (see Figure 1). A program that has been written to understand a certain set of tag types can scan the file for those tags and process the data. This program also can ignore any data that is beyond its scope. Figure I.1 Raster Image Set in an HDF File . The set has three data objects with different tags representing three different types of data. The palette and dimension objects contain metadata. The set of available data objects encompasses both primary data and metadata. Most HDF objects are machine- and medium-independent, physical representations of data and metadata. HDF Tags The HDF design assumes that we cannot know a priori what types of data objects will be needed in the future, nor can we know how scientists will want to view that data. As science progresses, people will discover new types of information and new relationships among existing data. New types of data objects new tags will be created to meet these expanding needs. To avoid unnecessary proliferation of tags and to ensure that all tags are available to potential users who need to share data, a portable public domain library is available that interprets all public tags. The library contains user interfaces designed to provide views of the data that are most natural for users. As we learn more about the way scientists need to view their data, we can add user interfaces that reflect data models consistent with those views. Types of Data and Structures HDF currently supports the most common types of data and metadata that scientists use, including multidimensional gridded data, 2-dimensional raster images, polygonal mesh data, multivariate data sets, finite-element data, non- Cartesian coordinate data, and text. In the future there will almost certainly be a need to incorporate new types of data, such as voice and video, some of which might actually be stored on other media than the central file itself. Under such circumstances, it may become desirable to employ the concept of a virtual file. A virtual file functions like a regular file but does not fit our normal notion of a monolithic sequence of bits stored entirely on a single disk or tape. HDF also makes it possible for the user to include annotations, titles, and specific descriptions of the data in the file. Thus, files can be archived with human-readable information about the data and its origins One collection of HDF tags supports a hierarchical grouping structure called Vset that allows scientists to organize data objects within HDF files to fit their views of how the objects go together, much as a person in an office or laboratory organizes information in folders, drawers, journal boxes, and on their desktops. Backward and Forward Compatibility An important goal of HDF is to maximize backward and forward compatibility among its interfaces. This is not always achievable, because data formats must sometimes change to enhance performance, to correct errors, or for other reasons. However, whenever possible, HDF files should not become out of date. For example, suppose a site falls far behind in the HDF standard so its users can only work with the portions of the specification that are three years old. Users at this site might produce files with their old HDF software then read them with newer software designed to work with more advanced data files. The newer software should still be able to read the old files. Conversely, if the site receives files that contain objects that its HDF software does not understand, it should still be able to list the types of data in the file. It should also be able to access all of the older types of data objects that it understands, despite the fact that the older types of data objects are mixed in with new kinds of data. In addition, if the more advanced site uses the text annotation facilities of HDF effectively, the files will arrive with complete human- readable descriptions of how to decipher the new tag types. Calling Interfaces To present a convenient user interface made up of something more usable than a list of tag types with their associated data requirements, HDF supports multiple calling interfaces. The low level calling interfaces are used to manipulate tags and raw data, for error handling, and to control the physical storage of data. These interfaces are designed to be used by developers who are providing the higher level interfaces for applications like raster image storage or scientific data archiving. The application interfaces, at the next level, include several modules specifically designed to simplify the process of storing and accessing specific types of data. For example, the palette interface is designed to handle color palettes and lookup tables while the scientific data interface is designed to handle arrays of scientific data. If you are primarily interested in reading or writing data to HDF files, you will spend most of your time working with the application interfaces. The HDF utilities and NCSA applications, at the top level, are special purpose programs designed to handle specific tasks or solve specific problems. The utilities provide a command line interface for data management. The applications provide solutions for problems in specific application areas and often include a graphic user interface. Several third party applications are also available at this level. Machine Independence An important issue in data file design is that of machine independence or transportability. The HDF design defines standard representations for storing all data types that it supports. When data is written to a file, it is typically written in the standard HDF representation. The conversion is handled by the HDF software and need not concern the user. Users may override this convention and install their own conversion routines, or they may write data to a file in the native format of the machine on which it was generated. Some History In 1987 a group of users and software developers at NCSA searched for a file format that would satisfy NCSAÕs data needs. There were some interesting candidates, but none that were in the public domain, were targeted to scientific data, and yet were sufficiently general and extensible. In the course of several months, borrowing concepts from several existing formats, the group designed HDF. The first version of HDF was implemented in the spring and summer of 1988. It included a general purpose interface and an 8-bit raster image interface. In the fall of 1988, a scientific data set interface was designed and implemented, enabling HDF users to store multidimensional arrays and related data. Soon thereafter interfaces were implemented for storing color palettes, 24-bit raster images, and annotations. In 1989, it became clear that there was a need to support a general grouping structure and unstructured data such as that used to represent polyhedra in graphical applications. This led to Vsets, whose interface routines were implemented as a separate HDF library. Also in 1989 it became clear that the existing general purpose layer was not sufficiently powerful to meet anticipated future needs and that the coding could use a substantial overhaul. From this, the long process of redesigning the lower layers of HDF began. The first version incorporating extended tags and the new lower layers of HDF was released in the summer of 1992 as HDF Version 3.2. This release, HDF Version 3.3, provides alternative physical storage methods (external and linked block data elements) through extended tags, JPEG data compression, changes to some Vset interface functions, access to netCDF files through a complete netCDF interface,1 hyperslab access routines for old-style SDS objects, and various performance improvements. About This Document This document is designed for software developers who are designing applications or routines for use with HDF files and for users who need detailed information about HDF. Users who are interested in using HDF to store or manipulate their data will not normally need the kind of detail presented in this manual. They should instead consult one of the user- level documents: Versions 3.2 and earlier NCSA HDF Calling Interfaces and Utilities NCSA HDF Vset Version 3.3 Getting Started with NCSA HDF NCSA HDF UserÕs Guide NCSA HDF Reference Manual Someone using third-party software that uses HDF may also have to consult a manual for that software. Document Contents The NCSA HDF Specification and DeveloperÕs Guide contains the following chapters and appendix: Chapter 1: Basic Structure of HDF Files Introduces and describes the components and organization of HDF files Chapter 2: Software Overview Describes the organization of the software layers that make up the basic HDF library and provides guidelines for writing HDF software Chapter 3: General Purpose Interface Describes the low level HDF routines that make up the general purpose interface Chapter 4: Sets and Groups Explains the roles of sets and groups in an HDF file, and describes raster image sets, scientific data sets, and Vsets Chapter 5: Annotations Explains the use of annotations in HDF files Chapter 6: Tag Specifications Describes the tag identification space, the extended tag structure, and all of the NCSA-supported tags Chapter 7: Portability Issues Describes the measures taken to maximize HDF portability across platforms and to ensure that HDF routines are available to both C and FORTRAN programs Appendix A: Tags and Extended Tag Labels Presents a list of NCSA-supported HDF tags and a list of labels used with extended tags Conventions Used in This Document Most of the descriptive text in this guide is printed in 10 point New Century Schoolbook. Other typefaces have specific meanings that will help the reader understand the functionality being described. New concepts are sometimes presented in italics on their first occurrence to indicate that they are defined within the paragraph. Cross references within the specification include the title of the referenced section or chapter enclosed in quotation marks. (E.g., See Chapter 1, "The Basic Structure of HDF Files," for a description of the basic HDF file structure.) References to documents italicize the title of the document. (E.g., See the guide Getting Started with NCSA HDF to familiarize yourself with the basic principles of using HDF.) Literal expressions and variables often appear in the discussion. Literal expressions are presented in Courier while variables are presented in italic Courier. A literal expression is any expression that would be entered exactly as presented, e.g., commands, command options, literal strings, and data. A variable is an expression that serves as a place holder for some other text that would be entered. Consider the expression cp file1 file2. cp is a command name and would be entered exactly as it appears, so it is printed in bold Courier. But file1 and file2 are variables, place holders for the names of real files, so they are printed in italic bold Courier; the user would enter the actual filenames. This guide frequently offers sample command lines. Sometimes these are examples of what might be done; other times they are specific instructions to the user. Command lines may appear within running text, as in the preceding paragraph, or on a separate line, as follows: cp file1 file2 Command lines always include one or more literal expressions and may include one or more variables, so they are printed in Courier and italic Courier as described above. Keys that are labeled with more than one character, such as the RETURN key, are identified with all uppercase letters. Keys that are to be pressed simultaneously or in succession are linked with a hyphen. For example, Òpress CONTROL-AÓ means to press the CONTROL key then, without releasing the CONTROL key, press the A key. Similarly, Òpress CONTROL-SHIFT-A Ò means to press the CONTROL and SHIFT keys then, without releasing either of those, press the A key. Table I.1 summarizes the use of typefaces in the technical discussion (i.e., everything except references and cross references). .c.Table I.1 Meaning of entry format notations Type Appearance Example Entry Method Literal expression (commands, literal strings, data) Courier dothis Enter the expression exactly as it appears. Variables Italic Courier filename Enter the name of the file or the specific data that this expression represents. Special keys Uppercase RETURN Press the key indicated. Key combinations Uppercase with hyphens between key names CONTROL-A While holding down the first one or two keys, press the last key. Program listings and screen listings are presented in a boxed display in Courier type such as in Figure I.2, ÒSample Screen Listing.Ó When the listing is intended as a sample that the reader will use for an exercise or model, variables that the reader will change are printed in italic Courier. .c.Figure I.2 Sample screen listing mars_53% ls -F MinMaxer/ net.source mars_54% cd MinMaxer mars_55% ls -F list.MinMaxer minmaxer.v1.04/ mars_56% cd minmaxer.v1.04 mars_57% ls -F COPYRIGHT minmaxer.bin/ source.minmaxer/ README sample/ source.triangulation/ mars_58% 1 NetCDF is a network-transparent derivative of the original CDF (Common Data Format) developed by the National Aeronautics and Space Administration (NASA). It is used widely in atmospheric sciences and other disciplines requiring very large data structures. NetCDF is in the public domain and was developed at the Unidata Program Center in Boulder, Colorado.