NCSA HDF Specification and DeveloperÕs Guide

		Introduction

i		National Center for Supercomputing Applications

November 8, 1993		i

                                                           
November 8, 1993		i


              I	Introduction


Overview

The Hierarchical Data Format (HDF) was designed to be an 
easy, straight-forward, and self-describing means of sharing 
scientific data among people, projects, and types of 
computers. An extensible header and carefully crafted 
internal layers provide a system that can grow as scientific 
data-handling needs evolve. 

This document, the NCSA HDF Specification and DeveloperÕs 
Guide, fully defines HDF and its interfaces, discusses criteria 
employed in its development, and provides guidelines for 
developers working on HDF itself or building applications 
that employ HDF.

This introduction provides a brief overview of HDF 
capabilities and design.


Why HDF?
A fundamental requirement of scientific data management is 
the ability to access as much information in as many ways, 
as quickly and easily as possible. A data storage and 
retrieval system that facilitates these capabilities must 
provide the following features:

Support for scientific data and metadata
Scientific data is characterized by a variety of data types 
and representations, data sets (including images) that can 
be extremely large and complex, and the need to attach 
accompanying attributes, parameters, notebooks, and 
other metadata.  Metadata, supplementary data that 
describes the basic data, includes information such as the 
dimensions of an array, the number type of the elements 
of a record, or a color lookup table (LUT).

Support for a range of hardware platforms
Data can originate on one machine only to be used later on 
many different machines. Scientists must be able to access 
data and metadata on as many hardware platforms as 
possible

Support for a range of software tools
Scientists need a variety of software tools and utilities for 
easily searching, analyzing, archiving, and transporting 
the data and metadata. These tools range from a library 
of routines for reading and writing data and metadata, to 
small utilities that simply display an image on a console, 
to full-blown database retrieval systems that provide 
multiple views of thousands of sets of data and metadata.

Rapid data transfer
Both the size and the dispersion of scientific data sets 
require that mechanisms exist to get the data from place 
to place rapidly.

Extendibility
As new types of information are generated and new kinds 
of science are done, a means must be provided to support 
them. 


What is HDF?

The HDF Structure
HDF is a self-describing extensible file format using tagged 
objects that have standard meanings. The idea is to store 
both a known format description and the data in the same 
file. HDF tags describe the format of the data because each 
tag is assigned a specific meaning: the tag DFTAG_LUT stands 
for color palette, the tag DFTAG_RI stands for 8-bit raster 
image, and so on (see Figure 1). A program that has been 
written to understand a certain set of tag types can scan the 
file for those tags and process the data. This program also 
can ignore any data that is beyond its scope.

Figure I.1	Raster Image Set in an HDF File . The set has three data objects with different tags 
representing three different types of data. The palette and dimension objects contain 
metadata.
                                                 

The set of available data objects encompasses both primary 
data and metadata. Most HDF objects are machine- and 
medium-independent, physical representations of data and 
metadata. 


HDF Tags
The HDF design assumes that we cannot know a priori what 
types of data objects will be needed in the future, nor can we 
know how scientists will want to view that data. As science 
progresses, people will discover new types of information and 
new relationships among existing data.  New types of data 
objects new tags will be created to meet these expanding 
needs. To avoid unnecessary proliferation of tags and to 
ensure that all tags are available to potential users who need 
to share data, a portable public domain library is available 
that interprets all public tags. The library contains user 
interfaces designed to provide views of the data that are most 
natural for users. As we learn more about the way scientists 
need to view their data, we can add user interfaces that 
reflect data models consistent with those views.


Types of Data and 
Structures
HDF currently supports the most common types of data and 
metadata that scientists use, including multidimensional 
gridded data, 2-dimensional raster images, polygonal mesh 
data, multivariate data sets, finite-element data, non-
Cartesian coordinate data, and text. 

In the future there will almost certainly be a need to 
incorporate new types of data, such as voice and video, some 
of which might actually be stored on other media than the 
central file itself. Under such circumstances, it may become 
desirable to employ the concept of a virtual file.  A virtual file 
functions like a regular file but does not fit our normal notion 
of a monolithic sequence of bits stored entirely on a single 
disk or tape. 

HDF also makes it possible for the user to include 
annotations, titles, and specific descriptions of the data in 
the file.  Thus, files can be archived with human-readable 
information about the data and its origins

One collection of HDF tags supports a hierarchical grouping 
structure called Vset  that allows scientists to organize data 
objects within HDF files to fit their views of how the objects 
go together, much as a person in an office or laboratory 
organizes information in folders, drawers, journal boxes, and 
on their desktops. 


Backward and Forward 
Compatibility
An important goal of HDF is to maximize backward and 
forward compatibility among its interfaces. This is not 
always achievable, because data formats must sometimes 
change to enhance performance, to correct errors, or for other 
reasons. However, whenever possible, HDF files should not 
become out of date. For example, suppose a site falls far 
behind in the HDF standard so its users can only work with 
the portions of the specification that are three years old. 
Users at this site might produce files with their old HDF 
software then read them with newer software designed to 
work with more advanced data files. The newer software 
should still be able to read the old files. 

Conversely, if the site receives files that contain objects that 
its HDF software does not understand, it should still be able 
to list the types of data in the file.  It should also be able to 
access all of the older types of data objects that it 
understands, despite the fact that the older types of data 
objects are mixed in with new kinds of data. In addition, if 
the more advanced site uses the text annotation facilities of 
HDF effectively, the files will arrive with complete human-
readable descriptions of how to decipher the new tag types.


Calling Interfaces
To present a convenient user interface made up of something 
more usable than a list of tag types with their associated 
data requirements, HDF supports multiple calling interfaces. 

The low level calling interfaces are used to manipulate tags 
and raw data, for error handling, and to control the physical 
storage of data. These interfaces are designed to be used by 
developers who are providing the higher level interfaces for 
applications like raster image storage or scientific data 
archiving.

The application interfaces, at the next level, include several 
modules specifically designed to simplify the process of 
storing and accessing specific types of data.  For example, the 
palette interface is designed to handle color palettes and 
lookup tables while the scientific data interface is designed to 
handle arrays of scientific data.  If you are primarily 
interested in reading or writing data to HDF files, you will 
spend most of your time working with the application 
interfaces.

The HDF utilities  and NCSA applications, at the top level, 
are special purpose programs designed to handle specific 
tasks or solve specific problems.  The utilities provide a 
command line interface for data management.  The 
applications provide solutions for problems in specific 
application areas and often include a graphic user interface.  
Several third party applications are also available at this 
level.


Machine Independence
An important issue in data file design is that of machine 
independence or transportability. The HDF design defines 
standard representations for storing all data types that it 
supports. When data is written to a file, it is typically written 
in the standard HDF representation. The conversion is 
handled by the HDF software and need not concern the user. 
Users may override this convention and install their own 
conversion routines, or they may write data to a file in the 
native format of the machine on which it was generated.


Some History

In 1987 a group of users and software developers at NCSA 
searched for a file format that would satisfy NCSAÕs data 
needs. There were some interesting candidates, but none 
that were in the public domain, were targeted to scientific 
data, and yet were sufficiently general and extensible. In the 
course of several months, borrowing concepts from several 
existing formats, the group designed HDF. 

The first version of HDF was implemented in the spring and 
summer of 1988. It included a general purpose interface and 
an 8-bit raster image interface. In the fall of 1988, a scientific 
data set interface was designed and implemented, enabling 
HDF users to store multidimensional arrays and related 
data. Soon thereafter interfaces were implemented for storing 
color palettes, 24-bit raster images, and annotations.

In 1989, it became clear that there was a need to support a 
general grouping structure and unstructured data such as 
that used to represent polyhedra in graphical applications. 
This led to Vsets, whose interface routines were implemented 
as a separate HDF library.

Also in 1989 it became clear that the existing general 
purpose layer was not sufficiently powerful to meet 
anticipated future needs and that the coding could use a 
substantial overhaul. From this, the long process of 
redesigning the lower layers of HDF began. The first version 
incorporating extended tags and the new lower layers of HDF 
was released in the summer of 1992 as HDF Version 3.2.

This release, HDF Version 3.3, provides alternative physical 
storage methods (external and linked block data elements) 
through extended tags, JPEG data compression, changes to 
some Vset interface functions, access to netCDF files through 
a complete netCDF interface,1  hyperslab access routines for 
old-style SDS objects, and various performance 
improvements.
About This Document

This document is designed for software developers who are 
designing applications or routines for use with HDF files and 
for users who need detailed information about HDF. Users 
who are interested in using HDF to store or manipulate their 
data will not normally need the kind of detail presented in 
this manual. They should instead consult one of the user-
level documents:
Versions 3.2 and earlier
NCSA HDF Calling Interfaces and Utilities
NCSA HDF Vset
Version 3.3
Getting Started with NCSA HDF
NCSA HDF UserÕs Guide
NCSA HDF Reference Manual
Someone using third-party software that uses HDF may also 
have to consult a manual for that software.


Document Contents
The NCSA HDF Specification and DeveloperÕs Guide contains 
the following chapters and appendix:

Chapter 1:  Basic Structure of HDF Files
Introduces and describes the components and 
organization of HDF files

Chapter 2:  Software Overview
Describes the organization of the software layers that 
make up the basic HDF library and provides guidelines 
for writing HDF software

Chapter 3:  General Purpose Interface
Describes the low level HDF routines that make up the 
general purpose interface

Chapter 4:  Sets and Groups
Explains the roles of sets and groups in an HDF file, and 
describes raster image sets, scientific data sets, and 
Vsets

Chapter 5:  Annotations
Explains the use of annotations in HDF files

Chapter 6:  Tag Specifications
Describes the tag identification space, the extended tag 
structure, and all of the NCSA-supported tags

Chapter 7:  Portability Issues
Describes the measures taken to maximize HDF 
portability across platforms and to ensure that HDF 
routines are available to both C and FORTRAN programs

Appendix A:  Tags and Extended Tag Labels
Presents a list of NCSA-supported HDF tags and a list of 
labels used with extended tags


Conventions Used in This Document

Most of the descriptive text in this guide is printed in 10 
point New Century Schoolbook.  Other typefaces have specific 
meanings that will help the reader understand the 
functionality being described.  

New concepts are sometimes presented in italics on their first 
occurrence to indicate that they are defined within the 
paragraph.

Cross references within the specification include the title of the 
referenced section or chapter enclosed in quotation marks.  
(E.g., See Chapter 1, "The Basic Structure of HDF Files," for a 
description of the basic HDF file structure.)

References to documents italicize the title of the document.  
(E.g., See the guide Getting Started with NCSA HDF  to 
familiarize yourself with the basic principles of using HDF.)

Literal expressions and variables often appear in the 
discussion.  Literal expressions are presented in Courier 
while variables are presented in italic Courier.  A literal 
expression is any expression that would be entered exactly as 
presented, e.g., commands, command options, literal strings, 
and data.  A variable is an expression that serves as a place 
holder for some other text that would be entered.  Consider 
the expression cp file1 file2.  cp is a command name and 
would be entered exactly as it appears, so it is printed in 
bold Courier.  But file1 and file2 are variables, place 
holders for the names of real files, so they are printed in italic 
bold Courier; the user would enter the actual filenames.

This guide frequently offers sample command lines.  
Sometimes these are examples of what might be done; other 
times they are specific instructions to the user.   Command 
lines may appear within running text, as in the preceding 
paragraph, or on a separate line, as follows:

	cp file1 file2

Command lines always include one or more literal 
expressions and may include one or more variables, so they 
are printed in Courier and italic Courier as described above. 

Keys that are labeled with more than one character, such as 
the RETURN key, are identified with all uppercase letters. 
Keys that are to be pressed simultaneously or in succession 
are linked with a hyphen.  For example, Òpress CONTROL-AÓ 
means to press the CONTROL key then, without releasing 
the CONTROL key, press the A key.  Similarly, Òpress 
CONTROL-SHIFT-A Ò means to press the CONTROL and 
SHIFT keys then, without releasing either of those, press the 
A key.

Table I.1 summarizes the use of typefaces in the technical 
discussion (i.e., everything except references and cross 
references).


.c.Table I.1	Meaning of entry format notations

Type	Appearance	Example	Entry Method
Literal expression
(commands, literal
strings, data)	Courier 	dothis	Enter the expression exactly
as it appears.
Variables	Italic Courier 	filename	Enter the name of the file or
the specific data that this
expression represents.
Special keys	Uppercase	RETURN	Press the key indicated.
Key combinations	Uppercase with
hyphens between
key names	CONTROL-A	While holding down the first
one or two keys, press the last
key.


Program listings and screen listings are presented in a boxed 
display in Courier type such as in Figure I.2, ÒSample Screen 
Listing.Ó  When the listing is intended as a sample that the 
reader will use for an exercise or model, variables that the 
reader will change are printed in italic Courier.

.c.Figure I.2	Sample screen listing

mars_53% ls -F
MinMaxer/                    net.source
mars_54% cd MinMaxer
mars_55% ls -F
list.MinMaxer                minmaxer.v1.04/
mars_56% cd minmaxer.v1.04
mars_57% ls -F
COPYRIGHT                    minmaxer.bin/               source.minmaxer/
README                       sample/                     source.triangulation/
mars_58% 

1	NetCDF is a network-transparent derivative of the original CDF (Common Data Format) developed by the 
National Aeronautics and Space Administration (NASA).  It is used widely in atmospheric sciences and 
other disciplines requiring very large data structures.  NetCDF is in the public domain and was 
developed at the Unidata Program Center in Boulder, Colorado.