HDF5 Archival Information Package (AIP) A METS Implementation
April 26, 2006
By Peter Cao
NCSA, University of Illinois at Urbana-Champaign
This work is part of the HDF5/SRB project, a joint project with SDSC. The HDF5/SRB project is sponsored by the NCSA/SDSC-led CyberInfrastructure Partnership (CIP) and the National Laboratory for Advanced Data Research (NLADR), NFS PACI project in support of NCSA-SDSC collaboration.
This document discusses three issues related to the implementation of an HDF5 Archival Information Package (AIP) in METS file:
- 1) Extracting metadata from an HDF5 file
- 2) Organizing that metadata
- 3) Creating a tool to generate metadata for the HDF5 AIP
This white paper is an informal working document and is subject to change. However, it will provide a framework for discussion and will be used as a guideline for the HDF5 AIP implementation.
HDF5 refers to the Hierarchical Data Format developed by the National Center for Supercomputing Applications (NCSA ). AIP refers to one of the three subtypes of the Information Package in the Open Archival Information System (OAIS ) Reference Model: the Archival Information Package (AIP), the Submission Information Package (SIP), and the Dissemination Information Package (DIP). The Metadata Encoding & Transmission Standard (METS ) is a standard for encoding descriptive, administrative, and structural metadata about objects within a digital library, expressed in XML. METS is being developed by the Digital Library Federation (DLF) and is maintained by the Library of Congress
Table of Contents
- 2.1 Why choose METS
- 2.2 METS document structure
- 2.3 METS elements and named complex types
- 2.4 A simple example
- 3.1 HDF5 AIP design
- 3.2 The descriptive metadata
- 3.2.1 External descriptive metadata (mdRef)
- 3.2.2 Internal descriptive metadata (mdWrap)
- 3.3 The administrative metadata
- 3.4 The file inventory
- 3.5 The structural map
The Hierarchical Data Format (HDF) formats are multi-object file formats for storing and transferring scientific data. The HDF formats are widely used for scientific data management ant there are enormous amounts of data stored in HDF. Because of its stability and flexibility, HDF products are used as archival formats for many scientific projects, including the Earth Observing System (EOS) and the National Polar-orbiting Operational Environmental Satellite System (NPOESS). With rapidly increasing volumes of data in HDF, scientists are facing problems of effective and efficient data access. Metadata technology will be a key element in simplifying access to this data and ensuring its long-term usability.
Some limited work has been done to explore the issue of archiving data in HDF formats. There was a discussion about HDF as an archive format  at the Digital Archive Directions (DADs) Workshop, June 22-26, 1998. The discussion concerned the strengths and weaknesses of HDF as an archive format and the expected relationship with the OAIS reference model. An earlier white paper, titled Thoughts on HDF-EOS Metadata , discussed structural metadata for HDF-EOS. As this discussion and paper illustrate, the complexity of the native structure of HDF file content makes archiving HDF files for long term use a difficult task. As new technologies such as XML emerge and as metadata standards such as METS mature, archiving complex data such as HDF will become much easier.
Two versions of HDF are currently in widespread use: HDF4 and HDF5. HDF4 is derived from the original HDF format. HDF5 is a completely new product launched in 1998 to take improved advantage of the features of modern computing systems such as high speed CPUs, large amounts of memory, and parallel environments. This paper proposes an HDF5 AIP in METS document files.
Maintaining a library or data pool of HDF5 files requires maintaining metadata about those files. This metadata is necessary for the successful management and use of digital data. In this document we propose METS as the standard for archiving HDF5 metadata.
In the following sections, we will present
- A brief introduction to the Metadata Encoding & Transmission Standard (METS)
- A METS document file for HDF5 AIP
- h5ingest -a tool to generate metadata for the HDF5 AIP
2. Metadata Encoding & Transmission Standard (METS)
The METS schema is a standard for encoding descriptive, administrative, and structural metadata about objects within a digital library. The schema is expressed using the XML schema language of the World Wide Web Consortium. As the CoverPage states,
METS is intended to provide a standardized XML format for transmission of complex digital library objects between systems. As such, it can be seen as filling a role similar to that defined for the Submission Information Package (SIP), Archival Information Package (AIP) and Dissemination Information Package (DIP) in the Reference Model for an Open Archival Information System.  METS is flexible, modular, extensible, expressive, and is an open (non-proprietary) standard.
2.1 Why choose METS
We recommend METS as the standard for HDF5 metadata for the following reasons. First, METS offers a coherent overall structure for encoding all relevant types of metadata (descriptive, administrative, and structural). Second, METS a widely-accepted standard designed specifically for digital library metadata (or
METS is the widely accepted standard for digital library metadata if there are no other standards). Third, because METS is written in XML, tools and software to create and administer METS files are freely available. Fourth, METS has already been adopted by many government and research organizations (give examples?).
2.2 METS document structure
A METS document consists of six possible sub-elements. Only the structural map is required in a METS document. Other elements are optional. The following is a brief introduction to each of the elements. For details, please visit the METS website at http://www.loc.gov/standards/mets/METSOverview.v2.html
- File Header (optional) - Metadata describing the METS document itself, such as creator, editor, etc.
- Descriptive Metadata (optional) - External descriptive metadata to the METS document, or internal descriptive metadata, or both.
- Administrative Metadata (optional) - Information regarding how the files were created and stored, intellectual property rights, etc.
- File Inventory (optional) - List of all files containing content which make up the electronic version of the digital object.
- Structural Map (required) - Hierarchical structure for the digital library object, and links from the elements of that structure to content files and metadata that pertain to each element.
- Behavior Metadata (optional) - Information to associate executable behaviors with content in the METS object.
2.3 METS elements and named complex types
There are 36 elements and 13 named complex types defined in the METS schema. The following diagram, generated by Altova XMLSpy, shows part of the METS schema in the schema design view.
2.4 A simple example
The following simple example is taken from "http://www.loc.gov/standards/mets/harvard/citation129277.xml. It shows the basic structure of a METS document.
3. METS document for HDF5 AIP
An HDF5 AIP is a package for submission or archival storage of HDF5 files. It contains both a digital object and metadata about that object. The digital object is an HDF5 data file or a group of HDF5 files. The metadata component is an XML file or a set of XML files. The METS document is used as a primary XML schema for the metadata. It contains a file group, a structural map and a section for the insertion of extension schema that provide descriptive and administrative metadata. In the following sections, we discuss the HDF5 AIP design and HDF5 metadata in more detail.
3.1 HDF5 AIP design
In the proposed AIP design, the HDF5 AIP contains two parts: the data file and metadata file. In most cases, the raw data is a single HDF5 file . The AIP metadata includes all the metadata described in section 2.2. The four major components of the HDF5 AIP metadata are descriptive metadata, administrative metadata, file inventory, and structural map. This design is based on the concepts of the OAIS references Model and the METS metadata model.
A single HDF5 file may in practice be stored as multiple files on disk. For the purpose of this discussion, however, we can speak of any HDF5 files as though it were a single file.
3.2 Descriptive metadata
The descriptive metadata describes what the materials (what "materials"?) are about. The descriptive metadata starts with <dmdSec> element. The <dmdSec> element may contain a pointer to external metadata (an <mdRef> element), internally embedded metadata (within an <mdWrap> element), or both. Descriptive metadata is optional in the HDF5 AIP.
3.2.1 External descriptive metadata (mdRef)
An external reference provides a universal resource identifier (URI), which may be used to retrieve the external metadata. External metadata can be any information that is related to the data file but not included in the METS document. There is no requirement on what (kinds of?) external references should be included; for example, the following metadata reference points to the HDF5 File Format Specification.
3.2.2 Internal descriptive metadata (mdWrap)
A internal metadata wrapper packages descriptive metadata associated with the object as either (unless Base64 is some special standard) binary data encoded in base 64 or as XML. METS does not require a particular scheme for this description. The following extension schema for descriptive metadata have been endorsed by the METS Editorial Board for use with METS:
- Dublin Core
- Metadata Object Description Schema (MODS)
- MARCXML MARC 21 Schema (MARCXML)
We use the metadata object description schema (MODS) metadata element set  for HDF5 AIP because it allows the inclusion of more descriptive details (than what?). The following diagram shows the top level elements in the MODS schema.
A great deal of information can be packed into the descriptive metadata section. We will leave it to the users to decide what information to include in the AIP. The following is an example of how MODS can be used in an HDF5 METS file.
3.3 Administrative metadata
As mentioned in section 2.2, the administrative metadata element, <amdSec>, contains information regarding how the files were created and stored, intellectual property rights, etc. The administrative metadata is optional in HDF5 AIP. There are four types of administrative metadata:
- Technical Metadata, <techMD>: information regarding files' creation, format and use characteristics
- Intellectual Property Rights Metadata, <rightsMD>: copyright and license information
- Source Metadata, <sourceMD>: information regarding the analog source from which a digital library object derives
- Digital Provenance Metadata, <digiprovMD>: information regarding source/destination relationships between files, including master/derivative relationships between files and information regarding migrations/transformations employed on files between the original digitization of an artifact and its current incarnation as a digital library object.
3.4 File inventory The file section (<fileSec>) contains one or more <fileGrp> elements used to group together related files. A <fileGrp> element lists all of the files which make up a single electronic version of the digital library object. For example, an HDF5 file with two external datasets has a master file (the HDF file) and two raw data files:
.5 Structural map
The structural map section defines a hierarchical structure of the digital object. METS uses the <div> element to present the hierarchical structure. Each <div> carries attribute information specifying what kind of division it is and may contain multiple METS pointers (<mptr>) and file pointers (<fptr>) to identify the content it includes.
The HDF5 AIP uses the HDF5 XML schema to represent the file hierarchical structure instead of using a nested series of <div> elements for three reasons. First, the hierarchical structure of an HDF5 file can be very complicated. A simple nested series of METS <div> elements is not sufficient to express a complex file's hierarchy. Second, a tool for dumping an HDF5 XML document, h5dump, is already available. Using h5dump to generate this XML dump will save a great deal of implementation work. Third, using a separate file to store the file structure will keep the METS document relatively small.
The following example uses an <fptr> element to point to the file structure stored in the XML file specified in the file section. The structural map section of the METS document does not contain the actual structure of the HDF5 file; instead, the file structure is stored an XML file, test_hdf5.xml, specified by the HDF5 XML schema. For example,
The HDF5 XML schema  defines a valid description for HDF5 files and specifies rules for the structure of HDF5 XML documents. It provides a list of the elements, tags, attributes, and entities contained in an HDF5 XML document and their relationships to each other. NCSA tools, such as h5gen and h5dump, read and write XML that conforms to the rules defined by the schema. Other tools should conform to this standard as well. The following figure presents the components in an HDF5 file based on the HDF5 XML schema.
4. h5ingest - a tool for metadata ingestion for the HDF5 AIP
One major task for HDF5 AIP is to develop a toolkit for metadata ingestion (is "metadata ingestion" a commonly-used term for this? If not, consider a synonym. It's a weird word.). There are many existing tools for metadata ingestion but each deals with different metadata. Since the HDF5 AIP uses METS, we will develop a tool called h5ingest that generates a METS document for HDF5 metadata.
The METS website lists several tools and utilities for developing with METS. We chose the "METS Java Toolkit" to help build h5ingest. The METS Java toolkit is a Java binding framework for the procedural construction, validation, and marshalling and un-marshalling for METS. The toolkit was developed by Harvard University Library  and is made available under the terms of the GNU Lesser General Public License (LGPL).
h5ingest is a Java-based visual tool for viewing and editing HDF5 METS documents built using the METS Java toolkit. h5ingest can be also used as a command-line tool to create and validate HDF5 METS documents. The following summarizes the usage of h5ingest.
5. Work plan
This work will be divided in two stages: the prototype and production. Due to the resource, this project will implement only the prototype of the AIP. The production of the AIP will depend on the availability of the future resource.
5.1 Stage one: prototype release
The GUI editor/viewer of h5ingest is not included in stage one. Users will be able to view and edit HDF5 METS document with existing XML editor such as Altova XMLSPY and XRay XML editor. The prototype HDF5 AIP will include the following tasks:
- Publishing the design document of the HDF5 AIP
- Publishing a standard HDF5 METS template file. hdf5_mets_template.xml
- Implementing command line h5ingest with the following options
- -t [-template] Create a new template in command-line mode
- -v [-validate] Validate the document in command-line mode
5.2 Stage two: production release
More features will be added into the production release. It includes:
- Adding GUI editor for HDF5 METS document
- Updating parts of the METS file. An example is changing the administrative metadata
- Reading the METS file as a remote procedure. An application is the extraction of the METS metadata for loading into a metadata catalog
- Validating content of a HDF5 AIP. An example is verifying that the structure file is still correctly linked to the AIP.
 - /
 - http://nost.gsfc.nasa.gov/isoas/
 - http://www.loc.gov/standards/mets/
 - Mike Folk, June, 1998. HDF as an Archive Format. http://nost.gsfc.nasa.gov/isoas/dads/DADS16.html
 - Brand Fortner, Dough Ilg, December 1995. Thoughts on HDF-EOS Metadata. http://edhs1.gsfc.nasa.gov/waisdata/docsw/html/wp1700201.html
 - Metadata Object Description Schema (MODS) http://www.loc.gov/standards/mods/
 - The HDF XML Schema http://hdf.ncsa.uiuc.edu/DTDs/HDF5-File.xsd
 - METS Java Toolkit, http://hul.harvard.edu/mets/
 - The CoverPages, http://xml.coverpages.org/mets.html