Integrating HDF5 with the SRB to Achieve Fast Access to High Volume, Remote Data
(An NCSA-SDSC Collaboration)
Imagine that you have ...just produced an HDF5 file that contains 100 10-gigabyte images from a simulation on a remote server. You want to have a quick look to see if a certain event occurred in the lower right hand corner of a specific image. You have a nice little program that does this. You could ftp the file back to your site and run your local program to view the result. This may take many hours, if you're lucky, and would use up valuable space on your local disk. Imagine instead that you can run that same little quick-look program immediately, on the dataset right where it sits. You get your information immediately, decide what do to next, and go from there, perhaps changing some parameters and submitting another run.
This is one of services that the HDF5-SRB model provides for you. Scientists who store their data in HDF5, and manage their data in the SRB repository used to have to transfer the whole file to whatever site was doing post-processing on their data, even if that post-processing needed to access only a small part of the file. Now, there is a new, experimental technology to allow clients to access data in place at remote sites, without having to endure potentially long and error-prone network latencies, and without having to make local storage available for duplicate copies of the data. If this looks like something you might want to do, read on.
The Hierarchical Data Format (HDF5) ...is a general purpose library and file format that offers a wide range of features to help acquire, organize, query, and access challenging data. HDF5 is designed for high volume, complex data of any kind. From large simulations on massively parallel systems to complex experiments involving heterogeneous collections of diverse data, HDF5 provides the kinds efficient storage and I/O methods needed by scientists and engineers working in high performance, data intensive computing environments. HDF5 is supported by a suite of free and open source software, including the HDF5 I/O library, HDFView and several utilities. Many additional products, both free and commercial, also support HDF5. HDF5 has over two million users across a variety of engineering, scientific, and even some non-technical fields. Data stored in HDF5 is used for a wide range of applications, from computational fluid dynamics to film making. For more about HDF5, go to HDF5.
The Storage Resource Broker (SRB) ...is a data management middleware that provides users with a global virtual file system for accessing remote heterogeneous storage resources across the network. It hides from users differences such as physical location, protocols, and authentication across administrative domains. The global user name space of the SRB provides users with single sign-on to the system. It provides many useful data management functionalities including data replication across resources, synchronization of data stored in various resources and movement of data across resources for staging and archival purposes. It provides many performance enhancements features including parallel data transfer, third party transfer and bulk transfer (for transferring a large number of small files). The Metadata Catalog (MCAT) of the SRB provides support for system and application level metadata. The application level metadata is free-formed and user-defined, providing users with a powerful browsing and discovery capability. The software is being used by over a hundred institutions around the world.
The HDF5-SRB model ...was proposed by NCSA and SDSC to take the advantages of the two powerful and complementary data management services of the HDF5 and SRB. The HDF5-SRB model uses the SRB as a middleware to transfer data between the server and client, and HDF5 serves as the file format for storing and accessing data content. The HDF5-SRB model uses a standard set of data objects to provide object-level data access instead of the whole file. The following figure shows the basic architecture of the HDF5-SRB model.
* The project is sponsored by the NCSA/SDSC-led Cyber-Infrastructure Partnership (CIP) and the National Laboratory for Advanced Data Research (NLADR), NFS PACI project in support of NCSA-SDSC collaboration
FIGURE 1 -- A SIMPLIFIED VIEW OF THE HDF5-SRB MODEL
The benefits of the HDF-SRB model include:
- Central data management. Data replication, synchronization, authentication, and other functionalities are all provide by the SRB system. Also, since SRB is a distributed storage system, large volumes of data can be stored and shared by many users
- Efficient data storage and I/O. Data stored in HDF5 can be compressed and accessed efficiently through the HDF5 library APIs
- Support for subsetting. Users will be able to retrieve part of a dataset such as cutting a slice of data from a multi-dimension large dataset.
- Different resolution of data content. A data array such as a 100,000 by 100,000 matrix can be too large to fit into memory and a visualization tool. Using HDF-SRB model, users will be able to load the whole data with lower resolution, and select part of the data for full resolution display
- Browsing data objects and metadata. Users will be able to browse data objects by examining the structure of file, without loading the data content. This can be very efficient for files with large numbers of data objects, or with very large objects.