The BioHDF project is a collaborative effort to address the bioinformatics data deluge problem. Based on the established open-source HDF5 binary data storage technology, BioHDF strives to help biologists come to terms with the flood of data that the latest instrumentation can produce. The current focus of BioHDF is on next-generation sequencing (NGS) data storage.
As we envision it, there are three key parts to BioHDF:
- The data model and file organization.
- The C application programming interface (API) and library.
- Command-line tools
This determines which data will be stored, how it will be arranged in the data file and how it will be queried. Data will be stored as fundamental building blocks such as "sequences", "alignments" and "MS/MS spectra". Unlike most file formats, which are set in stone, BioHDF files will are self-describing, flexible and extensible as they are based on HDF5.
This is the library which will provide the basic means for manipulating the data stored in a BioHDF file. C is a useful language for the basic BioHDF API since it allows for easy interfacing with the HDF5 API, can be ported easily to many operating systems and can interoperate with most higher-level languages. Much bioinformatics work is done in higher-level languages, however, and we intend to make the BioHDF API easily wrappable for these languages using packages like SWIG and XS.
Command-line tools are provided for data I/O and manipulation. Interoperability with existing bioinformatics tools will be provided by functions which allow for import and export of the data from/to existing bioinformatics file formats.
We believe that a key factor to the success of BioHDF is the participation of interested parties in the development of the data model and API. If you are being drowned in data and would like to be a participant in the development of BioHDF we encourage you to follow our progress on this website and to subscribe to our mailing list (contact link on the left). We welcome your input!