Checksumming and PGP signing HDF5 releases

James Laird - August 22, 2005

User Request (bug #387)

I was hoping I could get the release process modified to include a md5file checksum and a PGP signature of the release.

This would go a long way to ensure that the distribution that is downloaded is actually the software that was released by the NCSA.

For example:

    hdf5.1.6.4.tar.gz - compressed tarfile
    hdf5.1.6.4.tar.gz.md5 - md5 checksum of the compressed tarfile
    hdf5.1.6.4.tar.gz.asc - PGP signature of the compressed tarfile

This is common model for ensuring the integrity of open source Distribution that are downloaded from multiple mirrors.

Here a link to the Apache web site that is a good reference:
http://httpd.apache.org/dev/verification.html

Any chance that I can persuade the developers to add this feature to the current release process?

Regards,
Pete Rothermel

What is an MD5 checksum?
Explanation from the MD5 Homepage (unofficial):

MD5 was developed by Professor Ronald L. Rivest of MIT. What it does, to quote the executive summary of rfc1321, is:

[The MD5 algorithm] takes as input a message of arbitrary length and produces as output a 128-bit "fingerprint" or "message digest" of the input. It is conjectured that it is computationally infeasible to produce two messages having the same message digest, or to produce any message having a given prespecified target message digest. The MD5 algorithm is intended for digital signature applications, where a large file must be "compressed" in a secure manner before being encrypted with a private (secret) key under a public-key cryptosystem such as RSA.

In essence, MD5 is a way to verify data integrity, and is much more reliable than checksum and many other commonly used methods.

The "fingerprint" that MD5 produces is (usually) unique to a given message. To use this in HDF5, we would run MD5 on the release tarball and post the resulting file (hdf5.tar.md5) on our website along with the tarball itself. If an attacker attempted to replace the real hdf5.tar archive with a fake, the fake archive would have a different MD5 checksum and this could be detected by the user. A user could also download the hdf5 tarball from a mirror FTP site, then compare that tarball's MD5 checksum against the correct value on the main HDF5 website.

MD5 is vulnerable to an attack in which a false checksum is provided for the false archive. There must be a secure channel to transmit the correct checksum. However, this checksum is very small (50 bytes), so is easier to transmit than the full hdf5 tarball. It might be wise to post the MD5 checksum on the HDF5 website in cleartext as well as adding it to the FTP site to make it more difficult for an attacker to replace it with a false checksum.
MD5 is known to be vulnerable to certain collision attacks, but these would not affect the HDF5 release process; given an existing MD5 checksum, it is still extremely difficult to create another file with the same checksum.

What is a PGP signature?
For an in-depth discussion of PGP key signing, see the comp.security.pgp FAQ.

Briefly, a PGP signature allows a user to vouch for a message's trustworthiness. The signature uses a hash function to ensure that the message has not changed since it was signed (the same principle as MD5). Furthermore, the signature confirms that the message has come from the user who signed it.

PGP signatures rely on asymmetric keys. A signature is created with a private key, while the public key is widely published. Only the user with the private key could have created the signature and encoded the result of the hash function, but anyone can decode the signature to check its validity and that the message is unchanged. This would prevent an attacker from substituting a fake hdf5.tar archive, since the fake archive would have a different hash value--it would be clear that the signature corresponded to a different file. In addition, the attacker could not create a fake signature to accompany the fake archive (as they could with MD5), since they do not have HDF5's private key. The signature would ensure that no-one could masquerade as the HDF group.

PGP's main drawback is that a signature can only be trusted as far as it is known to belong to its owner. If an attacker creates an entirely new key and claims it belongs to HDF, users have no way of knowing which is the correct key and which is the fake. The solution to this problem is to be connected to the "web of trust," which involves having a key signed by others who can vouch for the authenticity of the key. If HDF5 doesn't expend some effort to enter this web of trust (and doesn't have any members with well-known public keys), users may not be able to tell which copy of HDF5 is genuine, but it will be easy to detect that an attack is taking place, since there will be two different keys in existence!

Short-term and long-term costs
First, the MD5 checksum and PGP signature would take a small amount of work to create. Doing this by hand is easy--it took me about ten minutes to create a PGP key, and the commands to create the checksum and signature are one line each. If we wanted to create both a checksum and a signature for every platform-specific tarball, these two lines would need to be run for each tarball. Both of the tools needed (pgp and md5sum) are already installed on heping.

The creation of an MD5 checksum can be automated to take place when the bin/reconfigure script is run. PGP signing should not be automated, any more than a pen-and-paper signature should be automated. It might be easiest to have snapshots involve only an MD5 checksum, while full releases use a PGP signature.

The resulting MD5 checksum and PGP signature files would have to be distributed along with the main HDF5 tarball. This would probably involve a small alteration to the website and would use a trivial amount of space.

In the long term, HDF5's PGP key would need to be maintained. It would be most appropriate to use a single person's key (rather than a "group" key), so this person would need to examine each release that was signed. Even if we don't worry about the "web of trust" in the short term, users will probably request that we have the key signed by other users eventually, so we may need to have our keyholder attend "key-signing parties" or attend conferences with his or her key and photo ID. We will not be forced to expend any more effort on this than we want to, of course, and a key that has not been signed by other users will still be more useful than no key at all.
There should be few other long-term costs beyond the effort required to create and keep track of the MD5 and PGP files. We will need to ensure that we have working versions of md5sum and pgp (both of which are under the GNU license) and keep track of our PGP keyring.

To verify MD5 checksums or PGP signatures, users would need the appropriate utilities. md5sum and gpg are free for Linux, and utilities exist for Windows. Both MD5 and PGP would create a separate files containing the checksum or signature, users who wouldn't want to or couldn't worry about these files could safely ignore them and simply untar the archive as usual.

To create a signature file:

Create a key:
        gpg --gen-key   (follow the instructions)

Create a separate sig file:
        gpg -b --armor hdf5.tar

Which can then be verified with
        gpg --verify hdf5.tar.asc

To create an MD5 checksum:

Find the checksum:
        md5sum hdf5.tar > hdf5.tar.md5

Which can then be verified:
        mdf5sum --check hdf5.tar.md5

Alternately or in addition, we can just post the output
from md5sum hdf5.tar on our website, and let users read it
and compare it themselves.

Last modified: 19 September 2005