Hashes come in many different classes for different purposes, not all of them for encryption. What we refer to as hash datatypes in Perl uses a hash algorithm internally to map a lookup string to a memory location. This algorithm is faster and shorter than the ones we use for encryption. Other hashes protect data from accidental changes, such as CRC or Adler. These are often used in networking protocols or inside file formats to check that the data has remained intact. They rarely go over 32-bits in length, and won’t withstand an attacker deliberately trying to change the data.
The kind of hashes we’re interested in here are cryptographic strength, which are designed to stop such a deliberate attack. They also have an “avalanche effect”, where changing just one bit of input creates a completely different output. Finally, it has to be difficult to figure out the original data if you only have the hash.
These goals are complicated by something called The Birthday Problem. Let’s say you’re planning a party, and want to know how many people you need to invite to have a 50% chance of someone having the same birthday as you (assuming your friends all have evenly distributed birthdays throughout the year). If you work it all out, you’ll find that the answer is 253 people.
Let’s change the problem slightly. Instead of choosing a person ahead of time (yourself) to find a match, we want there to be a 50% chance of any two people sharing a birthday. In this case, we only need 23 people.
The difference of selecting a datapoint to match ahead of time, versus allowing any two datapoints to match, makes a huge difference. Hashes face the same difference in these two problems:
- Having a specific dataset, and wanting to find another document that matches its hash
- Finding any two datasets that match its hash
You may wonder if a 256-bit hash is just as hard to brute force as a 256-bit block cipher. Due to the Birthday Problem, we have to double the size; all else being equal, a 512-bit hash is equivalent to a 256-bit block cipher.
Cryptographic hashes have a number of failed attempts over the years. MD4 was published in 1990 as a 128-bit hash. It strongly influenced both MD5 (also 128-bit) and SHA1 (160-bit), which were also hugely popular for a number of years. MD4 has since been completely blown away as a viable hash algorithm. By the late 90s, MD5 had also been exposed as weak, and while it wasn’t as crushing as MD4, attacks were only going to get better. Cryptographers at that time tended to recommend SHA1.
Backing up a little bit, NIST had published SHA-1 as an official US government standard in 1993. Unlike the open contests that produced DES and AES, the algorithm was originally developed internally by the NSA. Although SHA1 is similar to MD5 and MD4, it contained a number of improvements that resisted attacks against its predecessors. While cryptographers are cautious about trusting the NSA, it held up to the best public scrutiny they could throw against it for many years.
That changed in 2005, when new attacks showed a weakness in SHA-1. This sent cryptographers scrambling for an alternative. As a stop-gap measure, the SHA-2 standards started to be recommended (which had already been published by NIST in 2001). SHA-2 is actually a group of algorithms that stretch from 224 to 512-bits.
There’s still some similarity to SHA-1 inside SHA-2. Since the whole family going back to MD4 has had problems, cryptographers were still nervous. So this time, NIST put out a public contest for a new hash function, which would become SHA-3. Unfortunately, these things take time, since the new algorithms have to be validated through several rounds of consideration, with papers published in cryptographic journals that analyze their security. By the end of 2010, five finalists were selected, and the winner was finally announced in 2012, which was Keccak.
In the meantime, cryptographers had still been chipping away at SHA-1 and 2. As the years of the NIST contest went by, nobody was making much progress on generalizing SHA-1 attacks to SHA-2. It seems that the initial paranoia over SHA-2 was unjustified (this time). Before the SHA-3 winner was announced, Bruce Schneier suggested that there be no award.
Regardless, we now have a bunch of new cryptographic hashes around, so if one of them goes down, we can hop to a new one. It’s good to have options.
CPAN has all of these available under the Digest:: namespace. Not all of the modules under there are cryptographic digests–Digest::CRC and Digest::Adler32 are two in particular. These have their place, but not in cryptography. Older algorithms like Digest::MD4 and Digest::MD5 are in there, too. Please, don’t use them for anything new, and strongly consider migrating anything old.
All the SHA-3 contest finalist ciphers–BLAKE, Grøstl, JH, Keccak and Skein–are all in there. Unlike what happened with AES/Rijndael in the Crypt:: namespace, Keccak is now available as Digest::SHA3.
Nearly everything in the Digest:: namespace has a consistent interface. There is a functional interface of
$ perl -MDigest::SHA3=sha3_512_hex -E 'say sha3_512_hex( "foobarbaz" )' 97b46f09a2ce323f1cb840d66785c4e993b8328f3c4b28714416d848fbbb49858feae65183251cf17c6e0514f0b7a67f5cbb6e444398f2dec88ba59318987401
And an object oriented interface. In OO, the constructor is called with the hash size. You can then use
addfile() to add to the data that’s going to be hashed. These can be called multiple times to keep adding data. Finally, call
b64digest() for the hash value in the encoding you want:
$ perl -MDigest::SHA3 -E '$d = Digest::SHA3->new(512); $d->add("foo"); $d->add("barbaz"); say $d->hexdigest;' 97b46f09a2ce323f1cb840d66785c4e993b8328f3c4b28714416d848fbbb49858feae65183251cf17c6e0514f0b7a67f5cbb6e444398f2dec88ba59318987401
Note that as soon as you call for the digest, the internal state is flushed, so if you call it again, you’ll only hash what was added after the last call:
$ perl -MDigest::SHA3 -E '$d = Digest::SHA3->new(512); $d->add("foo"); $d->hexdigest; $d->add("barbaz"); say $d->hexdigest;' 9f5c484cffce693ac0a4a2f1d7b27b972ab5013fed6df2ad9fca994c53686a0547b54af7f0fafe5aabbf666ce54052d8a000ca8ce7a52c1338f28ef60e53ddab
Cryptographic hashes find many uses in cryptography for verifying the integrity of data. One place where you shouldn’t be using them is for storing passwords. We’ll discuss why in the next post.