PBKDF2 is an algorithm used to hash passwords.

import os
from pbkdf2 import PBKDF2

# random 16-byte salt
salt = os.urandom(16)

# 100,000 iterations (about 100 ms), 32-byte output
passwordHash = PBKDF2(password, salt, 100000).read(32)

The above code is based on real password hashing code that I found in the wild. There’s a subtle weakness in this hashing mechanism – can you see it?

The What

Technically, I’m being unfair here – I didn’t give you all the information. The missing piece is that this code uses a library called “python-pbkdf2”, which defaults to using HMAC-SHA-1.

OK, so what? HMAC-SHA-1 isn’t broken, so what actually went wrong? Interestingly enough, due to a subtle aspect of PBKDF2’s design, this code actually hashes the password 400,000 times, of which 200,000 are a complete waste of time.

This is a problem because programmers set the number of iterations based on how much CPU time they’re willing to spend on password hashing. Password hashing forces you to tradeoff performance for security. In practice, there are limits to how much performance can be sacrificed for security. If the hashing code wastes time, then the programmer is forced to lower the security level in order to meet their standard for performance.

The How

But why is the password hashed 400,000 times instead of 200,000? Let’s take a quick detour to look at how PBKDF2 actually works.

Password-Based Key Derivation Functions (“PBKDFs”) transform passwords into cryptographic keys. PBKDF2 is a particular PBKDF design from the late ’90s.

You might be thinking that PBKDFs sound unrelated to password hashing. It turns out that these two problems are mostly the same. A good PBKDF is slow to prevent brute force attacks, just like a password hash. A good PBKDF needs to preserve as much of the password’s entropy as possible, just like a password hash. Etc.

An important feature of PBKDF2 is its ability to produce any size output. Since the output of PBKDF2 is used as an encryption key, PBKDF2 needs to able to produce whatever size of key the encryption algorithm is expecting. This feature doesn’t make much sense for password hashing. With a password hash, the output just needs to be big enough to be secure – the specific size isn’t particularly important – so it’s typical to pick some arbitrary number, like 32 bytes.

In order to produce outputs of any size, PBKDF2 does something like this:

output = iteratedHashing(password, salt, iterationCount, 1)
	   + iteratedHashing(password, salt, iterationCount, 2)
	   + iteratedHashing(password, salt, iterationCount, 3)
	   + ... # as many as needed

From a performance perspective, this is alarming: each block of output causes the password to be hashed all over again!

For generating keys, this design is kind of acceptable. For password hashing, though, it’s a complete waste. Let’s suppose we choose to use four blocks of output. If an attacker gains access to the hash, they can pick a single block and crack it individually to find the password. Once they have the password, they no longer need to crack the other blocks.

If we look back at the hashing code from earlier, we can see that it asks for 32 bytes of output. Since it uses HMAC-SHA-1, which has a 20-byte output, PBKDF2 will perform the hashing twice in order to produce enough output. The second block of output adds no security benefit.

The Fix

It’s straightforward to work around this design flaw. Any time PBKDF2 is used for password hashing, the output length should match the output length of the HMAC function used:

HMAC-MD5: 128 bits (16 bytes)
HMAC-SHA-1: 160 bits (20 bytes)
HMAC-SHA-256: 256 bits (32 bytes)

That said, it would probably be wise to use a better password hashing algorithm (such as bcrypt or scrypt) if possible.