(These are notes adapted from a presentation I gave at the LakeHub workshop this week. They owe a lot of debt to this article, which inspired the talk. If you already know that you should just use bcrypt or something similar, and why, you can just skip to the ‘conclusions’ section.)
So let’s suppose you’ve just made a hot new website from which you’ll make a million dollars a year. You get to the point of creating a database for all of your users who will be logging in and doing things like buying airplanes, so you put together a database table. Maybe it looks something like this:
A few weeks after launch, you have two million users, and someone breaks into your server and steals the database. Of course, they don’t tell you that they did this; they’re much happier to keep the database, pull out a name and password, log in as someone else, and use your site to steal lots of money and undermine the basic building blocks of democracy and common decency. After sending apologies to the userbase, you decide that your database structure was flawed.
At this point you start doing some reading, and find out about password hashing. The very basic idea of password hashing is to take any password and turn it into a nonsense-looking string of characters; this is called the hash of the password. This hashed password is then what’s stored in the database. When a person tries to log in, their typed password is turned into another nonsense string of characters, and checked against the nonsense string in the database; if the nonsense matches, then they probably typed the right password, and are allowed to log in. So now your database looks like this:
So how do you make a hashed password? The right answer is to use a pre-made hashing function. There are numerous libraries for hash functions available in whatever language you’re using, php, python, whatever. You should use these instead of making your own because making a good hash function is incredibly difficult to do right. Here are some of the requirements of a good hash:
- A hash is a deterministic function. If you give it the same input, it produces the same output every time.
- It’s hard to go backwards. In other words, if someone finds the hash “827ccb0eea8a706c4c34a16891f84e7b” in your database, it should be hard to figure out that it came from the password “12345”. Mathematically, this says means (roughly speaking) that the hash is a one-way function.
- Given a hash, it should be really hard to produce a word that gives that hash. If it were easy to do this, then a hacker could just find SOME word (maybe really long and nonsensical) that gives the hash “827ccb0eea8a706c4c34a16891f84e7b”, and not even care that your ACTUAL password was “12345”.
- In particular, small changes to the input should make big changes in the hash. This makes it hard to find a password which gives something close to the right hash and then hone in on the actual hash from there.
- The hash should always be the same length. This keeps anyone from knowing whether your password has one character or thirty.
- The hash function should be resistant to all known means of reversing hash functions.
Finding a function that does all of these things is extremely difficult. And that’s why you should use a pre-made hash function. The good ones are designed with the notion that the hashes and the algorithm for generating the hashes are known to everyone; this means they’re designed to be incredibly hard to break.
But even so, there are good hash functions and bad hash functions; in fact, some of the hash functions that were good in the past are now bad because someone has figured out a way to break one of the six requirements above. A prime example of this is the md5 hash function, which has numerous weaknesses that have come to light over the years.
So you go ahead and use a nice, modern hash function, like SHA256. So now your table looks like this:
Now let’s imagine what happens when a hacker gets your database. They also manage to look at your source code and figure out that you’re using the SHA-256 algorithm to keep your hashes. That should be fine. But they also know a few thousand common passwords, and use this to get the passwords of some of your users.
To find out how easy this is, I downloaded a text file containing md5 hashes of passwords from eHarmony from a hack that occurred some years ago. I also downloaded a file with about 3000 common passwords. I then wrote some python code to go through the 3000 passwords, hash them with md5, and see if the hashes show up in the eHarmony list. This takes about 7.5 seconds on my cheap laptop, and turns up one password: ‘NIGHTWIND’. This clearly isn’t good enough, so I wrote another function to add digits on either side of a common password, and check each ‘new’ password. This starts turning up passwords quite quickly, thousands in fact.
Now, in fact there exists software that people have written exactly to speed this process. (Look up ‘password recovery tools’ if you’re interested.) This software includes lots and lots and lots of common ‘rules’ that people use for their passwords, like using 133t-sp34k, keyboard patterns (qeadzc13, for example), alternating numbers and letters, and so on and so forth. In one test carried out by Ars Technica, a skilled hacker broke over 80% of the passwords in a particular database.
This brings us to the topic of ‘salt’. Salt is some random characters added to each password in the database. You store the salt in the database along with the user password; when someone tries to log in, the salt is added automatically to the password and the hash is checked. Now your database looks like this:
The upshot is that now the hacker has to add the salt – which is different for each user – for every password check. Effectively, this means they need to do their whole ‘common password search’ separately for each user, vastly slowing down the operation. Hopefully enough to allow your userbase to change their passwords….
But the problem in the last few years is that people are now using fancy, high-powered graphics processors (GPU’s) for this sort of thing, and suddenly just adding some salt isn’t good enough anymore. With a setup that can check millions of passwords a minute, it actually isn’t a big deal to do the search on every password.
The response has been a new generation of password protection algorithms. These new algorithms use really complicated algorithms to produce the hash, complicated enough that it slows down the process of creating an individual hash. This means that even with the big rig full of GPU’s, the hacker can’t get through the individual users at a reasonable speed. The best-known of the new-generation hashing functions (usually called ‘key generation functions’ instead of hashes, these days) is bcrypt. Another competitor is scrypt.
So, as of this writing, these are some good tools to use in protecting your database. For now! It’s unknown whether someone will discover a vulnerability in bcrypt; there hasn’t been much research on the algorithm yet, so there’s a good chance someone will find a vulnerability eventually. And then we’ll need to move to the new best thing to keep our passwords safe!
One of the things that I kind of love about all of this is that, in the end, the design of the system probably matters more than the stupid password choices of individual users. This simple lesson is actually applicable in all kinds of situations: You’ll usually get more bang for your buck from designing better systems than you will from trying to change human nature. The applications to politics are obvious: instead of trying to change the nature of greedy politicians, we should instead try to get systems in place that make it impossible (or at least extremely difficult and risky) to be greedy. Of course, it’s a hard sell when the same greedy ministers have to sign off on those systems, but that’s a topic for another blog, perhaps…