The major tradeoffs between using UUIDs and sequential integers as primary keys fall into two categories:
- Usability tradeoffs, and
- Performance tradeoffs
And I'll say up front: I believe UUIDs are rarely a good idea, both for the reasons I've listed here, and the fact that I've almost never seen them used in the wild.
First and most importantly, the usability tradeoffs when dealing with UUIDs:
If the surrogate key will be included in URLs, UUIDs will be disgusting. Ex, this is what URLs would look like if StackExchange used UUIDs instead of sequential integers:
http://dba.stackexchange.com/questions/21448a7fc00b436ea61c3b98516f5e41/what-are-the-drawbacks-with-using-uuid-or-guid-as-a-primary-key
Or base64 encoded:
http://dba.stackexchange.com/questions/EHMXJqGrSbipNK5c3qzHuQ/what-are-the-drawbacks-with-using-uuid-or-guid-as-a-primary-key
Versus:
http://dba.stackexchange.com/questions/322/what-are-the-drawbacks-with-using-uuid-or-guid-as-a-primary-key
It's impossible to remember or recognize UUIDs, which can frustrate debugging, reading logs, and looking through DB dumps.
UUIDs are totally meaningless. A sequential ID tells you a little bit about the row it's referencing (ex, a small user ID was an early adopter, an especially large user ID is a recent signup). This isn't a huge benefit, but I've found it to be pleasantly useful.
Sequential IDs can leak information. You might now want a user with ID=39 to know that there are only 38 other people on your site. UUIDs do avoid this, but there are other solutions: starting the IDs at a random large number, or obfuscating IDs before they are sent to the user (I like to use IDs which are:
hex(id) || hex(crc32(hex(id)))[:4]
: they look big, and have the built-in benefit of sanity checking, so I can reject invalid IDs without even looking at the database).
- Sequential IDs are easily to trivially enumerable. Obfuscation can help, but UUIDs provide provide a "free" layer of security in this regard (although it should be noted that an indexed UUID column could be added to an existing table to serve a similar role).
- UUIDs are universally unique, so it's impossible to accidentally mix up primary keys in a JOIN, and you can uniquely reference any entity using just its ID instead of the combination of its ID and table.
- UUIDs can be generated offline, which can drastically simplify batch operations.
- Merging databases with sequential IDs is a huge pain (
ON UPDATE CASCADE
only goes so far), but UUIDs make merges trivial.
(upfront warning: as with all performance related things, take all of this with a grain of salt) the two major performance tradeoffs that come to mind are:
Temporal locality. Usually data inserted at about the same time will be accessed at about the same time (for example, tweets from today will be accessed more frequently than tweets from last year). With sequential IDs, rows inserted at similar times will have similar IDs, where as there will be no correlation with UUIDs. This has implications for indexed lookup speed, because each lookup on a UUID will essentially be a random walk through the index's b-tree, where sequential IDs will be somewhat predictable. This isn't much of an issue if the index is always entirely in memory... but if the index is too large to fit in memory, sequential IDs will be much more likely to hit the pages which have been cached in memory (ex, because the index pages for tweets from today will likely be cached, where index pages for tweets from last year probably won't be).
Note that Postgres stores data on disk in roughly insert order, and
CLUSTER
http://www.postgresql.org/docs/9.4/static/sql-cluster.html can be used to re-order on-disk data.UUIDs are potentially much slower to insert. See, ex, this benchmark: https://gist.github.com/jgaskins/8467016
In the past, I've struck a balance between the two by using a 64 bit ID where the first 32 bits are a timestamp, and the last 32 are a random number (which has a 0.01% chance of collision after 1,000 iterations — contrast with UUIDs — so not totally suitable for situations where you're generating a lot of IDs, but worked for my application):
And when these values are base36 encoded, they are moderately human-friendly:
1azbff2sxolpj 1azbff45p7iu8 1azbff31dpx6i 1azbff3ydfbt8 1azbff3nw5des 1azbff3gmm5gz 1azbff31zdqon 1azbff3ju65dt 1azbff41kbwpt 1azbff4jv6lj4
Other interesting solutions
- Instagram's 64 bit IDs which contain a timestamp, shard ID, and shard-specific sequence: http://instagram-engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram
- Mongo's 96 bit ObjectID, which contains a timestamp, machine + process IDs, and a counter: http://docs.mongodb.org/manual/reference/object-id/
(thanks @tavisrudd)
Distributed ID generating systems (such as Flickr's and Twitter's) are also worth mentioning, if only to note that they use some combination of timestamp, shard ID, and shard-specific sequence.
Sensible and clear, thank you.