Skip to content

Instantly share code, notes, and snippets.

@wolever
Last active December 9, 2021 14:31
Show Gist options
  • Save wolever/90498ec8812400770999 to your computer and use it in GitHub Desktop.
Save wolever/90498ec8812400770999 to your computer and use it in GitHub Desktop.
UUIDs as primary keys

The major tradeoffs between using UUIDs and sequential integers as primary keys fall into two categories:

  1. Usability tradeoffs, and
  2. Performance tradeoffs

And I'll say up front: I believe UUIDs are rarely a good idea, both for the reasons I've listed here, and the fact that I've almost never seen them used in the wild.

First and most importantly, the usability tradeoffs when dealing with UUIDs:

  1. If the surrogate key will be included in URLs, UUIDs will be disgusting. Ex, this is what URLs would look like if StackExchange used UUIDs instead of sequential integers:

    http://dba.stackexchange.com/questions/21448a7fc00b436ea61c3b98516f5e41/what-are-the-drawbacks-with-using-uuid-or-guid-as-a-primary-key
    

    Or base64 encoded:

    http://dba.stackexchange.com/questions/EHMXJqGrSbipNK5c3qzHuQ/what-are-the-drawbacks-with-using-uuid-or-guid-as-a-primary-key
    

    Versus:

    http://dba.stackexchange.com/questions/322/what-are-the-drawbacks-with-using-uuid-or-guid-as-a-primary-key
    
  2. It's impossible to remember or recognize UUIDs, which can frustrate debugging, reading logs, and looking through DB dumps.

  3. UUIDs are totally meaningless. A sequential ID tells you a little bit about the row it's referencing (ex, a small user ID was an early adopter, an especially large user ID is a recent signup). This isn't a huge benefit, but I've found it to be pleasantly useful.

  4. Sequential IDs can leak information. You might now want a user with ID=39 to know that there are only 38 other people on your site. UUIDs do avoid this, but there are other solutions: starting the IDs at a random large number, or obfuscating IDs before they are sent to the user (I like to use IDs which are: hex(id) || hex(crc32(hex(id)))[:4]: they look big, and have the built-in benefit of sanity checking, so I can reject invalid IDs without even looking at the database).

  1. Sequential IDs are easily to trivially enumerable. Obfuscation can help, but UUIDs provide provide a "free" layer of security in this regard (although it should be noted that an indexed UUID column could be added to an existing table to serve a similar role).
  1. UUIDs are universally unique, so it's impossible to accidentally mix up primary keys in a JOIN, and you can uniquely reference any entity using just its ID instead of the combination of its ID and table.
  1. UUIDs can be generated offline, which can drastically simplify batch operations.
  2. Merging databases with sequential IDs is a huge pain (ON UPDATE CASCADE only goes so far), but UUIDs make merges trivial.

(upfront warning: as with all performance related things, take all of this with a grain of salt) the two major performance tradeoffs that come to mind are:

  1. Temporal locality. Usually data inserted at about the same time will be accessed at about the same time (for example, tweets from today will be accessed more frequently than tweets from last year). With sequential IDs, rows inserted at similar times will have similar IDs, where as there will be no correlation with UUIDs. This has implications for indexed lookup speed, because each lookup on a UUID will essentially be a random walk through the index's b-tree, where sequential IDs will be somewhat predictable. This isn't much of an issue if the index is always entirely in memory... but if the index is too large to fit in memory, sequential IDs will be much more likely to hit the pages which have been cached in memory (ex, because the index pages for tweets from today will likely be cached, where index pages for tweets from last year probably won't be).

    Note that Postgres stores data on disk in roughly insert order, and CLUSTER http://www.postgresql.org/docs/9.4/static/sql-cluster.html can be used to re-order on-disk data.

  2. UUIDs are potentially much slower to insert. See, ex, this benchmark: https://gist.github.com/jgaskins/8467016

In the past, I've struck a balance between the two by using a 64 bit ID where the first 32 bits are a timestamp, and the last 32 are a random number (which has a 0.01% chance of collision after 1,000 iterationscontrast with UUIDs — so not totally suitable for situations where you're generating a lot of IDs, but worked for my application):

And when these values are base36 encoded, they are moderately human-friendly:

1azbff2sxolpj
1azbff45p7iu8
1azbff31dpx6i
1azbff3ydfbt8
1azbff3nw5des
1azbff3gmm5gz
1azbff31zdqon
1azbff3ju65dt
1azbff41kbwpt
1azbff4jv6lj4

Other interesting solutions

(thanks @tavisrudd)

Distributed ID generating systems (such as Flickr's and Twitter's) are also worth mentioning, if only to note that they use some combination of timestamp, shard ID, and shard-specific sequence.

@hlyakhovich
Copy link

Sensible and clear, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment