You have a system backed by a Postgres database that uses UUIDs as primary keys for everything. This is nice:
- You can generate ids on clients before transmitting writes to servers if you wish.
- You don't have to worry about exhausting the integer size allocated to your auto-incrementing primary keys.1
- You don't have to worry about enumeration attacks from auto-incrementing primary keys.
- You can identify and index on data that has no natural key, without the challenges of auto-incrementing primary keys.
However, now you have large, illegible UUIDs permeating your system's logs. This sucks when interpreting logs and debugging errors, especially for data that does have natural keys you could have chosen as a primary key; like any table with a global uniqueness constraint (ex: user.email
). Homogeneity is attractive, but having to guess what table a record lives in, based on its UUID alone, is not.
This is especially salient if you expose UUIDs in URLs, and your Customer Support team requires URLs in bug reports, and some of your URLs are polymorphic on ID. For example, an example.com/edit/<thing-id>
URL that can operate on data from different tables.
This debugging and observability challenge can be addressed with better logging—but that is true of most debugging and observability challenges. You could also implement your own custom UUID generation method in all your systems and give up some of those collision-avoiding bits to tag them with metadata. Instead, let's explore solving it at the storage level.
Use postgres inheritance and a generated column to automatically maintain a registry of all the UUIDs in your system.2
Let's create a base table named entity
where we can discover the location of any record in our system:
CREATE TABLE entity (
-- Every inheritor of this table gets a uuid PK
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
-- Every inheritor of this table auto-generates the table name it comes from
source_table text GENERATED ALWAYS AS (tableoid::regclass) STORED
);
We can now create tables that inherit from the entity registry for this core functionality:
CREATE TABLE user (
email citext PRIMARY KEY
) INHERITS (entity);
Existing tables (per the docs, I have not tried this on a non-greenfield database) can sign up to the registry:
ALTER TABLE existing_table INHERIT entity;
Now if you want to know what table a UUID in your system comes from, you can query the entity registry:
-- Found uuid 'd4c0fc8c-72b7-44ab-94a7-a070d59357e0' in a log file?
SELECT * FROM entity WHERE id = 'd4c0fc8c-72b7-44ab-94a7-a070d59357e0';
-- | id | source |
-- |--------------------------------------|--------|
-- | d4c0fc8c-72b7-44ab-94a7-a070d59357e0 | user |
With the source
in hand, you can check out the data associated with that UUID:
SELECT * FROM user WHERE id = 'd4c0fc8c-72b7-44ab-94a7-a070d59357e0';
--- | id | source | email |
--- |--------------------------------------|--------|-------------------|
--- | d4c0fc8c-72b7-44ab-94a7-a070d59357e0 | user | [email protected] |
-
Per this footnote1, this approach exacerbates the probablistic risks of using UUIDS as primary keys further.
You now have to worry about UUID generation collision in write activity across your entire table-space, instead of per-table. If you have a system with a write volume where this risk concerns you, hopefully you have already studied how to calculate those probabilities around collisions arising from the birthday paradox, and can apply them to this approach.3
-
You have to pay for the cost of storing the table name for every record.
Postgres generated columns only support the
STORED
qualifier forGENERATED ALWAYS AS
columns, and AFAICT in 2023 there are no plans to introduce support for a hypotheticalVIRTUAL
qualifier. This cost can be mitigated by limiting the size of thesource
column via something other thantext
.
If you're interested in enforcing audit timestamps for every record in your system, you can now add these columns to your base entity
table.
Unfortunately, triggers are not inherited from the base entity
table, so you must remember to execute DDLs that attach these triggers to every table in your system; new or existing.
-
Add a new column to your base table:
ALTER TABLE entity ADD COLUMN inserted_at DEFAULT CURRENT_TIMESTAMP NOT NULL;
Existing records in inheriting tables will receive values corresponding to the time the DDL was ran. New records will receive values corresponding to when they were inserted.
-
Add a new column to your base table.
ALTER TABLE entity ADD COLUMN updated_at DEFAULT CURRENT_TIMESTAMP NOT NULL;
Existing records in inheriting tables will receive values corresponding to the time the DDL was ran. New records will receive values corresponding to when they were inserted.
-
Create a function that any table can use to update their
updated_at
timestamp.CREATE OR REPLACE FUNCTION update_updated_at_timestamp() returns TRIGGER AS $$ BEGIN NEW.updated_on = CURRENT_TIMESTAMP(); RETURN NEW; END; $$ LANGUAGE 'plpgsql';
-
Add the trigger to your inheriting
<existing_table>
s, including every time you create one.CREATE TRIGGER update_updated_at_timestamp_on_<existing_table> BEFORE UPDATE ON <existing_table? FOR EACH ROW EXECUTE PROCEDURE update_updated_at_timestamp();
-
Add a new column to your base table.
ALTER TABLE entity ADD COLUMN soft_deleted_at DEFAULT NULL;
-
Enforce inserting a timestamp into this field when your application wants to "soft delete" data.
To query only "not-soft-deleted" records, only select where
soft_deleted_at IS NULL
everywhere, including in joins. -
Add a covering index on the null-ness of this field.
This way, if soft deletes accrete and become a disproportionate presence in your table, not-soft-deleted-ones remain rapidly queryable.
-
Consider a periodic sweep of soft-deleted records.
This must align with your data retention and right-to-be-forgotten policies. It must also support your "undo" and audit log features. Within those constraints, however, regularly purge soft-deleted records to keep your dataset minimal.
Footnotes
-
Technically, you now have to worry about your write volume exhausting the probabilistic "guarantees" of UUIDs. ↩ ↩2
-
Fair warning: I have only tried this approach in one meaningfully write-heavy production system; I've since re-implemented it in several hobby projects after devising this solution for that system. After a year and a half of productionization before I left that team, it was still working well. And, I still really love using it in my side projects. ↩
-
If you are even just an entire order of magnitude under having write volumes that could guarantee even occastional probabalistic UUID generation collisions across your database, let alone within single tables, you should be anxiously researching non-table-based multi-tenancy, or sharding approaches, for your storage. ↩