Skip to content

Instantly share code, notes, and snippets.

@manzanit0
Last active April 16, 2022 09:38
Show Gist options
  • Save manzanit0/9f424ec5b6f134c0147b694e30cbe1eb to your computer and use it in GitHub Desktop.
Save manzanit0/9f424ec5b6f134c0147b694e30cbe1eb to your computer and use it in GitHub Desktop.
Notes: SQL Antipatterns, by Bill Karwin

Notes: SQL Antipatterns, by Bill Karwin

Chapter 2: Jaywalking

TLDR: use constraints. They are your friends. If you push data validation to the application layer, the moment you add another consumer of the data (might be a manual script), you're in for a wild ride.

Chapter 3: Naive Trees

Problem: create a recursive, unlimited-depth, relationship, i.e: comments in a thread.

Usual Solution: Add a parent_id to the table, also called Adjacency List.

Pitfall: To retrieve the full tree, it requires adding manually a new join per depth level. Alternatively, you could query all the records and sort them in-memory.

Note: Using Common Table Expressions could aid in the pitfall. However, not all database engines/versions support them. This would allow for creating recursive queries.

Alternative solutions:

  • Path Enumeration: keep a unix-like path in a column to keep track of the depth. Querying could then look like: SELECT * FROM comments WHERE path LIKE '4/%';
  • Nested Sets: sorts the data in a tree structure and keeps track of right and left nodes through foreign keys. Makes manipulating the data hard, but querying simple and fast.
  • Closure Table: keep track of the parent-child relationship in an external table. Main advantage of this is that it's dead-cheap to do any operation; potentially even delegating the tree maintenance to a trigger.

Chapter 4: ID Required

"Conventions are good only if they are helpful"

  • A Primary Key is a constraint not a data type. It's what marks that two records are the same or not.
  • The most common pattern is to add a pseudokey or surrogate key to all tables. It's the typical id which is an integer or an UUID.
  • However, just like Eric Evans points out in DDD: do we have a natural way of identifying a particular entity as unique? Just like a house might use an address, or a parcel uses a reference. These make for natural keys, making pseudokeys irrelevant.
  • Sometimes compound keys can make for natural keys. This not just conveys domain knowledge, but also makes the model simpler. Don't not use them because they "seem" harder.
  • Why do we use id as the primary key and not comment_id? This is usually due to framework conventions. However, by using a unique name (such as comment_id for comments) we're (1) conveying more information and (2) potentially making our queries simpler (see USING keyword). Conventions aren't always good.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment