TLDR: use constraints. They are your friends. If you push data validation to the application layer, the moment you add another consumer of the data (might be a manual script), you're in for a wild ride.
Problem: create a recursive, unlimited-depth, relationship, i.e: comments in a thread.
Usual Solution: Add a parent_id
to the table, also called Adjacency
List.
Pitfall: To retrieve the full tree, it requires adding manually a new join per depth level. Alternatively, you could query all the records and sort them in-memory.
Note: Using Common Table Expressions could aid in the pitfall. However, not all database engines/versions support them. This would allow for creating recursive queries.
Alternative solutions:
- Path Enumeration: keep a unix-like path in a column to keep track of the
depth. Querying could then look like:
SELECT * FROM comments WHERE path LIKE '4/%';
- Nested Sets: sorts the data in a tree structure and keeps track of right and left nodes through foreign keys. Makes manipulating the data hard, but querying simple and fast.
- Closure Table: keep track of the parent-child relationship in an external table. Main advantage of this is that it's dead-cheap to do any operation; potentially even delegating the tree maintenance to a trigger.
"Conventions are good only if they are helpful"
- A Primary Key is a constraint not a data type. It's what marks that two records are the same or not.
- The most common pattern is to add a pseudokey or surrogate key to all
tables. It's the typical
id
which is an integer or an UUID. - However, just like Eric Evans points out in DDD: do we have a natural way of identifying a particular entity as unique? Just like a house might use an address, or a parcel uses a reference. These make for natural keys, making pseudokeys irrelevant.
- Sometimes compound keys can make for natural keys. This not just conveys domain knowledge, but also makes the model simpler. Don't not use them because they "seem" harder.
- Why do we use
id
as the primary key and notcomment_id
? This is usually due to framework conventions. However, by using a unique name (such ascomment_id
forcomments
) we're (1) conveying more information and (2) potentially making our queries simpler (seeUSING
keyword). Conventions aren't always good.