This purpose of this is to document all the tricks that I've learned over the last couple of decades (good grief, has it really been that long?) from interacting with and developing for databases.
I have organized this as an FAQ of sorts rather than by technical concept, e.g., scalar subuery, chasm trap, etc., because when you don't know how to express something in code, you can't know what the solution type is called or named. My other endeavor is to capture the more thornier types of queries that can come up as requirements, instead of just itemizing trivial, common examples.
This is still a work in progress! And suggestions or comments are welcome!
All of the SQL statements in this can be used against the canonical supplier/parts database schema used in many of C.J. Date's books: https://wiki.c2.com/?SupplierPartsDatabase
The tables used are:
id name status city
----------------------------
S1 Smith 20 London
S2 Jones 10 Paris
S3 Blake 30 Paris
S4 Clark 20 London
S5 Adams 30 Athens
id name color weight city
----------------------------------
P1 Nut Red 12.0 London
P2 Bolt Green 17.0 Paris
P3 Screw Blue 17.0 Oslo
P4 Screw Red 14.0 London
P5 Cam Blue 12.0 Paris
P6 Cog Red 19.0 London
supplier_id part_id quantity
------------------------------
S1 P1 300
S1 P2 200
S1 P3 400
S1 P4 200
S1 P5 100
S1 P6 100
S2 P1 300
S2 P2 400
S3 P2 200
S4 P2 200
S4 P4 300
S4 P5 400
A SQL script is attached this gist to allow you to create the schema in your own database.
Let's say that we want to know which parts are available and the number of suppliers for each. In this case, we want to group by the part_id
, and simply count the number of supplier_id
s:
select part_id,
count(supplier_id)
from suppliers_parts
group by part_id;
part_id | count
---------+-------
P2 | 4
P1 | 2
P6 | 1
P4 | 2
P5 | 2
P3 | 1
(6 rows)
In this case, we still want to group by the part ID but we want to sum the quantity for each:
select part_id,
sum(quantity)
from suppliers_parts
group by part_id;
part_id | sum
---------+------
P2 | 1000
P1 | 600
P6 | 100
P4 | 500
P5 | 500
P3 | 400
(6 rows)
What if you wanted to only include those parts for which at least a certain about, say 500, were available? Then we can filter records out using the having
clause. You can think of it like where
: select
:: having
: group by
:
select part_id,
sum(quantity)
from suppliers_parts
group by part_id
having sum(quantity) > 500;
part_id | sum
---------+------
P2 | 1000
P1 | 600
(2 rows)
Let's say you wanted get a list of all the supplier names for each part. You can take advantage of string_agg
here just like you would any other SQL aggregate function like max
or sum
:
select sp.part_id,
string_agg(s.name, ', ') as supplier_list
from suppliers_parts sp
join suppliers s
on sp.supplier_id = s.id
group by sp.part_id;
part_id | supplier_list
---------+----------------------------
P2 | Smith, Jones, Blake, Clark
P1 | Smith, Jones
P6 | Smith
P4 | Smith, Clark
P5 | Smith, Clark
P3 | Smith
(6 rows)
It's very common to need to join two or more tables but wind up dropping rows because certain rows from the first table don't have corresponding entries in the other. For instance, let's see we want to get a total count of quantities for all known parts, even if there are parts for which none is available. If we do this with a simple join
one of the suppliers drops out:
select s.id,
s.name,
sum(sp.quantity) as total_quantity
from suppliers s
join suppliers_parts sp
on s.id = sp.supplier_id
group by s.id,
s.name
order by s.id
id | name | total_quantity
----+-------+----------------
S1 | Smith | 1300
S2 | Jones | 700
S3 | Blake | 200
S4 | Clark | 900
(4 rows)
This is exactly what left outer join
is useful for:
select s.id,
s.name,
sum(sp.quantity) as total_quantity
from suppliers s
left outer join suppliers_parts sp
on s.id = sp.supplier_id
group by s.id,
s.name
order by s.id
id | name | total_quantity
----+-------+----------------
S1 | Smith | 1300
S2 | Jones | 700
S3 | Blake | 200
S4 | Clark | 900
S5 | Adams |
(5 rows)
How do I join two tables and ensure I get all rows back if some rows from one are missing from the other and vice versa?
This is a rather contrived example, but let's say we want to see more clearly which suppliers and parts are from the same city, and which suppliers lack an accompanying part and which parts don't have a supplier in the same city, all at once. For something like this, we can do a full outer join on the city
column:
select p.name,
p.city,
s.name,
s.city
from parts p
full outer join suppliers s
on p.city = s.city
name | city | name | city
-------+--------+-------+--------
Cog | London | Smith | London
Screw | London | Smith | London
Nut | London | Smith | London
Cam | Paris | Jones | Paris
Bolt | Paris | Jones | Paris
Cam | Paris | Blake | Paris
Bolt | Paris | Blake | Paris
Cog | London | Clark | London
Screw | London | Clark | London
Nut | London | Clark | London
| | Adams | Athens
Screw | Oslo | |
(12 rows)
You can clearly see that Adams has no part in Athens and the Screw has no supplier in Oslo.
One way to do this is indeed to use an exists
clause but if the two sets of records have the same structure, an easier way to express this is to use except
in PostgreSQL (or minus
in Oracle). Let's say we want to find out which cities making parts are not home to any suppliers?
select p.city
from parts p
except
select s.city
from suppliers s;
city
------
Oslo
(1 row)
You can easily answer the reverse question too:
select s.city
from suppliers s
except
select p.city
from parts p;
city
--------
Athens
(1 row)
Note that the shapes and types of the rows from each select
must be the same. You can also achieve this sort of thing using not exists
, but this pattern here is more commonly used with the two tables/record sets are similar in structure.
Oftentimes, you would like to join the results of one query to another table but can't do so easily. It is tempting to create a temporary table to do this sort of thing, but you don't need to with the SQL in most databases. Instead you can use what is called an inline view to produce a "temporary" result set and then join to that.
In most cases, you don't really need an inline view but sometimes it's easier to express things in an inline view first rather writing one top level query to properly aggregrate everything all at once. Again, needing to do this is sometimes symptomatic of a non-optimal schema.
Let's say we want to first get the set of cities and number of parts made for each, and then merge that to the set of suppliers; we can do this:
select s.id,
s.name,
s.city,
c.part_count
from suppliers s
join (select p.city,
count(p.id) as part_count
from parts p
group by p.city) as c
on s.city = c.city
id | name | city | part_count
----+-------+--------+------------
S1 | Smith | London | 3
S2 | Jones | Paris | 2
S3 | Blake | Paris | 2
S4 | Clark | London | 3
(4 rows)
Note that one of the suppliers drops out above. One way around that is to use a so-called scalar subquery instead:
select s.id,
s.name,
s.city,
(select count(p.id)
from parts p
where p.city = s.city) as part_count
from suppliers s
id | name | city | count
----+-------+--------+-------
S1 | Smith | London | 3
S2 | Jones | Paris | 2
S3 | Blake | Paris | 2
S4 | Clark | London | 3
S5 | Adams | Athens | 0
(5 rows)
There are times which this is the easiest and most expressive way to obtain what you want but beware that this query is less performant as the scalar subquery is "executed" once per row in the outer query.
Let's say that ultimately we want a list of the suppliers, the count of parts available in the same city as each supplier, and the total quantities of all parts available for each supplier. We can first get suppliers and part counts with this query:
select s.id,
s.name,
s.city,
c.part_count
from suppliers s
join (select p.city,
count(p.id) as part_count
from parts p
group by p.city) as c
on s.city = c.city;
id | name | city | part_count
----+-------+--------+------------
S1 | Smith | London | 3
S2 | Jones | Paris | 2
S3 | Blake | Paris | 2
S4 | Clark | London | 3
(4 rows)
If we then add logic to get part quantities for each supplier we can arrive at this:
select s.id,
s.name,
count(p.id) as part_count,
sum(sp.quantity) as total_quantity
from suppliers s
join parts p
on s.city = p.city
join suppliers_parts sp
on s.id = sp.supplier_id
group by s.id,
s.name;
id | name | part_count | total_quantity
----+-------+------------+----------------
S3 | Blake | 2 | 400
S1 | Smith | 18 | 3900
S2 | Jones | 4 | 1400
S4 | Clark | 9 | 2700
(4 rows)
Uh oh... what happened? Why did the part counts suddenly go up? It's due to something called chasm trap, which is what happens when you join one table to two other tables and try aggregating on them both at the same time. By joining all three tables at onces like this, we effectively create a Cartesian product of records, thereby counting/summing column values more than once.
The way out of this problem is to inline each aggregation separately and then join them:
select pc.id,
pc.name,
pc.part_count,
tq.total_quantity
from (select s.id,
s.name,
count(p.id) as part_count
from suppliers s
join parts p
on s.city = p.city
group by s.id,
s.name) pc
join (select sp.supplier_id,
sum(sp.quantity) as total_quantity
from suppliers_parts sp
group by sp.supplier_id) as tq
on pc.id = tq.supplier_id;
id | name | part_count | total_quantity
----+-------+------------+----------------
S2 | Jones | 2 | 700
S1 | Smith | 3 | 1300
S3 | Blake | 2 | 200
S4 | Clark | 3 | 900
(4 rows)
At this point you may want to take advantage of a technique called subquery refactoring which allows you to move parts of the main query out into a with
clause
with part_counts as
(select s.id,
s.name,
count(p.id) as part_count
from suppliers s
join parts p
on s.city = p.city
group by s.id,
s.name),
total_quantities as
(select sp.supplier_id,
sum(sp.quantity) as total_quantity
from suppliers_parts sp
group by sp.supplier_id)
select pc.id,
pc.name,
pc.part_count,
tq.total_quantity
from part_counts pc
join total_quantities tq
on pc.id = tq.supplier_id;
This makes the main portion of the query more succuinct and elucidate what you actually want to do at the top level.
There's two ways to do this, one of which is more expressive of the other in my humble opinion. Let's say you wanted to figure out which of the suppliers has no parts? One way is to scan through the list of records in suppliers
and leverage a not exists
clause:
select s.id,
s.name
from suppliers s
where not exists
(select 'x'
from suppliers_parts sp
where sp.supplier_id = s.id);
id | name
----+-------
S5 | Adams
(1 row)
Note that the inner subquery just needs to return some value; it doesn't matter what it is. Another way is to do an outer join from suppliers
to suppliers_parts
and see which rows have a null
value for the part_id
:
select s.id,
s.name
from suppliers s
left outer join suppliers_parts sp
on s.id = sp.supplier_id
where sp.supplier_id is null;
id | name
----+-------
S5 | Adams
(1 row)
I feel like the first query has clearer intent.
There's again two ways to do this. Let's say you wanted to know which of the suppliers does offer parts? Likewise, one way is to scan through the list of records in suppliers
and leverage an exists
clause this time:
select s.id,
s.name
from suppliers s
where exists
(select 'x'
from suppliers_parts sp
where sp.supplier_id = s.id);
id | name
----+-------
S1 | Smith
S2 | Jones
S3 | Blake
S4 | Clark
(4 rows)
... and indeed we get the other four suppliers. You can also do this with an inner join from suppliers
to suppliers_parts
but this time you have to make sure you deduplicate the results:
select distinct s.id,
s.name
from suppliers s
join suppliers_parts sp
on s.id = sp.supplier_id;
id | name
----+-------
S3 | Blake
S4 | Clark
S2 | Jones
S1 | Smith
(4 rows)
Again, I feel like the first query has significantly clearer intent.
Often you have to deal with legacy schemas which lack explicit data but there are on which you can infer/compute those data. For example, let's say that we want to choose a supplier for each part based on a rank that's not directly available in any table. But we do have the names for each supplier and we know which ones are preferred over others by name. Imagine we have the following preference logic for suppliers: if Jones carries it, then choose that; if not and Clark has it then choose that, then Smith, then Adams, and then lastly Blake. We exploit case
to set up a ranking score and then min
to select the best (in this case the minimum rank):
select sp.part_id,
sp.supplier_id,
s.name
from suppliers_parts sp
join suppliers s
on sp.supplier_id = s.id
where case
when s.name = 'Jones' then 1
when s.name = 'Clark' then 2
when s.name = 'Smith' then 3
when s.name = 'Adams' then 4
when s.name = 'Blake' then 5
end =
(select min(case
when s2.name = 'Jones' then 1
when s2.name = 'Clark' then 2
when s2.name = 'Smith' then 3
when s2.name = 'Adams' then 4
when s2.name = 'Blake' then 5
end)
from suppliers_parts sp2
join suppliers s2
on sp2.supplier_id = s2.id
where sp2.part_id = sp.part_id)
order by sp.part_id;
---------+-------------+-------
P1 | S2 | Jones
P2 | S2 | Jones
P3 | S1 | Smith
P4 | S4 | Clark
P5 | S4 | Clark
P6 | S1 | Smith
(6 rows)
If you look at all the rows in suppliers_parts
, you can verify that the suppliers listed above are the preferable ones according to the ranking of supplier names.
Let's say you wanted to tabulate which suppliers carried each part, with a column dedicated to each possible supplier. In this case, you want to effectively flip part of the suppliers_parts
table on its side, and select
ing each supplier:
select sp.part_id,
max(case when sp.supplier_id = 'S1' then 'X' else null end) as s1,
max(case when sp.supplier_id = 'S2' then 'X' else null end) as s2,
max(case when sp.supplier_id = 'S3' then 'X' else null end) as s3,
max(case when sp.supplier_id = 'S4' then 'X' else null end) as s4,
max(case when sp.supplier_id = 'S5' then 'X' else null end) as s5
from suppliers_parts sp
group by sp.part_id
order by sp.part_id;
part_id | s1 | s2 | s3 | s4 | s5
---------+----+----+----+----+----
P1 | X | X | | |
P2 | X | X | X | X |
P3 | X | | | |
P4 | X | | | X |
P5 | X | | | X |
P6 | X | | | |
(6 rows)
Let's say you wanted to get a list of all parts and suppliers, and wanted to sort and index each entry in suppliers_parts
by the supplier name. You can take advantage of windows functions again here
select sp.part_id,
p.name,
sp.supplier_id,
s.name,
row_number() over (partition by sp.part_id order by s.name) as sort_num
from suppliers_parts sp
join suppliers s
on sp.supplier_id = s.id
join parts p
on sp.part_id = p.id;
part_id | name | supplier_id | name | sort_num
---------+-------+-------------+-------+----------
P1 | Nut | S2 | Jones | 1
P1 | Nut | S1 | Smith | 2
P2 | Bolt | S3 | Blake | 1
P2 | Bolt | S4 | Clark | 2
P2 | Bolt | S2 | Jones | 3
P2 | Bolt | S1 | Smith | 4
P3 | Screw | S1 | Smith | 1
P4 | Screw | S4 | Clark | 1
P4 | Screw | S1 | Smith | 2
P5 | Cam | S4 | Clark | 1
P5 | Cam | S1 | Smith | 2
P6 | Cog | S1 | Smith | 1
(12 rows)
You can see above that the counts start over for each part ID.
There are times when row_number
is not sufficient to derive a proper rank; it's ok for sorting records but not good if you want to see when there are column values that repeat within a grouping. Consider the following:
select sp.supplier_id,
sp.part_id,
sp.quantity,
row_number() over (partition by sp.supplier_id order by sp.quantity desc) as sort_number
from suppliers_parts sp
supplier_id | part_id | quantity | sort_number
-------------+---------+----------+-------------
S1 | P3 | 400 | 1
S1 | P1 | 300 | 2
S1 | P4 | 200 | 3
S1 | P2 | 200 | 4
S1 | P5 | 100 | 5
S1 | P6 | 100 | 6
S2 | P2 | 400 | 1
S2 | P1 | 300 | 2
S3 | P2 | 200 | 1
S4 | P5 | 400 | 1
S4 | P4 | 300 | 2
S4 | P2 | 200 | 3
(12 rows)
For some of the suppliers above, there parts for which the quantities are the same and so you might want them ranked the same. If we want such duplicates taken into consideration, then we can use rank
instead:
select sp.supplier_id,
sp.part_id,
sp.quantity,
rank() over (partition by sp.supplier_id order by sp.quantity desc) as quantity_rank
from suppliers_parts sp;
supplier_id | part_id | quantity | quantity_rank
-------------+---------+----------+---------------
S1 | P3 | 400 | 1
S1 | P1 | 300 | 2
S1 | P4 | 200 | 3
S1 | P2 | 200 | 3
S1 | P5 | 100 | 5
S1 | P6 | 100 | 5
S2 | P2 | 400 | 1
S2 | P1 | 300 | 2
S3 | P2 | 200 | 1
S4 | P5 | 400 | 1
S4 | P4 | 300 | 2
S4 | P2 | 200 | 3
(12 rows)
We can further refine this query if we don't want to skip over ranks. That is, if for S1
above, we may want parts P5
, and P6
ranked 4 instead of 5. For that, we can use dense_rank
:
select sp.supplier_id,
sp.part_id,
sp.quantity,
dense_rank() over (partition by sp.supplier_id order by sp.quantity desc) as quantity_rank
from suppliers_parts sp;
supplier_id | part_id | quantity | quantity_rank
-------------+---------+----------+---------------
S1 | P3 | 400 | 1
S1 | P1 | 300 | 2
S1 | P4 | 200 | 3
S1 | P2 | 200 | 3
S1 | P5 | 100 | 4
S1 | P6 | 100 | 4
S2 | P2 | 400 | 1
S2 | P1 | 300 | 2
S3 | P2 | 200 | 1
S4 | P5 | 400 | 1
S4 | P4 | 300 | 2
S4 | P2 | 200 | 3
(12 rows)
There are times when you may want to display records corresponding to columns that have no rows in the database. For example, you may have to list sales or production data for a series of days or months, and for the instances that there are no such figures, you'd like to display a zero value nonetheless. It's situations like this that you'd like to somehow produce rows containing a range of values.
Different databases provide different means of doing this; in PostgreSQL, you can use the table function, generate_series
, to accomplish this. Since there are no data columns in our suppliers/parts schema, let's manufacture an example whereby we want to display the number of parts per weight (in this case from 10 to 20), but with the additional requirement that we want a zero count for weights that don't exist in the parts
table:
generate_series vs. recursive query
with all_weights(weight) as
(select *
from generate_series(10, 20))
select aw.weight,
count(p.id) as part_count
from all_weights aw
left outer join parts p
on p.weight = aw.weight
group by aw.weight
order by aw.weight;
weight | part_count
--------+------------
10 | 0
11 | 0
12 | 2
13 | 0
14 | 1
15 | 0
16 | 0
17 | 2
18 | 0
19 | 1
20 | 0
(11 rows)
There are times when it is really useful to be able to select current and previous values in the same row. Let's say we want to look at all suppliers and for each one show their respective quantities and parts, as well as ythe
select supplier_id,
part_id,
quantity,
lag(part_id) over (partition by supplier_id order by quantity) as previous_part_id,
lag(quantity) over (partition by supplier_id order by quantity) as previous_quantity
from suppliers_parts
order by supplier_id,
quantity;
supplier_id | part_id | quantity | previous_part_id | previous_quantity
-------------+---------+----------+------------------+-------------------
S1 | P5 | 100 | |
S1 | P6 | 100 | P5 | 100
S1 | P2 | 200 | P6 | 100
S1 | P4 | 200 | P2 | 200
S1 | P1 | 300 | P4 | 200
S1 | P3 | 400 | P1 | 300
S2 | P1 | 300 | |
S2 | P2 | 400 | P1 | 300
S3 | P2 | 200 | |
S4 | P2 | 200 | |
S4 | P4 | 300 | P2 | 200
S4 | P5 | 400 | P4 | 300
You can also get the next values within each row :
select supplier_id,
part_id,
quantity,
lead(part_id) over (partition by supplier_id order by quantity) as next_part_id,
lead(quantity) over (partition by supplier_id order by quantity) as next_quantity
from suppliers_parts
order by supplier_id,
quantity;
supplier_id | part_id | quantity | next_part_id | next_quantity
-------------+---------+----------+--------------+---------------
S1 | P5 | 100 | P6 | 100
S1 | P6 | 100 | P2 | 200
S1 | P2 | 200 | P4 | 200
S1 | P4 | 200 | P1 | 300
S1 | P1 | 300 | P3 | 400
S1 | P3 | 400 | |
S2 | P1 | 300 | P2 | 400
S2 | P2 | 400 | |
S3 | P2 | 200 | |
S4 | P2 | 200 | P4 | 300
S4 | P4 | 300 | P5 | 400
S4 | P5 | 400 | |
(12 rows)
Note that for the instances where there is no previous or next value, SQL will generate a null
.