bcobb/01_SQL tricks.md

SQL Tricks

Purpose

This purpose of this is to document all the tricks that I've learned over the last couple of decades (good grief, has it really been that long?) from interacting with and developing for databases.

I have organized this as an FAQ of sorts rather than by technical concept, e.g., scalar subuery, chasm trap, etc., because when you don't know how to express something in code, you can't know what the solution type is called or named. My other endeavor is to capture the more thornier types of queries that can come up as requirements, instead of just itemizing trivial, common examples.

Nota bene

This is still a work in progress! And suggestions or comments are welcome!

Example schema

All of the SQL statements in this can be used against the canonical supplier/parts database schema used in many of C.J. Date's books: https://wiki.c2.com/?SupplierPartsDatabase

The tables used are:

Suppliers

  id  name   status   city
  ----------------------------
  S1  Smith  20       London
  S2  Jones  10       Paris
  S3  Blake  30       Paris
  S4  Clark  20       London
  S5  Adams  30       Athens

Parts

  id  name   color  weight  city
  ----------------------------------
  P1  Nut    Red    12.0    London
  P2  Bolt   Green  17.0    Paris
  P3  Screw  Blue   17.0    Oslo
  P4  Screw  Red    14.0    London
  P5  Cam    Blue   12.0    Paris
  P6  Cog    Red    19.0    London

Supplier parts

  supplier_id  part_id  quantity
  ------------------------------
  S1           P1       300
  S1           P2       200
  S1           P3       400
  S1           P4       200
  S1           P5       100
  S1           P6       100
  S2           P1       300
  S2           P2       400
  S3           P2       200
  S4           P2       200
  S4           P4       300
  S4           P5       400

A SQL script is attached this gist to allow you to create the schema in your own database.

Selecting records

How do I get a list of items and the count of each in a table?

Let's say that we want to know which parts are available and the number of suppliers for each. In this case, we want to group by the part_id, and simply count the number of supplier_ids:

select  part_id,
        count(supplier_id)
from    suppliers_parts
group by part_id;

 part_id | count
---------+-------
 P2      |     4
 P1      |     2
 P6      |     1
 P4      |     2
 P5      |     2
 P3      |     1
(6 rows)

How do I get a list of items and the sum of another column in a table?

In this case, we still want to group by the part ID but we want to sum the quantity for each:

select  part_id,
        sum(quantity)
from    suppliers_parts
group by part_id;

 part_id | sum
---------+------
 P2      | 1000
 P1      |  600
 P6      |  100
 P4      |  500
 P5      |  500
 P3      |  400
(6 rows)

How do I get a list of items and filter the result set on some aggregate?

What if you wanted to only include those parts for which at least a certain about, say 500, were available? Then we can filter records out using the having clause. You can think of it like where : select :: having : group by:

select  part_id,
        sum(quantity)
from    suppliers_parts
group by part_id
having   sum(quantity) > 500;

 part_id | sum
---------+------
 P2      | 1000
 P1      |  600
(2 rows)

How do I concatenate a list of string values each relavant to a row?

Let's say you wanted get a list of all the supplier names for each part. You can take advantage of string_agg here just like you would any other SQL aggregate function like max or sum:

select  sp.part_id,
        string_agg(s.name, ', ') as supplier_list
from    suppliers_parts sp
join    suppliers s
on      sp.supplier_id = s.id
group by sp.part_id;

 part_id |       supplier_list
---------+----------------------------
 P2      | Smith, Jones, Blake, Clark
 P1      | Smith, Jones
 P6      | Smith
 P4      | Smith, Clark
 P5      | Smith, Clark
 P3      | Smith
(6 rows)

How do I join two tables and make sure rows don't drop out?

It's very common to need to join two or more tables but wind up dropping rows because certain rows from the first table don't have corresponding entries in the other. For instance, let's see we want to get a total count of quantities for all known parts, even if there are parts for which none is available. If we do this with a simple join one of the suppliers drops out:

select  s.id,
        s.name,
        sum(sp.quantity) as total_quantity
from    suppliers s
join    suppliers_parts sp
on      s.id = sp.supplier_id
group by s.id,
         s.name
order by s.id

 id | name  | total_quantity
----+-------+----------------
 S1 | Smith |           1300
 S2 | Jones |            700
 S3 | Blake |            200
 S4 | Clark |            900
(4 rows)

This is exactly what left outer join is useful for:

select  s.id,
        s.name,
        sum(sp.quantity) as total_quantity
from    suppliers s
left outer join suppliers_parts sp
on      s.id = sp.supplier_id
group by s.id,
         s.name
order by s.id

 id | name  | total_quantity
----+-------+----------------
 S1 | Smith |           1300
 S2 | Jones |            700
 S3 | Blake |            200
 S4 | Clark |            900
 S5 | Adams |
(5 rows)

How do I join two tables and ensure I get all rows back if some rows from one are missing from the other and vice versa?

This is a rather contrived example, but let's say we want to see more clearly which suppliers and parts are from the same city, and which suppliers lack an accompanying part and which parts don't have a supplier in the same city, all at once. For something like this, we can do a full outer join on the city column:

select  p.name,
        p.city,
        s.name,
        s.city
from    parts p
full outer join suppliers s
on      p.city = s.city

 name  |  city  | name  |  city
-------+--------+-------+--------
 Cog   | London | Smith | London
 Screw | London | Smith | London
 Nut   | London | Smith | London
 Cam   | Paris  | Jones | Paris
 Bolt  | Paris  | Jones | Paris
 Cam   | Paris  | Blake | Paris
 Bolt  | Paris  | Blake | Paris
 Cog   | London | Clark | London
 Screw | London | Clark | London
 Nut   | London | Clark | London
       |        | Adams | Athens
 Screw | Oslo   |       |
(12 rows)

You can clearly see that Adams has no part in Athens and the Screw has no supplier in Oslo.

How do I compare two tables/record sets and see which rows exist in one but not the either?

One way to do this is indeed to use an exists clause but if the two sets of records have the same structure, an easier way to express this is to use except in PostgreSQL (or minus in Oracle). Let's say we want to find out which cities making parts are not home to any suppliers?

select  p.city
from    parts p
except
select  s.city
from    suppliers s;

 city
------
 Oslo
(1 row)

You can easily answer the reverse question too:

select  s.city
from    suppliers s
except
select  p.city
from    parts p;

  city
--------
 Athens
(1 row)

Note that the shapes and types of the rows from each select must be the same. You can also achieve this sort of thing using not exists, but this pattern here is more commonly used with the two tables/record sets are similar in structure.

How do I join a table to the results of another query?

Oftentimes, you would like to join the results of one query to another table but can't do so easily. It is tempting to create a temporary table to do this sort of thing, but you don't need to with the SQL in most databases. Instead you can use what is called an inline view to produce a "temporary" result set and then join to that.

In most cases, you don't really need an inline view but sometimes it's easier to express things in an inline view first rather writing one top level query to properly aggregrate everything all at once. Again, needing to do this is sometimes symptomatic of a non-optimal schema.

Let's say we want to first get the set of cities and number of parts made for each, and then merge that to the set of suppliers; we can do this:

select  s.id,
        s.name,
        s.city,
        c.part_count
from    suppliers s
join   (select  p.city,
                count(p.id) as part_count
        from    parts p
        group by p.city) as c
on      s.city = c.city

 id | name  |  city  | part_count
----+-------+--------+------------
 S1 | Smith | London |          3
 S2 | Jones | Paris  |          2
 S3 | Blake | Paris  |          2
 S4 | Clark | London |          3
(4 rows)

Note that one of the suppliers drops out above. One way around that is to use a so-called scalar subquery instead:

select  s.id,
        s.name,
        s.city,
       (select  count(p.id)
        from    parts p
        where   p.city = s.city) as part_count
from    suppliers s

 id | name  |  city  | count
----+-------+--------+-------
 S1 | Smith | London |     3
 S2 | Jones | Paris  |     2
 S3 | Blake | Paris  |     2
 S4 | Clark | London |     3
 S5 | Adams | Athens |     0
(5 rows)

There are times which this is the easiest and most expressive way to obtain what you want but beware that this query is less performant as the scalar subquery is "executed" once per row in the outer query.

How do I compute multiple aggregations at the same time without counting rows more than once?

Let's say that ultimately we want a list of the suppliers, the count of parts available in the same city as each supplier, and the total quantities of all parts available for each supplier. We can first get suppliers and part counts with this query:

select  s.id,
        s.name,
        s.city,
        c.part_count
from    suppliers s
join   (select  p.city,
                count(p.id) as part_count
        from    parts p
        group by p.city) as c
on      s.city = c.city;

 id | name  |  city  | part_count
----+-------+--------+------------
 S1 | Smith | London |          3
 S2 | Jones | Paris  |          2
 S3 | Blake | Paris  |          2
 S4 | Clark | London |          3
(4 rows)

If we then add logic to get part quantities for each supplier we can arrive at this:

select  s.id,
        s.name,
        count(p.id) as part_count,
        sum(sp.quantity) as total_quantity
from    suppliers s
join    parts p
on      s.city = p.city
join    suppliers_parts sp
on      s.id = sp.supplier_id
group by s.id,
         s.name;

 id | name  | part_count | total_quantity
----+-------+------------+----------------
 S3 | Blake |          2 |            400
 S1 | Smith |         18 |           3900
 S2 | Jones |          4 |           1400
 S4 | Clark |          9 |           2700
(4 rows)

Uh oh... what happened? Why did the part counts suddenly go up? It's due to something called chasm trap, which is what happens when you join one table to two other tables and try aggregating on them both at the same time. By joining all three tables at onces like this, we effectively create a Cartesian product of records, thereby counting/summing column values more than once.

The way out of this problem is to inline each aggregation separately and then join them:

select  pc.id,
        pc.name,
        pc.part_count,
        tq.total_quantity
from   (select  s.id,
                s.name,
                count(p.id) as part_count
        from    suppliers s
        join    parts p
        on      s.city = p.city
        group by s.id,
                 s.name) pc
join   (select  sp.supplier_id,
                sum(sp.quantity) as total_quantity
        from    suppliers_parts sp
        group by sp.supplier_id) as tq
on      pc.id = tq.supplier_id;

 id | name  | part_count | total_quantity
----+-------+------------+----------------
 S2 | Jones |          2 |            700
 S1 | Smith |          3 |           1300
 S3 | Blake |          2 |            200
 S4 | Clark |          3 |            900
(4 rows)

At this point you may want to take advantage of a technique called subquery refactoring which allows you to move parts of the main query out into a with clause

with part_counts as
       (select  s.id,
                s.name,
                count(p.id) as part_count
        from    suppliers s
        join    parts p
        on      s.city = p.city
        group by s.id,
                 s.name),
total_quantities as
       (select  sp.supplier_id,
                sum(sp.quantity) as total_quantity
        from    suppliers_parts sp
        group by sp.supplier_id)
select  pc.id,
        pc.name,
        pc.part_count,
        tq.total_quantity
from    part_counts pc
join    total_quantities tq
on      pc.id = tq.supplier_id;

This makes the main portion of the query more succuinct and elucidate what you actually want to do at the top level.

How do get a list of rows for which a certain column value does not exist in a table?

There's two ways to do this, one of which is more expressive of the other in my humble opinion. Let's say you wanted to figure out which of the suppliers has no parts? One way is to scan through the list of records in suppliers and leverage a not exists clause:

select  s.id,
        s.name
from    suppliers s
where not exists
       (select  'x'
        from    suppliers_parts sp
        where   sp.supplier_id = s.id);

 id | name
----+-------
 S5 | Adams
(1 row)

Note that the inner subquery just needs to return some value; it doesn't matter what it is. Another way is to do an outer join from suppliers to suppliers_parts and see which rows have a null value for the part_id:

select  s.id,
        s.name
from    suppliers s
left outer join suppliers_parts sp
on      s.id = sp.supplier_id
where   sp.supplier_id is null;

 id | name
----+-------
 S5 | Adams
(1 row)

I feel like the first query has clearer intent.

How do get a list of rows for which a certain ID does exist in another table?

There's again two ways to do this. Let's say you wanted to know which of the suppliers does offer parts? Likewise, one way is to scan through the list of records in suppliers and leverage an exists clause this time:

select  s.id,
        s.name
from    suppliers s
where exists
       (select  'x'
        from    suppliers_parts sp
        where   sp.supplier_id = s.id);

 id | name
----+-------
 S1 | Smith
 S2 | Jones
 S3 | Blake
 S4 | Clark
(4 rows)

... and indeed we get the other four suppliers. You can also do this with an inner join from suppliers to suppliers_parts but this time you have to make sure you deduplicate the results:

select  distinct s.id,
        s.name
from    suppliers s
join    suppliers_parts sp
on      s.id = sp.supplier_id;

 id | name
----+-------
 S3 | Blake
 S4 | Clark
 S2 | Jones
 S1 | Smith
(4 rows)

Again, I feel like the first query has significantly clearer intent.

How do I select records based on ranking of column values?

Often you have to deal with legacy schemas which lack explicit data but there are on which you can infer/compute those data. For example, let's say that we want to choose a supplier for each part based on a rank that's not directly available in any table. But we do have the names for each supplier and we know which ones are preferred over others by name. Imagine we have the following preference logic for suppliers: if Jones carries it, then choose that; if not and Clark has it then choose that, then Smith, then Adams, and then lastly Blake. We exploit case to set up a ranking score and then min to select the best (in this case the minimum rank):

select  sp.part_id,
        sp.supplier_id,
        s.name
from    suppliers_parts sp
join    suppliers s
on      sp.supplier_id = s.id
where   case
          when s.name = 'Jones' then 1
          when s.name = 'Clark' then 2
          when s.name = 'Smith' then 3
          when s.name = 'Adams' then 4
          when s.name = 'Blake' then 5
        end =
       (select  min(case
                      when s2.name = 'Jones' then 1
                      when s2.name = 'Clark' then 2
                      when s2.name = 'Smith' then 3
                      when s2.name = 'Adams' then 4
                      when s2.name = 'Blake' then 5
                    end)
        from    suppliers_parts sp2
        join    suppliers s2
        on      sp2.supplier_id = s2.id
        where   sp2.part_id = sp.part_id)
order by sp.part_id;

---------+-------------+-------
 P1      | S2          | Jones
 P2      | S2          | Jones
 P3      | S1          | Smith
 P4      | S4          | Clark
 P5      | S4          | Clark
 P6      | S1          | Smith
(6 rows)

If you look at all the rows in suppliers_parts, you can verify that the suppliers listed above are the preferable ones according to the ranking of supplier names.

How do I pivot row values to column values?

Let's say you wanted to tabulate which suppliers carried each part, with a column dedicated to each possible supplier. In this case, you want to effectively flip part of the suppliers_parts table on its side, and selecting each supplier:

select  sp.part_id,
        max(case when sp.supplier_id = 'S1' then 'X' else null end) as s1,
        max(case when sp.supplier_id = 'S2' then 'X' else null end) as s2,
        max(case when sp.supplier_id = 'S3' then 'X' else null end) as s3,
        max(case when sp.supplier_id = 'S4' then 'X' else null end) as s4,
        max(case when sp.supplier_id = 'S5' then 'X' else null end) as s5
from    suppliers_parts sp
group by sp.part_id
order by sp.part_id;

 part_id | s1 | s2 | s3 | s4 | s5
---------+----+----+----+----+----
 P1      | X  | X  |    |    |
 P2      | X  | X  | X  | X  |
 P3      | X  |    |    |    |
 P4      | X  |    |    | X  |
 P5      | X  |    |    | X  |
 P6      | X  |    |    |    |
(6 rows)

How do I generate a sort index per aggregate?

Let's say you wanted to get a list of all parts and suppliers, and wanted to sort and index each entry in suppliers_parts by the supplier name. You can take advantage of windows functions again here

select  sp.part_id,
        p.name,
        sp.supplier_id,
        s.name,
        row_number() over (partition by sp.part_id order by s.name) as sort_num
from    suppliers_parts sp
join    suppliers s
on      sp.supplier_id = s.id
join    parts p
on      sp.part_id = p.id;

 part_id | name  | supplier_id | name  | sort_num
---------+-------+-------------+-------+----------
 P1      | Nut   | S2          | Jones |        1
 P1      | Nut   | S1          | Smith |        2
 P2      | Bolt  | S3          | Blake |        1
 P2      | Bolt  | S4          | Clark |        2
 P2      | Bolt  | S2          | Jones |        3
 P2      | Bolt  | S1          | Smith |        4
 P3      | Screw | S1          | Smith |        1
 P4      | Screw | S4          | Clark |        1
 P4      | Screw | S1          | Smith |        2
 P5      | Cam   | S4          | Clark |        1
 P5      | Cam   | S1          | Smith |        2
 P6      | Cog   | S1          | Smith |        1
(12 rows)

You can see above that the counts start over for each part ID.

How do I get a more accurate ranking if there are "ties"?

There are times when row_number is not sufficient to derive a proper rank; it's ok for sorting records but not good if you want to see when there are column values that repeat within a grouping. Consider the following:

select  sp.supplier_id,
        sp.part_id,
        sp.quantity,
        row_number() over (partition by sp.supplier_id order by sp.quantity desc) as sort_number
from    suppliers_parts sp

 supplier_id | part_id | quantity | sort_number
-------------+---------+----------+-------------
 S1          | P3      |      400 |           1
 S1          | P1      |      300 |           2
 S1          | P4      |      200 |           3
 S1          | P2      |      200 |           4
 S1          | P5      |      100 |           5
 S1          | P6      |      100 |           6
 S2          | P2      |      400 |           1
 S2          | P1      |      300 |           2
 S3          | P2      |      200 |           1
 S4          | P5      |      400 |           1
 S4          | P4      |      300 |           2
 S4          | P2      |      200 |           3
(12 rows)

For some of the suppliers above, there parts for which the quantities are the same and so you might want them ranked the same. If we want such duplicates taken into consideration, then we can use rank instead:

select  sp.supplier_id,
        sp.part_id,
        sp.quantity,
        rank() over (partition by sp.supplier_id order by sp.quantity desc) as quantity_rank
from    suppliers_parts sp;

 supplier_id | part_id | quantity | quantity_rank
-------------+---------+----------+---------------
 S1          | P3      |      400 |             1
 S1          | P1      |      300 |             2
 S1          | P4      |      200 |             3
 S1          | P2      |      200 |             3
 S1          | P5      |      100 |             5
 S1          | P6      |      100 |             5
 S2          | P2      |      400 |             1
 S2          | P1      |      300 |             2
 S3          | P2      |      200 |             1
 S4          | P5      |      400 |             1
 S4          | P4      |      300 |             2
 S4          | P2      |      200 |             3
(12 rows)

We can further refine this query if we don't want to skip over ranks. That is, if for S1 above, we may want parts P5, and P6 ranked 4 instead of 5. For that, we can use dense_rank:

select  sp.supplier_id,
        sp.part_id,
        sp.quantity,
        dense_rank() over (partition by sp.supplier_id order by sp.quantity desc) as quantity_rank
from    suppliers_parts sp;

 supplier_id | part_id | quantity | quantity_rank
-------------+---------+----------+---------------
 S1          | P3      |      400 |             1
 S1          | P1      |      300 |             2
 S1          | P4      |      200 |             3
 S1          | P2      |      200 |             3
 S1          | P5      |      100 |             4
 S1          | P6      |      100 |             4
 S2          | P2      |      400 |             1
 S2          | P1      |      300 |             2
 S3          | P2      |      200 |             1
 S4          | P5      |      400 |             1
 S4          | P4      |      300 |             2
 S4          | P2      |      200 |             3
(12 rows)

How do I generate rows for which there are no values for a specific column?

There are times when you may want to display records corresponding to columns that have no rows in the database. For example, you may have to list sales or production data for a series of days or months, and for the instances that there are no such figures, you'd like to display a zero value nonetheless. It's situations like this that you'd like to somehow produce rows containing a range of values.

Different databases provide different means of doing this; in PostgreSQL, you can use the table function, generate_series, to accomplish this. Since there are no data columns in our suppliers/parts schema, let's manufacture an example whereby we want to display the number of parts per weight (in this case from 10 to 20), but with the additional requirement that we want a zero count for weights that don't exist in the parts table:

generate_series vs. recursive query

with all_weights(weight) as
       (select  *
        from    generate_series(10, 20))
select  aw.weight,
        count(p.id) as part_count
from    all_weights aw
left outer join parts p
on      p.weight = aw.weight
group by aw.weight
order by aw.weight;

 weight | part_count
--------+------------
     10 |          0
     11 |          0
     12 |          2
     13 |          0
     14 |          1
     15 |          0
     16 |          0
     17 |          2
     18 |          0
     19 |          1
     20 |          0
(11 rows)

How do I not only sort records but see their previous or next values in the same row?

There are times when it is really useful to be able to select current and previous values in the same row. Let's say we want to look at all suppliers and for each one show their respective quantities and parts, as well as ythe

select  supplier_id,
        part_id,
        quantity,
        lag(part_id) over (partition by supplier_id order by quantity) as previous_part_id,
        lag(quantity) over (partition by supplier_id order by quantity) as previous_quantity
from suppliers_parts
order by supplier_id,
         quantity;

 supplier_id | part_id | quantity | previous_part_id | previous_quantity
-------------+---------+----------+------------------+-------------------
 S1          | P5      |      100 |                  |
 S1          | P6      |      100 | P5               |               100
 S1          | P2      |      200 | P6               |               100
 S1          | P4      |      200 | P2               |               200
 S1          | P1      |      300 | P4               |               200
 S1          | P3      |      400 | P1               |               300
 S2          | P1      |      300 |                  |
 S2          | P2      |      400 | P1               |               300
 S3          | P2      |      200 |                  |
 S4          | P2      |      200 |                  |
 S4          | P4      |      300 | P2               |               200
 S4          | P5      |      400 | P4               |               300

You can also get the next values within each row :

select  supplier_id,
        part_id,
        quantity,
        lead(part_id) over (partition by supplier_id order by quantity) as next_part_id,
        lead(quantity) over (partition by supplier_id order by quantity) as next_quantity
from suppliers_parts
order by supplier_id,
         quantity;

 supplier_id | part_id | quantity | next_part_id | next_quantity
-------------+---------+----------+--------------+---------------
 S1          | P5      |      100 | P6           |           100
 S1          | P6      |      100 | P2           |           200
 S1          | P2      |      200 | P4           |           200
 S1          | P4      |      200 | P1           |           300
 S1          | P1      |      300 | P3           |           400
 S1          | P3      |      400 |              |
 S2          | P1      |      300 | P2           |           400
 S2          | P2      |      400 |              |
 S3          | P2      |      200 |              |
 S4          | P2      |      200 | P4           |           300
 S4          | P4      |      300 | P5           |           400
 S4          | P5      |      400 |              |
(12 rows)

Note that for the instances where there is no previous or next value, SQL will generate a null.

	create table suppliers
	(id varchar(2),
	name varchar(20) not null,
	status numeric not null,
	city varchar(20) not null,
	primary key (id));

	insert into suppliers
	(id, name, status, city)
	values ('S1', 'Smith', 20, 'London'),
	('S2', 'Jones', 10, 'Paris'),
	('S3', 'Blake', 30, 'Paris'),
	('S4', 'Clark', 20, 'London'),
	('S5', 'Adams', 30, 'Athens');

	create table parts
	(id varchar(2),
	name varchar(20) not null,
	color varchar(20) not null,
	weight numeric not null,
	city varchar(20) not null,
	primary key (id));

	insert into parts
	(id, name, color, weight, city)
	values ('P1', 'Nut', 'Red', 12.0, 'London'),
	('P2', 'Bolt', 'Green', 17.0, 'Paris'),
	('P3', 'Screw', 'Blue', 17.0, 'Oslo'),
	('P4', 'Screw', 'Red', 14.0, 'London'),
	('P5', 'Cam', 'Blue', 12.0, 'Paris'),
	('P6', 'Cog', 'Red', 19.0, 'London');

	create table suppliers_parts
	(supplier_id varchar(2),
	part_id varchar(2),
	quantity numeric not null,
	foreign key (supplier_id) references suppliers (id),
	foreign key (part_id) references parts (id));

	insert into suppliers_parts
	(supplier_id, part_id, quantity)
	values ('S1', 'P1', 300),
	('S1', 'P2', 200),
	('S1', 'P3', 400),
	('S1', 'P4', 200),
	('S1', 'P5', 100),
	('S1', 'P6', 100),
	('S2', 'P1', 300),
	('S2', 'P2', 400),
	('S3', 'P2', 200),
	('S4', 'P2', 200),
	('S4', 'P4', 300),
	('S4', 'P5', 400);