Skip to content

Instantly share code, notes, and snippets.

@davetapley
Last active August 29, 2015 14:02
Show Gist options
  • Save davetapley/31da5a225f0e790f8084 to your computer and use it in GitHub Desktop.
Save davetapley/31da5a225f0e790f8084 to your computer and use it in GitHub Desktop.
Consolidating ranges across a union in Postgresql

Two tables share a user_id, a tstzrange called period, then each have another column called category and place_id respectively. Both have constraints allowing a user to only have one 'other column' value at any time:

 user_id    | integer                  | not null
 period     | tstzrange                | not null
 category   | character varying(16)    |

    "category_events_user_id_period_excl" EXCLUDE USING gist (user_id WITH =, period WITH &&)
 user_id       | integer                  | not null
 period        | tstzrange                | not null
 place_id      | integer                  |

    "dwell_events_user_id_period_excl" EXCLUDE USING gist (user_id WITH =, period WITH &&)

I can use union to join them to merge both tables for a user, thus:

SELECT period, place_id AS place , '' AS category
  FROM dwell_events WHERE user_id = 14845
UNION SELECT period, null, category
  FROM category_events WHERE user_id = 14845;

                       period                        | place | category
-----------------------------------------------------+-------+----------
 ["2014-06-10 22:43:26+00","2014-06-10 23:04:43+00"] |  1022 |
 ["2014-06-04 18:14:37+00","2014-06-04 18:32:02+00"] |  1022 |
 ["2014-06-10 22:43:26+00","2014-06-10 23:04:43+00"] |       | work
 ["2014-06-07 23:44:41+00","2014-06-08 00:00:14+00"] |  3381 |
 ["2014-06-04 18:14:37+00","2014-06-04 18:32:02+00"] |       | work
 ["2014-06-07 23:44:41+00","2014-06-08 00:00:14+00"] |       | social

Next I added group and a dubious use of MAX to merge and sort them, thus:

SELECT period, MAX(place) AS place, MAX(category) AS category FROM (
  SELECT user_id, period, place_id AS place , '' AS category
    FROM dwell_events
  UNION SELECT user_id, period, null, category
    FROM category_events
) t

WHERE user_id = 14845
GROUP BY period
ORDER BY period;

-----------------------------------------------------+-------+----------
 ["2014-06-04 18:14:37+00","2014-06-04 18:32:02+00"] |  1022 | work
 ["2014-06-07 23:44:41+00","2014-06-08 00:00:14+00"] |  3381 | social
 ["2014-06-10 22:43:26+00","2014-06-10 23:04:43+00"] |  1022 | work

Unfortunately it's not always the case that the periods in the two tables will line up so well. In this next example the later work category row has been deleted, and the preceding social one has been expanded in to its range. You can see that my current naive implementation starts to fall over:

-----------------------------------------------------+------+--------
 ["2014-06-04 18:14:37+00","2014-06-04 18:32:02+00"] | 1022 | work
 ["2014-06-07 23:44:41+00","2014-06-08 00:00:14+00"] | 3381 |
 ["2014-06-07 23:44:41+00","2014-06-10 23:04:43+00"] |      | social
 ["2014-06-10 22:43:26+00","2014-06-10 23:04:43+00"] | 1022 |

In this situation I'd like the nulls to be filled in with the corresponding overlapping range value, thus:

-----------------------------------------------------+------+--------
 ["2014-06-04 18:14:37+00","2014-06-04 18:32:02+00"] | 1022 | work
 ["2014-06-07 23:44:41+00","2014-06-08 00:00:14+00"] | 3381 | social
 ["2014-06-10 22:43:26+00","2014-06-10 23:04:43+00"] | 1022 | social

And this is where I'm stuck!

In some cases we may also not have a complete matching range available in either table. For example here I have reduced the length of the range for the latter place 1022 row by 4 minutes, producing this output with my current naive query:

-----------------------------------------------------+------+--------
 ["2014-06-04 18:14:37+00","2014-06-04 18:32:02+00"] | 1022 | work
 ["2014-06-07 23:44:41+00","2014-06-08 00:00:14+00"] | 3381 |
 ["2014-06-07 23:44:41+00","2014-06-10 23:04:43+00"] |      | social
 ["2014-06-10 22:43:26+00","2014-06-10 23:00:00+00"] | 1022 |

In this case I'd like to minimize the duration of the range which contain a null. So we see that we have a consolidated range during which 1022 and social is true, then we have a row with a null for the place for the remaining 4 minutes (but see social).

-----------------------------------------------------+------+--------
 ["2014-06-04 18:14:37+00","2014-06-04 18:32:02+00"] | 1022 | work
 ["2014-06-07 23:44:41+00","2014-06-08 00:00:14+00"] | 3381 | social
 ["2014-06-10 22:43:26+00","2014-06-10 23:00:00+00"] | 1022 | social
 ["2014-06-07 23:00:00+00","2014-06-10 23:04:43+00"] |      | social

There should never be a case when both place_id and category are null.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment