Finding long running queries in Postgres.md

Finding long running queries in Postgres

There have been a few times in the last several months at $DAY_JOB during which the following query, or some slight variation of it, has been used to identify long-running and potentially problematic queries:

select  pid,
        now() - query_start as duration,
        query,
        state
from    pg_stat_activity

I believe that this can yield misleading results as other things need to be taken into consideration. To illustrate this, I'm going to iterate on the above query in one session, and issue other test queries in another one.

OK... let's start with a clean slate and run this query for the first time; on my local machine I happen to see the following:

  pid  |                  query                  | state  |    duration
-------+-----------------------------------------+--------+-----------------
 84306 | SHOW TRANSACTION ISOLATION LEVEL        | idle   | 00:25:31.513976
 84307 | SHOW TRANSACTION ISOLATION LEVEL        | idle   | 00:25:31.509142
 84318 | SHOW TRANSACTION ISOLATION LEVEL        | idle   | 00:25:01.31746
 84319 | SHOW TRANSACTION ISOLATION LEVEL        | idle   | 00:25:01.311403
 83673 | select  pid,                           +| active | 00:00:00
       |         query,                         +|        |
       |         state,                         +|        |
       |         now() - query_start as duration+|        |
       | from    pg_stat_activity;               |        |
(5 rows)

This is all stuff we can ignore for this demonstration so I'm going to alter the query a little bit to exclude them:

select  pid,
        now() - query_start as duration,
        query,
        state
from    pg_stat_activity
where   query not ilike '%pg_stat_activity%'
and     query not ilike '%transaction%';

This is a little better:

pid | duration | query | state
-----+----------+-------+-------
(0 rows)

Let's issue a query from another session:

genome_development=# \timing
Timing is on.
genome_development=# select 'Hi, I''m not a long-running query!';
             ?column?
-----------------------------------
 Hi, I'm not a long-running query!
(1 row)

Time: 0.314 ms

... and then wait a few moments before looking at things in our main session:

genome_development=# select  pid,
        now() - query_start as duration,
        query,
        state
from    pg_stat_activity
where   query not ilike '%pg_stat_activity%'
and     query not ilike '%transaction%';
  pid  |    duration    |                    query                     | state
-------+----------------+----------------------------------------------+-------
 85188 | 00:01:08.68768 | select 'Hi, I''m not a long-running query!'; | idle
(1 row)

What's this? This says that the duration is over 1 minute... but psql in the other session said that our query took 0.314 ms. What's happening here? The problem here is that there are other fields in this view that we need to look at to determine what's actually taking a long time. Notice that the state of this row is idle. That means the process that this row represents is not currently executing the query. To see this, let's issue this which is purposefully constructed to take 30 seconds to run:

select  'But _I_ am indeed a long-running query.' from pg_sleep(30);

... and quickly switch back to the main session to see what the view is tells us:

genome_development=# select  pid,
        now() - query_start as duration,
        query,
        state
        pg_stat_activity
where   query not ilike '%pg_stat_activity%'
and     query not ilike '%transaction%';
  pid  |    duration     |                                query                                | state
-------+-----------------+---------------------------------------------------------------------+--------
 85188 | 00:00:03.551807 | select 'But _I_ am indeed a long-running query.' from pg_sleep(30); | active
(1 row)

Wait... is this saying that the query took just under 4 seconds? No... because the process state is still active and the query is still running. If we wait for the full thirty seconds to pass, we should see a state change:

genome_development=# select  pid,
        now() - query_start as duration,
        query,
        state
        pg_stat_activity
where   query not ilike '%pg_stat_activity%'
and     query not ilike '%transaction%';
  pid  |    duration     |                                query                                | state
-------+-----------------+---------------------------------------------------------------------+-------
 85188 | 00:00:45.365226 | select 'But _I_ am indeed a long-running query.' from pg_sleep(30); | idle
(1 row)

Aha... the process is idle now... but wait... the duration is wrong again. What is going on? The problem this time is that we need to look at something in addition to the query_start field to determine how long a query actually took to run. It turns out that the pg_stat_activity view also has a state_change field that we can exploit to properly figure this out. According to the documentation, the state_change field is the "Time when the state was last changed". (Duh.) So, if a process completes executing a query, then its state will go from active to idle, and thus we should see that the difference between the state_change and query_start field will be the actual execution time: Let's see:

genome_development=# select  pid,
        query_start,
        state_change,
        query,
        state
from    pg_stat_activity
where   query not ilike '%pg_stat_activity%'
and     query not ilike '%transaction%';
  pid  |          query_start          |         state_change          |                                query                                | state
-------+-------------------------------+-------------------------------+---------------------------------------------------------------------+-------
 85188 | 2017-06-23 13:48:23.776142-04 | 2017-06-23 13:48:53.778372-04 | select 'But _I_ am indeed a long-running query.' from pg_sleep(30); | idle
(1 row)

Well, look at that... they're 30 seconds apart, just as we expected.

So... what did we learn here? What we need to do is make sure that we only look at processes that are currently active; those are the ones that are actually running queries.

quephird/Finding long running queries in Postgres.md

Finding long running queries in Postgres