There have been a few times in the last several months at $DAY_JOB during which the following query, or some slight variation of it, has been used to identify long-running and potentially problematic queries:
select pid,
now() - query_start as duration,
query,
state
from pg_stat_activity
I believe that this can yield misleading results as other things need to be taken into consideration. To illustrate this, I'm going to iterate on the above query in one session, and issue other test queries in another one.
OK... let's start with a clean slate and run this query for the first time; on my local machine I happen to see the following:
pid | query | state | duration
-------+-----------------------------------------+--------+-----------------
84306 | SHOW TRANSACTION ISOLATION LEVEL | idle | 00:25:31.513976
84307 | SHOW TRANSACTION ISOLATION LEVEL | idle | 00:25:31.509142
84318 | SHOW TRANSACTION ISOLATION LEVEL | idle | 00:25:01.31746
84319 | SHOW TRANSACTION ISOLATION LEVEL | idle | 00:25:01.311403
83673 | select pid, +| active | 00:00:00
| query, +| |
| state, +| |
| now() - query_start as duration+| |
| from pg_stat_activity; | |
(5 rows)
This is all stuff we can ignore for this demonstration so I'm going to alter the query a little bit to exclude them:
select pid,
now() - query_start as duration,
query,
state
from pg_stat_activity
where query not ilike '%pg_stat_activity%'
and query not ilike '%transaction%';
This is a little better:
pid | duration | query | state
-----+----------+-------+-------
(0 rows)
Let's issue a query from another session:
genome_development=# \timing
Timing is on.
genome_development=# select 'Hi, I''m not a long-running query!';
?column?
-----------------------------------
Hi, I'm not a long-running query!
(1 row)
Time: 0.314 ms
... and then wait a few moments before looking at things in our main session:
genome_development=# select pid,
now() - query_start as duration,
query,
state
from pg_stat_activity
where query not ilike '%pg_stat_activity%'
and query not ilike '%transaction%';
pid | duration | query | state
-------+----------------+----------------------------------------------+-------
85188 | 00:01:08.68768 | select 'Hi, I''m not a long-running query!'; | idle
(1 row)
What's this? This says that the duration is over 1 minute...
but psql
in the other session said that our query took 0.314 ms.
What's happening here?
The problem here is that there are other fields in this view that we need to look at to determine what's actually taking a long time.
Notice that the state of this row is idle
.
That means the process that this row represents is not currently executing the query.
To see this, let's issue this which is purposefully constructed to take 30 seconds to run:
select 'But _I_ am indeed a long-running query.' from pg_sleep(30);
... and quickly switch back to the main session to see what the view is tells us:
genome_development=# select pid,
now() - query_start as duration,
query,
state
pg_stat_activity
where query not ilike '%pg_stat_activity%'
and query not ilike '%transaction%';
pid | duration | query | state
-------+-----------------+---------------------------------------------------------------------+--------
85188 | 00:00:03.551807 | select 'But _I_ am indeed a long-running query.' from pg_sleep(30); | active
(1 row)
Wait... is this saying that the query took just under 4 seconds?
No... because the process state is still active
and the query is still running.
If we wait for the full thirty seconds to pass, we should see a state change:
genome_development=# select pid,
now() - query_start as duration,
query,
state
pg_stat_activity
where query not ilike '%pg_stat_activity%'
and query not ilike '%transaction%';
pid | duration | query | state
-------+-----------------+---------------------------------------------------------------------+-------
85188 | 00:00:45.365226 | select 'But _I_ am indeed a long-running query.' from pg_sleep(30); | idle
(1 row)
Aha... the process is idle
now... but wait... the duration is wrong again.
What is going on?
The problem this time is that we need to look at something in addition to the query_start
field to determine how long a query actually took to run.
It turns out that the pg_stat_activity
view also has a state_change
field that we can exploit to properly figure this out.
According to the documentation, the state_change
field is the "Time when the state was last changed".
(Duh.)
So, if a process completes executing a query, then its state will go from active
to idle
, and thus we should see that the difference between the state_change
and query_start
field will be the actual execution time:
Let's see:
genome_development=# select pid,
query_start,
state_change,
query,
state
from pg_stat_activity
where query not ilike '%pg_stat_activity%'
and query not ilike '%transaction%';
pid | query_start | state_change | query | state
-------+-------------------------------+-------------------------------+---------------------------------------------------------------------+-------
85188 | 2017-06-23 13:48:23.776142-04 | 2017-06-23 13:48:53.778372-04 | select 'But _I_ am indeed a long-running query.' from pg_sleep(30); | idle
(1 row)
Well, look at that... they're 30 seconds apart, just as we expected.
So... what did we learn here?
What we need to do is make sure that we only look at processes that are currently active
;
those are the ones that are actually running queries.