It was taking too long for Bucardo to import certain tables. Decided to look into it by querying the new database to check what was so slow.
First step was checking pg_stat_activity
:
> select query, query_start, wait_event from pg_stat_activity where usename = 'bucardo_replicator';
/* Bucardo 5.6.0 */COPY public.some_table("id","...","...","created_at","updated_at") FROM STDIN | 2024-05-02 15:36:13.816752+00 | IO::DataFileRead
This COPY FROM query had been running for several days. The wait_event
was IO::DataFileRead
.
I checked pg_stat_progress_copy
to be sure it wasn't stuck.
> select command, type, tuples_processed from pg_stat_progress_copy;
command | type | tuples_processed
----------+------+------------------
COPY FROM | PIPE | 49396655
The tuples_processed
figure is the number of rows which have been copied (not always, but in our case). Running this multiple times showed this figure increasing, so it wasn't stuck, just slow.
Taking a look at Amazon RDS Performance Insights also showed lots of waits on IO::DataFileRead
.


AWS has a guide on this:
The IO:DataFileRead event occurs when a connection waits on a backend process to read a required page from storage because the page isn't available in shared memory.
So, at the core, this is a memory problem.
The guide mentions possible causes of increased waits. That, in combination with the AWS guide on tuning PostgreSQL RDS, led me to think that the shared_buffers
were too small for our copy.
I wondered if we were still on the default (32 MB as per the PostgresSQL docs), but checking the setting confirmed this was not the case.
> SHOW shared_buffers;
1933 MB
So I decided to investigate the usage of the buffers. To do this, I had to add the pg_buffercache
extension:
CREATE EXTENSION pg_buffercache;
Then:
> select buffers_used, buffers_unused from pg_buffercache_summary();
buffers_used | buffers_unused
--------------+---------------
241653 | 2
(1 row)
buffers_unused
was 2! This meant we were pretty much using all available memory.
Now to find out where our buffers were being used:
> SELECT n.nspname, c.relname, count(*) AS buffers
FROM pg_buffercache b JOIN pg_class c
ON b.relfilenode = pg_relation_filenode(c.oid) AND
b.reldatabase IN (0, (SELECT oid FROM pg_database WHERE datname = current_database()))
JOIN pg_namespace n ON n.oid = c.relnamespace
GROUP BY n.nspname, c.relname
ORDER BY 3 DESC
LIMIT 10;
nspname | relname | buffers
---------+--------------------------------------------+---------
public | some_table_pkey | 57792
public | index_some_table_on_some_id | 53626
public | index_some_table_on_some_id_and_some_field | 35264
public | index_some_table_on_some_other_id | 33086
public | index_some_table_on_user_id | 30236
public | index_some_other_table_on_some_field_and_id| 27712
public | some_table | 1704
public | index_some_table_on_collect_at | 1041
public | index_some_table_on_some_other_field | 408
public | index_some_table_on_deleted_at | 377
(10 rows)
This showed a pattern: most of the buffers were being used by indexes.
From here, it was an easy fix: I killed the copy and dropped the indexes, and the copy finished in 40 minutes. Adding the indexes back afterwards took about 20 minutes. All done!