Skip to content

Instantly share code, notes, and snippets.

@MichaelDBA
Created August 16, 2016 19:43
Show Gist options
  • Save MichaelDBA/7a8a05e31ab240a6ddf48cee8d57679e to your computer and use it in GitHub Desktop.
Save MichaelDBA/7a8a05e31ab240a6ddf48cee8d57679e to your computer and use it in GitHub Desktop.
replication problems
Same problem same WAL file:
016-08-16 02:58:28.984 CDT replicator@[unknown][17378:idle] 10.100.128.210(49089) walreceiver : ERROR: requested WAL segment 00000002000022F400000096 has already been removed
The oldest WAL on the master is dated yesterday august 15 at 3:17pm. The backup started yesterday at 8:40pm, over 5 hours after the oldest WAL on the master as shown below:
-rw-r--r-- 1 postgres postgres 33 May 12 2015 00000002.history
-rw------- 1 postgres postgres 16777216 Aug 15 15:17 000000020000230F000000E2
-rw------- 1 postgres postgres 16777216 Aug 15 15:17 000000020000230F000000E3
-rw------- 1 postgres postgres 16777216 Aug 15 15:18 000000020000230F000000E4
-rw------- 1 postgres postgres 16777216 Aug 15 15:18 000000020000230F000000E5
-rw------- 1 postgres postgres 16777216 Aug 15 15:18 000000020000230F000000E6
-rw------- 1 postgres postgres 16777216 Aug 15 15:18 000000020000230F000000E7
-rw------- 1 postgres postgres 16777216 Aug 15 15:18 000000020000230F000000E8
-rw------- 1 postgres postgres 16777216 Aug 15 15:19 000000020000230F000000E9
-rw------- 1 postgres postgres 16777216 Aug 15 15:19 000000020000230F000000EA
-rw------- 1 postgres postgres 16777216 Aug 15 15:19 000000020000230F000000EB
-rw------- 1 postgres postgres 16777216 Aug 15 15:20 000000020000230F000000EC
-rw------- 1 postgres postgres 16777216 Aug 15 15:20 000000020000230F000000ED
-rw------- 1 postgres postgres 16777216 Aug 15 15:20 000000020000230F000000EE
-rw------- 1 postgres postgres 16777216 Aug 15 15:20 000000020000230F000000EF
-rw------- 1 postgres postgres 16777216 Aug 15 15:20 000000020000230F000000F0
-rw------- 1 postgres postgres 16777216 Aug 15 15:21 000000020000230F000000F1
-rw------- 1 postgres postgres 16777216 Aug 15 15:21 000000020000230F000000F2
-rw------- 1 postgres postgres 16777216 Aug 15 15:21 000000020000230F000000F3
-rw------- 1 postgres postgres 16777216 Aug 15 15:22 000000020000230F000000F4
-rw------- 1 postgres postgres 16777216 Aug 15 15:22 000000020000230F000000F5
-rw------- 1 postgres postgres 16777216 Aug 15 15:22 000000020000230F000000F6
--More--
After starting the slave, it went through all the WALs starting with:
2016-08-16 02:50:38.727 T00 INFO: get WAL segment 000000020000230E00000015
2016-08-16 02:50:38.898 CDT [unknown]@[unknown][15758:] [local] [unknown] : LOG: connection received: host=[local]
2016-08-16 02:50:38.898 CDT postgres@postgres[15758:] [local] [unknown] : FATAL: the database system is starting up
2016-08-16 02:50:39.053 T00 INFO: archive-get stop
2016-08-16 02:50:39.095 CDT @[15744:] : LOG: restored log file "000000020000230E00000015" from archive
2016-08-16 02:50:39.165 CDT @[15744:] : LOG: redo starts at 230E/15002B38
Here is the snippet at the end where the error first occurs
2016-08-16 02:56:02.908 T00 INFO: archive-get start: --buffer-size=8388608 --compress --db-path=/pgdata/lenderx --log-level-console=info --log-level-file=info --log-path=/pgarchive/pgbackrest/log/lenderx --repo-path=/pgarchive/pgbackrest --stanza=lenderx
2016-08-16 02:56:02.908 T00 INFO: get WAL segment 000000020000230F0000008E
2016-08-16 02:56:03.350 T00 INFO: archive-get stop
2016-08-16 02:56:03.393 CDT @[15744:] : LOG: restored log file "000000020000230F0000008E" from archive
2016-08-16 02:56:03.521 CDT [unknown]@[unknown][17049:] 10.100.128.210(49031) [unknown] : LOG: connection received: host=10.100.128.210 port=49031
2016-08-16 02:56:03.650 CDT replicator@[unknown][17049:authentication] 10.100.128.210(49031) [unknown] : LOG: replication connection authorized: user=replicator SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES256-GCM-SHA384, compression=on)
2016-08-16 02:56:03.708 CDT replicator@[unknown][17049:idle] 10.100.128.210(49031) walreceiver : ERROR: requested WAL segment 00000002000022F400000096 has already been removed
2016-08-16 02:56:03.736 CDT replicator@[unknown][17049:idle] 10.100.128.210(49031) walreceiver : LOG: disconnection: session time: 0:00:00.214 user=replicator database= host=10.100.128.210 port=49031
2016-08-16 02:56:03.838 T00 INFO: archive-get start: --buffer-size=8388608 --compress --db-path=/pgdata/lenderx --log-level-console=info --log-level-file=info --log-path=/pgarchive/pgbackrest/log/lenderx --repo-path=/pgarchive/pgbackrest --stanza=lenderx
2016-08-16 02:56:03.839 T00 INFO: get WAL segment 000000020000230F0000008F
2016-08-16 02:56:04.198 T00 INFO: archive-get stop
2016-08-16 02:56:04.235 CDT @[15744:] : LOG: restored log file "000000020000230F0000008F" from archive
2016-08-16 02:56:04.735 T00 INFO: archive-get start: --buffer-size=8388608 --compress --db-path=/pgdata/lenderx --log-level-console=info --log-level-file=info --log-path=/pgarchive/pgbackrest/log/lenderx --repo-path=/pgarchive/pgbackrest --stanza=lenderx
2016-08-16 02:56:04.736 T00 INFO: get WAL segment 000000020000230F00000090
2016-08-16 02:56:05.069 T00 INFO: archive-get stop
2016-08-16 02:56:05.110 CDT @[15744:] : LOG: restored log file "000000020000230F00000090" from archive
2016-08-16 02:56:05.562 T00 INFO: archive-get start: --buffer-size=8388608 --compress --db-path=/pgdata/lenderx --log-level-console=info --log-level-file=info --log-path=/pgarchive/pgbackrest/log/lenderx --repo-path=/pgarchive/pgbackrest --stanza=lenderx
2016-08-16 02:56:05.562 T00 INFO: get WAL segment 000000020000230F00000091
2016-08-16 02:56:05.903 T00 INFO: archive-get stop
Then it continues to get WALs up to the timeline streaming message:
2016-08-16 02:57:07.438 T00 INFO: archive-get start: --buffer-size=8388608 --compress --db-path=/pgdata/lenderx --log-level-console=info --log-level-file=info --log-path=/pgarchive/pgbackrest/log/lenderx --repo-path=/pgarchive/pgbackrest --stanza=lenderx
2016-08-16 02:57:07.438 T00 INFO: get WAL segment 000000020000230F000000DF
2016-08-16 02:57:07.443 T00 INFO: unable to find 000000020000230F000000DF in the archive
2016-08-16 02:57:07.444 T00 INFO: archive-get stop
2016-08-16 02:57:07.515 CDT @[17284:] : LOG: fetching timeline history file for timeline 2 from primary server
2016-08-16 02:57:07.524 CDT @[17284:] : LOG: started streaming WAL from primary at 230F/DE000000 on timeline 2
2016-08-16 02:57:08.944 CDT [unknown]@[unknown][17285:] 10.100.128.210(49064) [unknown] : LOG: connection received: host=10.100.128.210 port=49064
2016-08-16 02:57:09.066 CDT replicator@[unknown][17285:authentication] 10.100.128.210(49064) [unknown] : LOG: replication connection authorized: user=replicator SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES256-GCM-SHA384, compression=on)
2016-08-16 02:57:09.120 CDT replicator@[unknown][17285:idle] 10.100.128.210(49064) walreceiver : ERROR: requested WAL segment 00000002000022F400000096 has already been removed
There is not problem with space for pg_xlog on master or slave:
/dev/mapper/vg02-pgxloglv 60G 24G 37G 39% /pgxlog
/dev/mapper/vg01-pgxloglv 60G 7.6G 53G 13% /pgxlog
There is nothing marked .ready on the master.
There is another slave pulling from the master (mine is 10.80.129.46):
postgres=# select * from pg_stat_replication;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state
-------+----------+------------+------------------+----------------+-----------------+-------------+-------------------------------+--------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------
12874 | 30315692 | replicator | walreceiver | 10.80.129.46 | | 57843 | 2016-08-16 02:57:07.494827-05 | | streaming | 230F/FB4994C8 | 230F/FB4994C8 | 230F/FB4994C8 | 230F/FB4992A8 | 0 | async
5075 | 30315692 | replicator | walreceiver | 10.100.128.209 | | 46847 | 2016-08-13 13:12:48.218157-05 | | streaming | 230F/FB4994C8 | 230F/FB4994C8 | 230F/FB4994C8 | 230F/FB4994C8 | 0 | async
(2 rows)
On the slave I run:
postgres=# select now() as now, now()-pg_last_xact_replay_timestamp() as lagtime, pg_last_xact_replay_timestamp() as replay_timestamp, pg_last_xlog_receive_location() as receive_loc,
pg_last_xlog_replay_location() as replay_loc, CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() THEN 0 ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp()) END AS log_delay, (select pg_xlog_location_diff(write_location, replay_location) as log_cnt_diff from pg_stat_replication);
now | lagtime | replay_timestamp | receive_loc | replay_loc | log_delay | log_cnt_diff
-------------------------------+-----------------+-------------------------------+---------------+---------------+-----------+--------------
2016-08-16 03:54:36.145632-05 | 00:00:00.040109 | 2016-08-16 03:54:36.105523-05 | 230F/FC184DD0 | 230F/FC184DD0 | 0 |
(1 row)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment