Created
August 16, 2016 19:43
-
-
Save MichaelDBA/7a8a05e31ab240a6ddf48cee8d57679e to your computer and use it in GitHub Desktop.
replication problems
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Same problem same WAL file: | |
016-08-16 02:58:28.984 CDT replicator@[unknown][17378:idle] 10.100.128.210(49089) walreceiver : ERROR: requested WAL segment 00000002000022F400000096 has already been removed | |
The oldest WAL on the master is dated yesterday august 15 at 3:17pm. The backup started yesterday at 8:40pm, over 5 hours after the oldest WAL on the master as shown below: | |
-rw-r--r-- 1 postgres postgres 33 May 12 2015 00000002.history | |
-rw------- 1 postgres postgres 16777216 Aug 15 15:17 000000020000230F000000E2 | |
-rw------- 1 postgres postgres 16777216 Aug 15 15:17 000000020000230F000000E3 | |
-rw------- 1 postgres postgres 16777216 Aug 15 15:18 000000020000230F000000E4 | |
-rw------- 1 postgres postgres 16777216 Aug 15 15:18 000000020000230F000000E5 | |
-rw------- 1 postgres postgres 16777216 Aug 15 15:18 000000020000230F000000E6 | |
-rw------- 1 postgres postgres 16777216 Aug 15 15:18 000000020000230F000000E7 | |
-rw------- 1 postgres postgres 16777216 Aug 15 15:18 000000020000230F000000E8 | |
-rw------- 1 postgres postgres 16777216 Aug 15 15:19 000000020000230F000000E9 | |
-rw------- 1 postgres postgres 16777216 Aug 15 15:19 000000020000230F000000EA | |
-rw------- 1 postgres postgres 16777216 Aug 15 15:19 000000020000230F000000EB | |
-rw------- 1 postgres postgres 16777216 Aug 15 15:20 000000020000230F000000EC | |
-rw------- 1 postgres postgres 16777216 Aug 15 15:20 000000020000230F000000ED | |
-rw------- 1 postgres postgres 16777216 Aug 15 15:20 000000020000230F000000EE | |
-rw------- 1 postgres postgres 16777216 Aug 15 15:20 000000020000230F000000EF | |
-rw------- 1 postgres postgres 16777216 Aug 15 15:20 000000020000230F000000F0 | |
-rw------- 1 postgres postgres 16777216 Aug 15 15:21 000000020000230F000000F1 | |
-rw------- 1 postgres postgres 16777216 Aug 15 15:21 000000020000230F000000F2 | |
-rw------- 1 postgres postgres 16777216 Aug 15 15:21 000000020000230F000000F3 | |
-rw------- 1 postgres postgres 16777216 Aug 15 15:22 000000020000230F000000F4 | |
-rw------- 1 postgres postgres 16777216 Aug 15 15:22 000000020000230F000000F5 | |
-rw------- 1 postgres postgres 16777216 Aug 15 15:22 000000020000230F000000F6 | |
--More-- | |
After starting the slave, it went through all the WALs starting with: | |
2016-08-16 02:50:38.727 T00 INFO: get WAL segment 000000020000230E00000015 | |
2016-08-16 02:50:38.898 CDT [unknown]@[unknown][15758:] [local] [unknown] : LOG: connection received: host=[local] | |
2016-08-16 02:50:38.898 CDT postgres@postgres[15758:] [local] [unknown] : FATAL: the database system is starting up | |
2016-08-16 02:50:39.053 T00 INFO: archive-get stop | |
2016-08-16 02:50:39.095 CDT @[15744:] : LOG: restored log file "000000020000230E00000015" from archive | |
2016-08-16 02:50:39.165 CDT @[15744:] : LOG: redo starts at 230E/15002B38 | |
Here is the snippet at the end where the error first occurs | |
2016-08-16 02:56:02.908 T00 INFO: archive-get start: --buffer-size=8388608 --compress --db-path=/pgdata/lenderx --log-level-console=info --log-level-file=info --log-path=/pgarchive/pgbackrest/log/lenderx --repo-path=/pgarchive/pgbackrest --stanza=lenderx | |
2016-08-16 02:56:02.908 T00 INFO: get WAL segment 000000020000230F0000008E | |
2016-08-16 02:56:03.350 T00 INFO: archive-get stop | |
2016-08-16 02:56:03.393 CDT @[15744:] : LOG: restored log file "000000020000230F0000008E" from archive | |
2016-08-16 02:56:03.521 CDT [unknown]@[unknown][17049:] 10.100.128.210(49031) [unknown] : LOG: connection received: host=10.100.128.210 port=49031 | |
2016-08-16 02:56:03.650 CDT replicator@[unknown][17049:authentication] 10.100.128.210(49031) [unknown] : LOG: replication connection authorized: user=replicator SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES256-GCM-SHA384, compression=on) | |
2016-08-16 02:56:03.708 CDT replicator@[unknown][17049:idle] 10.100.128.210(49031) walreceiver : ERROR: requested WAL segment 00000002000022F400000096 has already been removed | |
2016-08-16 02:56:03.736 CDT replicator@[unknown][17049:idle] 10.100.128.210(49031) walreceiver : LOG: disconnection: session time: 0:00:00.214 user=replicator database= host=10.100.128.210 port=49031 | |
2016-08-16 02:56:03.838 T00 INFO: archive-get start: --buffer-size=8388608 --compress --db-path=/pgdata/lenderx --log-level-console=info --log-level-file=info --log-path=/pgarchive/pgbackrest/log/lenderx --repo-path=/pgarchive/pgbackrest --stanza=lenderx | |
2016-08-16 02:56:03.839 T00 INFO: get WAL segment 000000020000230F0000008F | |
2016-08-16 02:56:04.198 T00 INFO: archive-get stop | |
2016-08-16 02:56:04.235 CDT @[15744:] : LOG: restored log file "000000020000230F0000008F" from archive | |
2016-08-16 02:56:04.735 T00 INFO: archive-get start: --buffer-size=8388608 --compress --db-path=/pgdata/lenderx --log-level-console=info --log-level-file=info --log-path=/pgarchive/pgbackrest/log/lenderx --repo-path=/pgarchive/pgbackrest --stanza=lenderx | |
2016-08-16 02:56:04.736 T00 INFO: get WAL segment 000000020000230F00000090 | |
2016-08-16 02:56:05.069 T00 INFO: archive-get stop | |
2016-08-16 02:56:05.110 CDT @[15744:] : LOG: restored log file "000000020000230F00000090" from archive | |
2016-08-16 02:56:05.562 T00 INFO: archive-get start: --buffer-size=8388608 --compress --db-path=/pgdata/lenderx --log-level-console=info --log-level-file=info --log-path=/pgarchive/pgbackrest/log/lenderx --repo-path=/pgarchive/pgbackrest --stanza=lenderx | |
2016-08-16 02:56:05.562 T00 INFO: get WAL segment 000000020000230F00000091 | |
2016-08-16 02:56:05.903 T00 INFO: archive-get stop | |
Then it continues to get WALs up to the timeline streaming message: | |
2016-08-16 02:57:07.438 T00 INFO: archive-get start: --buffer-size=8388608 --compress --db-path=/pgdata/lenderx --log-level-console=info --log-level-file=info --log-path=/pgarchive/pgbackrest/log/lenderx --repo-path=/pgarchive/pgbackrest --stanza=lenderx | |
2016-08-16 02:57:07.438 T00 INFO: get WAL segment 000000020000230F000000DF | |
2016-08-16 02:57:07.443 T00 INFO: unable to find 000000020000230F000000DF in the archive | |
2016-08-16 02:57:07.444 T00 INFO: archive-get stop | |
2016-08-16 02:57:07.515 CDT @[17284:] : LOG: fetching timeline history file for timeline 2 from primary server | |
2016-08-16 02:57:07.524 CDT @[17284:] : LOG: started streaming WAL from primary at 230F/DE000000 on timeline 2 | |
2016-08-16 02:57:08.944 CDT [unknown]@[unknown][17285:] 10.100.128.210(49064) [unknown] : LOG: connection received: host=10.100.128.210 port=49064 | |
2016-08-16 02:57:09.066 CDT replicator@[unknown][17285:authentication] 10.100.128.210(49064) [unknown] : LOG: replication connection authorized: user=replicator SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES256-GCM-SHA384, compression=on) | |
2016-08-16 02:57:09.120 CDT replicator@[unknown][17285:idle] 10.100.128.210(49064) walreceiver : ERROR: requested WAL segment 00000002000022F400000096 has already been removed | |
There is not problem with space for pg_xlog on master or slave: | |
/dev/mapper/vg02-pgxloglv 60G 24G 37G 39% /pgxlog | |
/dev/mapper/vg01-pgxloglv 60G 7.6G 53G 13% /pgxlog | |
There is nothing marked .ready on the master. | |
There is another slave pulling from the master (mine is 10.80.129.46): | |
postgres=# select * from pg_stat_replication; | |
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state | |
-------+----------+------------+------------------+----------------+-----------------+-------------+-------------------------------+--------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------ | |
12874 | 30315692 | replicator | walreceiver | 10.80.129.46 | | 57843 | 2016-08-16 02:57:07.494827-05 | | streaming | 230F/FB4994C8 | 230F/FB4994C8 | 230F/FB4994C8 | 230F/FB4992A8 | 0 | async | |
5075 | 30315692 | replicator | walreceiver | 10.100.128.209 | | 46847 | 2016-08-13 13:12:48.218157-05 | | streaming | 230F/FB4994C8 | 230F/FB4994C8 | 230F/FB4994C8 | 230F/FB4994C8 | 0 | async | |
(2 rows) | |
On the slave I run: | |
postgres=# select now() as now, now()-pg_last_xact_replay_timestamp() as lagtime, pg_last_xact_replay_timestamp() as replay_timestamp, pg_last_xlog_receive_location() as receive_loc, | |
pg_last_xlog_replay_location() as replay_loc, CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() THEN 0 ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp()) END AS log_delay, (select pg_xlog_location_diff(write_location, replay_location) as log_cnt_diff from pg_stat_replication); | |
now | lagtime | replay_timestamp | receive_loc | replay_loc | log_delay | log_cnt_diff | |
-------------------------------+-----------------+-------------------------------+---------------+---------------+-----------+-------------- | |
2016-08-16 03:54:36.145632-05 | 00:00:00.040109 | 2016-08-16 03:54:36.105523-05 | 230F/FC184DD0 | 230F/FC184DD0 | 0 | | |
(1 row) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment