This document describes a procedure for recovering an auto-failover cluster that has more than 1 failed node.
When multiple nodes need to be re-added to a cluster, the etcd
state needs
to be carefully managed to manage member and quorum consistency throughout.
The two inputs that need to be managed throughout the recovery process are:
-
Expected cluster members
This is the list of hosts defined under the cluster policy (typically
conjur/cluster/conjur
). During recovery, this list must only contain 1 non-running node at a time.If this list contains more than 1 failed node, then trying to re-enroll nodes will result in an error such as:
C | etcdmain: error validating peerURLs {...}: member count is unequal
This error says that the expected cluster nodes (the list from the policy) does not match the actual nodes present (i.e. the remaining active nodes plus the one being re-enrolled).
-
Current cluster members
This is the list of cluster members in the
etcd
metadata returned by the commandevoke conjur cluster member list
. Similar to the issue above, attempting to add more than one restored Standby node at a time will result in aerror: 500 Internal Server Error
error when running theevoke conjur cluster member add ..
command and the log will contain a message like:W | etcdserver/membership: Reject add member request: the number of started member (2) will be less than the quorum number of the cluster (3) W | etcdserver: not enough started members, rejecting member add {ID:f069f0abc99c2e98 RaftAttributes:{PeerURLs:[http://dr-xa-primary-3.itd.conjur.net:2380]} Attributes:{Name: ClientURLs:[]}} E | etcdserver/api/v2http: error adding member f069f0abc99c2e98 (etcdserver: re-configuration failed due to not enough started members) E | etcdserver/api/etcdhttp: etcdserver: re-configuration failed due to not enough started members
Assuming a 5 node auto-failover cluster with the following state:
primary-1
: Active Masterprimary-2
: Healthy Standbyprimary-3
: Unhealthy Standbyprimary-4
: Unhealthy Standbyprimary-5
: Unhealthy Standby
-
Reload cluster policy removing
primary-4
andprimary-5
using--replace
If the original cluster policy is:
- !policy id: cluster/conjur body: - !layer - &hosts - !host primary-1 - !host primary-2 - !host primary-3 - !host primary-4 - !host primary-5 - !grant role: !layer member: *hosts
Then you may comment out these hosts:
# policy/cluster.yaml - !policy id: cluster/conjur body: - !layer - &hosts - !host primary-1 - !host primary-2 - !host primary-3 # - !host primary-4 # - !host primary-5 - !grant role: !layer member: *hosts
Reload the policy:
(my-user) $ conjur policy load --replace conjur policy/cluster.yaml
-
Remove the existing appliance container on
primary-3
(primary-3) $ docker rm -f dap
-
Create a new appliance container for the Standby
(primary-3) $ docker run --name dap -d --restart=always --security-opt seccomp:/path/to/conjur-seccomp.json -p 443:443 -p 5432:5432 -p 1999:1999 conjur-appliance:5.3.0
-
Generate a seed file for the new Standby from the active Master
(my-user) $ ssh my-user@primary-1 docker exec dap evoke seed standby primary-3 primary-1 |\ ssh my-user@primary-3 docker exec -i dap evoke unpack seed -
-
Configure the Standby on
primary-3
(primary-3) $ docker exec dap evoke configure standby
-
Ensure the
etcd
metadata contains only the active cluster nodesList the current members from the active master,
primary-1
:(primary-1) $ docker exec dap evoke cluster member list | jq . [ { "id": "8173725b4825649", "name": "primary-1", "peerURLs": [ "http://primary-1:2380" ], "clientURLs": [ "http://localhost:2379" ] }, { "id": "8c4717f1e3134c85", "name": "primary-2", "peerURLs": [ "http://primary-2:2380" ], "clientURLs": [ "http://localhost:2379" ] }, { "id": "95b4109fd0d3d7c3", "name": "primary-4", "peerURLs": [ "http://primary-4:2380" ], "clientURLs": [ "http://localhost:2379" ] } ]
Remove any nodes that are not currently healthy using the
id
. For example, we would remove id95b4109fd0d3d7c3
forprimary-4
in the output above:(primary-1) $ docker exec dap evoke cluster member remove 95b4109fd0d3d7c3
-
Add
primary-3
to the cluster metadata on the active master,primary-1
(primary-1) $ docker exec dap evoke cluster member add primary-3
-
Enroll
primary-3
back into the cluster(primary-3) $ docker exec dap evoke cluster enroll \ --reenroll \ -n primary-3 \ -m primary-1 \ conjur
-
Verify that the cluster is healthy
From
primary-3
health:(primary-3) $ curl -k https://localhost/health ... "cluster": { ... "status": "standing_by", ... }, ...
From
primary-1
health:(primary-1) $ curl -k https://localhost/health ... "cluster": { "ok": true, "status": "running", "message": "acting as master" }, ...
From etcd health on
primary-1
:(primary-1) $ docker exec dap etcdctl cluster-health member 8173725b4825649 is healthy: got healthy result from http://localhost:2379 member 8c4717f1e3134c85 is healthy: got healthy result from http://localhost:2379 member 949c1d777035bd43 is healthy: got healthy result from http://localhost:2379 cluster is healthy
The cluster is now a healthy 3 node cluster and is ready to restore the 4th node,
primary-4
.
-
Reload cluster policy adding back
primary-4
using--replace
Uncomment the host entry for
primary-4
:# policy/cluster.yaml - !policy id: cluster/conjur body: - !layer - &hosts - !host primary-1 - !host primary-2 - !host primary-3 - !host primary-4 # - !host primary-5 - !grant role: !layer member: *hosts
Reload the policy:
(my-user) $ conjur policy load --replace conjur policy/cluster.yaml
-
Remove the existing appliance container on
primary-4
(primary-4) $ docker rm -f dap
-
Create a new appliance container for the Standby
(primary-4) $ docker run --name dap -d --restart=always --security-opt seccomp:/path/to/conjur-seccomp.json -p 443:443 -p 5432:5432 -p 1999:1999 conjur-appliance:5.3.0
-
Generate a seed file for the new Standby from the active Master
(my-user) $ ssh my-user@primary-1 docker exec dap evoke seed standby primary-4 primary-1 |\ ssh my-user@primary-4 docker exec -i dap evoke unpack seed -
-
Configure the Standby on
primary-4
(primary-4) $ docker exec dap evoke configure standby
-
Add
primary-4
to the cluster metadata on the active master,primary-1
(primary-1) $ docker exec dap evoke cluster member add primary-4
-
Enroll
primary-4
back into the cluster(primary-4) $ docker exec dap evoke cluster enroll \ --reenroll \ -n primary-4 \ -m primary-1 \ conjur
-
Verify that the cluster is healthy
From
primary-4
health:(primary-4) $ curl -k https://localhost/health ... "cluster": { ... "status": "standing_by", ... }, ...
From
primary-1
health:(primary-1) $ curl -k https://localhost/health ... "cluster": { "ok": true, "status": "running", "message": "acting as master" }, ...
From etcd health on
primary-1
:(primary-1) $ docker exec dap etcdctl cluster-health member 8173725b4825649 is healthy: got healthy result from http://localhost:2379 member 8c4717f1e3134c85 is healthy: got healthy result from http://localhost:2379 member 949c1d777035bd43 is healthy: got healthy result from http://localhost:2379 member 95b4109fd0d3d7c3 is healthy: got healthy result from http://localhost:2379 cluster is healthy
The cluster is now a healthy 4 node cluster and is ready to restore the final node,
primary-5
.
This process is identical to restoring primary-4
.
-
Reload cluster policy adding back
primary-5
using--replace
Uncomment the host entry for
primary-5
:# policy/cluster.yaml - !policy id: cluster/conjur body: - !layer - &hosts - !host primary-1 - !host primary-2 - !host primary-3 - !host primary-4 - !host primary-5 - !grant role: !layer member: *hosts
Reload the policy:
(my-user) $ conjur policy load --replace conjur policy/cluster.yaml
-
Remove the existing appliance container on
primary-5
(primary-5) $ docker rm -f dap
-
Create a new appliance container for the Standby
(primary-5) $ docker run --name dap -d --restart=always --security-opt seccomp:/path/to/conjur-seccomp.json -p 443:443 -p 5432:5432 -p 1999:1999 conjur-appliance:5.3.0
-
Generate a seed file for the new Standby from the active Master
(my-user) $ ssh my-user@primary-1 docker exec dap evoke seed standby primary-5 primary-1 |\ ssh my-user@primary-5 docker exec -i dap evoke unpack seed -
-
Configure the Standby on
primary-5
(primary-5) $ docker exec dap evoke configure standby
-
Add
primary-5
to the cluster metadata on the active master,primary-1
(primary-1) $ docker exec dap evoke cluster member add primary-5
-
Enroll
primary-5
back into the cluster(primary-4) $ docker exec dap evoke cluster enroll \ --reenroll \ -n primary-5 \ -m primary-1 \ conjur
-
Verify that the cluster is healthy
From
primary-5
health:(primary-5) $ curl -k https://localhost/health ... "cluster": { ... "status": "standing_by", ... }, ...
From
primary-1
health:(primary-1) $ curl -k https://localhost/health ... "cluster": { "ok": true, "status": "running", "message": "acting as master" }, ...
From etcd health on
primary-1
:(primary-1) $ docker exec dap etcdctl cluster-health member 8173725b4825649 is healthy: got healthy result from http://localhost:2379 member 8c4717f1e3134c85 is healthy: got healthy result from http://localhost:2379 member 949c1d777035bd43 is healthy: got healthy result from http://localhost:2379 member 95b4109fd0d3d7c3 is healthy: got healthy result from http://localhost:2379 member b729447ffbc4d179 is healthy: got healthy result from http://localhost:2379 cluster is healthy
The cluster is now a healthy 5 node cluster again.