Auto-failover Cluster Recovery Procedure

Overview

This document describes a procedure for recovering an auto-failover cluster that has more than 1 failed node.

When multiple nodes need to be re-added to a cluster, the etcd state needs to be carefully managed to manage member and quorum consistency throughout.

The two inputs that need to be managed throughout the recovery process are:

Expected cluster members

This is the list of hosts defined under the cluster policy (typically conjur/cluster/conjur). During recovery, this list must only contain 1 non-running node at a time.

If this list contains more than 1 failed node, then trying to re-enroll nodes will result in an error such as:
```
C | etcdmain: error validating peerURLs {...}: member count is unequal
```
This error says that the expected cluster nodes (the list from the policy) does not match the actual nodes present (i.e. the remaining active nodes plus the one being re-enrolled).

Current cluster members

This is the list of cluster members in the etcd metadata returned by the command evoke conjur cluster member list. Similar to the issue above, attempting to add more than one restored Standby node at a time will result in a error: 500 Internal Server Error error when running the evoke conjur cluster member add .. command and the log will contain a message like:

W | etcdserver/membership: Reject add member request: the number of started member (2) will be less than the quorum number of the cluster (3)
W | etcdserver: not enough started members, rejecting member add {ID:f069f0abc99c2e98 RaftAttributes:{PeerURLs:[http://dr-xa-primary-3.itd.conjur.net:2380]} Attributes:{Name: ClientURLs:[]}}
E | etcdserver/api/v2http: error adding member f069f0abc99c2e98 (etcdserver: re-configuration failed due to not enough started members)
E | etcdserver/api/etcdhttp: etcdserver: re-configuration failed due to not enough started members

Recovery Procedure Steps

Environment

Assuming a 5 node auto-failover cluster with the following state:

primary-1: Active Master
primary-2: Healthy Standby
primary-3: Unhealthy Standby
primary-4: Unhealthy Standby
primary-5: Unhealthy Standby

Procedure

Recover `primary-3`

Reload cluster policy removing primary-4 and primary-5 using --replace

If the original cluster policy is:

- !policy
  id: cluster/conjur
  body:
    - !layer
    - &hosts
      - !host primary-1
      - !host primary-2
      - !host primary-3
      - !host primary-4
      - !host primary-5

    - !grant
      role: !layer
      member: *hosts

Then you may comment out these hosts:

# policy/cluster.yaml
- !policy
  id: cluster/conjur
  body:
    - !layer
    - &hosts
      - !host primary-1
      - !host primary-2
      - !host primary-3
      # - !host primary-4
      # - !host primary-5

    - !grant
      role: !layer
      member: *hosts

Reload the policy:

(my-user) $ conjur policy load --replace conjur policy/cluster.yaml

Remove the existing appliance container on primary-3
```
(primary-3) $ docker rm -f dap
```

Create a new appliance container for the Standby

(primary-3) $ docker run --name dap -d --restart=always --security-opt seccomp:/path/to/conjur-seccomp.json -p 443:443 -p 5432:5432 -p 1999:1999 conjur-appliance:5.3.0

Generate a seed file for the new Standby from the active Master

(my-user) $ ssh my-user@primary-1 docker exec dap evoke seed standby primary-3 primary-1 |\
    ssh my-user@primary-3 docker exec -i dap evoke unpack seed -

Configure the Standby on primary-3

(primary-3) $ docker exec dap evoke configure standby

Ensure the etcd metadata contains only the active cluster nodes

List the current members from the active master, primary-1:

(primary-1) $ docker exec dap evoke cluster member list | jq .
[
  {
    "id": "8173725b4825649",
    "name": "primary-1",
    "peerURLs": [
      "http://primary-1:2380"
    ],
    "clientURLs": [
      "http://localhost:2379"
    ]
  },
  {
    "id": "8c4717f1e3134c85",
    "name": "primary-2",
    "peerURLs": [
      "http://primary-2:2380"
    ],
    "clientURLs": [
      "http://localhost:2379"
    ]
  },
  {
    "id": "95b4109fd0d3d7c3",
    "name": "primary-4",
    "peerURLs": [
      "http://primary-4:2380"
    ],
    "clientURLs": [
      "http://localhost:2379"
    ]
  }
]

Remove any nodes that are not currently healthy using the id. For example, we would remove id 95b4109fd0d3d7c3 for primary-4 in the output above:

(primary-1) $ docker exec dap evoke cluster member remove 95b4109fd0d3d7c3

Add primary-3 to the cluster metadata on the active master, primary-1

(primary-1) $ docker exec dap evoke cluster member add primary-3

Enroll primary-3 back into the cluster

(primary-3) $ docker exec dap evoke cluster enroll \
  --reenroll \
  -n primary-3 \
  -m primary-1 \
  conjur

Verify that the cluster is healthy

From primary-3 health:

(primary-3) $ curl -k https://localhost/health
...
 "cluster": {
    ...
    "status": "standing_by",
    ...
  },
...

From primary-1 health:

(primary-1) $ curl -k https://localhost/health
...
  "cluster": {
    "ok": true,
    "status": "running",
    "message": "acting as master"
  },
...

From etcd health on primary-1:

(primary-1) $ docker exec dap etcdctl cluster-health
member 8173725b4825649 is healthy: got healthy result from http://localhost:2379
member 8c4717f1e3134c85 is healthy: got healthy result from http://localhost:2379
member 949c1d777035bd43 is healthy: got healthy result from http://localhost:2379
cluster is healthy

The cluster is now a healthy 3 node cluster and is ready to restore the 4th node, primary-4.

Recover `primary-4`

Reload cluster policy adding back primary-4 using --replace

Uncomment the host entry for primary-4:

# policy/cluster.yaml
- !policy
  id: cluster/conjur
  body:
    - !layer
    - &hosts
      - !host primary-1
      - !host primary-2
      - !host primary-3
      - !host primary-4
      # - !host primary-5

    - !grant
      role: !layer
      member: *hosts

Reload the policy:

(my-user) $ conjur policy load --replace conjur policy/cluster.yaml

Remove the existing appliance container on primary-4
```
(primary-4) $ docker rm -f dap
```

Create a new appliance container for the Standby

(primary-4) $ docker run --name dap -d --restart=always --security-opt seccomp:/path/to/conjur-seccomp.json -p 443:443 -p 5432:5432 -p 1999:1999 conjur-appliance:5.3.0

Generate a seed file for the new Standby from the active Master

(my-user) $ ssh my-user@primary-1 docker exec dap evoke seed standby primary-4 primary-1 |\
    ssh my-user@primary-4 docker exec -i dap evoke unpack seed -

Configure the Standby on primary-4

(primary-4) $ docker exec dap evoke configure standby

Add primary-4 to the cluster metadata on the active master, primary-1

(primary-1) $ docker exec dap evoke cluster member add primary-4

Enroll primary-4 back into the cluster

(primary-4) $ docker exec dap evoke cluster enroll \
  --reenroll \
  -n primary-4 \
  -m primary-1 \
  conjur

Verify that the cluster is healthy

From primary-4 health:

(primary-4) $ curl -k https://localhost/health
...
 "cluster": {
    ...
    "status": "standing_by",
    ...
  },
...

From primary-1 health:

(primary-1) $ curl -k https://localhost/health
...
  "cluster": {
    "ok": true,
    "status": "running",
    "message": "acting as master"
  },
...

From etcd health on primary-1:

(primary-1) $ docker exec dap etcdctl cluster-health
member 8173725b4825649 is healthy: got healthy result from http://localhost:2379
member 8c4717f1e3134c85 is healthy: got healthy result from http://localhost:2379
member 949c1d777035bd43 is healthy: got healthy result from http://localhost:2379
member 95b4109fd0d3d7c3 is healthy: got healthy result from http://localhost:2379
cluster is healthy

The cluster is now a healthy 4 node cluster and is ready to restore the final node, primary-5.

Recover `primary-5`

This process is identical to restoring primary-4.

Reload cluster policy adding back primary-5 using --replace

Uncomment the host entry for primary-5:

# policy/cluster.yaml
- !policy
  id: cluster/conjur
  body:
    - !layer
    - &hosts
      - !host primary-1
      - !host primary-2
      - !host primary-3
      - !host primary-4
      - !host primary-5

    - !grant
      role: !layer
      member: *hosts

Reload the policy:

(my-user) $ conjur policy load --replace conjur policy/cluster.yaml

Remove the existing appliance container on primary-5
```
(primary-5) $ docker rm -f dap
```

Create a new appliance container for the Standby

(primary-5) $ docker run --name dap -d --restart=always --security-opt seccomp:/path/to/conjur-seccomp.json -p 443:443 -p 5432:5432 -p 1999:1999 conjur-appliance:5.3.0

Generate a seed file for the new Standby from the active Master

(my-user) $ ssh my-user@primary-1 docker exec dap evoke seed standby primary-5 primary-1 |\
    ssh my-user@primary-5 docker exec -i dap evoke unpack seed -

Configure the Standby on primary-5

(primary-5) $ docker exec dap evoke configure standby

Add primary-5 to the cluster metadata on the active master, primary-1

(primary-1) $ docker exec dap evoke cluster member add primary-5

Enroll primary-5 back into the cluster

(primary-4) $ docker exec dap evoke cluster enroll \
  --reenroll \
  -n primary-5 \
  -m primary-1 \
  conjur

Verify that the cluster is healthy

From primary-5 health:

(primary-5) $ curl -k https://localhost/health
...
 "cluster": {
    ...
    "status": "standing_by",
    ...
  },
...

From primary-1 health:

(primary-1) $ curl -k https://localhost/health
...
  "cluster": {
    "ok": true,
    "status": "running",
    "message": "acting as master"
  },
...

From etcd health on primary-1:

(primary-1) $ docker exec dap etcdctl cluster-health
member 8173725b4825649 is healthy: got healthy result from http://localhost:2379
member 8c4717f1e3134c85 is healthy: got healthy result from http://localhost:2379
member 949c1d777035bd43 is healthy: got healthy result from http://localhost:2379
member 95b4109fd0d3d7c3 is healthy: got healthy result from http://localhost:2379
member b729447ffbc4d179 is healthy: got healthy result from http://localhost:2379
cluster is healthy

The cluster is now a healthy 5 node cluster again.

micahlee/recovery_procedure.md

Auto-failover Cluster Recovery Procedure

Overview

Recovery Procedure Steps

Environment

Procedure

Recover primary-3

Recover primary-4

Recover primary-5

Recover `primary-3`

Recover `primary-4`

Recover `primary-5`