Skip to content

Instantly share code, notes, and snippets.

@micahlee
Last active October 10, 2019 15:53
Show Gist options
  • Save micahlee/587b9f0efd54a9ba528c20ea2121ee6a to your computer and use it in GitHub Desktop.
Save micahlee/587b9f0efd54a9ba528c20ea2121ee6a to your computer and use it in GitHub Desktop.
DAP Auto-Failover Cluster Recovery

Auto-failover Cluster Recovery Procedure

Overview

This document describes a procedure for recovering an auto-failover cluster that has more than 1 failed node.

When multiple nodes need to be re-added to a cluster, the etcd state needs to be carefully managed to manage member and quorum consistency throughout.

The two inputs that need to be managed throughout the recovery process are:

  • Expected cluster members

    This is the list of hosts defined under the cluster policy (typically conjur/cluster/conjur). During recovery, this list must only contain 1 non-running node at a time.

    If this list contains more than 1 failed node, then trying to re-enroll nodes will result in an error such as:

    C | etcdmain: error validating peerURLs {...}: member count is unequal
    

    This error says that the expected cluster nodes (the list from the policy) does not match the actual nodes present (i.e. the remaining active nodes plus the one being re-enrolled).

  • Current cluster members

    This is the list of cluster members in the etcd metadata returned by the command evoke conjur cluster member list. Similar to the issue above, attempting to add more than one restored Standby node at a time will result in a error: 500 Internal Server Error error when running the evoke conjur cluster member add .. command and the log will contain a message like:

    W | etcdserver/membership: Reject add member request: the number of started member (2) will be less than the quorum number of the cluster (3)
    W | etcdserver: not enough started members, rejecting member add {ID:f069f0abc99c2e98 RaftAttributes:{PeerURLs:[http://dr-xa-primary-3.itd.conjur.net:2380]} Attributes:{Name: ClientURLs:[]}}
    E | etcdserver/api/v2http: error adding member f069f0abc99c2e98 (etcdserver: re-configuration failed due to not enough started members)
    E | etcdserver/api/etcdhttp: etcdserver: re-configuration failed due to not enough started members
    

Recovery Procedure Steps

Environment

Assuming a 5 node auto-failover cluster with the following state:

  • primary-1: Active Master
  • primary-2: Healthy Standby
  • primary-3: Unhealthy Standby
  • primary-4: Unhealthy Standby
  • primary-5: Unhealthy Standby

Procedure

Recover primary-3

  1. Reload cluster policy removing primary-4 and primary-5 using --replace

    If the original cluster policy is:

    - !policy
      id: cluster/conjur
      body:
        - !layer
        - &hosts
          - !host primary-1
          - !host primary-2
          - !host primary-3
          - !host primary-4
          - !host primary-5
    
        - !grant
          role: !layer
          member: *hosts

    Then you may comment out these hosts:

    # policy/cluster.yaml
    - !policy
      id: cluster/conjur
      body:
        - !layer
        - &hosts
          - !host primary-1
          - !host primary-2
          - !host primary-3
          # - !host primary-4
          # - !host primary-5
    
        - !grant
          role: !layer
          member: *hosts

    Reload the policy:

    (my-user) $ conjur policy load --replace conjur policy/cluster.yaml
    
  2. Remove the existing appliance container on primary-3

    (primary-3) $ docker rm -f dap
    
  3. Create a new appliance container for the Standby

    (primary-3) $ docker run --name dap -d --restart=always --security-opt seccomp:/path/to/conjur-seccomp.json -p 443:443 -p 5432:5432 -p 1999:1999 conjur-appliance:5.3.0
    
  4. Generate a seed file for the new Standby from the active Master

    (my-user) $ ssh my-user@primary-1 docker exec dap evoke seed standby primary-3 primary-1 |\
        ssh my-user@primary-3 docker exec -i dap evoke unpack seed -
    
  5. Configure the Standby on primary-3

    (primary-3) $ docker exec dap evoke configure standby
    
  6. Ensure the etcd metadata contains only the active cluster nodes

    List the current members from the active master, primary-1:

    (primary-1) $ docker exec dap evoke cluster member list | jq .
    [
      {
        "id": "8173725b4825649",
        "name": "primary-1",
        "peerURLs": [
          "http://primary-1:2380"
        ],
        "clientURLs": [
          "http://localhost:2379"
        ]
      },
      {
        "id": "8c4717f1e3134c85",
        "name": "primary-2",
        "peerURLs": [
          "http://primary-2:2380"
        ],
        "clientURLs": [
          "http://localhost:2379"
        ]
      },
      {
        "id": "95b4109fd0d3d7c3",
        "name": "primary-4",
        "peerURLs": [
          "http://primary-4:2380"
        ],
        "clientURLs": [
          "http://localhost:2379"
        ]
      }
    ]
    

    Remove any nodes that are not currently healthy using the id. For example, we would remove id 95b4109fd0d3d7c3 for primary-4 in the output above:

    (primary-1) $ docker exec dap evoke cluster member remove 95b4109fd0d3d7c3
    
  7. Add primary-3 to the cluster metadata on the active master, primary-1

    (primary-1) $ docker exec dap evoke cluster member add primary-3
    
  8. Enroll primary-3 back into the cluster

    (primary-3) $ docker exec dap evoke cluster enroll \
      --reenroll \
      -n primary-3 \
      -m primary-1 \
      conjur
    
  9. Verify that the cluster is healthy

    From primary-3 health:

    (primary-3) $ curl -k https://localhost/health
    ...
     "cluster": {
        ...
        "status": "standing_by",
        ...
      },
    ...
    

    From primary-1 health:

    (primary-1) $ curl -k https://localhost/health
    ...
      "cluster": {
        "ok": true,
        "status": "running",
        "message": "acting as master"
      },
    ...
    

    From etcd health on primary-1:

    (primary-1) $ docker exec dap etcdctl cluster-health
    member 8173725b4825649 is healthy: got healthy result from http://localhost:2379
    member 8c4717f1e3134c85 is healthy: got healthy result from http://localhost:2379
    member 949c1d777035bd43 is healthy: got healthy result from http://localhost:2379
    cluster is healthy
    

The cluster is now a healthy 3 node cluster and is ready to restore the 4th node, primary-4.

Recover primary-4

  1. Reload cluster policy adding back primary-4 using --replace

    Uncomment the host entry for primary-4:

    # policy/cluster.yaml
    - !policy
      id: cluster/conjur
      body:
        - !layer
        - &hosts
          - !host primary-1
          - !host primary-2
          - !host primary-3
          - !host primary-4
          # - !host primary-5
    
        - !grant
          role: !layer
          member: *hosts

    Reload the policy:

    (my-user) $ conjur policy load --replace conjur policy/cluster.yaml
    
  2. Remove the existing appliance container on primary-4

    (primary-4) $ docker rm -f dap
    
  3. Create a new appliance container for the Standby

    (primary-4) $ docker run --name dap -d --restart=always --security-opt seccomp:/path/to/conjur-seccomp.json -p 443:443 -p 5432:5432 -p 1999:1999 conjur-appliance:5.3.0
    
  4. Generate a seed file for the new Standby from the active Master

    (my-user) $ ssh my-user@primary-1 docker exec dap evoke seed standby primary-4 primary-1 |\
        ssh my-user@primary-4 docker exec -i dap evoke unpack seed -
    
  5. Configure the Standby on primary-4

    (primary-4) $ docker exec dap evoke configure standby
    
  6. Add primary-4 to the cluster metadata on the active master, primary-1

    (primary-1) $ docker exec dap evoke cluster member add primary-4
    
  7. Enroll primary-4 back into the cluster

    (primary-4) $ docker exec dap evoke cluster enroll \
      --reenroll \
      -n primary-4 \
      -m primary-1 \
      conjur
    
  8. Verify that the cluster is healthy

    From primary-4 health:

    (primary-4) $ curl -k https://localhost/health
    ...
     "cluster": {
        ...
        "status": "standing_by",
        ...
      },
    ...
    

    From primary-1 health:

    (primary-1) $ curl -k https://localhost/health
    ...
      "cluster": {
        "ok": true,
        "status": "running",
        "message": "acting as master"
      },
    ...
    

    From etcd health on primary-1:

    (primary-1) $ docker exec dap etcdctl cluster-health
    member 8173725b4825649 is healthy: got healthy result from http://localhost:2379
    member 8c4717f1e3134c85 is healthy: got healthy result from http://localhost:2379
    member 949c1d777035bd43 is healthy: got healthy result from http://localhost:2379
    member 95b4109fd0d3d7c3 is healthy: got healthy result from http://localhost:2379
    cluster is healthy
    

The cluster is now a healthy 4 node cluster and is ready to restore the final node, primary-5.

Recover primary-5

This process is identical to restoring primary-4.

  1. Reload cluster policy adding back primary-5 using --replace

    Uncomment the host entry for primary-5:

    # policy/cluster.yaml
    - !policy
      id: cluster/conjur
      body:
        - !layer
        - &hosts
          - !host primary-1
          - !host primary-2
          - !host primary-3
          - !host primary-4
          - !host primary-5
    
        - !grant
          role: !layer
          member: *hosts

    Reload the policy:

    (my-user) $ conjur policy load --replace conjur policy/cluster.yaml
    
  2. Remove the existing appliance container on primary-5

    (primary-5) $ docker rm -f dap
    
  3. Create a new appliance container for the Standby

    (primary-5) $ docker run --name dap -d --restart=always --security-opt seccomp:/path/to/conjur-seccomp.json -p 443:443 -p 5432:5432 -p 1999:1999 conjur-appliance:5.3.0
    
  4. Generate a seed file for the new Standby from the active Master

    (my-user) $ ssh my-user@primary-1 docker exec dap evoke seed standby primary-5 primary-1 |\
        ssh my-user@primary-5 docker exec -i dap evoke unpack seed -
    
  5. Configure the Standby on primary-5

    (primary-5) $ docker exec dap evoke configure standby
    
  6. Add primary-5 to the cluster metadata on the active master, primary-1

    (primary-1) $ docker exec dap evoke cluster member add primary-5
    
  7. Enroll primary-5 back into the cluster

    (primary-4) $ docker exec dap evoke cluster enroll \
      --reenroll \
      -n primary-5 \
      -m primary-1 \
      conjur
    
  8. Verify that the cluster is healthy

    From primary-5 health:

    (primary-5) $ curl -k https://localhost/health
    ...
     "cluster": {
        ...
        "status": "standing_by",
        ...
      },
    ...
    

    From primary-1 health:

    (primary-1) $ curl -k https://localhost/health
    ...
      "cluster": {
        "ok": true,
        "status": "running",
        "message": "acting as master"
      },
    ...
    

    From etcd health on primary-1:

    (primary-1) $ docker exec dap etcdctl cluster-health
    member 8173725b4825649 is healthy: got healthy result from http://localhost:2379
    member 8c4717f1e3134c85 is healthy: got healthy result from http://localhost:2379
    member 949c1d777035bd43 is healthy: got healthy result from http://localhost:2379
    member 95b4109fd0d3d7c3 is healthy: got healthy result from http://localhost:2379
    member b729447ffbc4d179 is healthy: got healthy result from http://localhost:2379
    cluster is healthy
    

The cluster is now a healthy 5 node cluster again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment