Skip to content

Instantly share code, notes, and snippets.

@trinitronx
Last active June 28, 2017 20:18
Show Gist options
  • Save trinitronx/afd894c89384d413597b to your computer and use it in GitHub Desktop.
Save trinitronx/afd894c89384d413597b to your computer and use it in GitHub Desktop.

Running the a playbook to create an EC2 host, I find that there is a race condition for Ansible to be able to connect to the host.

Both wait_for and shell: echo host_is_up (with retry / do..until loop) tasks are not enough to avoid hitting the race condition where Ansible fails to SSH.

Running the play shows that the "Wait for SSH" task completes successfully, and the initial connect for the "Wait up to 10mins for potential cloud-init sshd race condition" task fails completely without retrying as I had expected:

$ export AWS_ACCESS_KEY_ID=ABCDEFGHIJKLMNOPQRST
$ export AWS_ACCESS_KEY=$AWS_ACCESS_KEY_ID
$ export AWS_SECRET_ACCESS_KEY=1234567890abcdefghijklmnopqrstuvwxyzABCD
$ export AWS_SECRET_KEY=$AWS_SECRET_ACCESS_KEY
$ export AWS_DEFAULT_REGION=us-east-1

$ ansible-playbook -i hosts create-ec2-host.yml -e 'instance_count=1 use_public_ip=true' -vvvv
PLAY [Create a Host in us-east-1] ***************************************

TASK [Launch instances] ********************************************************
task path: /Users/testuser/src/ansible-playbooks/playbooks/gist-ansible-race-condition/create-ec2-host.yml:7
ESTABLISH LOCAL CONNECTION FOR USER: testuser
localhost EXEC ( umask 22 && mkdir -p "$( echo $HOME/.ansible/tmp/ansible-tmp-1453749067.13-189564441516888 )" && echo "$( echo $HOME/.ansible/tmp/ansible-tmp-1453749067.13-189564441516888 )" )
localhost PUT /tmp/tmpUquQnz TO /Users/testuser/.ansible/tmp/ansible-tmp-1453749067.13-189564441516888/ec2
localhost EXEC LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 /usr/bin/python /Users/testuser/.ansible/tmp/ansible-tmp-1453749067.13-189564441516888/ec2; rm -rf "/Users/testuser/.ansible/tmp/ansible-tmp-1453749067.13-189564441516888/" > /dev/null 2>&1
changed: [127.0.0.1 -> localhost] => {"changed": true, "instance_ids": ["i-12345678"], "instances": [{"ami_launch_index": "0", "architecture": "x86_64", "block_device_mapping": {"/dev/sda1": {"delete_on_termination": false, "status": "attached", "volume_id": "vol-12345678"}}, "dns_name": "ec2-54-12-34-56.compute-1.amazonaws.com", "ebs_optimized": false, "groups": {"sg-12345678": "VPC-Private-SG"}, "hypervisor": "xen", "id": "i-12345678", "image_id": "ami-61bbf104", "instance_type": "m4.large", "kernel": null, "key_name": "test_key", "launch_time": "2016-01-25T19:11:09.000Z", "placement": "us-east-1a", "private_dns_name": "ip-10-12-34-56.ec2.internal", "private_ip": "10.12.34.56", "public_dns_name": "ec2-54-12-34-56.compute-1.amazonaws.com", "public_ip": "54.12.34.56", "ramdisk": null, "region": "us-east-1", "root_device_name": "/dev/sda1", "root_device_type": "ebs", "state": "running", "state_code": 16, "tags": {"Name": "test-instance-init", "clusterid": "test_instance", "location": "us-east-1", "environment": "test"}, "tenancy": "default", "virtualization_type": "hvm"}], "invocation": {"module_args": {"assign_public_ip": true, "count": "1", "group_id": "sg-12345678", "image": "ami-61bbf104", "instance_profile_name": "test-instance-profile", "instance_tags": {"Name": "test-instance-init", "clusterid": "test_instance", "location": "us-east-1", "environment": "test"}, "instance_type": "m4.large", "key_name": "test_key", "region": "us-east-1", "volumes": [{"device_name": "/dev/sdc", "device_type": "ephemeral", "ephemeral": "ephemeral0", "volume_size": 16}], "vpc_subnet_id": "subnet-12345678", "wait": true, "wait_timeout": 600, "zone": "us-east-1a"}, "module_name": "ec2"}, "tagged_instances": []}

TASK [Create an inventory group with the new nodes Private IP] *****************
task path: /Users/testuser/src/ansible-playbooks/playbooks/gist-ansible-race-condition/create-ec2-host.yml:37
skipping: [127.0.0.1] => (item={u'kernel': None, u'root_device_type': u'ebs', u'private_dns_name': u'ip-10-12-34-56.ec2.internal', u'public_ip': u'54.12.34.56', u'private_ip': u'10.12.34.56', u'id': u'i-12345678', u'ebs_optimized': False, u'state': u'running', u'virtualization_type': u'hvm', u'root_device_name': u'/dev/sda1', u'ramdisk': None, u'block_device_mapping': {u'/dev/sda1': {u'status': u'attached', u'delete_on_termination': False, u'volume_id': u'vol-12345678'}}, u'key_name': u'test_key', u'image_id': u'ami-61bbf104', u'tenancy': u'default', u'groups': {u'sg-12345678': u'VPC-Private-SG'}, u'public_dns_name': u'ec2-54-12-34-56.compute-1.amazonaws.com', u'state_code': 16, u'tags': {u'Name': u'test-instance-init', u'clusterid': u'test_instance', u'location': u'us-east-1', u'environment': u'test'}, u'placement': u'us-east-1a', u'ami_launch_index': u'0', u'dns_name': u'ec2-54-12-34-56.compute-1.amazonaws.com', u'region': u'us-east-1', u'launch_time': u'2016-01-25T19:11:09.000Z', u'instance_type': u'm4.large', u'architecture': u'x86_64', u'hypervisor': u'xen'})  => {"changed": false, "item": {"ami_launch_index": "0", "architecture": "x86_64", "block_device_mapping": {"/dev/sda1": {"delete_on_termination": false, "status": "attached", "volume_id": "vol-12345678"}}, "dns_name": "ec2-54-12-34-56.compute-1.amazonaws.com", "ebs_optimized": false, "groups": {"sg-12345678": "VPC-Private-SG"}, "hypervisor": "xen", "id": "i-12345678", "image_id": "ami-61bbf104", "instance_type": "m4.large", "kernel": null, "key_name": "test_key", "launch_time": "2016-01-25T19:11:09.000Z", "placement": "us-east-1a", "private_dns_name": "ip-10-12-34-56.ec2.internal", "private_ip": "10.12.34.56", "public_dns_name": "ec2-54-12-34-56.compute-1.amazonaws.com", "public_ip": "54.12.34.56", "ramdisk": null, "region": "us-east-1", "root_device_name": "/dev/sda1", "root_device_type": "ebs", "state": "running", "state_code": 16, "tags": {"Name": "test-instance-init", "clusterid": "test_instance", "location": "us-east-1", "environment": "test"}, "tenancy": "default", "virtualization_type": "hvm"}, "skip_reason": "Conditional check failed", "skipped": true}

TASK [Create an inventory group with the new nodes Public IP] ******************
task path: /Users/testuser/src/ansible-playbooks/playbooks/gist-ansible-race-condition/create-ec2-host.yml:47
creating host via 'add_host': hostname=54.12.34.56
changed: [127.0.0.1] => (item={u'kernel': None, u'root_device_type': u'ebs', u'private_dns_name': u'ip-10-12-34-56.ec2.internal', u'public_ip': u'54.12.34.56', u'private_ip': u'10.12.34.56', u'id': u'i-12345678', u'ebs_optimized': False, u'state': u'running', u'virtualization_type': u'hvm', u'root_device_name': u'/dev/sda1', u'ramdisk': None, u'block_device_mapping': {u'/dev/sda1': {u'status': u'attached', u'delete_on_termination': False, u'volume_id': u'vol-12345678'}}, u'key_name': u'test_key', u'image_id': u'ami-61bbf104', u'tenancy': u'default', u'groups': {u'sg-12345678': u'VPC-Private-SG'}, u'public_dns_name': u'ec2-54-12-34-56.compute-1.amazonaws.com', u'state_code': 16, u'tags': {u'Name': u'test-instance-init', u'clusterid': u'test_instance', u'location': u'us-east-1', u'environment': u'test'}, u'placement': u'us-east-1a', u'ami_launch_index': u'0', u'dns_name': u'ec2-54-12-34-56.compute-1.amazonaws.com', u'region': u'us-east-1', u'launch_time': u'2016-01-25T19:11:09.000Z', u'instance_type': u'm4.large', u'architecture': u'x86_64', u'hypervisor': u'xen'}) => {"add_host": {"groups": ["new_docker_hosts"], "host_name": "54.12.34.56", "host_vars": {}}, "changed": true, "invocation": {"module_args": {"groups": "new_docker_hosts", "name": "54.12.34.56"}, "module_name": "add_host"}, "item": {"ami_launch_index": "0", "architecture": "x86_64", "block_device_mapping": {"/dev/sda1": {"delete_on_termination": false, "status": "attached", "volume_id": "vol-12345678"}}, "dns_name": "ec2-54-12-34-56.compute-1.amazonaws.com", "ebs_optimized": false, "groups": {"sg-12345678": "VPC-Private-SG"}, "hypervisor": "xen", "id": "i-12345678", "image_id": "ami-61bbf104", "instance_type": "m4.large", "kernel": null, "key_name": "test_key", "launch_time": "2016-01-25T19:11:09.000Z", "placement": "us-east-1a", "private_dns_name": "ip-10-12-34-56.ec2.internal", "private_ip": "10.12.34.56", "public_dns_name": "ec2-54-12-34-56.compute-1.amazonaws.com", "public_ip": "54.12.34.56", "ramdisk": null, "region": "us-east-1", "root_device_name": "/dev/sda1", "root_device_type": "ebs", "state": "running", "state_code": 16, "tags": {"Name": "test-instance-init", "clusterid": "test_instance", "location": "us-east-1", "environment": "test"}, "tenancy": "default", "virtualization_type": "hvm"}}

TASK [Wait for SSH to come up (Public IP)] *************************************
task path: /Users/testuser/src/ansible-playbooks/playbooks/gist-ansible-race-condition/create-ec2-host.yml:57
ESTABLISH LOCAL CONNECTION FOR USER: testuser
localhost EXEC ( umask 22 && mkdir -p "$( echo $HOME/.ansible/tmp/ansible-tmp-1453749101.88-35398939720783 )" && echo "$( echo $HOME/.ansible/tmp/ansible-tmp-1453749101.88-35398939720783 )" )
localhost PUT /tmp/tmphd3pDi TO /Users/testuser/.ansible/tmp/ansible-tmp-1453749101.88-35398939720783/wait_for
localhost EXEC LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 /usr/bin/python /Users/testuser/.ansible/tmp/ansible-tmp-1453749101.88-35398939720783/wait_for; rm -rf "/Users/testuser/.ansible/tmp/ansible-tmp-1453749101.88-35398939720783/" > /dev/null 2>&1
ok: [127.0.0.1 -> localhost] => (item={u'kernel': None, u'root_device_type': u'ebs', u'private_dns_name': u'ip-10-12-34-56.ec2.internal', u'public_ip': u'54.12.34.56', u'private_ip': u'10.12.34.56', u'id': u'i-12345678', u'ebs_optimized': False, u'state': u'running', u'virtualization_type': u'hvm', u'root_device_name': u'/dev/sda1', u'ramdisk': None, u'block_device_mapping': {u'/dev/sda1': {u'status': u'attached', u'delete_on_termination': False, u'volume_id': u'vol-12345678'}}, u'key_name': u'test_key', u'image_id': u'ami-61bbf104', u'tenancy': u'default', u'groups': {u'sg-12345678': u'VPC-Private-SG'}, u'public_dns_name': u'ec2-54-12-34-56.compute-1.amazonaws.com', u'state_code': 16, u'tags': {u'Name': u'test-instance-init', u'clusterid': u'test_instance', u'location': u'us-east-1', u'environment': u'test'}, u'placement': u'us-east-1a', u'ami_launch_index': u'0', u'dns_name': u'ec2-54-12-34-56.compute-1.amazonaws.com', u'region': u'us-east-1', u'launch_time': u'2016-01-25T19:11:09.000Z', u'instance_type': u'm4.large', u'architecture': u'x86_64', u'hypervisor': u'xen'}) => {"changed": false, "elapsed": 110, "invocation": {"module_args": {"delay": "30", "host": "54.12.34.56", "port": "22", "state": "started", "timeout": "320"}, "module_name": "wait_for"}, "item": {"ami_launch_index": "0", "architecture": "x86_64", "block_device_mapping": {"/dev/sda1": {"delete_on_termination": false, "status": "attached", "volume_id": "vol-12345678"}}, "dns_name": "ec2-54-12-34-56.compute-1.amazonaws.com", "ebs_optimized": false, "groups": {"sg-12345678": "VPC-Private-SG"}, "hypervisor": "xen", "id": "i-12345678", "image_id": "ami-61bbf104", "instance_type": "m4.large", "kernel": null, "key_name": "test_key", "launch_time": "2016-01-25T19:11:09.000Z", "placement": "us-east-1a", "private_dns_name": "ip-10-12-34-56.ec2.internal", "private_ip": "10.12.34.56", "public_dns_name": "ec2-54-12-34-56.compute-1.amazonaws.com", "public_ip": "54.12.34.56", "ramdisk": null, "region": "us-east-1", "root_device_name": "/dev/sda1", "root_device_type": "ebs", "state": "running", "state_code": 16, "tags": {"Name": "test-instance-init", "clusterid": "test_instance", "location": "us-east-1", "environment": "test"}, "tenancy": "default", "virtualization_type": "hvm"}, "path": null, "port": 22, "search_regex": null, "state": "started"}

TASK [Wait for SSH to come up (Private IP)] ************************************
task path: /Users/testuser/src/ansible-playbooks/playbooks/gist-ansible-race-condition/create-ec2-host.yml:64
skipping: [127.0.0.1] => (item={u'kernel': None, u'root_device_type': u'ebs', u'private_dns_name': u'ip-10-12-34-56.ec2.internal', u'public_ip': u'54.12.34.56', u'private_ip': u'10.12.34.56', u'id': u'i-12345678', u'ebs_optimized': False, u'state': u'running', u'virtualization_type': u'hvm', u'root_device_name': u'/dev/sda1', u'ramdisk': None, u'block_device_mapping': {u'/dev/sda1': {u'status': u'attached', u'delete_on_termination': False, u'volume_id': u'vol-12345678'}}, u'key_name': u'test_key', u'image_id': u'ami-61bbf104', u'tenancy': u'default', u'groups': {u'sg-12345678': u'VPC-Private-SG'}, u'public_dns_name': u'ec2-54-12-34-56.compute-1.amazonaws.com', u'state_code': 16, u'tags': {u'Name': u'test-instance-init', u'clusterid': u'test_instance', u'location': u'us-east-1', u'environment': u'test'}, u'placement': u'us-east-1a', u'ami_launch_index': u'0', u'dns_name': u'ec2-54-12-34-56.compute-1.amazonaws.com', u'region': u'us-east-1', u'launch_time': u'2016-01-25T19:11:09.000Z', u'instance_type': u'm4.large', u'architecture': u'x86_64', u'hypervisor': u'xen'})  => {"changed": false, "item": {"ami_launch_index": "0", "architecture": "x86_64", "block_device_mapping": {"/dev/sda1": {"delete_on_termination": false, "status": "attached", "volume_id": "vol-12345678"}}, "dns_name": "ec2-54-12-34-56.compute-1.amazonaws.com", "ebs_optimized": false, "groups": {"sg-12345678": "VPC-Private-SG"}, "hypervisor": "xen", "id": "i-12345678", "image_id": "ami-61bbf104", "instance_type": "m4.large", "kernel": null, "key_name": "test_key", "launch_time": "2016-01-25T19:11:09.000Z", "placement": "us-east-1a", "private_dns_name": "ip-10-12-34-56.ec2.internal", "private_ip": "10.12.34.56", "public_dns_name": "ec2-54-12-34-56.compute-1.amazonaws.com", "public_ip": "54.12.34.56", "ramdisk": null, "region": "us-east-1", "root_device_name": "/dev/sda1", "root_device_type": "ebs", "state": "running", "state_code": 16, "tags": {"Name": "test-instance-init", "clusterid": "test_instance", "location": "us-east-1", "environment": "test"}, "tenancy": "default", "virtualization_type": "hvm"}, "skip_reason": "Conditional check failed", "skipped": true}

TASK [Tag workers] *************************************************************
task path: /Users/testuser/src/ansible-playbooks/playbooks/gist-ansible-race-condition/create-ec2-host.yml:71
ESTABLISH LOCAL CONNECTION FOR USER: testuser
localhost EXEC ( umask 22 && mkdir -p "$( echo $HOME/.ansible/tmp/ansible-tmp-1453749213.13-77467831849315 )" && echo "$( echo $HOME/.ansible/tmp/ansible-tmp-1453749213.13-77467831849315 )" )
localhost PUT /tmp/tmpAbyyFo TO /Users/testuser/.ansible/tmp/ansible-tmp-1453749213.13-77467831849315/ec2_tag
localhost EXEC LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 /usr/bin/python /Users/testuser/.ansible/tmp/ansible-tmp-1453749213.13-77467831849315/ec2_tag; rm -rf "/Users/testuser/.ansible/tmp/ansible-tmp-1453749213.13-77467831849315/" > /dev/null 2>&1
changed: [127.0.0.1 -> localhost] => (item={u'kernel': None, u'root_device_type': u'ebs', u'private_dns_name': u'ip-10-12-34-56.ec2.internal', u'public_ip': u'54.12.34.56', u'private_ip': u'10.12.34.56', u'id': u'i-12345678', u'ebs_optimized': False, u'state': u'running', u'virtualization_type': u'hvm', u'root_device_name': u'/dev/sda1', u'ramdisk': None, u'block_device_mapping': {u'/dev/sda1': {u'status': u'attached', u'delete_on_termination': False, u'volume_id': u'vol-12345678'}}, u'key_name': u'test_key', u'image_id': u'ami-61bbf104', u'tenancy': u'default', u'groups': {u'sg-12345678': u'VPC-Private-SG'}, u'public_dns_name': u'ec2-54-12-34-56.compute-1.amazonaws.com', u'state_code': 16, u'tags': {u'Name': u'test-instance-init', u'clusterid': u'test_instance', u'location': u'us-east-1', u'environment': u'test'}, u'placement': u'us-east-1a', u'ami_launch_index': u'0', u'dns_name': u'ec2-54-12-34-56.compute-1.amazonaws.com', u'region': u'us-east-1', u'launch_time': u'2016-01-25T19:11:09.000Z', u'instance_type': u'm4.large', u'architecture': u'x86_64', u'hypervisor': u'xen'}) => {"changed": true, "invocation": {"module_args": {"region": "us-east-1", "resource": "i-12345678", "state": "present", "tags": {"Name": "test_instance-0"}}, "module_name": "ec2_tag"}, "item": {"ami_launch_index": "0", "architecture": "x86_64", "block_device_mapping": {"/dev/sda1": {"delete_on_termination": false, "status": "attached", "volume_id": "vol-12345678"}}, "dns_name": "ec2-54-12-34-56.compute-1.amazonaws.com", "ebs_optimized": false, "groups": {"sg-12345678": "VPC-Private-SG"}, "hypervisor": "xen", "id": "i-12345678", "image_id": "ami-61bbf104", "instance_type": "m4.large", "kernel": null, "key_name": "test_key", "launch_time": "2016-01-25T19:11:09.000Z", "placement": "us-east-1a", "private_dns_name": "ip-10-12-34-56.ec2.internal", "private_ip": "10.12.34.56", "public_dns_name": "ec2-54-12-34-56.compute-1.amazonaws.com", "public_ip": "54.12.34.56", "ramdisk": null, "region": "us-east-1", "root_device_name": "/dev/sda1", "root_device_type": "ebs", "state": "running", "state_code": 16, "tags": {"Name": "test-instance-init", "clusterid": "test_instance", "location": "us-east-1", "environment": "test"}, "tenancy": "default", "virtualization_type": "hvm"}, "msg": "Tags {'Name': 'test_instance-0'} created for resource i-12345678."}

PLAY [Bootstrap new nodes for ansible] *****************************************

TASK [Wait up to 10mins for potential cloud-init sshd race condition] **********
task path: /Users/testuser/src/ansible-playbooks/playbooks/gist-ansible-race-condition/create-ec2-host.yml:92
<54.12.34.56> ESTABLISH SSH CONNECTION FOR USER: centos
<54.12.34.56> SSH: EXEC ssh -C -vvv -o ForwardAgent=yes -o ControlPath=/tmp/ssh-%r@%h:%p -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o Port=22 -o 'IdentityFile="/Users/testuser/.ssh/test_key.pem"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=centos -o ConnectTimeout=60 -tt 54.12.34.56 ( umask 22 && mkdir -p "$( echo $HOME/.ansible/tmp/ansible-tmp-1453749214.91-182879016608139 )" && echo "$( echo $HOME/.ansible/tmp/ansible-tmp-1453749214.91-182879016608139 )" )
fatal: [54.12.34.56]: UNREACHABLE! => {"changed": false, "msg": "ERROR! SSH encountered an unknown error. The output was:\nOpenSSH_6.6.1, OpenSSL 1.0.1e-fips 11 Feb 2013\r\ndebug1: Reading configuration data /Users/testuser/.ssh/config\r\ndebug1: /Users/testuser/.ssh/config line 378: Applying options for *\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: /etc/ssh/ssh_config line 56: Applying options for *\r\ndebug1: auto-mux: Trying existing master\r\ndebug1: Control socket \"/tmp/[email protected]:22\" does not exist\r\ndebug2: ssh_connect: needpriv 0\r\ndebug1: Connecting to 54.12.34.56 [54.12.34.56] port 22.\r\ndebug2: fd 3 setting O_NONBLOCK\r\ndebug1: fd 3 clearing O_NONBLOCK\r\ndebug1: Connection established.\r\ndebug3: timeout: 59950 ms remain after connect\r\ndebug3: Incorrect RSA1 identifier\r\ndebug3: Could not load \"/Users/testuser/.ssh/test_key.pem\" as a RSA1 public key\r\ndebug1: identity file /Users/testuser/.ssh/test_key.pem type -1\r\ndebug1: identity file /Users/testuser/.ssh/test_key.pem-cert type -1\r\ndebug1: Enabling compatibility mode for protocol 2.0\r\ndebug1: Local version string SSH-2.0-OpenSSH_6.6.1\r\ndebug1: Remote protocol version 2.0, remote software version OpenSSH_6.6.1\r\ndebug1: match: OpenSSH_6.6.1 pat OpenSSH_6.6.1* compat 0x04000000\r\ndebug2: fd 3 setting O_NONBLOCK\r\ndebug3: load_hostkeys: loading entries for host \"54.12.34.56\" from file \"/Users/testuser/.ssh/known_hosts\"\r\ndebug3: load_hostkeys: loaded 0 keys\r\ndebug1: SSH2_MSG_KEXINIT sent\r\ndebug1: SSH2_MSG_KEXINIT received\r\ndebug2: kex_parse_kexinit: [email protected],ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha1,diffie-hellman-group1-sha1\r\ndebug2: kex_parse_kexinit: [email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,ssh-ed25519,ssh-rsa,ssh-dss\r\ndebug2: kex_parse_kexinit: aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,[email protected],[email protected],[email protected],aes128-cbc,3des-cbc,blowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,[email protected]\r\ndebug2: kex_parse_kexinit: aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,[email protected],[email protected],[email protected],aes128-cbc,3des-cbc,blowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,[email protected]\r\ndebug2: kex_parse_kexinit: [email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],hmac-md5,hmac-sha1,[email protected],[email protected],hmac-sha2-256,hmac-sha2-512,hmac-ripemd160,[email protected],hmac-sha1-96,hmac-md5-96\r\ndebug2: kex_parse_kexinit: [email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],hmac-md5,hmac-sha1,[email protected],[email protected],hmac-sha2-256,hmac-sha2-512,hmac-ripemd160,[email protected],hmac-sha1-96,hmac-md5-96\r\ndebug2: kex_parse_kexinit: [email protected],zlib,none\r\ndebug2: kex_parse_kexinit: [email protected],zlib,none\r\ndebug2: kex_parse_kexinit: \r\ndebug2: kex_parse_kexinit: \r\ndebug2: kex_parse_kexinit: first_kex_follows 0 \r\ndebug2: kex_parse_kexinit: reserved 0 \r\ndebug2: kex_parse_kexinit: [email protected],ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha1,diffie-hellman-group1-sha1\r\ndebug2: kex_parse_kexinit: ssh-rsa,ecdsa-sha2-nistp256,ssh-ed25519\r\ndebug2: kex_parse_kexinit: aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,[email protected],[email protected],[email protected],aes128-cbc,3des-cbc,blowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,[email protected]\r\ndebug2: kex_parse_kexinit: aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,[email protected],[email protected],[email protected],aes128-cbc,3des-cbc,blowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,[email protected]\r\ndebug2: kex_parse_kexinit: [email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],hmac-md5,hmac-sha1,[email protected],[email protected],hmac-sha2-256,hmac-sha2-512,hmac-ripemd160,[email protected],hmac-sha1-96,hmac-md5-96\r\ndebug2: kex_parse_kexinit: [email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],hmac-md5,hmac-sha1,[email protected],[email protected],hmac-sha2-256,hmac-sha2-512,hmac-ripemd160,[email protected],hmac-sha1-96,hmac-md5-96\r\ndebug2: kex_parse_kexinit: none,[email protected]\r\ndebug2: kex_parse_kexinit: none,[email protected]\r\ndebug2: kex_parse_kexinit: \r\ndebug2: kex_parse_kexinit: \r\ndebug2: kex_parse_kexinit: first_kex_follows 0 \r\ndebug2: kex_parse_kexinit: reserved 0 \r\ndebug2: mac_setup: setup [email protected]\r\ndebug1: kex: server->client aes128-ctr [email protected] [email protected]\r\ndebug2: mac_setup: setup [email protected]\r\ndebug1: kex: client->server aes128-ctr [email protected] [email protected]\r\ndebug1: kex: [email protected] need=16 dh_need=16\r\ndebug1: kex: [email protected] need=16 dh_need=16\r\ndebug1: sending SSH2_MSG_KEX_ECDH_INIT\r\ndebug1: expecting SSH2_MSG_KEX_ECDH_REPLY\r\ndebug1: Server host key: ECDSA ff:ee:dd:cc:bb:aa:99:88:77:66:55:44:33:22:11:00\r\ndebug3: load_hostkeys: loading entries for host \"54.12.34.56\" from file \"/Users/testuser/.ssh/known_hosts\"\r\ndebug3: load_hostkeys: loaded 0 keys\r\nWarning: Permanently added '54.12.34.56' (ECDSA) to the list of known hosts.\r\ndebug1: ssh_ecdsa_verify: signature correct\r\ndebug2: kex_derive_keys\r\ndebug2: set_newkeys: mode 1\r\ndebug1: SSH2_MSG_NEWKEYS sent\r\ndebug1: expecting SSH2_MSG_NEWKEYS\r\ndebug2: set_newkeys: mode 0\r\ndebug1: SSH2_MSG_NEWKEYS received\r\ndebug1: SSH2_MSG_SERVICE_REQUEST sent\r\ndebug2: service_accept: ssh-userauth\r\ndebug1: SSH2_MSG_SERVICE_ACCEPT received\r\ndebug2: key: /Users/testuser/.ssh/test_key.pem ((nil)), explicit\r\ndebug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic,password\r\ndebug3: start over, passed a different list publickey,gssapi-keyex,gssapi-with-mic,password\r\ndebug3: preferred gssapi-with-mic,gssapi-keyex,hostbased,publickey\r\ndebug3: authmethod_lookup gssapi-with-mic\r\ndebug3: remaining preferred: gssapi-keyex,hostbased,publickey\r\ndebug3: authmethod_is_enabled gssapi-with-mic\r\ndebug1: Next authentication method: gssapi-with-mic\r\ndebug1: Unspecified GSS failure.  Minor code may provide more information\nNo Kerberos credentials available\n\r\ndebug1: Unspecified GSS failure.  Minor code may provide more information\nNo Kerberos credentials available\n\r\ndebug1: Unspecified GSS failure.  Minor code may provide more information\n\n\r\ndebug1: Unspecified GSS failure.  Minor code may provide more information\nNo Kerberos credentials available\n\r\ndebug2: we did not send a packet, disable method\r\ndebug3: authmethod_lookup gssapi-keyex\r\ndebug3: remaining preferred: hostbased,publickey\r\ndebug3: authmethod_is_enabled gssapi-keyex\r\ndebug1: Next authentication method: gssapi-keyex\r\ndebug1: No valid Key exchange context\r\ndebug2: we did not send a packet, disable method\r\ndebug3: authmethod_lookup publickey\r\ndebug3: remaining preferred: ,publickey\r\ndebug3: authmethod_is_enabled publickey\r\ndebug1: Next authentication method: publickey\r\ndebug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic,password\r\ndebug1: Trying private key: /Users/testuser/.ssh/test_key.pem\r\ndebug1: key_parse_private2: missing begin marker\r\ndebug1: read PEM private key done: type RSA\r\ndebug3: sign_and_send_pubkey: RSA 12:34:56:78:90:ab:cd:ef:fe:dc:ba:09:87:65:43:21\r\ndebug2: we sent a publickey packet, wait for reply\r\ndebug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic,password\r\ndebug2: we did not send a packet, disable method\r\ndebug1: No more authentication methods to try.\r\nPermission denied (publickey,gssapi-keyex,gssapi-with-mic,password).\r\n", "unreachable": true}

PLAY [Cleanup staging keys for AMI user] ***************************************

PLAY RECAP *********************************************************************
127.0.0.1                  : ok=4    changed=3    unreachable=0    failed=0
54.12.34.56                : ok=0    changed=0    unreachable=1    failed=0

After the playbook runs, I copy the same exact SSH command and retry after waiting a couple minutes, it works:

$ ssh -C -vvv -o ForwardAgent=yes -o ControlPath=/tmp/ssh-%r@%h:%p -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o Port=22 -o 'IdentityFile="/Users/testuser/.ssh/test_key.pem"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=centos -o ConnectTimeout=60 -tt 54.12.34.56
[...SNIP...]
[...LOTS of SSH -vvv output...]
[...SNIP...]
Last failed login: Mon Jan 25 18:36:45 UTC 2016 from 123.456.789.1 on ssh:notty
There was 1 failed login attempt since the last successful login.
[centos@ip-10-12-34-56 ~]$

/var/log/secure shows that the initial login user centos did not exist when Ansible first tried to login at 18:36:45:

Jan 25 18:36:14 localhost sshd[797]: Server listening on 0.0.0.0 port 22.
Jan 25 18:36:14 localhost sshd[797]: Server listening on :: port 22.
Jan 25 18:36:14 localhost sshd[1358]: Did not receive identification string from 123.456.789.1
Jan 25 18:36:45 localhost sshd[3437]: Invalid user centos from 123.456.789.1
Jan 25 18:36:45 localhost sshd[3437]: input_userauth_request: invalid user centos [preauth]
Jan 25 18:36:46 localhost sshd[3437]: Connection closed by 123.456.789.1 [preauth]
Jan 25 18:39:04 localhost useradd[10533]: new group: name=centos, GID=1000
Jan 25 18:39:04 localhost useradd[10533]: new user: name=centos, UID=1000, GID=1000, home=/home/centos, shell=/bin/bash
Jan 25 18:39:04 localhost useradd[10533]: add 'centos' to group 'adm'
Jan 25 18:39:04 localhost useradd[10533]: add 'centos' to group 'wheel'
Jan 25 18:39:04 localhost useradd[10533]: add 'centos' to group 'systemd-journal'
Jan 25 18:39:04 localhost useradd[10533]: add 'centos' to shadow group 'adm'
Jan 25 18:39:04 localhost useradd[10533]: add 'centos' to shadow group 'wheel'
Jan 25 18:39:04 localhost useradd[10533]: add 'centos' to shadow group 'systemd-journal'
Jan 25 18:39:05 localhost sshd[797]: Received signal 15; terminating.
Jan 25 18:39:05 localhost sshd[10558]: Server listening on 0.0.0.0 port 22.
Jan 25 18:39:05 localhost sshd[10558]: Server listening on :: port 22.
Jan 25 18:41:29 localhost sshd[10578]: Accepted publickey for centos from 123.456.789.1 port 30159 ssh2: RSA 12:34:56:78:90:ab:cd:ef:fe:dc:ba:09:87:65:43:21
Jan 25 18:41:29 localhost sshd[10578]: pam_unix(sshd:session): session opened for user centos by (uid=0)
Jan 25 18:41:53 localhost sudo: centos : TTY=pts/0 ; PWD=/home/centos ; USER=root ; COMMAND=/bin/less /var/log/cloud-init.log
Jan 25 18:41:58 localhost sudo: centos : TTY=pts/0 ; PWD=/home/centos ; USER=root ; COMMAND=/bin/less /var/log/cloud-init-output.log
Jan 25 18:42:13 localhost sudo: centos : TTY=pts/0 ; PWD=/home/centos ; USER=root ; COMMAND=/bin/less /var/log/boot.log
Jan 25 18:42:19 localhost sudo: centos : TTY=pts/0 ; PWD=/home/centos ; USER=root ; COMMAND=/bin/less /var/log/secure
Jan 25 18:43:30 localhost sudo: centos : TTY=pts/0 ; PWD=/home/centos ; USER=root ; COMMAND=/bin/less /var/log/secure

/var/log/cloud-init-output.log shows that cloud-init takes a couple minutes to run:

[centos@ip-10-12-34-56 ~]$ cat /var/log/cloud-init-output.log
Cloud-init v. 0.7.5 running 'init-local' at Mon, 25 Jan 2016 18:35:49 +0000. Up 26.32 seconds.
Cloud-init v. 0.7.5 running 'init' at Mon, 25 Jan 2016 18:36:14 +0000. Up 51.35 seconds.
ci-info: +++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++
ci-info: +--------+------+----------------+---------------+-------------------+
ci-info: | Device |  Up  |    Address     |      Mask     |     Hw-Address    |
ci-info: +--------+------+----------------+---------------+-------------------+
ci-info: |  lo:   | True |   127.0.0.1    |   255.0.0.0   |         .         |
ci-info: | eth0:  | True |   10.12.34.56  | 255.255.224.0 | 0a:ab:cd:ef:de:00 |
ci-info: +--------+------+----------------+---------------+-------------------+
ci-info: ++++++++++++++++++++++++++++++++Route info+++++++++++++++++++++++++++++++++
ci-info: +-------+--------------+--------------+---------------+-----------+-------+
ci-info: | Route | Destination  |   Gateway    |    Genmask    | Interface | Flags |
ci-info: +-------+--------------+--------------+---------------+-----------+-------+
ci-info: |   0   |   0.0.0.0    | 10.251.128.1 |    0.0.0.0    |    eth0   |   UG  |
ci-info: |   1   | 10.12.128.0  |   0.0.0.0    | 255.255.224.0 |    eth0   |   U   |
ci-info: +-------+--------------+--------------+---------------+-----------+-------+
Cloud-init v. 0.7.5 running 'modules:config' at Mon, 25 Jan 2016 18:39:05 +0000. Up 222.36 seconds.
Cloud-init v. 0.7.5 running 'modules:final' at Mon, 25 Jan 2016 18:39:05 +0000. Up 222.80 seconds.
Cloud-init v. 0.7.5 finished at Mon, 25 Jan 2016 18:39:05 +0000. Datasource DataSourceEc2.  Up 222.93 seconds

So from the above log output, it appears that the initial login user centos is created at: 18:39:04, and cloud-init finishes at 18:39:05. This is about ~3 mins (02:51 to be exact) after SSHd reports that it is listening on port 22. So the wait_for task and test echo shell command task with retry is not enough to avoid this race condition.

- name: Create a Host in {{ location|default('us-east-1') }}
hosts: localhost
gather_facts: false
vars_files:
- "us-east-1.yml"
tasks:
- name: Launch instances
local_action:
module: ec2
count: '{{ instance_count }}'
key_name: '{{ keypair }}'
instance_type: "{{ instance_type | default('c3.large') }}"
image: '{{ base_image }}'
wait: yes
wait_timeout: 600
region: '{{ region }}'
zone: '{{ zone }}'
vpc_subnet_id: '{{ subnet_id_private }}'
group_id: '{{ security_group_private }}'
assign_public_ip: yes
instance_profile_name: '{{ instance_profile_name }}'
instance_tags:
Name: '{{ clusterid }}-init'
clusterid: '{{ clusterid }}'
environment: '{{ environment }}'
location: '{{ location }}'
volumes:
- device_name: /dev/sdc
device_type: ephemeral
ephemeral: ephemeral0
volume_size: 16
register: ec2
tags: launch_instance
- name: Create an inventory group with the new nodes Private IP
add_host: >
name={{ item.private_ip }}
groups=new_hosts
with_items: ec2.instances
when: use_public_ip is not defined or not use_public_ip
tags:
- launch_instance
- new_hosts
- name: Create an inventory group with the new nodes Public IP
add_host: >
name={{ item.public_ip }}
groups=new_hosts
with_items: ec2.instances
when: use_public_ip is defined and use_public_ip
tags:
- launch_instance
- new_hosts
- name: Wait for SSH to come up (Public IP)
local_action: wait_for host={{ item.public_ip }} port=22 delay=30 timeout=320 state=started
with_items: ec2.instances
when: use_public_ip is defined and use_public_ip
tags:
- wait_for_ssh
- name: Wait for SSH to come up (Private IP)
local_action: wait_for host={{ item.private_ip }} port=22 delay=30 timeout=320 state=started
with_items: ec2.instances
when: use_public_ip is not defined or not use_public_ip
tags:
- wait_for_ssh
- name: Tag workers
local_action: ec2_tag resource={{ item.id }} region={{ region }} state=present
with_items: ec2.instances
args:
tags:
Name: '{{ clusterid }}-{{ item["ami_launch_index"] }}'
tags:
- tag_new_hosts
- name: Bootstrap new nodes for ansible
hosts: new_hosts
vars_files:
- "us-east-1.yml"
remote_user: "{{ image_user }}"
vars:
ansible_user: "{{ image_user }}"
gather_facts: no
sudo: yes
pre_tasks:
## Because cloud-init can take a while to run, and is what creates the image_user,
## we must wait for it to exist before we can SSH in successfully. (note: wait_for just checks for if port is listening)
- name: Wait up to 10mins for potential cloud-init sshd race condition
shell: echo host_is_up
register: sshd_is_up
until: sshd_is_up.rc == 0 and 'host_is_up' in sshd_is_up.stdout
retries: 60
delay: 10
tags:
- wait_for_ssh
- sshd_race_condition_wait
- ansible_client
tasks:
- name: Now do other stuff to bootstrap the node
shell: echo 'do some stuff'
[localhost]
127.0.0.1 ansible_connection=local
region: "us-east-1"
zone: "us-east-1a"
base_image: "ami-61bbf104" ## CentOS 7 (x86_64) with Updates HVM (7 2015-09-28, released 10/13/2015)
image_user: centos ## Default user for AMI
ansible_client_user: ansible
## If you are trying to run this race condition reproduction playbook...
## Set these variables appropriately for the objects that exist in your AWS account (subnets, keypairs, security groups, etc...)
subnet_id_public: "subnet-87654321"
subnet_id_private: "subnet-12345678"
keypair: "test_key"
ansible_ssh_private_key_file: '~/.ssh/test_key.pem'
security_group_public: "sg-87654321"
security_group_private: "sg-12345678"
instance_type: m4.large
instance_profile_name: "test-instance-profile"
# Tags for the instance
environment: test
clusterid: test_instance
location: "{{ region }}"
@ashern
Copy link

ashern commented May 24, 2016

Thank you for documenting this. I've been trying to track down the cause of intermittent SSH failures in this exact scenario. Provision EC2, wait for SSH, then run commands - will occasionally fail, likely due to this race condition. Very helpful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment