Running the a playbook to create an EC2 host, I find that there is a race condition for Ansible to be able to connect to the host.
Both wait_for
and shell: echo host_is_up
(with retry / do..until loop) tasks are not enough to avoid hitting the race condition where Ansible fails to SSH.
Running the play shows that the "Wait for SSH" task completes successfully, and the initial connect for the "Wait up to 10mins for potential cloud-init sshd race condition" task fails completely without retrying as I had expected:
$ export AWS_ACCESS_KEY_ID=ABCDEFGHIJKLMNOPQRST
$ export AWS_ACCESS_KEY=$AWS_ACCESS_KEY_ID
$ export AWS_SECRET_ACCESS_KEY=1234567890abcdefghijklmnopqrstuvwxyzABCD
$ export AWS_SECRET_KEY=$AWS_SECRET_ACCESS_KEY
$ export AWS_DEFAULT_REGION=us-east-1
$ ansible-playbook -i hosts create-ec2-host.yml -e 'instance_count=1 use_public_ip=true' -vvvv
PLAY [Create a Host in us-east-1] ***************************************
TASK [Launch instances] ********************************************************
task path: /Users/testuser/src/ansible-playbooks/playbooks/gist-ansible-race-condition/create-ec2-host.yml:7
ESTABLISH LOCAL CONNECTION FOR USER: testuser
localhost EXEC ( umask 22 && mkdir -p "$( echo $HOME/.ansible/tmp/ansible-tmp-1453749067.13-189564441516888 )" && echo "$( echo $HOME/.ansible/tmp/ansible-tmp-1453749067.13-189564441516888 )" )
localhost PUT /tmp/tmpUquQnz TO /Users/testuser/.ansible/tmp/ansible-tmp-1453749067.13-189564441516888/ec2
localhost EXEC LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 /usr/bin/python /Users/testuser/.ansible/tmp/ansible-tmp-1453749067.13-189564441516888/ec2; rm -rf "/Users/testuser/.ansible/tmp/ansible-tmp-1453749067.13-189564441516888/" > /dev/null 2>&1
changed: [127.0.0.1 -> localhost] => {"changed": true, "instance_ids": ["i-12345678"], "instances": [{"ami_launch_index": "0", "architecture": "x86_64", "block_device_mapping": {"/dev/sda1": {"delete_on_termination": false, "status": "attached", "volume_id": "vol-12345678"}}, "dns_name": "ec2-54-12-34-56.compute-1.amazonaws.com", "ebs_optimized": false, "groups": {"sg-12345678": "VPC-Private-SG"}, "hypervisor": "xen", "id": "i-12345678", "image_id": "ami-61bbf104", "instance_type": "m4.large", "kernel": null, "key_name": "test_key", "launch_time": "2016-01-25T19:11:09.000Z", "placement": "us-east-1a", "private_dns_name": "ip-10-12-34-56.ec2.internal", "private_ip": "10.12.34.56", "public_dns_name": "ec2-54-12-34-56.compute-1.amazonaws.com", "public_ip": "54.12.34.56", "ramdisk": null, "region": "us-east-1", "root_device_name": "/dev/sda1", "root_device_type": "ebs", "state": "running", "state_code": 16, "tags": {"Name": "test-instance-init", "clusterid": "test_instance", "location": "us-east-1", "environment": "test"}, "tenancy": "default", "virtualization_type": "hvm"}], "invocation": {"module_args": {"assign_public_ip": true, "count": "1", "group_id": "sg-12345678", "image": "ami-61bbf104", "instance_profile_name": "test-instance-profile", "instance_tags": {"Name": "test-instance-init", "clusterid": "test_instance", "location": "us-east-1", "environment": "test"}, "instance_type": "m4.large", "key_name": "test_key", "region": "us-east-1", "volumes": [{"device_name": "/dev/sdc", "device_type": "ephemeral", "ephemeral": "ephemeral0", "volume_size": 16}], "vpc_subnet_id": "subnet-12345678", "wait": true, "wait_timeout": 600, "zone": "us-east-1a"}, "module_name": "ec2"}, "tagged_instances": []}
TASK [Create an inventory group with the new nodes Private IP] *****************
task path: /Users/testuser/src/ansible-playbooks/playbooks/gist-ansible-race-condition/create-ec2-host.yml:37
skipping: [127.0.0.1] => (item={u'kernel': None, u'root_device_type': u'ebs', u'private_dns_name': u'ip-10-12-34-56.ec2.internal', u'public_ip': u'54.12.34.56', u'private_ip': u'10.12.34.56', u'id': u'i-12345678', u'ebs_optimized': False, u'state': u'running', u'virtualization_type': u'hvm', u'root_device_name': u'/dev/sda1', u'ramdisk': None, u'block_device_mapping': {u'/dev/sda1': {u'status': u'attached', u'delete_on_termination': False, u'volume_id': u'vol-12345678'}}, u'key_name': u'test_key', u'image_id': u'ami-61bbf104', u'tenancy': u'default', u'groups': {u'sg-12345678': u'VPC-Private-SG'}, u'public_dns_name': u'ec2-54-12-34-56.compute-1.amazonaws.com', u'state_code': 16, u'tags': {u'Name': u'test-instance-init', u'clusterid': u'test_instance', u'location': u'us-east-1', u'environment': u'test'}, u'placement': u'us-east-1a', u'ami_launch_index': u'0', u'dns_name': u'ec2-54-12-34-56.compute-1.amazonaws.com', u'region': u'us-east-1', u'launch_time': u'2016-01-25T19:11:09.000Z', u'instance_type': u'm4.large', u'architecture': u'x86_64', u'hypervisor': u'xen'}) => {"changed": false, "item": {"ami_launch_index": "0", "architecture": "x86_64", "block_device_mapping": {"/dev/sda1": {"delete_on_termination": false, "status": "attached", "volume_id": "vol-12345678"}}, "dns_name": "ec2-54-12-34-56.compute-1.amazonaws.com", "ebs_optimized": false, "groups": {"sg-12345678": "VPC-Private-SG"}, "hypervisor": "xen", "id": "i-12345678", "image_id": "ami-61bbf104", "instance_type": "m4.large", "kernel": null, "key_name": "test_key", "launch_time": "2016-01-25T19:11:09.000Z", "placement": "us-east-1a", "private_dns_name": "ip-10-12-34-56.ec2.internal", "private_ip": "10.12.34.56", "public_dns_name": "ec2-54-12-34-56.compute-1.amazonaws.com", "public_ip": "54.12.34.56", "ramdisk": null, "region": "us-east-1", "root_device_name": "/dev/sda1", "root_device_type": "ebs", "state": "running", "state_code": 16, "tags": {"Name": "test-instance-init", "clusterid": "test_instance", "location": "us-east-1", "environment": "test"}, "tenancy": "default", "virtualization_type": "hvm"}, "skip_reason": "Conditional check failed", "skipped": true}
TASK [Create an inventory group with the new nodes Public IP] ******************
task path: /Users/testuser/src/ansible-playbooks/playbooks/gist-ansible-race-condition/create-ec2-host.yml:47
creating host via 'add_host': hostname=54.12.34.56
changed: [127.0.0.1] => (item={u'kernel': None, u'root_device_type': u'ebs', u'private_dns_name': u'ip-10-12-34-56.ec2.internal', u'public_ip': u'54.12.34.56', u'private_ip': u'10.12.34.56', u'id': u'i-12345678', u'ebs_optimized': False, u'state': u'running', u'virtualization_type': u'hvm', u'root_device_name': u'/dev/sda1', u'ramdisk': None, u'block_device_mapping': {u'/dev/sda1': {u'status': u'attached', u'delete_on_termination': False, u'volume_id': u'vol-12345678'}}, u'key_name': u'test_key', u'image_id': u'ami-61bbf104', u'tenancy': u'default', u'groups': {u'sg-12345678': u'VPC-Private-SG'}, u'public_dns_name': u'ec2-54-12-34-56.compute-1.amazonaws.com', u'state_code': 16, u'tags': {u'Name': u'test-instance-init', u'clusterid': u'test_instance', u'location': u'us-east-1', u'environment': u'test'}, u'placement': u'us-east-1a', u'ami_launch_index': u'0', u'dns_name': u'ec2-54-12-34-56.compute-1.amazonaws.com', u'region': u'us-east-1', u'launch_time': u'2016-01-25T19:11:09.000Z', u'instance_type': u'm4.large', u'architecture': u'x86_64', u'hypervisor': u'xen'}) => {"add_host": {"groups": ["new_docker_hosts"], "host_name": "54.12.34.56", "host_vars": {}}, "changed": true, "invocation": {"module_args": {"groups": "new_docker_hosts", "name": "54.12.34.56"}, "module_name": "add_host"}, "item": {"ami_launch_index": "0", "architecture": "x86_64", "block_device_mapping": {"/dev/sda1": {"delete_on_termination": false, "status": "attached", "volume_id": "vol-12345678"}}, "dns_name": "ec2-54-12-34-56.compute-1.amazonaws.com", "ebs_optimized": false, "groups": {"sg-12345678": "VPC-Private-SG"}, "hypervisor": "xen", "id": "i-12345678", "image_id": "ami-61bbf104", "instance_type": "m4.large", "kernel": null, "key_name": "test_key", "launch_time": "2016-01-25T19:11:09.000Z", "placement": "us-east-1a", "private_dns_name": "ip-10-12-34-56.ec2.internal", "private_ip": "10.12.34.56", "public_dns_name": "ec2-54-12-34-56.compute-1.amazonaws.com", "public_ip": "54.12.34.56", "ramdisk": null, "region": "us-east-1", "root_device_name": "/dev/sda1", "root_device_type": "ebs", "state": "running", "state_code": 16, "tags": {"Name": "test-instance-init", "clusterid": "test_instance", "location": "us-east-1", "environment": "test"}, "tenancy": "default", "virtualization_type": "hvm"}}
TASK [Wait for SSH to come up (Public IP)] *************************************
task path: /Users/testuser/src/ansible-playbooks/playbooks/gist-ansible-race-condition/create-ec2-host.yml:57
ESTABLISH LOCAL CONNECTION FOR USER: testuser
localhost EXEC ( umask 22 && mkdir -p "$( echo $HOME/.ansible/tmp/ansible-tmp-1453749101.88-35398939720783 )" && echo "$( echo $HOME/.ansible/tmp/ansible-tmp-1453749101.88-35398939720783 )" )
localhost PUT /tmp/tmphd3pDi TO /Users/testuser/.ansible/tmp/ansible-tmp-1453749101.88-35398939720783/wait_for
localhost EXEC LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 /usr/bin/python /Users/testuser/.ansible/tmp/ansible-tmp-1453749101.88-35398939720783/wait_for; rm -rf "/Users/testuser/.ansible/tmp/ansible-tmp-1453749101.88-35398939720783/" > /dev/null 2>&1
ok: [127.0.0.1 -> localhost] => (item={u'kernel': None, u'root_device_type': u'ebs', u'private_dns_name': u'ip-10-12-34-56.ec2.internal', u'public_ip': u'54.12.34.56', u'private_ip': u'10.12.34.56', u'id': u'i-12345678', u'ebs_optimized': False, u'state': u'running', u'virtualization_type': u'hvm', u'root_device_name': u'/dev/sda1', u'ramdisk': None, u'block_device_mapping': {u'/dev/sda1': {u'status': u'attached', u'delete_on_termination': False, u'volume_id': u'vol-12345678'}}, u'key_name': u'test_key', u'image_id': u'ami-61bbf104', u'tenancy': u'default', u'groups': {u'sg-12345678': u'VPC-Private-SG'}, u'public_dns_name': u'ec2-54-12-34-56.compute-1.amazonaws.com', u'state_code': 16, u'tags': {u'Name': u'test-instance-init', u'clusterid': u'test_instance', u'location': u'us-east-1', u'environment': u'test'}, u'placement': u'us-east-1a', u'ami_launch_index': u'0', u'dns_name': u'ec2-54-12-34-56.compute-1.amazonaws.com', u'region': u'us-east-1', u'launch_time': u'2016-01-25T19:11:09.000Z', u'instance_type': u'm4.large', u'architecture': u'x86_64', u'hypervisor': u'xen'}) => {"changed": false, "elapsed": 110, "invocation": {"module_args": {"delay": "30", "host": "54.12.34.56", "port": "22", "state": "started", "timeout": "320"}, "module_name": "wait_for"}, "item": {"ami_launch_index": "0", "architecture": "x86_64", "block_device_mapping": {"/dev/sda1": {"delete_on_termination": false, "status": "attached", "volume_id": "vol-12345678"}}, "dns_name": "ec2-54-12-34-56.compute-1.amazonaws.com", "ebs_optimized": false, "groups": {"sg-12345678": "VPC-Private-SG"}, "hypervisor": "xen", "id": "i-12345678", "image_id": "ami-61bbf104", "instance_type": "m4.large", "kernel": null, "key_name": "test_key", "launch_time": "2016-01-25T19:11:09.000Z", "placement": "us-east-1a", "private_dns_name": "ip-10-12-34-56.ec2.internal", "private_ip": "10.12.34.56", "public_dns_name": "ec2-54-12-34-56.compute-1.amazonaws.com", "public_ip": "54.12.34.56", "ramdisk": null, "region": "us-east-1", "root_device_name": "/dev/sda1", "root_device_type": "ebs", "state": "running", "state_code": 16, "tags": {"Name": "test-instance-init", "clusterid": "test_instance", "location": "us-east-1", "environment": "test"}, "tenancy": "default", "virtualization_type": "hvm"}, "path": null, "port": 22, "search_regex": null, "state": "started"}
TASK [Wait for SSH to come up (Private IP)] ************************************
task path: /Users/testuser/src/ansible-playbooks/playbooks/gist-ansible-race-condition/create-ec2-host.yml:64
skipping: [127.0.0.1] => (item={u'kernel': None, u'root_device_type': u'ebs', u'private_dns_name': u'ip-10-12-34-56.ec2.internal', u'public_ip': u'54.12.34.56', u'private_ip': u'10.12.34.56', u'id': u'i-12345678', u'ebs_optimized': False, u'state': u'running', u'virtualization_type': u'hvm', u'root_device_name': u'/dev/sda1', u'ramdisk': None, u'block_device_mapping': {u'/dev/sda1': {u'status': u'attached', u'delete_on_termination': False, u'volume_id': u'vol-12345678'}}, u'key_name': u'test_key', u'image_id': u'ami-61bbf104', u'tenancy': u'default', u'groups': {u'sg-12345678': u'VPC-Private-SG'}, u'public_dns_name': u'ec2-54-12-34-56.compute-1.amazonaws.com', u'state_code': 16, u'tags': {u'Name': u'test-instance-init', u'clusterid': u'test_instance', u'location': u'us-east-1', u'environment': u'test'}, u'placement': u'us-east-1a', u'ami_launch_index': u'0', u'dns_name': u'ec2-54-12-34-56.compute-1.amazonaws.com', u'region': u'us-east-1', u'launch_time': u'2016-01-25T19:11:09.000Z', u'instance_type': u'm4.large', u'architecture': u'x86_64', u'hypervisor': u'xen'}) => {"changed": false, "item": {"ami_launch_index": "0", "architecture": "x86_64", "block_device_mapping": {"/dev/sda1": {"delete_on_termination": false, "status": "attached", "volume_id": "vol-12345678"}}, "dns_name": "ec2-54-12-34-56.compute-1.amazonaws.com", "ebs_optimized": false, "groups": {"sg-12345678": "VPC-Private-SG"}, "hypervisor": "xen", "id": "i-12345678", "image_id": "ami-61bbf104", "instance_type": "m4.large", "kernel": null, "key_name": "test_key", "launch_time": "2016-01-25T19:11:09.000Z", "placement": "us-east-1a", "private_dns_name": "ip-10-12-34-56.ec2.internal", "private_ip": "10.12.34.56", "public_dns_name": "ec2-54-12-34-56.compute-1.amazonaws.com", "public_ip": "54.12.34.56", "ramdisk": null, "region": "us-east-1", "root_device_name": "/dev/sda1", "root_device_type": "ebs", "state": "running", "state_code": 16, "tags": {"Name": "test-instance-init", "clusterid": "test_instance", "location": "us-east-1", "environment": "test"}, "tenancy": "default", "virtualization_type": "hvm"}, "skip_reason": "Conditional check failed", "skipped": true}
TASK [Tag workers] *************************************************************
task path: /Users/testuser/src/ansible-playbooks/playbooks/gist-ansible-race-condition/create-ec2-host.yml:71
ESTABLISH LOCAL CONNECTION FOR USER: testuser
localhost EXEC ( umask 22 && mkdir -p "$( echo $HOME/.ansible/tmp/ansible-tmp-1453749213.13-77467831849315 )" && echo "$( echo $HOME/.ansible/tmp/ansible-tmp-1453749213.13-77467831849315 )" )
localhost PUT /tmp/tmpAbyyFo TO /Users/testuser/.ansible/tmp/ansible-tmp-1453749213.13-77467831849315/ec2_tag
localhost EXEC LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 /usr/bin/python /Users/testuser/.ansible/tmp/ansible-tmp-1453749213.13-77467831849315/ec2_tag; rm -rf "/Users/testuser/.ansible/tmp/ansible-tmp-1453749213.13-77467831849315/" > /dev/null 2>&1
changed: [127.0.0.1 -> localhost] => (item={u'kernel': None, u'root_device_type': u'ebs', u'private_dns_name': u'ip-10-12-34-56.ec2.internal', u'public_ip': u'54.12.34.56', u'private_ip': u'10.12.34.56', u'id': u'i-12345678', u'ebs_optimized': False, u'state': u'running', u'virtualization_type': u'hvm', u'root_device_name': u'/dev/sda1', u'ramdisk': None, u'block_device_mapping': {u'/dev/sda1': {u'status': u'attached', u'delete_on_termination': False, u'volume_id': u'vol-12345678'}}, u'key_name': u'test_key', u'image_id': u'ami-61bbf104', u'tenancy': u'default', u'groups': {u'sg-12345678': u'VPC-Private-SG'}, u'public_dns_name': u'ec2-54-12-34-56.compute-1.amazonaws.com', u'state_code': 16, u'tags': {u'Name': u'test-instance-init', u'clusterid': u'test_instance', u'location': u'us-east-1', u'environment': u'test'}, u'placement': u'us-east-1a', u'ami_launch_index': u'0', u'dns_name': u'ec2-54-12-34-56.compute-1.amazonaws.com', u'region': u'us-east-1', u'launch_time': u'2016-01-25T19:11:09.000Z', u'instance_type': u'm4.large', u'architecture': u'x86_64', u'hypervisor': u'xen'}) => {"changed": true, "invocation": {"module_args": {"region": "us-east-1", "resource": "i-12345678", "state": "present", "tags": {"Name": "test_instance-0"}}, "module_name": "ec2_tag"}, "item": {"ami_launch_index": "0", "architecture": "x86_64", "block_device_mapping": {"/dev/sda1": {"delete_on_termination": false, "status": "attached", "volume_id": "vol-12345678"}}, "dns_name": "ec2-54-12-34-56.compute-1.amazonaws.com", "ebs_optimized": false, "groups": {"sg-12345678": "VPC-Private-SG"}, "hypervisor": "xen", "id": "i-12345678", "image_id": "ami-61bbf104", "instance_type": "m4.large", "kernel": null, "key_name": "test_key", "launch_time": "2016-01-25T19:11:09.000Z", "placement": "us-east-1a", "private_dns_name": "ip-10-12-34-56.ec2.internal", "private_ip": "10.12.34.56", "public_dns_name": "ec2-54-12-34-56.compute-1.amazonaws.com", "public_ip": "54.12.34.56", "ramdisk": null, "region": "us-east-1", "root_device_name": "/dev/sda1", "root_device_type": "ebs", "state": "running", "state_code": 16, "tags": {"Name": "test-instance-init", "clusterid": "test_instance", "location": "us-east-1", "environment": "test"}, "tenancy": "default", "virtualization_type": "hvm"}, "msg": "Tags {'Name': 'test_instance-0'} created for resource i-12345678."}
PLAY [Bootstrap new nodes for ansible] *****************************************
TASK [Wait up to 10mins for potential cloud-init sshd race condition] **********
task path: /Users/testuser/src/ansible-playbooks/playbooks/gist-ansible-race-condition/create-ec2-host.yml:92
<54.12.34.56> ESTABLISH SSH CONNECTION FOR USER: centos
<54.12.34.56> SSH: EXEC ssh -C -vvv -o ForwardAgent=yes -o ControlPath=/tmp/ssh-%r@%h:%p -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o Port=22 -o 'IdentityFile="/Users/testuser/.ssh/test_key.pem"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=centos -o ConnectTimeout=60 -tt 54.12.34.56 ( umask 22 && mkdir -p "$( echo $HOME/.ansible/tmp/ansible-tmp-1453749214.91-182879016608139 )" && echo "$( echo $HOME/.ansible/tmp/ansible-tmp-1453749214.91-182879016608139 )" )
fatal: [54.12.34.56]: UNREACHABLE! => {"changed": false, "msg": "ERROR! SSH encountered an unknown error. The output was:\nOpenSSH_6.6.1, OpenSSL 1.0.1e-fips 11 Feb 2013\r\ndebug1: Reading configuration data /Users/testuser/.ssh/config\r\ndebug1: /Users/testuser/.ssh/config line 378: Applying options for *\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: /etc/ssh/ssh_config line 56: Applying options for *\r\ndebug1: auto-mux: Trying existing master\r\ndebug1: Control socket \"/tmp/[email protected]:22\" does not exist\r\ndebug2: ssh_connect: needpriv 0\r\ndebug1: Connecting to 54.12.34.56 [54.12.34.56] port 22.\r\ndebug2: fd 3 setting O_NONBLOCK\r\ndebug1: fd 3 clearing O_NONBLOCK\r\ndebug1: Connection established.\r\ndebug3: timeout: 59950 ms remain after connect\r\ndebug3: Incorrect RSA1 identifier\r\ndebug3: Could not load \"/Users/testuser/.ssh/test_key.pem\" as a RSA1 public key\r\ndebug1: identity file /Users/testuser/.ssh/test_key.pem type -1\r\ndebug1: identity file /Users/testuser/.ssh/test_key.pem-cert type -1\r\ndebug1: Enabling compatibility mode for protocol 2.0\r\ndebug1: Local version string SSH-2.0-OpenSSH_6.6.1\r\ndebug1: Remote protocol version 2.0, remote software version OpenSSH_6.6.1\r\ndebug1: match: OpenSSH_6.6.1 pat OpenSSH_6.6.1* compat 0x04000000\r\ndebug2: fd 3 setting O_NONBLOCK\r\ndebug3: load_hostkeys: loading entries for host \"54.12.34.56\" from file \"/Users/testuser/.ssh/known_hosts\"\r\ndebug3: load_hostkeys: loaded 0 keys\r\ndebug1: SSH2_MSG_KEXINIT sent\r\ndebug1: SSH2_MSG_KEXINIT received\r\ndebug2: kex_parse_kexinit: [email protected],ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha1,diffie-hellman-group1-sha1\r\ndebug2: kex_parse_kexinit: [email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,ssh-ed25519,ssh-rsa,ssh-dss\r\ndebug2: kex_parse_kexinit: aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,[email protected],[email protected],[email protected],aes128-cbc,3des-cbc,blowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,[email protected]\r\ndebug2: kex_parse_kexinit: aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,[email protected],[email protected],[email protected],aes128-cbc,3des-cbc,blowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,[email protected]\r\ndebug2: kex_parse_kexinit: [email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],hmac-md5,hmac-sha1,[email protected],[email protected],hmac-sha2-256,hmac-sha2-512,hmac-ripemd160,[email protected],hmac-sha1-96,hmac-md5-96\r\ndebug2: kex_parse_kexinit: [email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],hmac-md5,hmac-sha1,[email protected],[email protected],hmac-sha2-256,hmac-sha2-512,hmac-ripemd160,[email protected],hmac-sha1-96,hmac-md5-96\r\ndebug2: kex_parse_kexinit: [email protected],zlib,none\r\ndebug2: kex_parse_kexinit: [email protected],zlib,none\r\ndebug2: kex_parse_kexinit: \r\ndebug2: kex_parse_kexinit: \r\ndebug2: kex_parse_kexinit: first_kex_follows 0 \r\ndebug2: kex_parse_kexinit: reserved 0 \r\ndebug2: kex_parse_kexinit: [email protected],ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha1,diffie-hellman-group1-sha1\r\ndebug2: kex_parse_kexinit: ssh-rsa,ecdsa-sha2-nistp256,ssh-ed25519\r\ndebug2: kex_parse_kexinit: aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,[email protected],[email protected],[email protected],aes128-cbc,3des-cbc,blowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,[email protected]\r\ndebug2: kex_parse_kexinit: aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,[email protected],[email protected],[email protected],aes128-cbc,3des-cbc,blowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,[email protected]\r\ndebug2: kex_parse_kexinit: [email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],hmac-md5,hmac-sha1,[email protected],[email protected],hmac-sha2-256,hmac-sha2-512,hmac-ripemd160,[email protected],hmac-sha1-96,hmac-md5-96\r\ndebug2: kex_parse_kexinit: [email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],hmac-md5,hmac-sha1,[email protected],[email protected],hmac-sha2-256,hmac-sha2-512,hmac-ripemd160,[email protected],hmac-sha1-96,hmac-md5-96\r\ndebug2: kex_parse_kexinit: none,[email protected]\r\ndebug2: kex_parse_kexinit: none,[email protected]\r\ndebug2: kex_parse_kexinit: \r\ndebug2: kex_parse_kexinit: \r\ndebug2: kex_parse_kexinit: first_kex_follows 0 \r\ndebug2: kex_parse_kexinit: reserved 0 \r\ndebug2: mac_setup: setup [email protected]\r\ndebug1: kex: server->client aes128-ctr [email protected] [email protected]\r\ndebug2: mac_setup: setup [email protected]\r\ndebug1: kex: client->server aes128-ctr [email protected] [email protected]\r\ndebug1: kex: [email protected] need=16 dh_need=16\r\ndebug1: kex: [email protected] need=16 dh_need=16\r\ndebug1: sending SSH2_MSG_KEX_ECDH_INIT\r\ndebug1: expecting SSH2_MSG_KEX_ECDH_REPLY\r\ndebug1: Server host key: ECDSA ff:ee:dd:cc:bb:aa:99:88:77:66:55:44:33:22:11:00\r\ndebug3: load_hostkeys: loading entries for host \"54.12.34.56\" from file \"/Users/testuser/.ssh/known_hosts\"\r\ndebug3: load_hostkeys: loaded 0 keys\r\nWarning: Permanently added '54.12.34.56' (ECDSA) to the list of known hosts.\r\ndebug1: ssh_ecdsa_verify: signature correct\r\ndebug2: kex_derive_keys\r\ndebug2: set_newkeys: mode 1\r\ndebug1: SSH2_MSG_NEWKEYS sent\r\ndebug1: expecting SSH2_MSG_NEWKEYS\r\ndebug2: set_newkeys: mode 0\r\ndebug1: SSH2_MSG_NEWKEYS received\r\ndebug1: SSH2_MSG_SERVICE_REQUEST sent\r\ndebug2: service_accept: ssh-userauth\r\ndebug1: SSH2_MSG_SERVICE_ACCEPT received\r\ndebug2: key: /Users/testuser/.ssh/test_key.pem ((nil)), explicit\r\ndebug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic,password\r\ndebug3: start over, passed a different list publickey,gssapi-keyex,gssapi-with-mic,password\r\ndebug3: preferred gssapi-with-mic,gssapi-keyex,hostbased,publickey\r\ndebug3: authmethod_lookup gssapi-with-mic\r\ndebug3: remaining preferred: gssapi-keyex,hostbased,publickey\r\ndebug3: authmethod_is_enabled gssapi-with-mic\r\ndebug1: Next authentication method: gssapi-with-mic\r\ndebug1: Unspecified GSS failure. Minor code may provide more information\nNo Kerberos credentials available\n\r\ndebug1: Unspecified GSS failure. Minor code may provide more information\nNo Kerberos credentials available\n\r\ndebug1: Unspecified GSS failure. Minor code may provide more information\n\n\r\ndebug1: Unspecified GSS failure. Minor code may provide more information\nNo Kerberos credentials available\n\r\ndebug2: we did not send a packet, disable method\r\ndebug3: authmethod_lookup gssapi-keyex\r\ndebug3: remaining preferred: hostbased,publickey\r\ndebug3: authmethod_is_enabled gssapi-keyex\r\ndebug1: Next authentication method: gssapi-keyex\r\ndebug1: No valid Key exchange context\r\ndebug2: we did not send a packet, disable method\r\ndebug3: authmethod_lookup publickey\r\ndebug3: remaining preferred: ,publickey\r\ndebug3: authmethod_is_enabled publickey\r\ndebug1: Next authentication method: publickey\r\ndebug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic,password\r\ndebug1: Trying private key: /Users/testuser/.ssh/test_key.pem\r\ndebug1: key_parse_private2: missing begin marker\r\ndebug1: read PEM private key done: type RSA\r\ndebug3: sign_and_send_pubkey: RSA 12:34:56:78:90:ab:cd:ef:fe:dc:ba:09:87:65:43:21\r\ndebug2: we sent a publickey packet, wait for reply\r\ndebug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic,password\r\ndebug2: we did not send a packet, disable method\r\ndebug1: No more authentication methods to try.\r\nPermission denied (publickey,gssapi-keyex,gssapi-with-mic,password).\r\n", "unreachable": true}
PLAY [Cleanup staging keys for AMI user] ***************************************
PLAY RECAP *********************************************************************
127.0.0.1 : ok=4 changed=3 unreachable=0 failed=0
54.12.34.56 : ok=0 changed=0 unreachable=1 failed=0
After the playbook runs, I copy the same exact SSH command and retry after waiting a couple minutes, it works:
$ ssh -C -vvv -o ForwardAgent=yes -o ControlPath=/tmp/ssh-%r@%h:%p -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o Port=22 -o 'IdentityFile="/Users/testuser/.ssh/test_key.pem"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=centos -o ConnectTimeout=60 -tt 54.12.34.56
[...SNIP...]
[...LOTS of SSH -vvv output...]
[...SNIP...]
Last failed login: Mon Jan 25 18:36:45 UTC 2016 from 123.456.789.1 on ssh:notty
There was 1 failed login attempt since the last successful login.
[centos@ip-10-12-34-56 ~]$
/var/log/secure
shows that the initial login user centos
did not exist when Ansible first tried to login at 18:36:45
:
Jan 25 18:36:14 localhost sshd[797]: Server listening on 0.0.0.0 port 22.
Jan 25 18:36:14 localhost sshd[797]: Server listening on :: port 22.
Jan 25 18:36:14 localhost sshd[1358]: Did not receive identification string from 123.456.789.1
Jan 25 18:36:45 localhost sshd[3437]: Invalid user centos from 123.456.789.1
Jan 25 18:36:45 localhost sshd[3437]: input_userauth_request: invalid user centos [preauth]
Jan 25 18:36:46 localhost sshd[3437]: Connection closed by 123.456.789.1 [preauth]
Jan 25 18:39:04 localhost useradd[10533]: new group: name=centos, GID=1000
Jan 25 18:39:04 localhost useradd[10533]: new user: name=centos, UID=1000, GID=1000, home=/home/centos, shell=/bin/bash
Jan 25 18:39:04 localhost useradd[10533]: add 'centos' to group 'adm'
Jan 25 18:39:04 localhost useradd[10533]: add 'centos' to group 'wheel'
Jan 25 18:39:04 localhost useradd[10533]: add 'centos' to group 'systemd-journal'
Jan 25 18:39:04 localhost useradd[10533]: add 'centos' to shadow group 'adm'
Jan 25 18:39:04 localhost useradd[10533]: add 'centos' to shadow group 'wheel'
Jan 25 18:39:04 localhost useradd[10533]: add 'centos' to shadow group 'systemd-journal'
Jan 25 18:39:05 localhost sshd[797]: Received signal 15; terminating.
Jan 25 18:39:05 localhost sshd[10558]: Server listening on 0.0.0.0 port 22.
Jan 25 18:39:05 localhost sshd[10558]: Server listening on :: port 22.
Jan 25 18:41:29 localhost sshd[10578]: Accepted publickey for centos from 123.456.789.1 port 30159 ssh2: RSA 12:34:56:78:90:ab:cd:ef:fe:dc:ba:09:87:65:43:21
Jan 25 18:41:29 localhost sshd[10578]: pam_unix(sshd:session): session opened for user centos by (uid=0)
Jan 25 18:41:53 localhost sudo: centos : TTY=pts/0 ; PWD=/home/centos ; USER=root ; COMMAND=/bin/less /var/log/cloud-init.log
Jan 25 18:41:58 localhost sudo: centos : TTY=pts/0 ; PWD=/home/centos ; USER=root ; COMMAND=/bin/less /var/log/cloud-init-output.log
Jan 25 18:42:13 localhost sudo: centos : TTY=pts/0 ; PWD=/home/centos ; USER=root ; COMMAND=/bin/less /var/log/boot.log
Jan 25 18:42:19 localhost sudo: centos : TTY=pts/0 ; PWD=/home/centos ; USER=root ; COMMAND=/bin/less /var/log/secure
Jan 25 18:43:30 localhost sudo: centos : TTY=pts/0 ; PWD=/home/centos ; USER=root ; COMMAND=/bin/less /var/log/secure
/var/log/cloud-init-output.log
shows that cloud-init takes a couple minutes to run:
[centos@ip-10-12-34-56 ~]$ cat /var/log/cloud-init-output.log
Cloud-init v. 0.7.5 running 'init-local' at Mon, 25 Jan 2016 18:35:49 +0000. Up 26.32 seconds.
Cloud-init v. 0.7.5 running 'init' at Mon, 25 Jan 2016 18:36:14 +0000. Up 51.35 seconds.
ci-info: +++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++
ci-info: +--------+------+----------------+---------------+-------------------+
ci-info: | Device | Up | Address | Mask | Hw-Address |
ci-info: +--------+------+----------------+---------------+-------------------+
ci-info: | lo: | True | 127.0.0.1 | 255.0.0.0 | . |
ci-info: | eth0: | True | 10.12.34.56 | 255.255.224.0 | 0a:ab:cd:ef:de:00 |
ci-info: +--------+------+----------------+---------------+-------------------+
ci-info: ++++++++++++++++++++++++++++++++Route info+++++++++++++++++++++++++++++++++
ci-info: +-------+--------------+--------------+---------------+-----------+-------+
ci-info: | Route | Destination | Gateway | Genmask | Interface | Flags |
ci-info: +-------+--------------+--------------+---------------+-----------+-------+
ci-info: | 0 | 0.0.0.0 | 10.251.128.1 | 0.0.0.0 | eth0 | UG |
ci-info: | 1 | 10.12.128.0 | 0.0.0.0 | 255.255.224.0 | eth0 | U |
ci-info: +-------+--------------+--------------+---------------+-----------+-------+
Cloud-init v. 0.7.5 running 'modules:config' at Mon, 25 Jan 2016 18:39:05 +0000. Up 222.36 seconds.
Cloud-init v. 0.7.5 running 'modules:final' at Mon, 25 Jan 2016 18:39:05 +0000. Up 222.80 seconds.
Cloud-init v. 0.7.5 finished at Mon, 25 Jan 2016 18:39:05 +0000. Datasource DataSourceEc2. Up 222.93 seconds
So from the above log output, it appears that the initial login user centos
is created at: 18:39:04
, and cloud-init
finishes at 18:39:05
. This is about ~3 mins (02:51
to be exact) after SSHd reports that it is listening on port 22. So the wait_for
task and test echo
shell command task with retry is not enough to avoid this race condition.
Thank you for documenting this. I've been trying to track down the cause of intermittent SSH failures in this exact scenario. Provision EC2, wait for SSH, then run commands - will occasionally fail, likely due to this race condition. Very helpful!