Hi. Welcome to this exploration of Ansible real projects to help us close (or at least shorten) the gap between training knowledge and practical experience that are used in production-grade projects.
Often, the training deeply focuses on specific topics such as modules, filters, and inventories which are highly valuable, but misses on best practices and comprehensive knowledge that should bring everything together.
So my favorite practice is to explore real projects, clone them in my local environment, and play with them until I become comfortable with the ideas and the practices. In this post, I'll share some insights I got from observing Kubespray, an Ansible tool to deploy and maintain Kubernetes clusters.
An intermediate knowledge or experience in Ansible is desired to follow along.
Also on medium https://medium.com/@danielnegreirosb/learning-production-ansible-from-real-projects-c6aeb13c104e
Organization is crucial for Ansible automation on multiple levels. We need to consider repository structure, roles or collections structure, credentials management, and inventories.
Understanding how Ansible collections and roles are structured is the first step to adopting best practices in development.
One aspect that I found particularly confusing initially was the distinction between playbooks and plays.
Each "section" of the code below is an Ansible Play. The file playbook.yml that contains these plays is a playbook.
---
- name: Play 1 - Invoke Role 1
hosts: hosts
roles:
- role1
- name: Play 2 - Invoke Role 2
hosts: hosts, hostsB
roles:
- role2
Using Kubespray as an example, their main playbooks contain simple logic but hold many different plays depending on the use case, distinguished by tags. The complexity resides within the roles used by the plays.
- name: Prepare nodes for upgrade
hosts: k8s_cluster:etcd:calico_rr
gather_facts: False
any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
environment: "{{ proxy_disable_env }}"
roles:
- { role: kubespray-defaults }
- { role: kubernetes/preinstall, tags: preinstall }
- { role: download, tags: download, when: "not skip_downloads" }
- name: Upgrade container engine on non-cluster nodes
hosts: etcd:calico_rr:!k8s_cluster
gather_facts: False
any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
environment: "{{ proxy_disable_env }}"
serial: "{{ serial | default('20%') }}"
roles:
- { role: kubespray-defaults }
- { role: container-engine, tags: "container-engine", when: deploy_container_engine }
-
Plays are particularly useful when you have different sets of roles targeting different set of hosts.
-
We could place all roles under a single play if the targets are the same, but this might reduce cohesion. For example, if you want to remove an application XYZ, you would need to locate all relevant roles instead of just a single play.
-
Understand the structure of roles and collections and utilize their native hierarchy to store files such as tasks, templates, vars, modules, and filters appropriately.
Maintaining a clear separation between the intent of automation and its implementation is fundamental for effective Ansible practices.
The imperative section of automation is within the tasks/handlers files. It contains a set of instructions that are necessary for the automation to be properly executed and it makes use of whats declared in the variables section.
The declarative section should be use to centralized in a easy to understand format what is expected from the automation.
But what's the difference between vars and defaults?
I see defaults/ as a set of values that are closely realted to the end user and could be used as user inputs. The user may choose to overwrite some or none of them.
I see vars/ as a helper for the developer, such as temporary variables or even interpolations.One good use case is to avoid hard-coded or repeated values in different tasks files.
It shows configuration preferences from the user perspective.
# roles/container-engine/containerd/defaults/main.yml
containerd_storage_dir: "/var/lib/containerd"
containerd_state_dir: "/run/containerd"
containerd_systemd_dir: "/etc/systemd/system/containerd.service.d"
# The default value is not -999 here because containerd's oom_score_adj has been
# set to the -999 even if containerd_oom_score is 0.
# Ref: https://github.com/kubernetes-sigs/kubespray/pull/9275#issuecomment-1246499242
containerd_oom_score: 0
containerd_default_runtime: "runc"
containerd_snapshotter: "overlayfs"
# ....
It’s a variable that will be used to achieve a user goal.
# roles/container-engine/containerd/vars/debian.yml
containerd_repo_info:
repos:
- >
deb {{ containerd_debian_repo_base_url }}
{{ ansible_distribution_release | lower }}
{{ containerd_debian_repo_component }}
We should not need to look at any tasks files to understand what the automation is doing. Declarative files like vars and defaults should provide that information. Tasks files should be referenced to understand or implement how the what is going to be performed.
The code should be simple to change and to read.
Let's compare two code snippets to observe the difference in readability between them. One offers clarity and ease of modification.
# playbooks/upgrade_cluster.yml
- name: Upgrade container engine on non-cluster nodes
hosts: etcd:calico_rr:!k8s_cluster
gather_facts: False
any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
environment: "{{ proxy_disable_env }}"
serial: "{{ serial | default('20%') }}"
roles:
- { role: kubespray-defaults }
- { role: container-engine, tags: "container-engine", when: deploy_container_engine }
# fake code
- name: Upgrade container engine on non-cluster nodes
hosts: etcd:calico_rr:!k8s_cluster
gather_facts: False
any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
environment: "{{ proxy_disable_env }}"
serial: "{{ serial | default('20%') }}"
tasks:
- name: Include kubespray-defaults role
include_role:
name: kubespray-defaults
- name: Include container-engine role
include_role:
name: container-engine
tags: container-engine
when: deploy_container_engine
It is a lot easier to understand what's the goal of the play by looking at the first snippet rather than the second snippet.
When all the conditions use AND, we can write them as a list.
# roles/bootstrap-os/tasks/main.yml
- name: Assign inventory name to unconfigured hostnames (non-CoreOS, non-Flatcar, Suse and ClearLinux, non-Fedora)
hostname:
name: "{{ inventory_hostname }}"
when:
- override_system_hostname
- ansible_os_family not in ['Suse', 'Flatcar', 'Flatcar Container Linux by Kinvolk', 'ClearLinux']
- not ansible_distribution == "Fedora"
- not is_fedora_coreos
But what happens when there is an OR conditional in some of them?
We could use this syntax
# roles/bootstrap-os/tasks/main.yml
- name: Assign inventory name to unconfigured hostnames (CoreOS, Flatcar, Suse, ClearLinux and Fedora only)
command: "hostnamectl set-hostname {{ inventory_hostname }}"
register: hostname_changed
become: true
changed_when: false
when: >
override_system_hostname
and (ansible_os_family in ['Suse', 'Flatcar', 'Flatcar Container Linux by Kinvolk', 'ClearLinux']
or is_fedora_coreos
or ansible_distribution == "Fedora")
We can use > or | in "when conditionals" to put the conditions vertically. This avoids long lines that require scrolling.
When matching strings, it is often better to use "in list" instead of multiple equality checks.
# Use
when: ansible_os_family in ['Suse', 'Flatcar', 'Flatcar Container Linux by Kinvolk', 'ClearLinux']
# Over:
when: ansible_os_family == "Suse" or ansible_os_family == "Flatcar" or ....
Remember that the idea here is not to enforce these two techniques, but to always keep in mind practices that will lead to better readability and maintenance in the code.
Sometimes we encounter deterministic tasks like creating a local file, while other tasks depend on external factors such as networking or unpredictable completion times.
We must ensure that such circumstances do not break our progress. While the procedure for handling unreliable tasks is simple, it's crucial not to overlook it and risk inefficiency in our playbooks.
An example from Kubespray demonstrates a good combination of features for reliability, utilizing until, retry, and delay.
# roles/remove-node/pre-remove/tasks/main.yml
- name: Remove-node | Drain node except daemonsets resource
command: >-
{{ kubectl }} drain
--force
--ignore-daemonsets
--grace-period {{ drain_grace_period }}
--timeout {{ drain_timeout }}
--delete-emptydir-data {{ kube_override_hostname | default(inventory_hostname) }}
when:
- groups['kube_control_plane'] | length > 0
# ignore servers that are not nodes
- kube_override_hostname | default(inventory_hostname) in nodes.stdout_lines
register: result
failed_when: result.rc != 0 and not allow_ungraceful_removal
delegate_to: "{{ groups['kube_control_plane'] | first }}"
until: result.rc == 0 or allow_ungraceful_removal
retries: "{{ drain_retries }}"
delay: "{{ drain_retry_delay_seconds }}"
We are retrying the task x times retries: {{ drain_retries }}, waiting n seconds delay: {{ drain_retry_delay_seconds }} between each attempt, until: result.rc == 0
When errors occur, the automation should gracefully handle errors by providing appropriate error messages and implementing effective error-handling mechanisms.
Example: If the docker service is not started then (block/rescue) report a message and remove configuration.
# extra_playbooks/roles/container-engine/docker/tasks/main.yml
- name: Ensure docker started, remove our config if docker start failed and try again
block:
- name: Ensure service is started if docker packages are already present
service:
name: docker
state: started
when: docker_task_result is not changed
rescue:
- debug: # noqa name[missing]
msg: "Docker start failed. Try to remove our config"
- name: Remove kubespray generated config
file:
path: "{{ item }}"
state: absent
with_items:
- /etc/systemd/system/docker.service.d/http-proxy.conf
- /etc/systemd/system/docker.service.d/docker-options.conf
- /etc/systemd/system/docker.service.d/docker-dns.conf
- /etc/systemd/system/docker.service.d/docker-orphan-cleanup.conf
notify: Restart docker
Our automation must remain robust, ensuring that any failures occur due to external factors rather than issues originating from within the automation itself.
The ansible.cfg file allows for various performance optimizations. Let's examine the ansible.cfg configuration from Kubespray, focusing on the [ssh_connection] section:
[ssh_connection]
pipelining=True
ssh_args = -o ControlMaster=auto -o ControlPersist=30m -o ConnectionAttempts=100 -o UserKnownHostsFile=/dev/null
Pipelining: Setting pipelining=True enhances performance by reducing the number of network operations needed to transfer Ansible modules to hosts. However, for pipelining to work effectively, requiretty must be disabled on the remote hosts in the /etc/sudoers file. If requiretty is enabled, there may be issues with sudo privilege escalation, which is why pipelining is disabled by default.
SSH Arguments: The ssh_args parameter enables SSH multiplexing, which creates a master connection that can be reused for new connections. This avoids the overhead of establishing new SSH connections each time, resulting in reduced latency and improved efficiency.
- ControlMaster=auto: Enables connection sharing.
- ControlPersist=30m: Keeps the master connection open for 30 minutes, allowing subsequent connections to reuse this connection.
- ConnectionAttempts=100: Sets the maximum number of connection attempts to 100, which can improve reliability in environments with unstable connections.
- UserKnownHostsFile=/dev/null: Disables SSH host key checking, which can be useful in certain automated or temporary environments.
Although not utilized in Kubespray, it's worth mentioning the use of asynchronous tasks in other contexts. When executing a long-running task, we might want to continue performing unrelated tasks simultaneously. This is where async tasks come in handy, allowing us to run them asynchronously.
- async: 3600 - Specifies the maximum amount of time to wait for the task to complete.
- poll: 0 - Defines the interval (in seconds) for checking the task's status. Setting it to 0 means it won't wait for the task to complete.
It's important to note that some long-running tasks might be prerequisites for subsequent tasks. In such cases, async might not be appropriate. However, when tasks can run concurrently, such as downloading a large file while configuring other files, async is highly beneficial.
Here's an example:
- name: Start backup script in the background
command: /usr/local/bin/backup_script.sh
async: 3600 # Maximum runtime of 1 hour
poll: 0 # Don't wait for the task to complete
- name: Task that does not depend on the previous async task
ansible.builtin.shell:
cmd: "send notification, download something ..."
# We lock the playbook execution until the previous async task is finalized.
- name: Check status of backup script
async_status:
jid: "{{ ansible_job_id }}"
register: job_result
until: job_result.finished
retries: 30 # check if its completed 30 times every 60 seconds (see delay below)
delay: 60
- name: Task that depends on the previous async task
ansible.builtin.shell:
cmd: "backup completed, doing something"
Implementing the above techniques will enhance the performance and resiliency of your automation, making it more suitable for production environments.
Establishing guidelines and best practices is crucial to avoid loss of productivity, bugs, and lack of readability. However, even with all the guidelines in place and commit reviews, bad code can sometimes go through. Therefore, it's essential to automate what's possible, and Kubespray uses rules for linting and pre-commit checks that we can learn from.
- Linting will verify the code for issues such as indentation problems, deprecated modules, bad styling, and many others.
- Pre-commit will enforce these checks when developers attempt to commit.
Ideally, we should have linting and other checks running both at the pre-commit stage and in the continuous integration (CI) process to ensure that linting hasn't been locally suppressed by the developer.
Referencing Kubespray, let's focus on the main linting configurations:
- Used to set validations and paths to be skipped skip_list. Certain rules are accepted even if they're not the default from Ansible Lint, and some paths are excluded from linting exclude_paths.
# .yamllint
# I removed comments and blank lines to make it better to show it here.
---
parseable: true
skip_list:
- 'role-name'
- 'var-naming'
- 'fqcn-builtins'
- 'name[template]'
- 'no-changed-when'
- 'run-once[task]'
exclude_paths:
- tests/files/custom_cni/cilium.yaml
- venv
- If we don't want to skip a rule globally but want to skip it for certain files, we can use this file to specify the files and the rules that should be skipped.
# .ansible-lint-ignore
# This file contains ignores rule violations for ansible-lint
inventory/sample/group_vars/k8s_cluster/k8s-cluster.yml jinja[spacing]
roles/kubernetes/control-plane/defaults/main/kube-proxy.yml jinja[spacing]
roles/kubernetes/control-plane/defaults/main/main.yml jinja[spacing]
# ...
- Defines linting rules for YAML styling. This ensures consistency throughout the project. Useful paths to ignore are venv/** and **/molecule/**.
# .yamllint
---
extends: default
ignore: |
.git/
# Generated file
tests/files/custom_cni/cilium.yaml
rules:
braces:
min-spaces-inside: 0
max-spaces-inside: 1
brackets:
min-spaces-inside: 0
max-spaces-inside: 1
indentation:
spaces: 2
indent-sequences: consistent
line-length: disable
new-line-at-end-of-file: disable
truthy: disable
ansible-lint -v
yamllint --strict .
This ensures that the linters are executed every time there is a local commit:
# .pre-commit-config.yaml
# I kept only yamlliny and ansible lint for simple vizualization
---
repos:
- repo: https://github.com/adrienverge/yamllint.git
rev: v1.27.1
hooks:
- id: yamllint
args: [--strict]
- repo: local
hooks:
- id: ansible-lint
name: ansible-lint
entry: ansible-lint -v
language: python
pass_filenames: false
additional_dependencies:
- .[community]
Kubespray uses Gitlab-CI, but the concept applies to any CI provider:
# .gitlab-ci/lint.yml
# I kept only yamlliny and ansible lint for simple vizualization
yamllint:
extends: .job
stage: unit-tests
tags: [light]
variables:
LANG: C.UTF-8
script:
- yamllint --strict .
except: ['triggers', 'master']
ansible-lint:
extends: .job
stage: unit-tests
tags: [light]
script:
- ansible-lint -v
except: ['triggers', 'master']
With these tools and configurations properly set up, developers can ensure code quality even before committing, and the project will benefit from a global validation with the CI setting.
Although creating automated testing may initially feel like a loss of productivity, as projects grow more complex, it becomes an important asset. The moment we hesitate to make a code change is likely the moment we realize the importance of automated testing.
Kubespray utilizes Ansible Molecule in some of its roles. We won't deep dive into the details of Ansible Molecule here, but if this tool is unfamiliar to you, you can check out more about it here.
- molecule.yml: Holds essential setup configurations like inventory and drivers for setting up targets, along with general settings.
- prepare.yml: Executes pre-operations on the environment, such as installing prerequisites.
- converge.yml: Initiates the execution to be tested, typically triggering the desired role.
- verify.yml: Contains customized tasks for asserting the correctness of the execution.
An example of Kubespray molecule usage can be found in: roles/container-engine/containerd/molecule/
.
└── default
├── converge.yml
├── molecule.yml
├── prepare.yml
└── tests
└── test_default.py
One interesting aspect is that Kubespray relies on the verifier testinfra instead of the default ansible which uses verify.yml.
# roles/container-engine/containerd/molecule/default/molecule.yml
# drivers vagrant, provider libvirt
# platforms: ubuntu, debian....
verifier:
name: testinfra
This means that Ansible Molecule will not trigger verify.yml to validate the execution and test the infra, but it will use the Python code with testinfra packages to test it.
Let's examine part of this Python code from Kubespray example: /tests/test_default.py
import pytest
import testinfra.utils.ansible_runner
testinfra_hosts = testinfra.utils.ansible_runner.AnsibleRunner(
os.environ['MOLECULE_INVENTORY_FILE']).get_hosts('all')
def test_service(host):
svc = host.service("containerd")
assert svc.is_running
assert svc.is_enabled
def test_version(host):
crictl = "/usr/local/bin/crictl"
path = "unix:///var/run/containerd/containerd.sock"
with host.sudo():
cmd = host.command(crictl + " --runtime-endpoint " + path + " version")
assert cmd.rc == 0
assert "RuntimeName: containerd" in cmd.stdout
@pytest.mark.parametrize('image, dest', [
('quay.io/kubespray/hello-world:latest', '/tmp/hello-world.tar')
])
def test_image_pull_save_load(host, image, dest):
...
The package can read the inventory and provide a simple API for validating the infrastructure.
Looking at the code, we notice how easy it is to test if a service is enabled and running:
svc = host.service("containerd")
assert svc.is_running
assert svc.is_enabled
and if a given command was executed successfully:
with host.sudo():
cmd = host.command(crictl + " --runtime-endpoint " + path + " version")
assert cmd.rc == 0
assert "RuntimeName: containerd" in cmd.stdout
Because it's integrated with pytest, we can use pytest features like parameterized tests and even exporting to opentelemetry if that's the case.
So it's a good alternative to writing these same kinds of tests using Ansible Tasks modules in the verify.yml playbook.
It's important to note that Ansible Molecule will remove its support for testinfra. (ansible/molecule#3920). This means it will not seamlessly integrate with Ansible Molecule.
For those who still want to use it, we could roll back to Molecule verifier ansible.
# molecule.yml
# ...
verifier:
name: ansible
Create the verify.yml, set the host to localhost, and then trigger pytest from the verifier playbook. And the infra test Python code will be executed.
# verify.yml
- name: Test infra
hosts: localhost
tasks:
- name: Test infra
ansible.builtin.command:
cmd: pytest
chdir: ./tests/
Ansible provides a comprehensive range of collections and modules that serve virtually all IT infrastructure purposes. However, there are instances where we may not find the exact module we need from trusted sources. When we start developing custom tasks that become overly complex and difficult to read, maintain, and test, it might be time to create a customized module.
Creating a custom module not only enhances efficiency but also increases reusability. By examining existing modules, such as the one found in Kubespray's plugins folder:
.
└── modules
└── kube.py
We can gain insights into creating a well-designed module. For full reference, see plugins/modules/kube.py in the Kubespray repo.
In this section, we'll explore how to define the arguments needed for our module.
Some key considerations include utilizing features like aliases, default values, lists of choices, and making certain arguments mutually exclusive:
def main():
module = AnsibleModule(
argument_spec=dict(
name=dict(),
filename=dict(type='list', aliases=['files', 'file', 'filenames']),
namespace=dict(),
resource=dict(),
label=dict(),
server=dict(),
kubeconfig=dict(),
kubectl=dict(),
force=dict(default=False, type='bool'),
wait=dict(default=False, type='bool'),
all=dict(default=False, type='bool'),
log_level=dict(default=0, type='int'),
state=dict(default='present', choices=['present', 'absent', 'latest', 'reloaded', 'stopped', 'exists']),
recursive=dict(default=False, type='bool'),
),
mutually_exclusive=[['filename', 'list']]
)
Based on the state passed by the user, the module calls the appropriate function. If no matching state is found, it exits with a proper error message inside the else block:
if state == 'present':
result = manager.create(check=False)
elif state == 'absent':
result = manager.delete()
else:
module.fail_json(msg='Unrecognized state %s.' % state)
module.exit_json(changed=changed,
msg='success: %s' % (' '.join(result))
)
Inside the state function, it builds the command to be executed by appending the arguments based on the input received from the user using the module:
def create(self, check=True, force=True):
if check and self.exists():
return []
cmd = ['apply']
if force:
cmd.append('--force')
if self.wait:
cmd.append('--wait')
if self.recursive:
cmd.append('--recursive={}'.format(self.recursive))
if not self.filename:
self.module.fail_json(msg='filename required to create')
cmd.append('--filename=' + ','.join(self.filename))
return self._execute(cmd)
Finally, it calls the execute command that runs the command. If the command fails to run or the result code is different from 0, it returns an error message. Otherwise, it returns the output from the command:
def _execute(self, cmd):
args = self.base_cmd + cmd
try:
rc, out, err = self.module.run_command(args)
if rc != 0:
self.module.fail_json(
msg='error running kubectl (%s) command (rc=%d), out=\'%s\', err=\'%s\'' % (' '.join(args), rc, out, err))
except Exception as exc:
self.module.fail_json(
msg='error running kubectl (%s) command: %s' % (' '.join(args), str(exc)))
return out.splitlines()
Documentation for the module should also be provided, explaining the intent, the arguments, and providing examples of how to use it.
For this module, we could use it like this:
- name: test nginx is present
kube: name=nginx resource=rc state=present
In summary, diving into real-world projects like Kubespray offers invaluable insights. This hands-on approach not only enhances our technical skills but also prepares us to handle the complexities of production environments with greater confidence.
There are many Ansible projects, such as AWX, or collections that we could use as references.
I hope these insights help you increase your confidence and improve your automations. If you liked this content, please leave a comment or a like so I know to bring more content like this.