ansible_production.md

Learning Production Ansible from real projects: A Dive into Kubespray Practices

Hi. Welcome to this exploration of Ansible real projects to help us close (or at least shorten) the gap between training knowledge and practical experience that are used in production-grade projects.

Often, the training deeply focuses on specific topics such as modules, filters, and inventories which are highly valuable, but misses on best practices and comprehensive knowledge that should bring everything together.

So my favorite practice is to explore real projects, clone them in my local environment, and play with them until I become comfortable with the ideas and the practices. In this post, I'll share some insights I got from observing Kubespray, an Ansible tool to deploy and maintain Kubernetes clusters.

An intermediate knowledge or experience in Ansible is desired to follow along.

Also on medium https://medium.com/@danielnegreirosb/learning-production-ansible-from-real-projects-c6aeb13c104e

1. Good organization

Organization is crucial for Ansible automation on multiple levels. We need to consider repository structure, roles or collections structure, credentials management, and inventories.

Understanding how Ansible collections and roles are structured is the first step to adopting best practices in development.

One aspect that I found particularly confusing initially was the distinction between playbooks and plays.

Playbooks and Plays. Whats is difference?

Each "section" of the code below is an Ansible Play. The file playbook.yml that contains these plays is a playbook.

---
- name: Play 1 - Invoke Role 1
  hosts: hosts
  roles:
    - role1

- name: Play 2 - Invoke Role 2
  hosts: hosts, hostsB
  roles:
    - role2

Breaking It Dow

Playbooks should act as simple orchestrators without complex logic. Plays should contain a collection of roles and settings that achieve a specific goal.

Using Kubespray as an example, their main playbooks contain simple logic but hold many different plays depending on the use case, distinguished by tags. The complexity resides within the roles used by the plays.

Kubespray code

- name: Prepare nodes for upgrade
  hosts: k8s_cluster:etcd:calico_rr
  gather_facts: False
  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
  environment: "{{ proxy_disable_env }}"
  roles:
    - { role: kubespray-defaults }
    - { role: kubernetes/preinstall, tags: preinstall }
    - { role: download, tags: download, when: "not skip_downloads" }

- name: Upgrade container engine on non-cluster nodes
  hosts: etcd:calico_rr:!k8s_cluster
  gather_facts: False
  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
  environment: "{{ proxy_disable_env }}"
  serial: "{{ serial | default('20%') }}"
  roles:
    - { role: kubespray-defaults }
    - { role: container-engine, tags: "container-engine", when: deploy_container_engine }

Considerations:

Plays are particularly useful when you have different sets of roles targeting different set of hosts.
We could place all roles under a single play if the targets are the same, but this might reduce cohesion. For example, if you want to remove an application XYZ, you would need to locate all relevant roles instead of just a single play.
Understand the structure of roles and collections and utilize their native hierarchy to store files such as tasks, templates, vars, modules, and filters appropriately.

2. Distinguish between WHAT and HOW

Maintaining a clear separation between the intent of automation and its implementation is fundamental for effective Ansible practices.

Imperative (tasks)

The imperative section of automation is within the tasks/handlers files. It contains a set of instructions that are necessary for the automation to be properly executed and it makes use of whats declared in the variables section.

Declarative (vars and defaults)

The declarative section should be use to centralized in a easy to understand format what is expected from the automation.

But what's the difference between vars and defaults?

defaults/

I see defaults/ as a set of values that are closely realted to the end user and could be used as user inputs. The user may choose to overwrite some or none of them.

vars/

I see vars/ as a helper for the developer, such as temporary variables or even interpolations.

One good use case is to avoid hard-coded or repeated values in different tasks files.

Kubespray code - defaults/

It shows configuration preferences from the user perspective.

# roles/container-engine/containerd/defaults/main.yml
containerd_storage_dir: "/var/lib/containerd"
containerd_state_dir: "/run/containerd"
containerd_systemd_dir: "/etc/systemd/system/containerd.service.d"
# The default value is not -999 here because containerd's oom_score_adj has been
# set to the -999 even if containerd_oom_score is 0.
# Ref: https://github.com/kubernetes-sigs/kubespray/pull/9275#issuecomment-1246499242
containerd_oom_score: 0

containerd_default_runtime: "runc"
containerd_snapshotter: "overlayfs"
# ....

Kubespray code - vars/

It’s a variable that will be used to achieve a user goal.

# roles/container-engine/containerd/vars/debian.yml
containerd_repo_info:
  repos:
    - >
      deb {{ containerd_debian_repo_base_url }}
      {{ ansible_distribution_release | lower }}
      {{ containerd_debian_repo_component }}

Mindset

We should not need to look at any tasks files to understand what the automation is doing. Declarative files like vars and defaults should provide that information. Tasks files should be referenced to understand or implement how the what is going to be performed.

3. Readability

The code should be simple to change and to read.

Example: Roles inclusion in Play with one liner conditionals

Let's compare two code snippets to observe the difference in readability between them. One offers clarity and ease of modification.

Kubespray code

# playbooks/upgrade_cluster.yml
- name: Upgrade container engine on non-cluster nodes
  hosts: etcd:calico_rr:!k8s_cluster
  gather_facts: False
  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
  environment: "{{ proxy_disable_env }}"
  serial: "{{ serial | default('20%') }}"
  roles:
    - { role: kubespray-defaults }
    - { role: container-engine, tags: "container-engine", when: deploy_container_engine }

Regular code

# fake code
- name: Upgrade container engine on non-cluster nodes
  hosts: etcd:calico_rr:!k8s_cluster
  gather_facts: False
  any_errors_fatal: "{{ any_errors_fatal | default(true) }}"
  environment: "{{ proxy_disable_env }}"
  serial: "{{ serial | default('20%') }}"
  tasks:
    - name: Include kubespray-defaults role
      include_role:
        name: kubespray-defaults

    - name: Include container-engine role
      include_role:
        name: container-engine
      tags: container-engine
      when: deploy_container_engine

It is a lot easier to understand what's the goal of the play by looking at the first snippet rather than the second snippet.

Another Example: Conditional

When all the conditions use AND, we can write them as a list.

# roles/bootstrap-os/tasks/main.yml
- name: Assign inventory name to unconfigured hostnames (non-CoreOS, non-Flatcar, Suse and ClearLinux, non-Fedora)
  hostname:
    name: "{{ inventory_hostname }}"
  when:
    - override_system_hostname
    - ansible_os_family not in ['Suse', 'Flatcar', 'Flatcar Container Linux by Kinvolk', 'ClearLinux']
    - not ansible_distribution == "Fedora"
    - not is_fedora_coreos

But what happens when there is an OR conditional in some of them?

We could use this syntax

# roles/bootstrap-os/tasks/main.yml
- name: Assign inventory name to unconfigured hostnames (CoreOS, Flatcar, Suse, ClearLinux and Fedora only)
  command: "hostnamectl set-hostname {{ inventory_hostname }}"
  register: hostname_changed
  become: true
  changed_when: false
  when: >
    override_system_hostname
    and (ansible_os_family in ['Suse', 'Flatcar', 'Flatcar Container Linux by Kinvolk', 'ClearLinux']
    or is_fedora_coreos
    or ansible_distribution == "Fedora")

Learnings

We can use > or | in "when conditionals" to put the conditions vertically. This avoids long lines that require scrolling.

When matching strings, it is often better to use "in list" instead of multiple equality checks.

# Use
when: ansible_os_family in ['Suse', 'Flatcar', 'Flatcar Container Linux by Kinvolk', 'ClearLinux']
# Over:
when: ansible_os_family == "Suse" or ansible_os_family == "Flatcar" or ....

Note

Remember that the idea here is not to enforce these two techniques, but to always keep in mind practices that will lead to better readability and maintenance in the code.

4. Reliability

Retries

Sometimes we encounter deterministic tasks like creating a local file, while other tasks depend on external factors such as networking or unpredictable completion times.

We must ensure that such circumstances do not break our progress. While the procedure for handling unreliable tasks is simple, it's crucial not to overlook it and risk inefficiency in our playbooks.

An example from Kubespray demonstrates a good combination of features for reliability, utilizing until, retry, and delay.

# roles/remove-node/pre-remove/tasks/main.yml
- name: Remove-node | Drain node except daemonsets resource
  command: >-
    {{ kubectl }} drain
      --force
      --ignore-daemonsets
      --grace-period {{ drain_grace_period }}
      --timeout {{ drain_timeout }}
      --delete-emptydir-data {{ kube_override_hostname | default(inventory_hostname) }}
  when:
    - groups['kube_control_plane'] | length > 0
    # ignore servers that are not nodes
    - kube_override_hostname | default(inventory_hostname) in nodes.stdout_lines
  register: result
  failed_when: result.rc != 0 and not allow_ungraceful_removal
  delegate_to: "{{ groups['kube_control_plane'] | first }}"
  until: result.rc == 0 or allow_ungraceful_removal
  retries: "{{ drain_retries }}"
  delay: "{{ drain_retry_delay_seconds }}"

Breaking it down

We are retrying the task x times retries: {{ drain_retries }}, waiting n seconds delay: {{ drain_retry_delay_seconds }} between each attempt, until: result.rc == 0

Handling Errors

When errors occur, the automation should gracefully handle errors by providing appropriate error messages and implementing effective error-handling mechanisms.

Example: If the docker service is not started then (block/rescue) report a message and remove configuration.

# extra_playbooks/roles/container-engine/docker/tasks/main.yml
- name: Ensure docker started, remove our config if docker start failed and try again
  block:
    - name: Ensure service is started if docker packages are already present
      service:
        name: docker
        state: started
      when: docker_task_result is not changed
  rescue:
    - debug:  # noqa name[missing]
        msg: "Docker start failed. Try to remove our config"
    - name: Remove kubespray generated config
      file:
        path: "{{ item }}"
        state: absent
      with_items:
        - /etc/systemd/system/docker.service.d/http-proxy.conf
        - /etc/systemd/system/docker.service.d/docker-options.conf
        - /etc/systemd/system/docker.service.d/docker-dns.conf
        - /etc/systemd/system/docker.service.d/docker-orphan-cleanup.conf
      notify: Restart docker

Note

Our automation must remain robust, ensuring that any failures occur due to external factors rather than issues originating from within the automation itself.

5. Perfomance

Configuration tunning

The ansible.cfg file allows for various performance optimizations. Let's examine the ansible.cfg configuration from Kubespray, focusing on the [ssh_connection] section:

[ssh_connection]
pipelining=True
ssh_args = -o ControlMaster=auto -o ControlPersist=30m -o ConnectionAttempts=100 -o UserKnownHostsFile=/dev/null

Pipelining: Setting pipelining=True enhances performance by reducing the number of network operations needed to transfer Ansible modules to hosts. However, for pipelining to work effectively, requiretty must be disabled on the remote hosts in the /etc/sudoers file. If requiretty is enabled, there may be issues with sudo privilege escalation, which is why pipelining is disabled by default.

SSH Arguments: The ssh_args parameter enables SSH multiplexing, which creates a master connection that can be reused for new connections. This avoids the overhead of establishing new SSH connections each time, resulting in reduced latency and improved efficiency.

ControlMaster=auto: Enables connection sharing.
ControlPersist=30m: Keeps the master connection open for 30 minutes, allowing subsequent connections to reuse this connection.
ConnectionAttempts=100: Sets the maximum number of connection attempts to 100, which can improve reliability in environments with unstable connections.
UserKnownHostsFile=/dev/null: Disables SSH host key checking, which can be useful in certain automated or temporary environments.

Async

Although not utilized in Kubespray, it's worth mentioning the use of asynchronous tasks in other contexts. When executing a long-running task, we might want to continue performing unrelated tasks simultaneously. This is where async tasks come in handy, allowing us to run them asynchronously.

async: 3600 - Specifies the maximum amount of time to wait for the task to complete.
poll: 0 - Defines the interval (in seconds) for checking the task's status. Setting it to 0 means it won't wait for the task to complete.

It's important to note that some long-running tasks might be prerequisites for subsequent tasks. In such cases, async might not be appropriate. However, when tasks can run concurrently, such as downloading a large file while configuring other files, async is highly beneficial.

Here's an example:

- name: Start backup script in the background
  command: /usr/local/bin/backup_script.sh
  async: 3600  # Maximum runtime of 1 hour
  poll: 0      # Don't wait for the task to complete

- name: Task that does not depend on the previous async task
  ansible.builtin.shell:
    cmd: "send notification, download something ..."

# We lock the playbook execution until the previous async task is finalized.
- name: Check status of backup script
  async_status:
    jid: "{{ ansible_job_id }}"
  register: job_result
  until: job_result.finished
  retries: 30  # check if its completed 30 times every 60 seconds (see delay below)
  delay: 60

- name: Task that depends on the previous async task
  ansible.builtin.shell:
    cmd: "backup completed, doing something"

Considerations

Implementing the above techniques will enhance the performance and resiliency of your automation, making it more suitable for production environments.

6. Conventions and Guidelines

Establishing guidelines and best practices is crucial to avoid loss of productivity, bugs, and lack of readability. However, even with all the guidelines in place and commit reviews, bad code can sometimes go through. Therefore, it's essential to automate what's possible, and Kubespray uses rules for linting and pre-commit checks that we can learn from.

Linting will verify the code for issues such as indentation problems, deprecated modules, bad styling, and many others.
Pre-commit will enforce these checks when developers attempt to commit.

Ideally, we should have linting and other checks running both at the pre-commit stage and in the continuous integration (CI) process to ensure that linting hasn't been locally suppressed by the developer.

Setting Linting

Referencing Kubespray, let's focus on the main linting configurations:

.ansible-lint:
Used to set validations and paths to be skipped skip_list. Certain rules are accepted even if they're not the default from Ansible Lint, and some paths are excluded from linting exclude_paths.

# .yamllint
# I removed comments and blank lines to make it better to show it here.
---
parseable: true
skip_list:
  - 'role-name'
  - 'var-naming'
  - 'fqcn-builtins'
  - 'name[template]'
  - 'no-changed-when'
  - 'run-once[task]'
exclude_paths:
  - tests/files/custom_cni/cilium.yaml
  - venv

.ansible-lint-ignore:
If we don't want to skip a rule globally but want to skip it for certain files, we can use this file to specify the files and the rules that should be skipped.

# .ansible-lint-ignore
# This file contains ignores rule violations for ansible-lint
inventory/sample/group_vars/k8s_cluster/k8s-cluster.yml jinja[spacing]
roles/kubernetes/control-plane/defaults/main/kube-proxy.yml jinja[spacing]
roles/kubernetes/control-plane/defaults/main/main.yml jinja[spacing]
# ...

.yamllint:
Defines linting rules for YAML styling. This ensures consistency throughout the project. Useful paths to ignore are venv/** and **/molecule/**.

# .yamllint
---
extends: default

ignore: |
  .git/
  # Generated file
  tests/files/custom_cni/cilium.yaml

rules:
  braces:
    min-spaces-inside: 0
    max-spaces-inside: 1
  brackets:
    min-spaces-inside: 0
    max-spaces-inside: 1
  indentation:
    spaces: 2
    indent-sequences: consistent
  line-length: disable
  new-line-at-end-of-file: disable
  truthy: disable

Executing the linters: Locally

ansible-lint -v
yamllint --strict .

Executing the linters: On Pre Commit

This ensures that the linters are executed every time there is a local commit:

# .pre-commit-config.yaml
# I kept only yamlliny and ansible lint for simple vizualization
---
repos:
  - repo: https://github.com/adrienverge/yamllint.git
    rev: v1.27.1
    hooks:
      - id: yamllint
        args: [--strict]

  - repo: local
    hooks:
      - id: ansible-lint
        name: ansible-lint
        entry: ansible-lint -v
        language: python
        pass_filenames: false
        additional_dependencies:
          - .[community]

Executing the linters: On CI

Kubespray uses Gitlab-CI, but the concept applies to any CI provider:

# .gitlab-ci/lint.yml
# I kept only yamlliny and ansible lint for simple vizualization
yamllint:
  extends: .job
  stage: unit-tests
  tags: [light]
  variables:
    LANG: C.UTF-8
  script:
    - yamllint --strict .
  except: ['triggers', 'master']

ansible-lint:
  extends: .job
  stage: unit-tests
  tags: [light]
  script:
    - ansible-lint -v
  except: ['triggers', 'master']

With these tools and configurations properly set up, developers can ensure code quality even before committing, and the project will benefit from a global validation with the CI setting.

7. Automated Testing

Although creating automated testing may initially feel like a loss of productivity, as projects grow more complex, it becomes an important asset. The moment we hesitate to make a code change is likely the moment we realize the importance of automated testing.

Kubespray utilizes Ansible Molecule in some of its roles. We won't deep dive into the details of Ansible Molecule here, but if this tool is unfamiliar to you, you can check out more about it here.

Molecule Main Configuration Components:

molecule.yml: Holds essential setup configurations like inventory and drivers for setting up targets, along with general settings.
prepare.yml: Executes pre-operations on the environment, such as installing prerequisites.
converge.yml: Initiates the execution to be tested, typically triggering the desired role.
verify.yml: Contains customized tasks for asserting the correctness of the execution.

An example of Kubespray molecule usage can be found in: roles/container-engine/containerd/molecule/

.
└── default
    ├── converge.yml
    ├── molecule.yml
    ├── prepare.yml
    └── tests
        └── test_default.py

One interesting aspect is that Kubespray relies on the verifier testinfra instead of the default ansible which uses verify.yml.

# roles/container-engine/containerd/molecule/default/molecule.yml
# drivers vagrant, provider libvirt
# platforms: ubuntu, debian....
verifier:
  name: testinfra

This means that Ansible Molecule will not trigger verify.yml to validate the execution and test the infra, but it will use the Python code with testinfra packages to test it.

Let's examine part of this Python code from Kubespray example: /tests/test_default.py

import pytest

import testinfra.utils.ansible_runner

testinfra_hosts = testinfra.utils.ansible_runner.AnsibleRunner(
    os.environ['MOLECULE_INVENTORY_FILE']).get_hosts('all')


def test_service(host):
    svc = host.service("containerd")
    assert svc.is_running
    assert svc.is_enabled

def test_version(host):
    crictl = "/usr/local/bin/crictl"
    path = "unix:///var/run/containerd/containerd.sock"
    with host.sudo():
        cmd = host.command(crictl + " --runtime-endpoint " + path + " version")
    assert cmd.rc == 0
    assert "RuntimeName:  containerd" in cmd.stdout

@pytest.mark.parametrize('image, dest', [
    ('quay.io/kubespray/hello-world:latest', '/tmp/hello-world.tar')
])
def test_image_pull_save_load(host, image, dest):
  ...

The package can read the inventory and provide a simple API for validating the infrastructure.

Looking at the code, we notice how easy it is to test if a service is enabled and running:

svc = host.service("containerd")
assert svc.is_running
assert svc.is_enabled

and if a given command was executed successfully:

with host.sudo():
    cmd = host.command(crictl + " --runtime-endpoint " + path + " version")
assert cmd.rc == 0
assert "RuntimeName:  containerd" in cmd.stdout

Because it's integrated with pytest, we can use pytest features like parameterized tests and even exporting to opentelemetry if that's the case.

So it's a good alternative to writing these same kinds of tests using Ansible Tasks modules in the verify.yml playbook.

Note: Ansible Molecule Removing Support for Testinfra

It's important to note that Ansible Molecule will remove its support for testinfra. (ansible/molecule#3920). This means it will not seamlessly integrate with Ansible Molecule.

For those who still want to use it, we could roll back to Molecule verifier ansible.

# molecule.yml
# ...
verifier:
  name: ansible

Create the verify.yml, set the host to localhost, and then trigger pytest from the verifier playbook. And the infra test Python code will be executed.

# verify.yml
- name: Test infra
  hosts: localhost

  tasks:
    - name: Test infra
      ansible.builtin.command:
        cmd: pytest
        chdir: ./tests/

Considerations:

Testing is fundamental and should be considered, especially for complex and large projects. The testinfra Python package offers great capabilities to support infra testing as an alternative to Ansible tasks in the verify.yml. Choose between one or another depending on the level of expertise of the developers in Python, and the groups preferences.

8. Custom Module

Ansible provides a comprehensive range of collections and modules that serve virtually all IT infrastructure purposes. However, there are instances where we may not find the exact module we need from trusted sources. When we start developing custom tasks that become overly complex and difficult to read, maintain, and test, it might be time to create a customized module.

Creating a custom module not only enhances efficiency but also increases reusability. By examining existing modules, such as the one found in Kubespray's plugins folder:

.
└── modules
    └── kube.py

We can gain insights into creating a well-designed module. For full reference, see plugins/modules/kube.py in the Kubespray repo.

In this section, we'll explore how to define the arguments needed for our module.

Some key considerations include utilizing features like aliases, default values, lists of choices, and making certain arguments mutually exclusive:

def main():

    module = AnsibleModule(
        argument_spec=dict(
            name=dict(),
            filename=dict(type='list', aliases=['files', 'file', 'filenames']),
            namespace=dict(),
            resource=dict(),
            label=dict(),
            server=dict(),
            kubeconfig=dict(),
            kubectl=dict(),
            force=dict(default=False, type='bool'),
            wait=dict(default=False, type='bool'),
            all=dict(default=False, type='bool'),
            log_level=dict(default=0, type='int'),
            state=dict(default='present', choices=['present', 'absent', 'latest', 'reloaded', 'stopped', 'exists']),
            recursive=dict(default=False, type='bool'),
            ),
            mutually_exclusive=[['filename', 'list']]
        )

Based on the state passed by the user, the module calls the appropriate function. If no matching state is found, it exits with a proper error message inside the else block:

if state == 'present':
    result = manager.create(check=False)

elif state == 'absent':
    result = manager.delete()

else:
    module.fail_json(msg='Unrecognized state %s.' % state)

module.exit_json(changed=changed,
                 msg='success: %s' % (' '.join(result))
                 )

Inside the state function, it builds the command to be executed by appending the arguments based on the input received from the user using the module:

def create(self, check=True, force=True):
    if check and self.exists():
        return []

    cmd = ['apply']

    if force:
        cmd.append('--force')

    if self.wait:
        cmd.append('--wait')

    if self.recursive:
        cmd.append('--recursive={}'.format(self.recursive))

    if not self.filename:
        self.module.fail_json(msg='filename required to create')

    cmd.append('--filename=' + ','.join(self.filename))

    return self._execute(cmd)

Finally, it calls the execute command that runs the command. If the command fails to run or the result code is different from 0, it returns an error message. Otherwise, it returns the output from the command:

def _execute(self, cmd):
    args = self.base_cmd + cmd
    try:
        rc, out, err = self.module.run_command(args)
        if rc != 0:
            self.module.fail_json(
                msg='error running kubectl (%s) command (rc=%d), out=\'%s\', err=\'%s\'' % (' '.join(args), rc, out, err))
    except Exception as exc:
        self.module.fail_json(
            msg='error running kubectl (%s) command: %s' % (' '.join(args), str(exc)))
    return out.splitlines()

Documentation for the module should also be provided, explaining the intent, the arguments, and providing examples of how to use it.

For this module, we could use it like this:

- name: test nginx is present
  kube: name=nginx resource=rc state=present

Conclusion

In summary, diving into real-world projects like Kubespray offers invaluable insights. This hands-on approach not only enhances our technical skills but also prepares us to handle the complexities of production environments with greater confidence.

There are many Ansible projects, such as AWX, or collections that we could use as references.

I hope these insights help you increase your confidence and improve your automations. If you liked this content, please leave a comment or a like so I know to bring more content like this.

danielnegreiros/ansible_production.md

Learning Production Ansible from real projects: A Dive into Kubespray Practices

1. Good organization

Playbooks and Plays. Whats is difference?

Breaking It Dow

Kubespray code

Considerations:

2. Distinguish between WHAT and HOW

Imperative (tasks)

Declarative (vars and defaults)

defaults/

vars/

Kubespray code - defaults/

Kubespray code - vars/

Mindset

3. Readability

Example: Roles inclusion in Play with one liner conditionals

Kubespray code

Regular code

Another Example: Conditional

Learnings

Note

4. Reliability

Retries

Breaking it down

Handling Errors

Note

5. Perfomance

Configuration tunning

Async

Considerations

6. Conventions and Guidelines

Setting Linting

.ansible-lint:

.ansible-lint-ignore:

.yamllint:

Executing the linters: Locally

Executing the linters: On Pre Commit

Executing the linters: On CI

7. Automated Testing

Molecule Main Configuration Components:

Note: Ansible Molecule Removing Support for Testinfra

Considerations:

8. Custom Module

Conclusion