Skip to content

Instantly share code, notes, and snippets.

@MephistoaD
Last active November 19, 2024 23:14
Show Gist options
  • Save MephistoaD/ac1c1fabaa4d6af638db928d5470812d to your computer and use it in GitHub Desktop.
Save MephistoaD/ac1c1fabaa4d6af638db928d5470812d to your computer and use it in GitHub Desktop.
How to overengineer a homelab

My Homelab Setup

or: How to Overengineer a Homelab

My first real dive into homelabbing began in 2021 when I discovered virtualization on servers. I was still new to Linux and Bash, and I often messed up systems - sometimes on the very day I’d installed them.

Honestly, discovering virtualization was a huge relief!

After choosing Proxmox VE (PVE) as my virtualization platform, restoring a system became as simple as a single mouse click.

Of course, this brought new challenges...

Current State

Today, my entire setup is managed via Ansible roles. While writing these roles, I followed several key principles:

  • One size fits all

    There is a single playbook to configure any server. All details about the desired state of the server are expected to come from the inventory, rather than from the admin's choice of playbook.

  • Updates included

    What are the total costs when introducing a new application? The usual answer is “The effort it takes to set it up completely”, leaving regular patching aside. Here, patching is built into the roles, with no binaries included in the repo.

  • Idempotency

    This principle is inherent to the toolset used: I always define a desired state, and the Ansible code then tells me if the server is already in that state or if it had to make changes. As a consequence, using the shell module isn’t entirely forbidden but requires a method for determining when the shell script made a change and when it didn’t, mostly based on its exit code.

  • Single Layer of Abstraction

    The Single Layer of Abstraction Principle (SLAP) applies here: Each layer of abstraction for an application server is separated into a role. Tasks that are necessary but not true abstraction layers within the configuration set are inherently not roles. As a result, each server has only one main role in the inventory, which may reference other roles as dependencies. Example: Virtual machines need to be created, but there is no create_vm role to encapsulate VM creation logic since this does not add an abstraction layer to server configuration.

But let's dive straight into it!

Roles Overview

My roles make extensive use of dependencies, so I found the easiest way to document this is with a graph:

graph TD

alertmanager ---> debian
alertmanager ---> nginx
autoup ---> debian
autoup ---> nginx
debian ---> baseline
etckeeper ---> debian
grafana ---> debian
grafana ---> nginx
keepalived ---> debian
netbox ---> nixos
nextcloud ---> debian
nextcloud ---> nginx
nextcloud ---> php
nextcloud ---> redis
nextcloud ---> postgres
nixos ---> baseline
nginx ---> debian
pbs ---> debian
php ---> debian
pihole ---> debian
pihole ---> keepalived
pihole ---> unbound
postgres --> debian
prometheus ---> debian
prometheus ---> nginx
prometheus ---> alertmanager
pve ---> debian
pve ---> pveceph
pveceph ---> debian
redis ---> debian
semaphore ---> debian
semaphore ---> nginx
semaphore ---> postgres
unbound ---> debian
vaultwarden ---> nixos
Loading

What’s the advantage?
Using NetBox as an inventory source, I can assign a single role to any server - say, nextcloud - and automatically handle all its dependencies, like the database, PHP installation, and reverse proxy. Package managers work similarly; you get the idea.

Configuring a Server

The playbooks/generic_playbook.yml can handle any server configuration action. However, you can adjust the playbook’s behavior with a few variables:

Variable Possible Values Default Description
target <hostname>
<inventory group>
all Specifies the server or group targeted by the Ansible run.
deploy_guest false
"true"
"redeploy"
"purge"
"backup"
"only"
false false: Ignores playbooks/tasks/manage_guest.yml (for non-virtual servers).
"true": Ensures the guest is deployed with the config from inventory (CPU, RAM, IP, etc.).
"redeploy": Backs up the guest to PBS, removes it from PVE, then redeploys as "true".
"purge": Backs up the guest to PBS, removes it from PVE (backup retained).
"backup": Backs up the guest to PBS, then ends playbook.
"only": Acts as "true", then ends playbook.
upgrade false
"true"
false false: Runs regular configuration management.
"true": Runs config management and updates server (like a full apt upgrade but for all installed items).
prohibit_restore false
"true"
false false: Restores guest if a role fails during upgrade == true, or simply fails otherwise.
"true": Suppresses guest restoration (useful for debugging or reproducing errors).
serial false
"true"
false Enables optional serial execution for groups of servers.
include_role <role name> unset Overrides the device_role field in NetBox (for testing purposes).
quick false
"true"
false false: Runs standard configuration.
"true": Skips time-consuming install processes where possible (use cautiously).
only_role <role name> unset Skips roles from dependencies (for development purposes).

Let's take a broader look at the workflow in different exemplary scenarios:

$ ansible-playbook playbooks/generic_playbook.yml -e target=test-vm

This execution assumes that the server already exists with the right properties set.

graph LR

Start ---> assertions[Basic Assertions]
assertions ---> roles[Apply Roles]
roles ---> End
Loading
$ ansible-playbook playbooks/generic_playbook.yml -e target=test-vm -e deploy_guest=true
graph LR

Start ---> assertions[Basic Assertions]
assertions ---> create
subgraph Manage Guest
create[Create Guest] --> configure[Configure VM Properties]
end
configure ---> roles["Apply Roles"]
roles ---> End
Loading
$ ansible-playbook playbooks/generic_playbook.yml -e target=test-vm -e deploy_guest=purge
graph LR

Start ---> assertions[Basic Assertions]
assertions ---> backup
subgraph Manage Guest
backup[Backup Guest] --> delete[Delete Guest]
end
delete ---> End
Loading
$ ansible-playbook playbooks/generic_playbook.yml -e target=test-vm -e deploy_guest=redeploy
graph LR

Start ---> assertions[Basic Assertions]
assertions ---> backup
subgraph Manage Guest
backup[Backup Guest] --> delete[Delete Guest] --> create[Create Guest] --> configure[Configure VM Properties]
end
configure ---> roles["Apply Roles"]
roles ---> End
Loading
$ ansible-playbook playbooks/generic_playbook.yml -e target=test-vm -e deploy_guest=true -e upgrade=true
graph LR

Start ---> assertions[Basic Assertions]
assertions ---> backup[Backup to PBS]
subgraph Manage Guest
backup[Backup Guest] --x delete[Delete Guest] --x create[Create Guest] --> configure[Configure VM Properties]
backup --> create
end
configure ---> roles["Apply Roles (This Includes Upgrades)"]
roles ---> End
Loading

or

graph LR

Start ---> assertions[Basic Assertions]
assertions ---> backup[Backup to PBS]
subgraph Manage Guest
backup[Backup Guest] --x delete[Delete Guest] --x create[Create Guest] --> configure[Configure VM Properties]
backup --> create
end
configure ---> roles["Apply Roles (This Includes Upgrades)"]
roles ---x End
roles -.-> restore["Restore Backup from the Beginning"] --> End
Loading

My Ansible Inventory

As mentioned above, one of the primary goals is to strictly separate the data defining how something is configured from the data describing which systems exist and what runs on them.
In the context of Ansible, this results in a strict separation between my Ansible roles and my inventory.

Many organizations achieve this separation by setting up CMDBs or similar systems. Following this approach, I decided to use NetBox, a software initially developed by DigitalOcean to serve as their IP address management (IPAM), data center and infrastructure management (DCIM), and configuration management database (CMDB) solution.

When deciding to use NetBox, I considered the following key advantages:

  • Ansible Compatibility:

    NetBox offers an Ansible Inventory plugin to fetch the inventory directly from the NetBox API. The required configuration is relatively simple.

  • Feature Completeness:

    I have been able to add everything I need in my inventory as an entry in NetBox. The schema is (at least for my use case) highly feature-complete and enables extensive usage, including grouping, adding hostvars and groupvars, and more.

  • Widespread Usage

    In the IT industry, NetBox is not a new product and is already widely adopted by many enterprises. Its popularity has particularly grown in Europe since the introduction of the NIS II regulation, which requires enterprises to maintain a CMDB.

  • Automation

    The level of automation enabled by NetBox largely depends on how it is utilized and is potentially limitless. While I am still in the early stages of leveraging its automation capabilities, I already find it incredibly convenient that changes made in my inventory are automatically reflected on my servers without requiring me to manually trigger any jobs.

An example of how I set up my NetBox instance before having a fully developed inventory can be found here.

Semaphore

The Ansible Semaphore operates as a blend of tools like Rundeck and AWX.

I chose this project for several reasons:

  • Simplicity: Installing this software does not require an OCI image or a Helm chart (although such options exist). A straightforward installation via apt is equally possible.
  • API: The project provides a comprehensive REST API, which I leverage for triggering jobs via webhooks, such as from NetBox.
  • Accessibility: A turnkey image is available for experimenting with Ansible Semaphore. I extensively used this option before deciding it was worth creating a dedicated Ansible role.

Dataflow (Ansible)

Since the entire setup is managed by Ansible, there is only one primary flow of data:

  • My Ansible Semaphore loads the inventory by scraping the NetBox API.
  • The loaded configuration is then applied to all Ansible-managed systems in my homelab (currently, this includes nearly all of them).
graph
graph LR

inventory(Ansible Inventory)

semaphore[Ansible Semaphore]

subgraph Other managed Systems
vm1[VM 1]
vm2[VM 2]
lxc1(LXC 1)
lxc2(LXC 2)
pve[PVE]
other[...]
end

inventory ==triggers ansible run on inventory changes==> semaphore
semaphore -.scrapes inventory.-> inventory
semaphore --applies ansible config--> inventory
semaphore ---> vm1
semaphore ---> vm2
semaphore --applies ansible config--> lxc1
semaphore ---> lxc2
semaphore ---> pve
semaphore ---> other
Loading

Prometheus Configuration

This paragraph won't delve into what Prometheus is, what it does, or why it operates the way it does.

Instead, I'd like to highlight that I use Prometheus to monitor my entire setup. Before integrating Ansible, this involved a significant amount of manual work, as I had to add each server to the Prometheus configuration manually. However, with Ansible and the Jinja2 templating language, it became relatively straightforward to configure Prometheus to scrape the appropriate servers dynamically.

My first iteration of the Prometheus role included a fairly basic template definition for the Prometheus configuration. If you take a quick look at it, you’ll notice that the template contains a specific paragraph for each exporter installed by other roles.

This design had a major drawback: adding the nginx-prometheus exporter to the nginx role required a corresponding change in the Prometheus role. Similarly, adding the pve-exporter to the pve role also necessitated a change in the Prometheus role. Each new exporter caused a similar ripple effect, indicating a clear code smell: a violation of the open-closed principle.

To align with this principle, the ideal prometheus.yml template should contain only a generic definition of how scrape jobs are structured. Specific details, such as the exporter ports, should not reside in the template or the Prometheus role itself. Instead, these details should be encapsulated within the role responsible for installing the exporter, and then imported or integrated dynamically.

Unfortunately, Ansible does not natively support this kind of modular configuration management (at least not to my knowledge).

I devised the following concept to improve the modularity and code quality of the Prometheus role:

  • Roles as hierarchical layers: Roles are treated as hierarchical layers of abstraction, where each role can have other roles as dependencies.
  • Dependency map: This hierarchy can be represented as a map, with each role being a key that maps to all roles it depends on.
  • Reversed dependency map: The dependency map can be reversed, such that each role maps to a list of all roles that depend on it.
  • Recursive resolution: The reversed dependency map can be resolved recursively, enabling each role to map not only to directly dependent ones but also to indirect dependencies.
  • Exporter configuration variables: Roles that configure a Prometheus exporter on a server can include the variables prometheus_role_exporter_path_<exporter_name> and prometheus_role_exporter_port_<exporter_name> in their defaults/main.yml, vars/main.yml, or group_vars/<role_name>.yml files (example).
  • Automation via custom module: A custom module, prometheus_dependency_map_info, is used to process these rules and generate a map like the following:
"dependency_map": {
    "alertmanager": {
        "dependent_roles": [
            "alertmanager",
            "prometheus"
        ],
        "exporter_path": false,
        "exporter_port": 9093
    },
    "autoup": {
        "dependent_roles": [
            "autoup"
        ],
        "exporter_path": false,
        "exporter_port": false
    },
    "nextcloud": {
        "dependent_roles": [
            "nextcloud"
        ],
        "exporter_path": false,
        "exporter_port": 9205
    },
    "nginx": {
        "dependent_roles": [
            "autoup",
            "nextcloud",
            "vaultwarden",
            "alertmanager",
            "prometheus",
            "semaphore",
            "grafana",
            "nginx",
            "netbox"
        ],
        "exporter_path": false,
        "exporter_port": 9113
    },
    "pve": {
        "dependent_roles": [
            "pve"
        ],
        "exporter_path": "/pve",
        "exporter_port": 9221
    },
    ...
}

Let me explain this "mysterious" map of roles and their Prometheus exporters:

Role Name Comment
autoup The role itself does not directly add a Prometheus exporter to its server. Instead, it indirectly adds the exporter installed by the nginx role, which is one of its dependencies.
It is listed here because it is part of the repository, but no exporter_port or exporter_path is defined.
The only role that depends on autoup is autoup itself, and it is not a dependency of any other role.
nextcloud This role adds its own Prometheus exporter to the server, which is exposed at http://<servername>:9205/metrics.
It is listed here because its defaults/main.yml defines the variable prometheus_role_exporter_port_nextcloud: 9205.
nextcloud is a high-level role that is not currently a dependency for any other role.
alertmanager This role adds its Prometheus exporter to the server, exposed at http://<servername>:9093/metrics.
It is listed here for the variable prometheus_role_exporter_port_alertmanager: 9093 defined in its defaults/main.yml.
The role itself is included as a dependency on servers configured with the prometheus role.
pve This role installs the PVE exporter, which listens on http://<servername>:9221/pve.
It defines both prometheus_role_exporter_port_pve: 9221 and prometheus_role_exporter_path_pve: /pve in its defaults/main.yml.
pve is currently a high-level role that is not included as a dependency in any other roles.
nginx This role installs the nginx exporter, which listens on http://<servername>:9113/metrics.
It defines prometheus_role_exporter_port_nginx: 9113 in its defaults/main.yml.
nginx is a lower-level, generic role included by several other roles. Therefore, the set of servers to scrape nginx metrics from corresponds to the union of the inventory groups identified by the dependent_roles list.

The final version of the prometheus.yml template leverages this dataset, combined with the inventory groups, to determine the hosts to scrape for each Prometheus exporter.

Notes
  • Adding exporters from unrelated roles (i.e., roles that neither directly nor indirectly include the foreign role) can be done as shown in this example. However, this approach should be used with caution to avoid unintended consequences.

  • When reviewing the prometheus.yml template, you may have noticed that the vanilla prometheus-node-exporter is handled differently. This is because I decided to scrape LXC metrics using a separate exporter job. The rationale is that the resource-related metrics provided by the default prometheus-node-exporter are tied to the underlying Proxmox (PVE) host, making them unusable for my purposes. However, other metrics, such as pending upgrades, remain highly relevant, so the scrape job for LXCs is still essential.

With all of this being implemented, any changes to another server in the inventory, such as adding a new role, necessitate a redeployment of the Prometheus configuration. To address this, I created a dedicated Ansible job specifically for this purpose. This job runs every time a change is made to the inventory, although a more granular approach could achieve the same effect.

Custom Prometheus Metrics

By default, the prometheus-node-exporter exports any *.prom file located in /var/lib/prometheus/node-exporter as a custom metric. This allows for tracking custom metrics, such as the timestamp of the last system update, via Prometheus. The greatest advantage of this feature, however, is realized in the implementation of autoup.

Upgrade Jobs

Like mentioned in the section above, the only difference between a configuration management job and an upgrade job is the addition of the -e upgrade=true parameter during the execution of the generic_playbook.yml. As a result, upgrade jobs:

run the full configuration management logic

None of the "usual" tasks should be skipped because the job is flagged as an upgrade job. However, this does not imply that configuration management can be casually executed as an upgrade job, as these jobs expect a machine that is already set up.

apply system updates / system upgrades via the distribution's package manager

Typically, this corresponds to commands such as:

sudo apt-get update
sudo DEBIAN_FRONTEND=noninteractive apt-get dist-upgrade -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold"

Essentially, nothing overly complex or unique.

reboot the system to ensure kernel upgrades are applied instantly too

Some individuals strive to avoid clean reboots at all costs, utilizing technologies like live-patching of the Linux kernel. However, establishing a well-tested, planned, and stable routine of reboots often yields the same benefits as live-patching, with one key advantage: even those who generally avoid reboots will eventually have to perform one. On that day, I prefer to remain unfazed, as reboots are already a routine operation.

update any software initially deployed via ansible but without using a package manager

A good example of this process can be found in scenarios like the installation of Prometheus Alertmanager from GitHub as a binary source.

take the server out of clusters to ensure everything works seamlessly

See the example for keepalived and for pve. The case of PVE is a bit special, however, since there is neither a real way nor a need to temporarily and gracefully remove nodes from the cluster. I'll explain more about this in the paragraph about the PVE role.

are supposed to fail in all circumstances that can reasonably be caused by the upgrade job

In such cases, the buggy state is openly communicated by the failed job, prompting manual intervention and debugging by me. This approach avoids "hidden bugs" lingering within my systems. It includes testing the servers for running services, though admittedly, my testing in this area is currently insufficient.

are expected to be non-destructive

This requirement probably needs some more explanation:

What are destructive actions in this context?

There are a few primary states in which a server might find itself during its lifecycle:

graph LR

created_inventory[Guest created in the inventory]
desired_state[Guest fully configured, serving requests]
deleted[Guest deleted]
failed{{FAILED: Guest configured with errors, needs human intervention}}

created_inventory --NetBox triggers generic_playbook.yml in Semaphore, guest gets installed and configured--> desired_state
desired_state --Guest fulfilled its purpose and gets deleted--> deleted
desired_state ==Upgrade job runs successfully==> desired_state
desired_state ==Upgrade job fails while changing things==> failed
Loading

The failed state is the one to avoid. Why? Upgrades are supposed to happen often but should not strike fear into the admin's heart. If there is a system to upgrade, the main question should be, "When can I afford a 10-minute downtime?" rather than, "What do I do in case it fails, and how much time should I reserve for that beforehand?"

To address this, backups are made of all guests before upgrading them and get restored in case anything fails during the procedure. This, of course, still requires debugging the system afterward but prevents extended downtime in the meantime.

With all of these factors applied, a run of the Ansible-based upgrade job is already disproportionately more powerful than simply installing unattended-upgrades on each server. It provides better insight into the processes occurring, offers the admin greater control, and is more robust in handling complex procedures. This significant advantage serves as the foundation for leveraging the next layer of complexity integrated into the setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment