My Homelab Setup

or: How to Overengineer a Homelab

My first real dive into homelabbing began in 2021 when I discovered virtualization on servers. I was still new to Linux and Bash, and I often messed up systems - sometimes on the very day I’d installed them.

Honestly, discovering virtualization was a huge relief!

After choosing Proxmox VE (PVE) as my virtualization platform, restoring a system became as simple as a single mouse click.

Of course, this brought new challenges...

Current State

Today, my entire setup is managed via Ansible roles. While writing these roles, I followed several key principles:

One size fits all

There is a single playbook to configure any server. All details about the desired state of the server are expected to come from the inventory, rather than from the admin's choice of playbook.
Updates included

What are the total costs when introducing a new application? The usual answer is “The effort it takes to set it up completely”, leaving regular patching aside. Here, patching is built into the roles, with no binaries included in the repo.
Idempotency

This principle is inherent to the toolset used: I always define a desired state, and the Ansible code then tells me if the server is already in that state or if it had to make changes. As a consequence, using the shell module isn’t entirely forbidden but requires a method for determining when the shell script made a change and when it didn’t, mostly based on its exit code.
Single Layer of Abstraction

The Single Layer of Abstraction Principle (SLAP) applies here: Each layer of abstraction for an application server is separated into a role. Tasks that are necessary but not true abstraction layers within the configuration set are inherently not roles. As a result, each server has only one main role in the inventory, which may reference other roles as dependencies. Example: Virtual machines need to be created, but there is no create_vm role to encapsulate VM creation logic since this does not add an abstraction layer to server configuration.

But let's dive straight into it!

Roles Overview

My roles make extensive use of dependencies, so I found the easiest way to document this is with a graph:

graph TD

alertmanager ---> debian
alertmanager ---> nginx
autoup ---> debian
autoup ---> nginx
debian ---> baseline
etckeeper ---> debian
grafana ---> debian
grafana ---> nginx
keepalived ---> debian
netbox ---> nixos
nextcloud ---> debian
nextcloud ---> nginx
nextcloud ---> php
nextcloud ---> redis
nextcloud ---> postgres
nixos ---> baseline
nginx ---> debian
pbs ---> debian
php ---> debian
pihole ---> debian
pihole ---> keepalived
pihole ---> unbound
postgres --> debian
prometheus ---> debian
prometheus ---> nginx
prometheus ---> alertmanager
pve ---> debian
pve ---> pveceph
pveceph ---> debian
redis ---> debian
semaphore ---> debian
semaphore ---> nginx
semaphore ---> postgres
unbound ---> debian
vaultwarden ---> nixos

What’s the advantage?
Using NetBox as an inventory source, I can assign a single role to any server - say, nextcloud - and automatically handle all its dependencies, like the database, PHP installation, and reverse proxy. Package managers work similarly; you get the idea.

Configuring a Server

The playbooks/generic_playbook.yml can handle any server configuration action. However, you can adjust the playbook’s behavior with a few variables:

Variable	Possible Values	Default	Description
`target`	`<hostname>` `<inventory group>`	`all`	Specifies the server or group targeted by the Ansible run.
`deploy_guest`	`false` `"true"` `"redeploy"` `"purge"` `"backup"` `"only"`	`false`	`false`: Ignores `playbooks/tasks/manage_guest.yml` (for non-virtual servers). `"true"`: Ensures the guest is deployed with the config from inventory (CPU, RAM, IP, etc.). `"redeploy"`: Backs up the guest to PBS, removes it from PVE, then redeploys as `"true"`. `"purge"`: Backs up the guest to PBS, removes it from PVE (backup retained). `"backup"`: Backs up the guest to PBS, then ends playbook. `"only"`: Acts as `"true"`, then ends playbook.
`upgrade`	`false` `"true"`	`false`	`false`: Runs regular configuration management. `"true"`: Runs config management and updates server (like a full `apt` upgrade but for all installed items).
`prohibit_restore`	`false` `"true"`	`false`	`false`: Restores guest if a role fails during `upgrade == true`, or simply fails otherwise. `"true"`: Suppresses guest restoration (useful for debugging or reproducing errors).
`serial`	`false` `"true"`	`false`	Enables optional serial execution for groups of servers.
`include_role`	`<role name>`	`unset`	Overrides the `device_role` field in NetBox (for testing purposes).
`quick`	`false` `"true"`	`false`	`false`: Runs standard configuration. `"true"`: Skips time-consuming install processes where possible (use cautiously).
`only_role`	`<role name>`	`unset`	Skips roles from dependencies (for development purposes).

Let's take a broader look at the workflow in different exemplary scenarios:

$ ansible-playbook playbooks/generic_playbook.yml -e target=test-vm

This execution assumes that the server already exists with the right properties set.

graph LR

Start ---> assertions[Basic Assertions]
assertions ---> roles[Apply Roles]
roles ---> End

$ ansible-playbook playbooks/generic_playbook.yml -e target=test-vm -e deploy_guest=true

graph LR

Start ---> assertions[Basic Assertions]
assertions ---> create
subgraph Manage Guest
create[Create Guest] --> configure[Configure VM Properties]
end
configure ---> roles["Apply Roles"]
roles ---> End

$ ansible-playbook playbooks/generic_playbook.yml -e target=test-vm -e deploy_guest=purge

graph LR

Start ---> assertions[Basic Assertions]
assertions ---> backup
subgraph Manage Guest
backup[Backup Guest] --> delete[Delete Guest]
end
delete ---> End

$ ansible-playbook playbooks/generic_playbook.yml -e target=test-vm -e deploy_guest=redeploy

graph LR

Start ---> assertions[Basic Assertions]
assertions ---> backup
subgraph Manage Guest
backup[Backup Guest] --> delete[Delete Guest] --> create[Create Guest] --> configure[Configure VM Properties]
end
configure ---> roles["Apply Roles"]
roles ---> End

$ ansible-playbook playbooks/generic_playbook.yml -e target=test-vm -e deploy_guest=true -e upgrade=true

graph LR

Start ---> assertions[Basic Assertions]
assertions ---> backup[Backup to PBS]
subgraph Manage Guest
backup[Backup Guest] --x delete[Delete Guest] --x create[Create Guest] --> configure[Configure VM Properties]
backup --> create
end
configure ---> roles["Apply Roles (This Includes Upgrades)"]
roles ---> End

graph LR

Start ---> assertions[Basic Assertions]
assertions ---> backup[Backup to PBS]
subgraph Manage Guest
backup[Backup Guest] --x delete[Delete Guest] --x create[Create Guest] --> configure[Configure VM Properties]
backup --> create
end
configure ---> roles["Apply Roles (This Includes Upgrades)"]
roles ---x End
roles -.-> restore["Restore Backup from the Beginning"] --> End

My Ansible Inventory

As mentioned above, one of the primary goals is to strictly separate the data defining how something is configured from the data describing which systems exist and what runs on them.
In the context of Ansible, this results in a strict separation between my Ansible roles and my inventory.

Many organizations achieve this separation by setting up CMDBs or similar systems. Following this approach, I decided to use NetBox, a software initially developed by DigitalOcean to serve as their IP address management (IPAM), data center and infrastructure management (DCIM), and configuration management database (CMDB) solution.

When deciding to use NetBox, I considered the following key advantages:

Ansible Compatibility:

NetBox offers an Ansible Inventory plugin to fetch the inventory directly from the NetBox API. The required configuration is relatively simple.
Feature Completeness:

I have been able to add everything I need in my inventory as an entry in NetBox. The schema is (at least for my use case) highly feature-complete and enables extensive usage, including grouping, adding hostvars and groupvars, and more.
Widespread Usage

In the IT industry, NetBox is not a new product and is already widely adopted by many enterprises. Its popularity has particularly grown in Europe since the introduction of the NIS II regulation, which requires enterprises to maintain a CMDB.
Automation

The level of automation enabled by NetBox largely depends on how it is utilized and is potentially limitless. While I am still in the early stages of leveraging its automation capabilities, I already find it incredibly convenient that changes made in my inventory are automatically reflected on my servers without requiring me to manually trigger any jobs.

An example of how I set up my NetBox instance before having a fully developed inventory can be found here.

Semaphore

The Ansible Semaphore operates as a blend of tools like Rundeck and AWX.

I chose this project for several reasons:

Simplicity: Installing this software does not require an OCI image or a Helm chart (although such options exist). A straightforward installation via apt is equally possible.
API: The project provides a comprehensive REST API, which I leverage for triggering jobs via webhooks, such as from NetBox.
Accessibility: A turnkey image is available for experimenting with Ansible Semaphore. I extensively used this option before deciding it was worth creating a dedicated Ansible role.

Dataflow (Ansible)

Since the entire setup is managed by Ansible, there is only one primary flow of data:

My Ansible Semaphore loads the inventory by scraping the NetBox API.
The loaded configuration is then applied to all Ansible-managed systems in my homelab (currently, this includes nearly all of them).

graph

graph LR

inventory(Ansible Inventory)

semaphore[Ansible Semaphore]

subgraph Other managed Systems
vm1[VM 1]
vm2[VM 2]
lxc1(LXC 1)
lxc2(LXC 2)
pve[PVE]
other[...]
end

inventory ==triggers ansible run on inventory changes==> semaphore
semaphore -.scrapes inventory.-> inventory
semaphore --applies ansible config--> inventory
semaphore ---> vm1
semaphore ---> vm2
semaphore --applies ansible config--> lxc1
semaphore ---> lxc2
semaphore ---> pve
semaphore ---> other

Prometheus Configuration

This paragraph won't delve into what Prometheus is, what it does, or why it operates the way it does.

Instead, I'd like to highlight that I use Prometheus to monitor my entire setup. Before integrating Ansible, this involved a significant amount of manual work, as I had to add each server to the Prometheus configuration manually. However, with Ansible and the Jinja2 templating language, it became relatively straightforward to configure Prometheus to scrape the appropriate servers dynamically.

My first iteration of the Prometheus role included a fairly basic template definition for the Prometheus configuration. If you take a quick look at it, you’ll notice that the template contains a specific paragraph for each exporter installed by other roles.

This design had a major drawback: adding the nginx-prometheus exporter to the nginx role required a corresponding change in the Prometheus role. Similarly, adding the pve-exporter to the pve role also necessitated a change in the Prometheus role. Each new exporter caused a similar ripple effect, indicating a clear code smell: a violation of the open-closed principle.

To align with this principle, the ideal prometheus.yml template should contain only a generic definition of how scrape jobs are structured. Specific details, such as the exporter ports, should not reside in the template or the Prometheus role itself. Instead, these details should be encapsulated within the role responsible for installing the exporter, and then imported or integrated dynamically.

Unfortunately, Ansible does not natively support this kind of modular configuration management (at least not to my knowledge).

I devised the following concept to improve the modularity and code quality of the Prometheus role:

Roles as hierarchical layers: Roles are treated as hierarchical layers of abstraction, where each role can have other roles as dependencies.
Dependency map: This hierarchy can be represented as a map, with each role being a key that maps to all roles it depends on.
Reversed dependency map: The dependency map can be reversed, such that each role maps to a list of all roles that depend on it.
Recursive resolution: The reversed dependency map can be resolved recursively, enabling each role to map not only to directly dependent ones but also to indirect dependencies.
Exporter configuration variables: Roles that configure a Prometheus exporter on a server can include the variables prometheus_role_exporter_path_<exporter_name> and prometheus_role_exporter_port_<exporter_name> in their defaults/main.yml, vars/main.yml, or group_vars/<role_name>.yml files (example).
Automation via custom module: A custom module, prometheus_dependency_map_info, is used to process these rules and generate a map like the following:

"dependency_map": {
    "alertmanager": {
        "dependent_roles": [
            "alertmanager",
            "prometheus"
        ],
        "exporter_path": false,
        "exporter_port": 9093
    },
    "autoup": {
        "dependent_roles": [
            "autoup"
        ],
        "exporter_path": false,
        "exporter_port": false
    },
    "nextcloud": {
        "dependent_roles": [
            "nextcloud"
        ],
        "exporter_path": false,
        "exporter_port": 9205
    },
    "nginx": {
        "dependent_roles": [
            "autoup",
            "nextcloud",
            "vaultwarden",
            "alertmanager",
            "prometheus",
            "semaphore",
            "grafana",
            "nginx",
            "netbox"
        ],
        "exporter_path": false,
        "exporter_port": 9113
    },
    "pve": {
        "dependent_roles": [
            "pve"
        ],
        "exporter_path": "/pve",
        "exporter_port": 9221
    },
    ...
}

Let me explain this "mysterious" map of roles and their Prometheus exporters:

Role Name	Comment
autoup	The role itself does not directly add a Prometheus exporter to its server. Instead, it indirectly adds the exporter installed by the `nginx` role, which is one of its dependencies. It is listed here because it is part of the repository, but no `exporter_port` or `exporter_path` is defined. The only role that depends on `autoup` is `autoup` itself, and it is not a dependency of any other role.
nextcloud	This role adds its own Prometheus exporter to the server, which is exposed at `http://<servername>:9205/metrics`. It is listed here because its `defaults/main.yml` defines the variable `prometheus_role_exporter_port_nextcloud: 9205`. `nextcloud` is a high-level role that is not currently a dependency for any other role.
alertmanager	This role adds its Prometheus exporter to the server, exposed at `http://<servername>:9093/metrics`. It is listed here for the variable `prometheus_role_exporter_port_alertmanager: 9093` defined in its `defaults/main.yml`. The role itself is included as a dependency on servers configured with the `prometheus` role.
pve	This role installs the PVE exporter, which listens on `http://<servername>:9221/pve`. It defines both `prometheus_role_exporter_port_pve: 9221` and `prometheus_role_exporter_path_pve: /pve` in its `defaults/main.yml`. `pve` is currently a high-level role that is not included as a dependency in any other roles.
nginx	This role installs the nginx exporter, which listens on `http://<servername>:9113/metrics`. It defines `prometheus_role_exporter_port_nginx: 9113` in its `defaults/main.yml`. `nginx` is a lower-level, generic role included by several other roles. Therefore, the set of servers to scrape nginx metrics from corresponds to the union of the inventory groups identified by the `dependent_roles` list.

The final version of the prometheus.yml template leverages this dataset, combined with the inventory groups, to determine the hosts to scrape for each Prometheus exporter.

Notes

Adding exporters from unrelated roles (i.e., roles that neither directly nor indirectly include the foreign role) can be done as shown in this example. However, this approach should be used with caution to avoid unintended consequences.
When reviewing the prometheus.yml template, you may have noticed that the vanilla prometheus-node-exporter is handled differently. This is because I decided to scrape LXC metrics using a separate exporter job. The rationale is that the resource-related metrics provided by the default prometheus-node-exporter are tied to the underlying Proxmox (PVE) host, making them unusable for my purposes. However, other metrics, such as pending upgrades, remain highly relevant, so the scrape job for LXCs is still essential.

With all of this being implemented, any changes to another server in the inventory, such as adding a new role, necessitate a redeployment of the Prometheus configuration. To address this, I created a dedicated Ansible job specifically for this purpose. This job runs every time a change is made to the inventory, although a more granular approach could achieve the same effect.

Custom Prometheus Metrics

By default, the prometheus-node-exporter exports any *.prom file located in /var/lib/prometheus/node-exporter as a custom metric. This allows for tracking custom metrics, such as the timestamp of the last system update, via Prometheus. The greatest advantage of this feature, however, is realized in the implementation of autoup.

Upgrade Jobs

Like mentioned in the section above, the only difference between a configuration management job and an upgrade job is the addition of the -e upgrade=true parameter during the execution of the generic_playbook.yml. As a result, upgrade jobs:

run the full configuration management logic

None of the "usual" tasks should be skipped because the job is flagged as an upgrade job. However, this does not imply that configuration management can be casually executed as an upgrade job, as these jobs expect a machine that is already set up.

apply system updates / system upgrades via the distribution's package manager

Typically, this corresponds to commands such as:

sudo apt-get update
sudo DEBIAN_FRONTEND=noninteractive apt-get dist-upgrade -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold"

Essentially, nothing overly complex or unique.

reboot the system to ensure kernel upgrades are applied instantly too

Some individuals strive to avoid clean reboots at all costs, utilizing technologies like live-patching of the Linux kernel. However, establishing a well-tested, planned, and stable routine of reboots often yields the same benefits as live-patching, with one key advantage: even those who generally avoid reboots will eventually have to perform one. On that day, I prefer to remain unfazed, as reboots are already a routine operation.

update any software initially deployed via ansible but without using a package manager

A good example of this process can be found in scenarios like the installation of Prometheus Alertmanager from GitHub as a binary source.

take the server out of clusters to ensure everything works seamlessly

See the example for keepalived and for pve. The case of PVE is a bit special, however, since there is neither a real way nor a need to temporarily and gracefully remove nodes from the cluster. I'll explain more about this in the paragraph about the PVE role.

are supposed to fail in all circumstances that can reasonably be caused by the upgrade job

In such cases, the buggy state is openly communicated by the failed job, prompting manual intervention and debugging by me. This approach avoids "hidden bugs" lingering within my systems. It includes testing the servers for running services, though admittedly, my testing in this area is currently insufficient.

are expected to be non-destructive

This requirement probably needs some more explanation:

What are destructive actions in this context?

There are a few primary states in which a server might find itself during its lifecycle:

graph LR

created_inventory[Guest created in the inventory]
desired_state[Guest fully configured, serving requests]
deleted[Guest deleted]
failed{{FAILED: Guest configured with errors, needs human intervention}}

created_inventory --NetBox triggers generic_playbook.yml in Semaphore, guest gets installed and configured--> desired_state
desired_state --Guest fulfilled its purpose and gets deleted--> deleted
desired_state ==Upgrade job runs successfully==> desired_state
desired_state ==Upgrade job fails while changing things==> failed

The failed state is the one to avoid. Why? Upgrades are supposed to happen often but should not strike fear into the admin's heart. If there is a system to upgrade, the main question should be, "When can I afford a 10-minute downtime?" rather than, "What do I do in case it fails, and how much time should I reserve for that beforehand?"

To address this, backups are made of all guests before upgrading them and get restored in case anything fails during the procedure. This, of course, still requires debugging the system afterward but prevents extended downtime in the meantime.

With all of these factors applied, a run of the Ansible-based upgrade job is already disproportionately more powerful than simply installing unattended-upgrades on each server. It provides better insight into the processes occurring, offers the admin greater control, and is more robust in handling complex procedures. This significant advantage serves as the foundation for leveraging the next layer of complexity integrated into the setup.

MephistoaD/how-to-overengineer-a-homelab.md