Drallas/Docker Swarm in LXC Containers.md

Last active April 21, 2025 08:05

Star (39) You must be signed in to star a gist
Fork (10) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/Drallas/e03eb5a4f68bb526f920a423455bc0c9.js"></script>
Save Drallas/e03eb5a4f68bb526f920a423455bc0c9 to your computer and use it in GitHub Desktop.

Download ZIP

Raw

Docker Swarm in LXC Containers.md

Docker Swarm in LXC Containers

Part of collection: Hyper-converged Homelab with Proxmox

Update 09-10-2024: I have moved on to a setup with Docker LXC on Proxmox, with bind-mounts (without VirtioFS) as described in Gist Docker Swarm in Vm's with CephFS.

Update 07-03-2025: The Dockge LXC does this even better, eliminating most issues, with the Docker LXC](https://community-scripts.github.io/ProxmoxVE/scripts?id=docker)!

After struggling for some days, and since I really needed this to work (ignoring the it can't be done vibe everywhere), I managed to get Docker to work reliable in privileged Debian 12 LXC Containers on Proxmox 8

(Unfortunately, I couldn't get anything to work in unprivileged LXC Containers)

There are NO modifications required on the Proxmox host or the /etc/pve/lxc/xxx.conf file; everything is done on the Docker Swarm host. So the only obvious candidate who could break this setup, are future Docker Engine updates!

Host Setup

My host are Debian 12 LXC containers, installed via tteck's Proxmox VE Helper Scripts

Install the LXC via the Proxmox VE Helper Script

bash -c "$(wget -qLO - https://github.com/tteck/Proxmox/raw/main/ct/debian.sh)"

Backing filesystems

Docker info shows i'm using overlay2, this is the recommended storage driver for Debian. This storage driver requires XFS or EXT4 as backing file system.

docker info | grep -A 7 "Storage Driver:"

 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd

Set userns-remap

As Neuer_User pointed out, running the Docker Containers unprivileged on a privileged

LXC seems the best compromise to run the containers in a relative secure way.

To do so, add a daemon.json on the Docker Servers that are part of the Swarm.

mkdir /etc/docker
nano /etc/docker/daemon.json

{
  "userns-remap": "root"
}

And reboot reboot the Docker Host.

(This moves everything below /var/lib/docker/ to the folder /var/lib/docker/0.0/ existing workload disappear, hence it's a step pre Docker installation!)

Install Docker

The get-docker.sh script is the most convenient way to quickly install the latest Docker-CE release!

curl -fsSL https://get.docker.com -o get-docker.sh
chmod +x get-docker.sh
./get-docker.sh

Join Create Docker / Swarm

Without this step, the next step(s) fail!

# Manager Node
docker swarm init

# Add Node
docker swarm join --token <some-very-long-token>

# Display Join token again
docker swarm join-token worker
docker swarm join-token manager

Add ipv4 for Ingress_sbox

For Docker in LXC to work, the only thing needed is to execute:

nsenter --net=/run/docker/netns/ingress_sbox sysctl -w net.ipv4.ip_forward=1

on the Docker Swarm Servers

Make it permanent

This doesn't survive reboots, so I created an oneshot systemd service for it, to make sure that after each reboot the setting is applied.

Create a Bash Script

First, we need a Bash script to be executed by the service.

nano /usr/local/bin/ipforward.sh

#!/bin/bash
nsenter --net=/run/docker/netns/ingress_sbox sysctl -w net.ipv4.ip_forward=1

Make it executable

chmod +x /usr/local/bin/ipforward.sh

Create a Systemd Service

This service is of the type oneshot, during startup it waits for the docker.service to be started, and then 10 seconds for run-docker-netns-ingress_sbox.mount to be loaded. Only after that net.ipv4.ip_forward=1 can be applied.

nano /etc/systemd/system/ingress-sbox-ipforward.service

[Unit]
Description = Set net.ipv4.ip_forward for ingress_sbox namespace
After = docker.service
Wants = docker.service

[Service]
Type = oneshot
RemainAfterExit = yes
ExecStartPre = /bin/sleep 10
ExecStart = /usr/local/bin/ipforward.sh

[Install]
WantedBy = multi-user.target

Start the service and check if it's healthy

systemctl daemon-reload
systemctl enable ingress-sbox-ipforward.service
systemctl start ingress-sbox-ipforward.service
systemctl status ingress-sbox-ipforward.service

Final Checks

Without ipv4.ip_forward set to 1, the Ingress Networking to the Docker Swarm is not active. So it's important to verify if the value is applied successfully.

Manual check if ipv4.ip_forward is set to 1

systemctl status ingress-sbox-ipforward.service | grep ipforward.sh

# Or in a script via:

current_value=$(nsenter --net=/run/docker/netns/ingress_sbox sysctl -n net.ipv4.ip_forward)
echo $current_value

(Now, Docker in LXC seems to behave as Docker in a VM.)

Issues

Service in docker-compose resolved wrong ip

To fix this, it’s needed to add a hostname entry for each swarm service, to make it more logical I also add a prefix service to the service names.

services:
  service_nginx: # Prefix service_
    image: nginx
    hostname: nginx

Screenshot

00Asgaroth00 commented May 17, 2024

Hi, yes the cephfs is mounted across all swarm nodes, the mount is defined in lxc conf file as follows:

mp0: /mnt/pve/cephfs/swarm_data,mp=/data,shared=1

Where the pve mount for cephfs is /mnt/pve/cephfs, the "swarm_data" is a directory under that mount point on the pve host itself.

I can see the data on all lxc nodes and ai can "cat" text files on all nodes and the data appears correct.
The portainer.db file is a boltdb data file so i cannot easily see its data to see where it is going wrong :/

Author

Drallas commented May 17, 2024

Not sure, not using this anymore, i will see I have a backup in can restore to test..

00Asgaroth00 commented May 17, 2024 •

edited

Loading

Did you end up moving over to vm's with the virtiofs option (i see the heading there on your main page). I may switch to that if it has less headaches than running swarm in lxc. I find the creation of the vm's much simpler for lxc than qemu vms (i currently use ansible to automate the lot for lxc), however, if i switch to vm's i'll need to hook in packer to create a template first before cloning the vm's. anyhooo, i might switch to vm's and look into virtiofs for the cephfs shared filesystem, this is where i hoped the bind mounts for lxc would have sufficed...

no need to do a restore, thanks for commenting though!

EDIT:

For reference, this is the error message I get from the portainer app when the app fails over to another node while testing:

[root@swarm-manager-01 portainer]# docker logs abfb8f866522
2024/05/17 04:05PM INF github.com/portainer/portainer/api/cmd/portainer/main.go:369 > encryption key file not present | filename=portainer
2024/05/17 04:05PM INF github.com/portainer/portainer/api/cmd/portainer/main.go:392 > proceeding without encryption key |
2024/05/17 04:05PM INF github.com/portainer/portainer/api/database/boltdb/db.go:125 > loading PortainerDB | filename=portainer.db
2024/05/17 04:05PM FTL github.com/portainer/portainer/api/cmd/portainer/main.go:98 > failed opening store | error="invalid database"

Author

Drallas commented May 17, 2024 •

edited

Loading

I did run into this with Portainer and it might happen on VM’s too, need to check my documentation for details.

do other Containers persist data correctly when you move them over to another node?

00Asgaroth00 commented May 17, 2024 •

edited

Loading

I've not tested any other container to be honest, i might try something small like adguard or something like that just to see if it exhibits the same issue.

EDIT:

Just tried it out with adguardhome, and it looks like I have the same issue there as well. First start is okay, as soon as I drain the node to force a failover, the container fails to read the data on startup on the new node. I am however able to see to text files contents on all nodes and i can create new files on each of the nodes.

Adguardhome's spsecific error message:

[root@swarm-worker-02 ~]# docker logs 591ded60ee2b
2024/05/17 18:21:55.592321 [info] AdGuard Home, version v0.107.48
2024/05/17 18:21:55.593258 [info] tls: using default ciphers
2024/05/17 18:21:55.594867 [info] safesearch default: reset 253 rules
2024/05/17 18:21:55.693092 [info] Initializing auth module: /opt/adguardhome/work/data/sessions.db
2024/05/17 18:21:55.700098 [error] auth: open DB: /opt/adguardhome/work/data/sessions.db: invalid database
2024/05/17 18:21:55.700108 [fatal] initializing auth module failed

Author

Drallas commented May 24, 2024

Perhaps AdGuard isn’t closing / shutting the DB in a clean state..

00Asgaroth00 commented May 24, 2024 •

edited

Loading

I'm not sure it seems to happen with both portainer and adguard. both databases get corrupted when i test a "failover", ie: drain the node the containers are running on and wait for them to be rescheduled elsewhere.

virtiofsd debug logs dont seem to indicate any issues either :(

jimbothigpen commented May 28, 2024

@00Asgaroth00 : Have you made any progress debugging this issue? I'm bumping up against the same problem. Privileged LXCs 3 node swarm, Portainer works after the service first places and starts the container, but when the service restarts the container on another node, I have the same errors you're seeing. Portainer's /data directory is a bind mount to a cephfs directory, readable and writable by all swarm members.

Author

Drallas commented May 28, 2024 •

edited

Loading

@jimbothigpen How did you install Portainer and the agent?

I ran into similar issues, but didn’t document it at the time.

All I remember is that following this guide helped me.

00Asgaroth00 commented May 28, 2024

@00Asgaroth00 : Have you made any progress debugging this issue? I'm bumping up against the same problem. Privileged LXCs 3 node swarm, Portainer works after the service first places and starts the container, but when the service restarts the container on another node, I have the same errors you're seeing. Portainer's /data directory is a bind mount to a cephfs directory, readable and writable by all swarm members.

Hi, no, I did not make any progress with this, using lxc with bind mounts on the cephfs directory results in an invalid database when testing failover, its as if file state is not sync'd in time before failover completes on secondary node.

All I remember is that following this guide helped me.

That is exactly the guide I followed to start portainer in both lxc's and vm's.

With vm's and using virtiofs I can actually remove the files and do an ls on the remaining nodes and the files still show up implying that [i|d]node entries are not synced between instances. I'm still trying different parameters on virtiofsd, for example cache=never|none to see if i can force it to re-read directly from file system, but i've had no luck with it so far. at this point i'm starting to consider older tech like glusterfs with gfs2/ocfs2 filesystems for this. although, knowing that cephfs is available is messing with my ocd, i want to use that mount

jimbothigpen commented May 28, 2024

Portainer & portainer agent were both installed using the same stack file you pointed to, just changed the /data volume to a bind mount aimed at the proper directory on the shared cephfs mount.

Frustrating thing is that I know this worked in the recent past. I've had this setup running for a while -- docker swarm in privileged lxc with a cephfs mount for persistent container data. Portainer was happily chugging along for the better part of a year, with dozens of host restarts and service relocations just working as expected. At some point in the last 6 weeks I noticed the portainer service failing (not 100% certain when it stopped working, as my attention has been elsewhere, and hadn't actually tried to log on to the portainer interface for a while). Seems entirely unrelated to the docker or portainer versions (I've tried multiple versions of docker and portainer recently, trying to get it to work as expected).

I also tried removing the mount point from the LXC and installing the ceph client inside the container and mounting via fstab. Same behavior.

Gave up hope this morning. Since I already have an NFS server exporting a couple of the cephfs directories, I just used that -- mounted the same directory via NFS on the docker hosts. Portainer now behaves as expected -- service is able to move to any host w/out complaint.

But yeah -- the added (admittedly minor) complexity of using NFS mounts inside the docker containers instead of a cephfs mount on the LXC makes my eye twitch a bit.

00Asgaroth00 commented Jun 5, 2024

I also tried removing the mount point from the LXC and installing the ceph client inside the container and mounting via fstab. Same behavior.

I just mounted the cephfs filesystem using the ceph client within the virtual machines fstab and portainer/adguard are working properly now. I did not try this in an lxc container though. I had to create a local bridge on each hypervisor and nat out traffic over the point-to-point link to get the virtual machine running on host 1 to communicate with monitors on host 2 and 3, but it is working away nicely now.

tulonbaar commented Jan 6, 2025 •

edited

Loading

I lost ~3h debuging my network, and wasn't able to figure why no external connection is reaching my swarm. Especially that i don't remember if ever on other instances i had to redirect connections locally.
Using your nsenter method I was able to finally start that swarm and connect it to other environments.

Thanks!

akhdanfadh commented Jan 8, 2025 •

edited

Loading

Thank you very much for writing this guide! I found this when I try to use docker secrets in portainer but turns out it only supported in docker swarm environment.

I am trying to understand the user mapping. I don't think simply making the daemon.json file is enough. See this docker docs on namespace,

Some distributions do not automatically add the new group to the /etc/subuid and /etc/subgid files. If that's the case, you are may have to manually edit these files and assign non-overlapping ranges. This step is covered in Prerequisites.

And this is the case for me on a fresh privileged debian LXC. I don't see any mapping in both /etc/subuid and /etc/subgid after following all your guide, it's just empty. I think this is crucial if you concern about unprivileged/privileged in the first place.

So my question is:

Are you aware of this? If so, is this intended (the no mapping in those two files)?
Will existing containers break if I add those two files and the mapping inside?
This question is specific. Basically if I want to use GPU transcoding in an unprivileged LXC, I can trick the mapping in the LXC configuration file (discussed in many guides, this for example). Then the question: even if I deploy a privileged LXC and use this namespace mapping for docker inside it, the GPU won't be able to be used straight away, right?

curtistinkers commented Mar 5, 2025 •

edited

Loading

This is a great write-up, thank you! I was trying to figure this out for days.

The systemd service has to be restarted if the docker service is restarted. You can use the PartOf directive to trigger that automatically. Additionally, ~~you can removed delay and instead just~~ I made the service file require run-docker-netns-ingress_sbox.mount. I also don't understand why a separate file is needed just to run the required command?

Below is the version of the systemd unit I'm using that I've adapted from yours:

[Unit]
Description = Set net.ipv4.ip_forward for ingress_sbox namespace
After = docker.service run-docker-netns-ingress_sbox.mount
Requires = run-docker-netns-ingress_sbox.mount
PartOf = docker.service run-docker-netns-ingress_sbox.mount

[Service]
Type = oneshot
RemainAfterExit = yes
ExecStartPre = /bin/sleep 10
ExecStart = nsenter --net=/run/docker/netns/ingress_sbox sysctl -w net.ipv4.ip_forward=1

[Install]
WantedBy = multi-user.target

Edit: Strangely, my swarm manager worked without ExecStartPre = /bin/sleep 10 but the worker doesn't.

Author

Drallas commented Mar 7, 2025

This is a great write-up, thank you! I was trying to figure this out for days.

The systemd service has to be restarted if the docker service is restarted. You can use the PartOf directive to trigger that automatically. Additionally, ~~you can removed delay and instead just~~ I made the service file require run-docker-netns-ingress_sbox.mount. I also don't understand why a separate file is needed just to run the required command?

Below is the version of the systemd unit I'm using that I've adapted from yours:
[Unit]
Description = Set net.ipv4.ip_forward for ingress_sbox namespace
After = docker.service run-docker-netns-ingress_sbox.mount
Requires = run-docker-netns-ingress_sbox.mount
PartOf = docker.service run-docker-netns-ingress_sbox.mount

[Service]
Type = oneshot
RemainAfterExit = yes
ExecStartPre = /bin/sleep 10
ExecStart = nsenter --net=/run/docker/netns/ingress_sbox sysctl -w net.ipv4.ip_forward=1

[Install]
WantedBy = multi-user.target
Edit: Strangely, my swarm manager worked without ExecStartPre = /bin/sleep 10 but the worker doesn't.

Thank you! I'm currently not using LXC Docker Any more, but a great tip for those who will.

Drallas/Docker Swarm in LXC Containers.md

Docker Swarm in LXC Containers

Host Setup

Install the LXC via the Proxmox VE Helper Script

Backing filesystems

Set userns-remap

Install Docker

Join Create Docker / Swarm

Add ipv4 for Ingress_sbox

Make it permanent

Create a Bash Script

Make it executable

Create a Systemd Service

Start the service and check if it's healthy

Final Checks

Manual check if ipv4.ip_forward is set to 1

Issues

Links

Screenshot

00Asgaroth00 commented May 17, 2024

Uh oh!

Drallas commented May 17, 2024

Uh oh!

00Asgaroth00 commented May 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Drallas commented May 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

00Asgaroth00 commented May 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Drallas commented May 24, 2024

Uh oh!

00Asgaroth00 commented May 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jimbothigpen commented May 28, 2024

Uh oh!

Drallas commented May 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

00Asgaroth00 commented May 28, 2024

Uh oh!

jimbothigpen commented May 28, 2024

Uh oh!

00Asgaroth00 commented Jun 5, 2024

Uh oh!

tulonbaar commented Jan 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akhdanfadh commented Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

curtistinkers commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Drallas commented Mar 7, 2025

Uh oh!

00Asgaroth00 commented May 17, 2024 •

edited

Loading

Drallas commented May 17, 2024 •

edited

Loading

00Asgaroth00 commented May 17, 2024 •

edited

Loading

00Asgaroth00 commented May 24, 2024 •

edited

Loading

Drallas commented May 28, 2024 •

edited

Loading

tulonbaar commented Jan 6, 2025 •

edited

Loading

akhdanfadh commented Jan 8, 2025 •

edited

Loading

curtistinkers commented Mar 5, 2025 •

edited

Loading