Skip to content

Instantly share code, notes, and snippets.

@Drallas
Last active May 13, 2024 01:11
Show Gist options
  • Save Drallas/dbd08021c3aa8dd0526d0a205165b77b to your computer and use it in GitHub Desktop.
Save Drallas/dbd08021c3aa8dd0526d0a205165b77b to your computer and use it in GitHub Desktop.

Docker Swarm in Vm's with CephFS

Part of collection: Hyper-converged Homelab with Proxmox

One of the objectives, of building my Proxmox HA Cluster, was to store persistent Docker volume data inside CephFS folders.

There are many different options to achieve this; via Docker Swarm in LXC using Bind Mounts, Docker Third Party Plugins that are hard to use and often outdated.

Another option for Docker Volumes was running GlusterFS, storing the disks on local NVMe storage and not using CephFS. Although appealing, it's adding complexity and unnecessary resource consumption, while I already have a High Available File System (CephFS) running!

Evaluating all the available options, it became clear to me that Docker already has everything onboard what I need! With VirtioFS I already mount CephFS volumes in all my Docker Swarm VM's.

mnt_pve_cephfs_docker 9.1T 198G 9.0T 3% /srv/cephfs-mounts/docker

I just needed something to connect 'plain' Docker Volumes to the VirtioFS CephFS mounts on my systems.

Fortunately it's possible, using the Docker driver: local option, to let docker when it's creating a volumes, to redirect the data to a CephFS folder and not store it on the local filesystem. So effectively it's a local volume that points to a VirtioFS mount in the VM connected to CephFS.


Preparation

Regardless if Docker Swarm or Host mode is used, it's necessary to manually create the folder for the Docker volume on CephFS.

Create volume folder

mkdir /srv/cephfs-mounts/docker/<volumename>

Docker Swarm Mode

In this setup, the container can freely move to other Swarm Nodes. No matter where it 'lands' it creates a local Docker Volume or uses an already existing one that is pointing to the VirtioFS mount. No matter on which host, the volume has access to the same folder /srv/cephfs-mounts/docker/<volumename>.

version: '3.8'

services:
  web:
    hostname: <hostname>
    image: nginx
    volumes:
      - <volumename>:/var/www/html
volumes:
  <volumename>:
    name: <volumename> # Control the name of the volume
    driver: local
    driver_opts:
      o: bind
      device: /srv/cephfs-mounts/docker/<volumename>
      type: none

Docker Single Host

If using just one Docker Host or for testing purposes, this method is better and more straightforward.

Create volume

Create the volume beforehand.

docker volume create \
  --driver local \
  --opt type=none \
  --opt o=bind \
  --opt device=/srv/cephfs-mounts/docker/<volumename> \
  <volumename>

Docker Compose

Now the volume can be used in Docker Compose and with an external reference.

version: '3.8'

services:
  web:
    hostname: <hostname>
    image: nginx
    volumes:
      - web_data:/var/www/html
volumes:
  <volumename>:
    external: true

Verification

Inspect the volume

root@dswarm01:~# docker volume inspect root_web_data
[
    {
        "CreatedAt": "2023-10-05T11:21:59+02:00",
        "Driver": "local",
        "Labels": {
            "com.docker.compose.project": "root",
            "com.docker.compose.version": "2.21.0",
            "com.docker.compose.volume": "web_data"
        },
        "Mountpoint": "/var/lib/docker/volumes/root_web_data/_data",
        "Name": "root_web_data",
        "Options": {
            "device": "/srv/cephfs-mounts/docker/web_data",
            "o": "bind",
            "type": "none"
        },
        "Scope": "local"
    }
]

Docker Exec

Inside the container, the mnt_pve_cephfs_docker 9.1T 197G 9.0T 3% /var/www/html is connected.

root@4e4aa5c02bc8:/# df -h
Filesystem             Size  Used Avail Use% Mounted on
overlay                 20G  2.6G   16G  14% /
tmpfs                   64M     0   64M   0% /dev
shm                     64M     0   64M   0% /dev/shm
/dev/sda1               20G  2.6G   16G  14% /etc/hosts
tmpfs                  3.9G     0  3.9G   0% /proc/acpi
tmpfs                  3.9G     0  3.9G   0% /sys/firmware
mnt_pve_cephfs_docker  9.1T  197G  9.0T   3% /var/www/html
@Drallas
Copy link
Author

Drallas commented Oct 5, 2023

@scyto This Gist is still a bit a Work in Progress, but VirtioFS and this approach is the easiest way to consume CephFS storage for persistent Docker Volumes. This is the last 'nut I had to crack' before moving over the last containers from my Qnap NAS. 🥲

PS: This seems more stable to me than GlusterFS with outdated Docker Plugins. 'no pun intended' and it's completely plugin free; VirtioFS is also stable if you understand how to use it!

@scyto
Copy link

scyto commented Oct 6, 2023

My GlusterFS is rock solid - remember it actually uses whatever gluster client / lernel driver you have installed, the plugin is just an orchestrator - not what actually talks to the filesystem. That said if you have cephfs i would expect you to use that!!

And yes I had assumed a bind mount would be used this way if you had cephfs via virtiofs - not sure why you are using the long syntax though?

My plan is to try the cephFS plugins first for no other reason that they are there and it will be fun to do while i wait for formal virtiofs - i am resistant to all that scripting.... and my gluster was already working so there is no compelling event for me to move.

If i was doing it from scratch i would do it your way ;-)

I actually still don't understand why you went with an erasure pools vs a replicated pool?

@Drallas
Copy link
Author

Drallas commented Oct 6, 2023

I'm using the 'long syntax' Yaml syntax, because like that, all the logic to run a container is inside the docker-compose.yml files.

The Docker Swarm host will be pinned to a Proxmox host and won't migrate, only the Docker containers will move.

After I create the folder for a Docker Stack Container on CephFS, Yaml does the rest, without needing any intervention.

Pools

Regarding 'replicated' pools, my systems have three SSD disks each:

  • WD-Blue NVMe's
    Those are local drives without Ceph pools holding Proxmox and pinned VM's like the Docker Swarm Nodes. They don't need to migrate because the Docker Containers move!

  • Samsung 980 Pro NVMEe's
    These have a single replicated Ceph pool holding my VM's and LXC's. This is setup for speed and redundancy. Each Proxmox host has a full copy of all VM's and LXC's!

  • WD-Green SSD's
    These have various erasure coded pool, setup to maximize the amount of usable storage, I have multiple backups (SSD, Backblaze of this data) no issue if I might lose it!

Screenshot 2023-10-06 at 08 42 48

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment