Skip to content

Instantly share code, notes, and snippets.

@scyto
Last active April 3, 2025 19:30
Show Gist options
  • Save scyto/8c652f3eab61ed1fa2f980d02a484c35 to your computer and use it in GitHub Desktop.
Save scyto/8c652f3eab61ed1fa2f980d02a484c35 to your computer and use it in GitHub Desktop.
setting up the ceph cluster

CEPH HA Setup

Note this should only be done once you are sure you have reliable TB mesh network.

this is because proxmox UI seems fragile wrt to changing underlying network after configuration of ceph.

All installation done via command line due to gui not understanding the mesh network

This setup doesn't attempt to seperate the ceph public network and ceph cluster network (not same as proxmox clutser network), The goal is to get an easy working setup.

this gist is part of this series

Ceph Initial Install & monitor creation

  1. On all nodes execute the command pveceph install --repository no-subscription accept all the packages and install
  2. On node 1 execute the command pveceph init --network 10.0.0.81/24
  3. On node 1 execute the command pveceph mon create --mon-address 10.0.0.81
  4. On node 2 execute the command pveceph mon create --mon-address 10.0.0.82
  5. On node 3 execute the command pveceph mon create --mon-address 10.0.0.83

Now if you access the gui Datacenter > pve1 > ceph > monitor you should have 3 running monitors (ignore any errors on the root ceph UI leaf for now).

If so you can proceed to next step. If not you probably have something wrong in your network, check all settings.

Add Addtional managers

  1. On any node go to Datacenter > nodename > ceph > monitor and click create manager in the manager section.
  2. Selecty an node that doesn't have a manager from the drop dwon and click create 3 repeat step 2 as needed If this fails it probably means your networking is not working

Add OSDs

  1. On any node go to Datacenter > nodename > ceph > OSD
  2. click create OSDselect all the defaults (again this for a simple setup)
  3. repeat untill you have 3 nodes like this (note it can take 30 seconds for a new OSD to go green) image

If you find there are no availale disks when you try to add it probably means your dedicated nvme/ssd has some other filesystem or old osd on it. To wipe the disk use the following UI. Becareful not to wipe your OS disk. image

Create Pool

  1. On any node go to Datacenter > nodename > ceph > pools and click create
  2. name the volume, e.g. vm-disks and leave defaults as is and click create

Configure HA

  1. On any node go to Datacenter > options
  2. Set Cluster Resource Scheduling to ha-rebalance-on-start=1 (this will rebalance nodes as needed)
  3. Set HA Settings to shutdown_policy=migrate (this will migrate VMs and CTs if you gracefully shutdown a node).
  4. Set migration settings leave as default (seperate gist will talk about seperating migration network later)
@taslabs-net
Copy link

@mrkhachaturov just the standard 3. Ordered a hub to move it to the 6 nodes I have.

@cjboyce
Copy link

cjboyce commented Mar 17, 2025

Impressive docs. May I ask about OSDs... the docs say a minimum of 6 OSDs (something like, 12 recommended) for ceph. It looks like you're getting by with 3 total (one per node). My 3 nodes have only two SATA bays and one NVMe slot and I can't decide what SSDs to buy & dedicate to OS, VMs, etc. I assumed ceph was out for me. But your VMs perform well under 3 OSDs? Thanks!

@turdf
Copy link

turdf commented Apr 3, 2025

@scyto Thank you so much for this guide. I have my MS-01s set up in a three node cluster w/ thunderbolt private ring network. I can't, for the life of me, get ceph to actually use the private thunderbolt mesh though... it always breaks.

The default ceph.conf appears as follows:

[global]
        auth_client_required = cephx
        auth_cluster_required = cephx
        auth_service_required = cephx
        cluster_network = 10.1.1.11/24
        fsid = 43d49bb4-1abe-4479-9bbd-a647e6f3ef4b
        mon_allow_pool_delete = true
        mon_host = 10.1.1.11 10.1.1.12 10.1.1.13
        ms_bind_ipv4 = true
        ms_bind_ipv6 = false
        osd_pool_default_min_size = 2
        osd_pool_default_size = 3
        public_network = 10.1.1.11/24

[client]
        keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
        keyring = /etc/pve/ceph/$cluster.$name.keyring

[mon.pve01]
        public_addr = 10.1.1.11

[mon.pve02]
        public_addr = 10.1.1.12

[mon.pve03]
        public_addr = 10.1.1.13

In this configuration, everything "works," but I assume ceph is passing traffic over the public nework as there is nothing in the configuration file to reference the private network. https://imgur.com/a/9EjdOTa

The private ring network does function, and proxmox already has it set for migration purposes. Each host is addressed as so:

PVE01
private address: fc00::81/128
public address: 10.1.1.11

  • THUNDERBOLT PORTS
    left = 0000:00:0d.3
    right = 0000:00:0d.2

PVE02
private address fc00::82/128
public address 10.1.1.12

  • THUNDERBOLT PORTS
    left = 0000:00:0d.3
    right = 0000:00:0d.2

PVE03
private address: fc00::83/128
public address 10.1.1.13
THUNDERBOLT PORTS
left = 0000:00:0d.3
right = 0000:00:0d.2

Iperf3 between pve01 and pve02 demonstrates that the private ring network is active and addresses properly: https://imgur.com/a/19hLcNb

My novice gut tells me that, if I make the following modifications to the config file, the private network will be used.

[global]
        auth_client_required = cephx
        auth_cluster_required = cephx
        auth_service_required = cephx
        cluster_network = fc00::/128 #I've also tried /64
        fsid = 43d49bb4-1abe-4479-9bbd-a647e6f3ef4b
        mon_allow_pool_delete = true
        mon_host = 10.1.1.11 10.1.1.12 10.1.1.13
        ms_bind_ipv4 = true
        ms_bind_ipv6 = true
        osd_pool_default_min_size = 2
        osd_pool_default_size = 3
        public_network = 10.1.1.11/24

[client]
        keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
        keyring = /etc/pve/ceph/$cluster.$name.keyring

[mon.pve01]
        public_addr = 10.1.1.11
        cluster_addr = fc00::81

[mon.pve02]
        public_addr = 10.1.1.12
        cluster_addr = fc00::82

[mon.pve03]
        public_addr = 10.1.1.13
        cluster_addr = fc00::83

This, however, results in unknown status of PGs (and storage capacity going from 5.xx TiB to 0). My hair is starting to come out trying to troubleshoot this, could you potentially offer some advice?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment