Skip to content

Instantly share code, notes, and snippets.

@scyto
Last active February 28, 2025 06:31
Show Gist options
  • Save scyto/8c652f3eab61ed1fa2f980d02a484c35 to your computer and use it in GitHub Desktop.
Save scyto/8c652f3eab61ed1fa2f980d02a484c35 to your computer and use it in GitHub Desktop.
setting up the ceph cluster

CEPH HA Setup

Note this should only be done once you are sure you have reliable TB mesh network.

this is because proxmox UI seems fragile wrt to changing underlying network after configuration of ceph.

All installation done via command line due to gui not understanding the mesh network

This setup doesn't attempt to seperate the ceph public network and ceph cluster network (not same as proxmox clutser network), The goal is to get an easy working setup.

this gist is part of this series

Ceph Initial Install & monitor creation

  1. On all nodes execute the command pveceph install --repository no-subscription accept all the packages and install
  2. On node 1 execute the command pveceph init --network 10.0.0.81/24
  3. On node 1 execute the command pveceph mon create --mon-address 10.0.0.81
  4. On node 2 execute the command pveceph mon create --mon-address 10.0.0.82
  5. On node 3 execute the command pveceph mon create --mon-address 10.0.0.83

Now if you access the gui Datacenter > pve1 > ceph > monitor you should have 3 running monitors (ignore any errors on the root ceph UI leaf for now).

If so you can proceed to next step. If not you probably have something wrong in your network, check all settings.

Add Addtional managers

  1. On any node go to Datacenter > nodename > ceph > monitor and click create manager in the manager section.
  2. Selecty an node that doesn't have a manager from the drop dwon and click create 3 repeat step 2 as needed If this fails it probably means your networking is not working

Add OSDs

  1. On any node go to Datacenter > nodename > ceph > OSD
  2. click create OSDselect all the defaults (again this for a simple setup)
  3. repeat untill you have 3 nodes like this (note it can take 30 seconds for a new OSD to go green) image

If you find there are no availale disks when you try to add it probably means your dedicated nvme/ssd has some other filesystem or old osd on it. To wipe the disk use the following UI. Becareful not to wipe your OS disk. image

Create Pool

  1. On any node go to Datacenter > nodename > ceph > pools and click create
  2. name the volume, e.g. vm-disks and leave defaults as is and click create

Configure HA

  1. On any node go to Datacenter > options
  2. Set Cluster Resource Scheduling to ha-rebalance-on-start=1 (this will rebalance nodes as needed)
  3. Set HA Settings to shutdown_policy=migrate (this will migrate VMs and CTs if you gracefully shutdown a node).
  4. Set migration settings leave as default (seperate gist will talk about seperating migration network later)
@IndianaJoe1216
Copy link

I am looking for guidance on configuring Thunderbolt networking to access a Ceph cluster from a virtual machine. My goal is to utilize the Ceph Container Storage Interface (CSI) in a Kubernetes environment running on my Proxmox cluster.
In the Thunderbolt networking configuration, we are defining IP addresses on the loopback interfaces for en05 and en06. Should I create a bridge on these interfaces and attach it to the virtual machine?
Any recommendations would be greatly appreciated.

I am in the same situation, I'd like to allow some of my VMs to access the cephfs for persistent storage in a K8s cluster. I found some threads about doing this, but they all seem to use https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server#RSTP_Loop_Setup I'd prefer not to migrate from frr to RTSP if possible, so is there someone with away of doing that with frr? Or a way to migrate from frr to RTSP without breaking the cluster, ceph ...? Any hint would be appreciated :)

I am in the exact same boat and would like some assistance here as well!

@yet-an-other
Copy link

Yeah, same here.
I have the same problem. I tried multiple approaches, and one actually works, but I'm not sure it's the best solution. But it works. :)

Disclaimer: Network engineering is not my area of expertise, so everything I'm writing here might be completely off-base.
Note: My setup is a bit different - I have two nodes and one QDevice. This shouldn't affect the networking, but it's worth mentioning, just in case.

At first, my uneducated guess was that if we're using FRR to set up a mesh between nodes, then we need to use FRR on the VM to include it in the mesh. However, as mentioned, I'm not a network engineer and have zero knowledge of how FRR actually works, so I gave up as soon as my first straightforward and naive attempt failed.

The second approach was to use VXLAN. VXLAN allows you to create a virtual network within a cluster, and it uses a mesh network to establish communication between VMs on different nodes. The nodes themselves can also be part of the network. It's quite easy to install, and from the VM or node perspective, it looks like just another bridge or NIC. The network itself works perfectly (though the throughput dropped to 10Gb on average from 26Gb, and lots of retries appeared, but I didn't dig deeper). Unfortunately, I was unable to use this network as a public network for Ceph and enforce monitors to listen on IPs from it. As soon as I changed the config, the cluster went down.

The final attempt, which actually works but looks ugly, uses Samba or NFS. The trick is simple: on every node, you create a virtual bridge that isn't attached to any interface, with the same static IP. No gateway. Afterward, add this bridge to every VM that needs access to shared disk, and give it a static IP from the same network. This creates a virtual network within one node. Then just create an SMB/NFS server on the node and mount the share on VMs. Since the bridge configuration is similar on every node, nothing will change for the VM during migration, and it should continue to work.
Performance is fine for my scenarios - I have stable 1GB write and 2-2.5GB read speeds.
There are a few issues with this approach though. There's a slight delay with disk access during migration as the connection to the share is interrupted, which may cause issues if the VM migrates during active I/O. Also, some applications that are sensitive to disk type might not work with shared disk, such as PostgreSQL.

Bottom line - it works for me for now, but I'd be happy if someone could help me fix this in a more proper way.

@IndianaJoe1216
Copy link

Yeah, same here. I have the same problem. I tried multiple approaches, and one actually works, but I'm not sure it's the best solution. But it works. :)

Disclaimer: Network engineering is not my area of expertise, so everything I'm writing here might be completely off-base. Note: My setup is a bit different - I have two nodes and one QDevice. This shouldn't affect the networking, but it's worth mentioning, just in case.

At first, my uneducated guess was that if we're using FRR to set up a mesh between nodes, then we need to use FRR on the VM to include it in the mesh. However, as mentioned, I'm not a network engineer and have zero knowledge of how FRR actually works, so I gave up as soon as my first straightforward and naive attempt failed.

The second approach was to use VXLAN. VXLAN allows you to create a virtual network within a cluster, and it uses a mesh network to establish communication between VMs on different nodes. The nodes themselves can also be part of the network. It's quite easy to install, and from the VM or node perspective, it looks like just another bridge or NIC. The network itself works perfectly (though the throughput dropped to 10Gb on average from 26Gb, and lots of retries appeared, but I didn't dig deeper). Unfortunately, I was unable to use this network as a public network for Ceph and enforce monitors to listen on IPs from it. As soon as I changed the config, the cluster went down.

The final attempt, which actually works but looks ugly, uses Samba or NFS. The trick is simple: on every node, you create a virtual bridge that isn't attached to any interface, with the same static IP. No gateway. Afterward, add this bridge to every VM that needs access to shared disk, and give it a static IP from the same network. This creates a virtual network within one node. Then just create an SMB/NFS server on the node and mount the share on VMs. Since the bridge configuration is similar on every node, nothing will change for the VM during migration, and it should continue to work. Performance is fine for my scenarios - I have stable 1GB write and 2-2.5GB read speeds. There are a few issues with this approach though. There's a slight delay with disk access during migration as the connection to the share is interrupted, which may cause issues if the VM migrates during active I/O. Also, some applications that are sensitive to disk type might not work with shared disk, such as PostgreSQL.

Bottom line - it works for me for now, but I'd be happy if someone could help me fix this in a more proper way.

Thanks for following up! Unfortunately I don't think this will work for me. How are you mounting the Ceph pool as an NFS share? Are you able to do that if Samba/NFS server is configured on the PVE Nodes?

@mrkhachaturov
Copy link

mrkhachaturov commented Jan 3, 2025

:

Ceph allows the addition of multiple public networks.

In my Proxmox cluster, I have six Minisforum MS-01 machines. Each MS-01 is equipped with two SFP+ NICs and two 2.5 Gb NICs.

The SFP+ adapters are configured in an 802.3ad bond with VLAN support.

I have set up VLAN 10 for virtual machines, using the subnet 10.10.0.0/24.

Additionally, I have included this as an extra public network for Ceph.

root@pve01:~# ceph mon stat
e8: 6 mons at {pve01=[v2:10.0.0.81:3300/0,v1:10.0.0.81:6789/0,v2:10.10.0.146:3300/0,v1:10.10.0.146:6789/0],pve02=[v2:10.0.0.82:3300/0,v1:10.0.0.82:6789/0,v2:10.10.0.147:3300/0,v1:10.10.0.147:6789/0],pve03=[v2:10.0.0.83:3300/0,v1:10.0.0.83:6789/0,v2:10.10.0.150:3300/0,v1:10.10.0.150:6789/0],pve04=[v2:10.0.0.84:3300/0,v1:10.0.0.84:6789/0,v2:10.10.0.153:3300/0,v1:10.10.0.153:6789/0],pve05=[v2:10.0.0.85:3300/0,v1:10.0.0.85:6789/0,v2:10.10.0.154:3300/0,v1:10.10.0.154:6789/0],pve06=[v2:10.0.0.86:3300/0,v1:10.0.0.86:6789/0,v2:10.10.0.155:3300/0,v1:10.10.0.155:6789/0]} removed_ranks: {} disallowed_leaders: {}, election epoch 156752, leader 0 pve02, quorum 0,1,2,3,4,5 pve02,pve03,pve04,pve01,pve05,pve06

I plan to test this configuration with Ceph CSI. If everything works as expected, I will share a detailed guide on how to configure it.

@IndianaJoe1216
Copy link

IndianaJoe1216 commented Jan 3, 2025

:

Ceph allows the addition of multiple public networks.

In my Proxmox cluster, I have six Minisforum MS-01 machines. Each MS-01 is equipped with two SFP+ NICs and two 2.5 Gb NICs.

The SFP+ adapters are configured in an 802.3ad bond with VLAN support.

I have set up VLAN 10 for virtual machines, using the subnet 10.10.0.0/24.

Additionally, I have included this as an extra public network for Ceph.

root@pve01:~# ceph mon stat
e8: 6 mons at {pve01=[v2:10.0.0.81:3300/0,v1:10.0.0.81:6789/0,v2:10.10.0.146:3300/0,v1:10.10.0.146:6789/0],pve02=[v2:10.0.0.82:3300/0,v1:10.0.0.82:6789/0,v2:10.10.0.147:3300/0,v1:10.10.0.147:6789/0],pve03=[v2:10.0.0.83:3300/0,v1:10.0.0.83:6789/0,v2:10.10.0.150:3300/0,v1:10.10.0.150:6789/0],pve04=[v2:10.0.0.84:3300/0,v1:10.0.0.84:6789/0,v2:10.10.0.153:3300/0,v1:10.10.0.153:6789/0],pve05=[v2:10.0.0.85:3300/0,v1:10.0.0.85:6789/0,v2:10.10.0.154:3300/0,v1:10.10.0.154:6789/0],pve06=[v2:10.0.0.86:3300/0,v1:10.0.0.86:6789/0,v2:10.10.0.155:3300/0,v1:10.10.0.155:6789/0]} removed_ranks: {} disallowed_leaders: {}, election epoch 156752, leader 0 pve02, quorum 0,1,2,3,4,5 pve02,pve03,pve04,pve01,pve05,pve06

I plan to test this configuration with Ceph CSI. If everything works as expected, I will share a detailed guide on how to configure it.

Please do! I only have 3 MS-01's but this should work perfectly for me.

@yet-an-other
Copy link

yet-an-other commented Jan 3, 2025

Thanks for following up! Unfortunately I don't think this will work for me. How are you mounting the Ceph pool as an NFS share? Are you able to do that if Samba/NFS server is configured on the PVE Nodes?

Correct, but you have to mount not a ceph pool but Ceph FS. You have to create ceph fs volume (node->Ceph->CephFS), and then mount it from /mnt/pve/cephfs

@mrkhachaturov
Copy link

:

Ceph allows the addition of multiple public networks.
In my Proxmox cluster, I have six Minisforum MS-01 machines. Each MS-01 is equipped with two SFP+ NICs and two 2.5 Gb NICs.
The SFP+ adapters are configured in an 802.3ad bond with VLAN support.
I have set up VLAN 10 for virtual machines, using the subnet 10.10.0.0/24.
Additionally, I have included this as an extra public network for Ceph.

root@pve01:~# ceph mon stat
e8: 6 mons at {pve01=[v2:10.0.0.81:3300/0,v1:10.0.0.81:6789/0,v2:10.10.0.146:3300/0,v1:10.10.0.146:6789/0],pve02=[v2:10.0.0.82:3300/0,v1:10.0.0.82:6789/0,v2:10.10.0.147:3300/0,v1:10.10.0.147:6789/0],pve03=[v2:10.0.0.83:3300/0,v1:10.0.0.83:6789/0,v2:10.10.0.150:3300/0,v1:10.10.0.150:6789/0],pve04=[v2:10.0.0.84:3300/0,v1:10.0.0.84:6789/0,v2:10.10.0.153:3300/0,v1:10.10.0.153:6789/0],pve05=[v2:10.0.0.85:3300/0,v1:10.0.0.85:6789/0,v2:10.10.0.154:3300/0,v1:10.10.0.154:6789/0],pve06=[v2:10.0.0.86:3300/0,v1:10.0.0.86:6789/0,v2:10.10.0.155:3300/0,v1:10.10.0.155:6789/0]} removed_ranks: {} disallowed_leaders: {}, election epoch 156752, leader 0 pve02, quorum 0,1,2,3,4,5 pve02,pve03,pve04,pve01,pve05,pve06

I plan to test this configuration with Ceph CSI. If everything works as expected, I will share a detailed guide on how to configure it.

Please do! I only have 3 MS-01's but this should work perfectly for me.

I have set up three public networks for Ceph: Thunderbolt Mesh, VLAN 60, and VLAN 80. I recently installed the Ceph Dashboard to verify the connectivity of the monitors across these networks, and I'm pleased to report that everything is functioning smoothly.

CleanShot 2025-01-05 at 04 29 11@2x
CleanShot 2025-01-05 at 04 30 15@2x

root@pve01:/etc/ceph# ceph mon stat
e18: 6 mons at {pve01=[v2:10.0.0.81:3300/0,v1:10.0.0.81:6789/0,v2:10.1.60.1:3300/0,v1:10.1.60.1:6789/0,v2:10.1.80.1:3300/0,v1:10.1.80.1:6789/0],pve02=[v2:10.0.0.82:3300/0,v1:10.0.0.82:6789/0,v2:10.1.60.2:3300/0,v1:10.1.60.2:6789/0,v2:10.1.80.2:3300/0,v1:10.1.80.2:6789/0],pve03=[v2:10.0.0.83:3300/0,v1:10.0.0.83:6789/0,v2:10.1.60.3:3300/0,v1:10.1.60.3:6789/0,v2:10.1.80.3:3300/0,v1:10.1.80.3:6789/0],pve04=[v2:10.0.0.84:3300/0,v1:10.0.0.84:6789/0,v2:10.1.60.4:3300/0,v1:10.1.60.4:6789/0,v2:10.1.80.4:3300/0,v1:10.1.80.4:6789/0],pve05=[v2:10.0.0.85:3300/0,v1:10.0.0.85:6789/0,v2:10.1.60.5:3300/0,v1:10.1.60.5:6789/0,v2:10.1.80.5:3300/0,v1:10.1.80.5:6789/0],pve06=[v2:10.0.0.86:3300/0,v1:10.0.0.86:6789/0,v2:10.1.60.6:3300/0,v1:10.1.60.6:6789/0,v2:10.1.80.6:3300/0,v1:10.1.80.6:6789/0]} removed_ranks: {} disallowed_leaders: {}, election epoch 126, leader 0 pve04, quorum 0,1,2,3,4,5 pve04,pve01,pve02,pve03,pve05,pve06

@flx-666
Copy link

flx-666 commented Jan 5, 2025

@mrkhachaturov , how did you set up your vlans to achieve this?
Did you use SDN? In a specific zone? Or just using vmbr.60 locally?
When I create a VLAN in SDN, I can specify the tag, not an IP range, so where did you specify these addres ranges?
Also, how did you add these networks to the ceph public network? You manually edited teh /etc/pve/ceph.conf file?

Some details on how to achieve that would be very appreciated, since I am not the only one trying to achieve this ... (and some details on the dashboard install as well - did you install on all your nodes?)

Thanks in advance for the help

@mrkhachaturov
Copy link

mrkhachaturov commented Jan 5, 2025

@mrkhachaturov , how did you set up your vlans to achieve this? Did you use SDN? In a specific zone? Or just using vmbr.60 locally? When I create a VLAN in SDN, I can specify the tag, not an IP range, so where did you specify these addres ranges? Also, how did you add these networks to the ceph public network? You manually edited teh /etc/pve/ceph.conf file?

Some details on how to achieve that would be very appreciated, since I am not the only one trying to achieve this ... (and some details on the dashboard install as well - did you install on all your nodes?)

Thanks in advance for the help

I will share the guide shortly.

The machines are connected to a MikroTik switch, which manages all connections and VLANs.

In brief: After executing the command pveceph init --network 10.0.0.81/24, you need to edit the ceph.conf file to add any additional public networks.

  GNU nano 7.2                                                                                           ceph.conf                                                                                                     
[global]
        auth_client_required = cephx
        auth_cluster_required = cephx
        auth_service_required = cephx
        cluster_network = 10.0.0.81/24
        fsid = 5576dc38-3708-4536-8dad-bf709a212bcc
        mon_allow_pool_delete = true
        osd_pool_default_min_size = 2
        osd_pool_default_size = 3
        public_network = 10.0.0.81/24, 10.1.60.1/24, 10.1.80.1/24

[client]
        keyring = /etc/pve/priv/$cluster.$name.keyring
	

When you create monitors, the system will automatically incorporate the host IP addresses from the defined subnets.

I have created a ceph monitor on pve01 through the Proxmox GUI, and here are the results:

root@pve01:/etc/ceph# ceph mon stat
e1: 1 mons at {pve01=[v2:10.0.0.81:3300/0,v1:10.0.0.81:6789/0,v2:10.1.60.1:3300/0,v1:10.1.60.1:6789/0,v2:10.1.80.1:3300/0,v1:10.1.80.1:6789/0]} removed_ranks: {} disallowed_leaders: {}, election epoch 3, leader 0 pve01, quorum 0 pve01

ceph.conf

root@pve01:/etc/ceph# cat ceph.conf
[global]
	auth_client_required = cephx
	auth_cluster_required = cephx
	auth_service_required = cephx
	cluster_network = 10.0.0.81/24
	fsid = 9bcee10a-e2fa-45cf-8308-1e834bc24881
	mon_allow_pool_delete = true
	mon_host =  10.0.0.81 10.1.60.1 10.1.80.1
	ms_bind_ipv4 = true
	ms_bind_ipv6 = false
	osd_pool_default_min_size = 2
	osd_pool_default_size = 3
	public_network = 10.0.0.81/24, 10.1.60.1/24, 10.1.80.1/24

[client]
	keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
	keyring = /etc/pve/ceph/$cluster.$name.keyring

[mon.pve01]
	public_addr = 10.0.0.81

I am currently experiencing an issue while creating CephFS and am working to resolve it.
Cephfs is created but kernel is not mounting it.

P.S.
Cannot mount cephfs via kernel driver when multiple public networks are defined.

At this moment only mount via fuse is working

Also some problem when creating VM and selecting ceph pool for TPM

@IndianaJoe1216
Copy link

@mrkhachaturov Creating the monitors worked perfectly for me and I am seeing the same as you. Looking forward to your full guide. I want to get distributed storage via ceph up and running on my docker nodes.

@mrkhachaturov
Copy link

mrkhachaturov commented Jan 9, 2025

@IndianaJoe1216 check this guide

With 6 nodes I think I will use Thunderbolt network only for migration and maybe Ceph cluster network.
For Ceph public network I think better is to use 10G interface.

@IndianaJoe1216
Copy link

@mrkhachaturov reviewing this now. I am doing the same. Thunderbolt network only for ceph backend and then the public network I need to be on the 10G interface because that is essentially what the VM's will have access to.

@taslabs-net
Copy link

taslabs-net commented Feb 28, 2025

After many nights, at least 4, I have this working with 10gbe sfp+ for my public network, and TB4 for my ceph cluster. I got into, blacked out, and now here I am. Comes up after reboot. I feel like I’m late to this party.

Screenshot 2025-02-27 at 23 59 59

I promise I'm being serious, but is this good? Or should I be able to move faster? Or am I reaching limits of my drives?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment