Skip to content

Instantly share code, notes, and snippets.

@scyto
Last active February 26, 2025 12:57
Show Gist options
  • Save scyto/8c652f3eab61ed1fa2f980d02a484c35 to your computer and use it in GitHub Desktop.
Save scyto/8c652f3eab61ed1fa2f980d02a484c35 to your computer and use it in GitHub Desktop.
setting up the ceph cluster

CEPH HA Setup

Note this should only be done once you are sure you have reliable TB mesh network.

this is because proxmox UI seems fragile wrt to changing underlying network after configuration of ceph.

All installation done via command line due to gui not understanding the mesh network

This setup doesn't attempt to seperate the ceph public network and ceph cluster network (not same as proxmox clutser network), The goal is to get an easy working setup.

this gist is part of this series

Ceph Initial Install & monitor creation

  1. On all nodes execute the command pveceph install --repository no-subscription accept all the packages and install
  2. On node 1 execute the command pveceph init --network 10.0.0.81/24
  3. On node 1 execute the command pveceph mon create --mon-address 10.0.0.81
  4. On node 2 execute the command pveceph mon create --mon-address 10.0.0.82
  5. On node 3 execute the command pveceph mon create --mon-address 10.0.0.83

Now if you access the gui Datacenter > pve1 > ceph > monitor you should have 3 running monitors (ignore any errors on the root ceph UI leaf for now).

If so you can proceed to next step. If not you probably have something wrong in your network, check all settings.

Add Addtional managers

  1. On any node go to Datacenter > nodename > ceph > monitor and click create manager in the manager section.
  2. Selecty an node that doesn't have a manager from the drop dwon and click create 3 repeat step 2 as needed If this fails it probably means your networking is not working

Add OSDs

  1. On any node go to Datacenter > nodename > ceph > OSD
  2. click create OSDselect all the defaults (again this for a simple setup)
  3. repeat untill you have 3 nodes like this (note it can take 30 seconds for a new OSD to go green) image

If you find there are no availale disks when you try to add it probably means your dedicated nvme/ssd has some other filesystem or old osd on it. To wipe the disk use the following UI. Becareful not to wipe your OS disk. image

Create Pool

  1. On any node go to Datacenter > nodename > ceph > pools and click create
  2. name the volume, e.g. vm-disks and leave defaults as is and click create

Configure HA

  1. On any node go to Datacenter > options
  2. Set Cluster Resource Scheduling to ha-rebalance-on-start=1 (this will rebalance nodes as needed)
  3. Set HA Settings to shutdown_policy=migrate (this will migrate VMs and CTs if you gracefully shutdown a node).
  4. Set migration settings leave as default (seperate gist will talk about seperating migration network later)
@nicedevil007
Copy link

Ok, I was able to test this, but still the same issue :( Maybe it is relating to the other problem on the other gist, because after adding one monitor it looses after some time the ip routes :(

@nicedevil007
Copy link

command to deploy reef version in shell:

pveceph install --repository no-subscription --version reef

@nicedevil007
Copy link

Fix for Clock Skew was => check ntp settings, is timesync working ;)

@zombiehoffa
Copy link

my two thesis are:

1. the traffic is getting routed onto the LAN and back into the mesh

2. you found some weird and wonderful new bug

3. you have some sort of lower level TB problem as IIRC you have USB4 not TB4 - so maybe some other bug i wont hit?

if i get time i will test the scenario again, i originally did it when i wasn't using fabricd but OSPF, i did test this exact scenario when i was fiing the bugs with intel and don't recall seeing any iperf3 drop off like this

Did you get a chance to test the traversing a node scenario?

Thanks.

@jacoburgin
Copy link

Just noticed, is that your pikvm in the pi rack?

@Kirkland-gh
Copy link

I was migrating because I added the fourth node so it was rebalancing. I don't think it's an frr thing anymore, I can recreate with iperf3 all I have to do is have the path go through another node and I get 1-5MB/sec instead of 12 gbit/sec. It's really, really weird. I was expecting potentially a 50% performance drop, not nearly entire performance drop just by transiting a node. it happens across ip4 and ip6. Direct connections 12 gbit/sec transit through one or more nodes to get to the node (I disconnected the ring to test it out) and it's 1-5 MB/sec. (it's weird I thought it would drop even more with 2 nodes in between but it basically didn't).

I see the same behavior. 3 11th gen intel nucs using TB 3. traversing another node to get to my destination takes me from 19/Gbps to .5-5Mbps tested with iperf. Did you manage to work around this?

@zombiehoffa
Copy link

I was migrating because I added the fourth node so it was rebalancing. I don't think it's an frr thing anymore, I can recreate with iperf3 all I have to do is have the path go through another node and I get 1-5MB/sec instead of 12 gbit/sec. It's really, really weird. I was expecting potentially a 50% performance drop, not nearly entire performance drop just by transiting a node. it happens across ip4 and ip6. Direct connections 12 gbit/sec transit through one or more nodes to get to the node (I disconnected the ring to test it out) and it's 1-5 MB/sec. (it's weird I thought it would drop even more with 2 nodes in between but it basically didn't).

I see the same behavior. 3 11th gen intel nucs using TB 3. traversing another node to get to my destination takes me from 19/Gbps to .5-5Mbps tested with iperf. Did you manage to work around this?

Nope. No solution yet. I am eyeing the ms01 instead, as it has dual 10 gig, which should be fine for my purposes. Pretty sad about this because if it worked it would be awesome.

@lettucebuns
Copy link

I'm wondering if anyone else is having similar issues. I'm able to get through setup without issue, communication works over IPv4/IPv6, but as soon as I add an ISO to the CephFS disk or migrate a VM to the vm-disks Ceph storage, the nodes go offline. Usually, the node where the upload or migration started from stays online, but isn't able to get the status of Ceph components. The hosts cannot ping each other and I cannot ping them from my management workstation. I've wiped the cluster twice and configured it again, the 3rd time as IPv6 but the same issue occurred all 3 builds. I'm using 3 Intel NUCs 12 gen.

I reviewed logs using journalctl -xe but I couldn't find anything that pointed to what the issue could be. If anyone has any suggestions for logs to review I'm happy to do so.

It did look like the line to restart the frr.service did not working for me:

Jan 28 15:10:17 LAB-PX-01 /usr/sbin/ifup[715]: error: /etc/network/interfaces: line41: error processing line 'post-up /usr/bin/systemctl restart frr.service'
Jan 28 15:10:17 LAB-PX-01 /usr/sbin/ifup[715]: >>> Full logs available in: /var/log/ifupdown2/network_config_ifupdown2_43_Jan-28-2024_15:10:16.989774 <<<

My experience was if the Ceph cluster was configured using IPv4, then I needed to manually restart the frr service post-reboot. The 3rd time I configured the Ceph cluster to use IPv6 and it would come back up without needing to restart the frr service.

After seeing this entry, I did try setting the IOPS to 310000 and then 10000 but neither change made a difference:

706474281.7610672 osd.0 (osd.0) 1 : cluster 3 OSD bench result of 106755.004772 IOPS exceeded the threshold limit of 80000.000000 IOPS for osd.0. IOPS capacity is unchanged at 21500.000000 IOPS. The recommendation is to establish the osd's IOPS capacity using other benchmark tools (e.g. Fio) and then override osd_mclock_max_capacity_iops_[hdd|ssd].
1706474281.7613761 osd.1 (osd.1) 1 : cluster 3 OSD bench result of 115067.749781 IOPS exceeded the threshold limit of 80000.000000 IOPS for osd.1. IOPS capacity is unchanged at 21500.000000 IOPS. The recommendation is to establish the osd's IOPS capacity using other benchmark tools (e.g. Fio) and then override osd_mclock_max_capacity_iops_[hdd|ssd].

not sure if this is at all useful:

Jan 28 13:51:59 LAB-PX-02 ceph-mon[1131]: 2024-01-28T13:51:59.263-0500 7f71b77cd6c0 -1 mon.LAB-PX-02@2(probing) e3 get_health_metrics reporting 1 slow ops, oldest is auth(proto 0 26 bytes epoch 0)
Jan 28 13:52:04 LAB-PX-02 ceph-mon[1131]: 2024-01-28T13:52:04.266-0500 7f71b77cd6c0 -1 mon.LAB-PX-02@2(probing) e3 get_health_metrics reporting 1 slow ops, oldest is auth(proto 0 26 bytes epoch 0)
Jan 28 13:52:04 LAB-PX-02 kernel: libceph: mon1 (1)10.0.0.82:6789 socket closed (con state OPEN)
Jan 28 13:52:08 LAB-PX-02 fabricd[817]: [NBV6R-CM3PT] OpenFabric: Needed to resync LSPDB using CSNP!
Jan 28 13:52:09 LAB-PX-02 ceph-mon[1131]: 2024-01-28T13:52:09.266-0500 7f71b77cd6c0 -1 mon.LAB-PX-02@2(probing) e3 get_health_metrics reporting 1 slow ops, oldest is auth(proto 0 26 bytes epoch 0)
Jan 28 13:52:09 LAB-PX-02 kernel: libceph: mon1 (1)10.0.0.82:6789 socket closed (con state OPEN)
Jan 28 13:52:14 LAB-PX-02 ceph-mon[1131]: 2024-01-28T13:52:14.266-0500 7f71b77cd6c0 -1 mon.LAB-PX-02@2(probing) e3 get_health_metrics reporting 1 slow ops, oldest is auth(proto 0 26 bytes epoch 0)
Jan 28 13:52:14 LAB-PX-02 kernel: libceph: mon1 (1)10.0.0.82:6789 socket closed (con state OPEN)

Thanks for reading - let me know if you have any tips!

@jacoburgin
Copy link

I am having something similar. I have 3 nuc12's and has been working fine until recently. then out of the blue one node is shown as disconnected. Can't ping it and my KVM shows the video output frozen and won't accept input....

Did you use ceph reef or Quincy? I'm wanting to cross that variable off the list as I never had an issue with Quincy.

@lettucebuns
Copy link

I've deployed the cluster using both versions - the issue existed for both.

@jacoburgin
Copy link

Hmmmm. Mine has been fine. Perhaps do not run apt update && apt upgrade after initial install incase something new is breaking it from a fresh iso install

@jacoburgin
Copy link

Just restarted all 3 machines. As you say as soon as I upload an ISO the other two medicines crash completely. Will reinstall all 3 later today from the same iso that has been working but will not upgrade it and test....

I'm getting good at reinstalling this!

@lettucebuns
Copy link

Let me know how it goes! It it works for you maybe I'll consider wiping a 3rd time...

@jacoburgin
Copy link

jacoburgin commented Jan 30, 2024

I think I'm up to at least 10 wipes 😭😂 keep breaking it on my own 😂

@jacoburgin
Copy link

Let me know how it goes! It it works for you maybe I'll consider wiping a 3rd time...

Well I have learnt a lot more about removing cephfs...

But nothing has fixed the random node freezing and subsequently disconnecting.

I fresh installed with the 8.1-1 iso. Ran apt update only to get the package list or lldp won't install (maybe that was a mistake)?

I have tried with no cephfs for ISOs-Templates and used a NFS share instead.

This worked the longest but shortly after nodes froze...

I'm off to bed but tomorrow I'll try the 8.0-2 iso, then maybe Kernal update on-top if it is stable.

But something has completely broken it for us NUC12 users...

To me though the it has to be some sort of driver issue maybe for the CPU as at least in my case when the node "disconnects" In the webui, the machine has actually locked up/frozen (I can see this through my KVM) and has to be hard reset.

@jacoburgin
Copy link

jacoburgin commented Feb 7, 2024

Let me know how it goes! It it works for you maybe I'll consider wiping a 3rd time...

Some success, updating the microcode has made migrating a windows VM possible. No crashes there. But still uploading an iso to a cephfs. That locked two nodes and had to be hard reset.

Others are experiencing similar after AN update. Just not sure what broke it all

https://www.reddit.com/r/Proxmox/s/pDMvr9WKA8

@jacoburgin
Copy link

SO I have reinstalled the 3 nuc12's to 7.5, zero issues as expected with Scyto's gist. Upgraded to PVE8 and kernel 6.5 and everything is broken.

Downgraded the kernel to 6.2.16.20 (which includes Scyto's TB fix) and have had zero issues so far! I can live migrate a again and upload ISO's. No other "fixes" applied just a change in kernel

@lettucebuns

@zombiehoffa
Copy link

Thevlater kernels reverted the fix???

@jacoburgin
Copy link

Thevlater kernels reverted the fix???

No, Scyto's thunderbolt fix is applied from 6.2.16-14 onwards.

@DarkPhyber-hg
Copy link

i just got my ms-01's, i followed the guide and i've re-installed 3 or4 times now. When using 10gbe for my ceph network, everything works fine. When using thunderbolt i keep getting random lock ups on any node when ever the ceph storage pool is under load. I am on kernel 6.5.13-1-pve and pve 8.1.4.

I wonder if something broke in the later kernel?

@DarkPhyber-hg
Copy link

DarkPhyber-hg commented Feb 23, 2024

Following up on what i've done so far. I've reinstalled proxmox quite a few times. I couldn't go 3 minutes into restoring a VM from PBS without at least 1 node locking up hard.

In an attempt to isolate the issue, i only used 2 nodes, i was still having the exact same issue. I'm using 2/2 replication and in corosync.conf i gave one node 2 votes.

I decided to eliminate open fabric, so i am just using standard IP'ing assigned to en05 with 2 hosts. I also used reef instead of quincy, so I changed 2 variables. It's been working perfectly for like 8 hours so i think this is a success.

My next test that i'm gonna start working on now, will be to add openfabric to the working configuration. If this doesn't work then there's some kind of issue with TB, openfabric, ceph, PVE 8.1.4, and kernel 6.5, and if it does work then the issue is likely with quincy and the combination of variables on kernel 6.5

@DarkPhyber-hg
Copy link

ok, going to reef did the trick, no more lockups even with openfabric

@jacoburgin
Copy link

I had to lock the kernel to 6.2 to get stability on my nuc 12's

@DarkPhyber-hg
Copy link

I forgot that i had commented out the MTU of 65520, so it was defaulting to 1500, when i put it back to 65520 i got an instant lock up! I'm playing around with various mtu sizes right now. What's strange is that with an extended iperf3 test i got no lockups with the higher mtu value.

@DarkPhyber-hg
Copy link

DarkPhyber-hg commented Feb 25, 2024

ok, i've been playing around with various mtu sizes, there's no perceivable difference on my hardware in iperf3 speeds for an mtu betweeen 1500 and 34,000. I always wind up with an iperf3 test of around 22-23gbps. Going to 35,000 i get lockups with ceph.

Using the ceph benchmark tool rados, on a write test, is a good way to stress test and see if i will get a lockup without having to use real world load. Additionally, i consistently get the best write throughput and iop performance with an mtu of 1500 with my current hardware. I am using consumer wd sn850x m.2 drives, until i get some enterprise ones, so this could have an impact on this as well.

I have some Samsung PM9A3 u.2 drives on the way, along with some PM983 m.2 drives. Once i get those i'll do another round of testing and hopefully put this stack into production to replace my r730.

@djs42012
Copy link

Thank you for the gist! I'ts worked fantastically for me so far.

Pardon me if this is thick, but I have two questions before I proceed with this step (setting up ceph and HA).

As some others have reported, I cannot ping the mesh network's IPv4 addresses unless I systemctl restart frr on each node after startup. That said, am I correctly interpreting your earlier guidance

i strongly recommend you consider only using IPv6... either use IPv4 or IPv6 addressees for all the monitors

if I just replace the IPv4 addresses with their IPv6 counterparts in the instructions in this part of the gist?

Additionally, I would very much like to add a fourth node to the cluster to serve as a dedicated router/reverse proxy/networking tools stack.

Is this possible/ does this pose any issues if so? I have never worked with HA and am getting a little stuck what to make of the settings we put in for the migration network, and what bearing they would have on a potential fourth node.

Currently I am using migration: insecure,network=fc00::81/125 in my datacenter.cfg and everything is working as expected.

I did see one previous poster mention a fourth node but could not gather whether any special configuration changes are required to add one to this setup.

Thank you!

@scyto
Copy link
Author

scyto commented Nov 20, 2024

@djs42012 there is defintely weird issues on many machines with timing that stops IPv4 fully coming up in some scenarios and sometimes stops the thunderbolt. Folks have found a variety of workarounds (documented in the comment history). I only put fixes in my main gist that a)i implemented my self b)that i think can work for all scenarios. As i dont have any of those issues i can't do a repro and figure out root cause to get bugs filed with the proxmox team. it may be as simple as changing some service order and start ups, it may be as complex as needing a fix in the upstream kernel - we just don't know.

There really is no need to run IPv4 on the mesh network, it can be all configured for IPv6 and seems to work more reliably (i think the IPv4 issue is related to the kernel routing module and timing at startup). As such i have contemplated removing all the IPv4 stuff from the gist, i only did both for my own playing. All my ceph is configured with IPv6 only like this.

Ceph Config

[global]
	auth_client_required = cephx
	auth_cluster_required = cephx
	auth_service_required = cephx
	cluster_network = fc00::/64
	fsid = 5e55fd50-d135-413d-bffe-9d0fae0ef5fa
	mon_allow_pool_delete = true
	mon_host = fc00::83 fc00::82 fc00::81
	ms_bind_ipv4 = false
	ms_bind_ipv6 = true
	osd_pool_default_min_size = 2
	osd_pool_default_size = 3
	public_network = fc00::/64

[client]
	keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
	keyring = /etc/pve/ceph/$cluster.$name.keyring

[mds]
	keyring = /var/lib/ceph/mds/ceph-$id/keyring

[mds.pve1]
	host = pve1
	mds_standby_for_name = pve

[mds.pve1-1]
	host = pve1
	mds_standby_for_name = pve

[mds.pve2]
	host = pve2
	mds_standby_for_name = pve

[mds.pve2-1]
	host = pve2
	mds_standby_for_name = pve

[mds.pve3]
	host = pve3
	mds_standby_for_name = pve

[mds.pve3-1]
	host = pve3
	mds_standby_for_name = pve

[mon.pve1-IPv6]
	public_addr = fc00::81

[mon.pve2-IPv6]
	public_addr = fc00::82

[mon.pve3-IPv6]
	public_addr = fc00::83

and PVE cluster config

root@pve1:/etc/pve# cat datacenter.cfg 

crs: ha-rebalance-on-start=1
email_from: [email protected]
keyboard: en-us
# migration: insecure,network=10.0.0.80/29
migration: insecure,network=fc00::81/64
notify: target-fencing=send-alerts-to-alex,target-package-updates=send-alerts-to-alex,target-replication=send-alerts-to-alex

hope that helps you make your decisions on what approach you want

@djs42012
Copy link

Thank you @scyto , that does help. As for adding a fourth node to the cluster, do you foresee any issues there?

@scyto
Copy link
Author

scyto commented Nov 20, 2024

@djs42012 not at all, just remember that for cross node traffic it now might be 2 hop process, so traffic may have to pass through one node to get to another, i don't know what means for performance. But the routing will work, its no different on three node setup to pulling one of the cables - in that scenario the two nodes at the end of the chain have to pass traffic through the one in the middle.

@djs42012
Copy link

@scyto Great, thank you! I wasn't sure if the migration: insecure,network=fc00::81/64 entry would somehow lock out the fourth node since, to my understanding, that references the thunderbolt network to which it has no access.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment