Skip to content

Instantly share code, notes, and snippets.

@scyto
Last active October 9, 2024 02:53
Show Gist options
  • Save scyto/8c652f3eab61ed1fa2f980d02a484c35 to your computer and use it in GitHub Desktop.
Save scyto/8c652f3eab61ed1fa2f980d02a484c35 to your computer and use it in GitHub Desktop.
setting up the ceph cluster

CEPH HA Setup

Note this should only be done once you are sure you have reliable TB mesh network.

this is because proxmox UI seems fragile wrt to changing underlying network after configuration of ceph.

All installation done via command line due to gui not understanding the mesh network

This setup doesn't attempt to seperate the ceph public network and ceph cluster network (not same as proxmox clutser network), The goal is to get an easy working setup.

this gist is part of this series

Ceph Initial Install & monitor creation

  1. On all nodes execute the command pveceph install --repository no-subscription accept all the packages and install
  2. On node 1 execute the command pveceph init --network 10.0.0.81/24
  3. On node 1 execute the command pveceph mon create --mon-address 10.0.0.81
  4. On node 2 execute the command pveceph mon create --mon-address 10.0.0.82
  5. On node 3 execute the command pveceph mon create --mon-address 10.0.0.83

Now if you access the gui Datacenter > pve1 > ceph > monitor you should have 3 running monitors (ignore any errors on the root ceph UI leaf for now).

If so you can proceed to next step. If not you probably have something wrong in your network, check all settings.

Add Addtional managers

  1. On any node go to Datacenter > nodename > ceph > monitor and click create manager in the manager section.
  2. Selecty an node that doesn't have a manager from the drop dwon and click create 3 repeat step 2 as needed If this fails it probably means your networking is not working

Add OSDs

  1. On any node go to Datacenter > nodename > ceph > OSD
  2. click create OSDselect all the defaults (again this for a simple setup)
  3. repeat untill you have 3 nodes like this (note it can take 30 seconds for a new OSD to go green) image

If you find there are no availale disks when you try to add it probably means your dedicated nvme/ssd has some other filesystem or old osd on it. To wipe the disk use the following UI. Becareful not to wipe your OS disk. image

Create Pool

  1. On any node go to Datacenter > nodename > ceph > pools and click create
  2. name the volume, e.g. vm-disks and leave defaults as is and click create

Configure HA

  1. On any node go to Datacenter > options
  2. Set Cluster Resource Scheduling to ha-rebalance-on-start=1 (this will rebalance nodes as needed)
  3. Set HA Settings to shutdown_policy=migrate (this will migrate VMs and CTs if you gracefully shutdown a node).
  4. Set migration settings leave as default (seperate gist will talk about seperating migration network later)
@jacoburgin
Copy link

jacoburgin commented Jan 30, 2024

I think I'm up to at least 10 wipes 😭😂 keep breaking it on my own 😂

@jacoburgin
Copy link

Let me know how it goes! It it works for you maybe I'll consider wiping a 3rd time...

Well I have learnt a lot more about removing cephfs...

But nothing has fixed the random node freezing and subsequently disconnecting.

I fresh installed with the 8.1-1 iso. Ran apt update only to get the package list or lldp won't install (maybe that was a mistake)?

I have tried with no cephfs for ISOs-Templates and used a NFS share instead.

This worked the longest but shortly after nodes froze...

I'm off to bed but tomorrow I'll try the 8.0-2 iso, then maybe Kernal update on-top if it is stable.

But something has completely broken it for us NUC12 users...

To me though the it has to be some sort of driver issue maybe for the CPU as at least in my case when the node "disconnects" In the webui, the machine has actually locked up/frozen (I can see this through my KVM) and has to be hard reset.

@jacoburgin
Copy link

jacoburgin commented Feb 7, 2024

Let me know how it goes! It it works for you maybe I'll consider wiping a 3rd time...

Some success, updating the microcode has made migrating a windows VM possible. No crashes there. But still uploading an iso to a cephfs. That locked two nodes and had to be hard reset.

Others are experiencing similar after AN update. Just not sure what broke it all

https://www.reddit.com/r/Proxmox/s/pDMvr9WKA8

@jacoburgin
Copy link

SO I have reinstalled the 3 nuc12's to 7.5, zero issues as expected with Scyto's gist. Upgraded to PVE8 and kernel 6.5 and everything is broken.

Downgraded the kernel to 6.2.16.20 (which includes Scyto's TB fix) and have had zero issues so far! I can live migrate a again and upload ISO's. No other "fixes" applied just a change in kernel

@lettucebuns

@zombiehoffa
Copy link

Thevlater kernels reverted the fix???

@jacoburgin
Copy link

Thevlater kernels reverted the fix???

No, Scyto's thunderbolt fix is applied from 6.2.16-14 onwards.

@DarkPhyber-hg
Copy link

i just got my ms-01's, i followed the guide and i've re-installed 3 or4 times now. When using 10gbe for my ceph network, everything works fine. When using thunderbolt i keep getting random lock ups on any node when ever the ceph storage pool is under load. I am on kernel 6.5.13-1-pve and pve 8.1.4.

I wonder if something broke in the later kernel?

@DarkPhyber-hg
Copy link

DarkPhyber-hg commented Feb 23, 2024

Following up on what i've done so far. I've reinstalled proxmox quite a few times. I couldn't go 3 minutes into restoring a VM from PBS without at least 1 node locking up hard.

In an attempt to isolate the issue, i only used 2 nodes, i was still having the exact same issue. I'm using 2/2 replication and in corosync.conf i gave one node 2 votes.

I decided to eliminate open fabric, so i am just using standard IP'ing assigned to en05 with 2 hosts. I also used reef instead of quincy, so I changed 2 variables. It's been working perfectly for like 8 hours so i think this is a success.

My next test that i'm gonna start working on now, will be to add openfabric to the working configuration. If this doesn't work then there's some kind of issue with TB, openfabric, ceph, PVE 8.1.4, and kernel 6.5, and if it does work then the issue is likely with quincy and the combination of variables on kernel 6.5

@DarkPhyber-hg
Copy link

ok, going to reef did the trick, no more lockups even with openfabric

@jacoburgin
Copy link

I had to lock the kernel to 6.2 to get stability on my nuc 12's

@DarkPhyber-hg
Copy link

I forgot that i had commented out the MTU of 65520, so it was defaulting to 1500, when i put it back to 65520 i got an instant lock up! I'm playing around with various mtu sizes right now. What's strange is that with an extended iperf3 test i got no lockups with the higher mtu value.

@DarkPhyber-hg
Copy link

DarkPhyber-hg commented Feb 25, 2024

ok, i've been playing around with various mtu sizes, there's no perceivable difference on my hardware in iperf3 speeds for an mtu betweeen 1500 and 34,000. I always wind up with an iperf3 test of around 22-23gbps. Going to 35,000 i get lockups with ceph.

Using the ceph benchmark tool rados, on a write test, is a good way to stress test and see if i will get a lockup without having to use real world load. Additionally, i consistently get the best write throughput and iop performance with an mtu of 1500 with my current hardware. I am using consumer wd sn850x m.2 drives, until i get some enterprise ones, so this could have an impact on this as well.

I have some Samsung PM9A3 u.2 drives on the way, along with some PM983 m.2 drives. Once i get those i'll do another round of testing and hopefully put this stack into production to replace my r730.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment