Note this should only be done once you are sure you have reliable TB mesh network.
this is because proxmox UI seems fragile wrt to changing underlying network after configuration of ceph.
All installation done via command line due to gui not understanding the mesh network
This setup doesn't attempt to seperate the ceph public network and ceph cluster network (not same as proxmox clutser network), The goal is to get an easy working setup.
this gist is part of this series
- On all nodes execute the command
pveceph install --repository no-subscription
accept all the packages and install - On node 1 execute the command
pveceph init --network 10.0.0.81/24
- On node 1 execute the command
pveceph mon create --mon-address 10.0.0.81
- On node 2 execute the command
pveceph mon create --mon-address 10.0.0.82
- On node 3 execute the command
pveceph mon create --mon-address 10.0.0.83
Now if you access the gui Datacenter > pve1 > ceph > monitor
you should have 3 running monitors (ignore any errors on the root ceph UI leaf for now).
If so you can proceed to next step. If not you probably have something wrong in your network, check all settings.
- On any node go to
Datacenter > nodename > ceph > monitor
and clickcreate
manager in the manager section. - Selecty an node that doesn't have a manager from the drop dwon and click
create
3 repeat step 2 as needed If this fails it probably means your networking is not working
- On any node go to
Datacenter > nodename > ceph > OSD
- click
create OSD
select all the defaults (again this for a simple setup) - repeat untill you have 3 nodes like this (note it can take 30 seconds for a new OSD to go green)
If you find there are no availale disks when you try to add it probably means your dedicated nvme/ssd has some other filesystem or old osd on it. To wipe the disk use the following UI. Becareful not to wipe your OS disk.
- On any node go to
Datacenter > nodename > ceph > pools
and clickcreate
- name the volume, e.g.
vm-disks
and leave defaults as is and clickcreate
- On any node go to
Datacenter > options
- Set
Cluster Resource Scheduling
toha-rebalance-on-start=1
(this will rebalance nodes as needed) - Set
HA Settings
toshutdown_policy=migrate
(this will migrate VMs and CTs if you gracefully shutdown a node). - Set
migration settings
leave as default (seperate gist will talk about seperating migration network later)
Yeah, same here.
I have the same problem. I tried multiple approaches, and one actually works, but I'm not sure it's the best solution. But it works. :)
Disclaimer: Network engineering is not my area of expertise, so everything I'm writing here might be completely off-base.
Note: My setup is a bit different - I have two nodes and one QDevice. This shouldn't affect the networking, but it's worth mentioning, just in case.
At first, my uneducated guess was that if we're using FRR to set up a mesh between nodes, then we need to use FRR on the VM to include it in the mesh. However, as mentioned, I'm not a network engineer and have zero knowledge of how FRR actually works, so I gave up as soon as my first straightforward and naive attempt failed.
The second approach was to use VXLAN. VXLAN allows you to create a virtual network within a cluster, and it uses a mesh network to establish communication between VMs on different nodes. The nodes themselves can also be part of the network. It's quite easy to install, and from the VM or node perspective, it looks like just another bridge or NIC. The network itself works perfectly (though the throughput dropped to 10Gb on average from 26Gb, and lots of retries appeared, but I didn't dig deeper). Unfortunately, I was unable to use this network as a public network for Ceph and enforce monitors to listen on IPs from it. As soon as I changed the config, the cluster went down.
The final attempt, which actually works but looks ugly, uses Samba or NFS. The trick is simple: on every node, you create a virtual bridge that isn't attached to any interface, with the same static IP. No gateway. Afterward, add this bridge to every VM that needs access to shared disk, and give it a static IP from the same network. This creates a virtual network within one node. Then just create an SMB/NFS server on the node and mount the share on VMs. Since the bridge configuration is similar on every node, nothing will change for the VM during migration, and it should continue to work.
Performance is fine for my scenarios - I have stable 1GB write and 2-2.5GB read speeds.
There are a few issues with this approach though. There's a slight delay with disk access during migration as the connection to the share is interrupted, which may cause issues if the VM migrates during active I/O. Also, some applications that are sensitive to disk type might not work with shared disk, such as PostgreSQL.
Bottom line - it works for me for now, but I'd be happy if someone could help me fix this in a more proper way.