scyto/dual-stack-openfabric-mesh.md

Last active April 27, 2025 18:36

Star (18) You must be signed in to star a gist
Fork (9) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/scyto/4c664734535da122f4ab2951b22b2085.js"></script>
Save scyto/4c664734535da122f4ab2951b22b2085 to your computer and use it in GitHub Desktop.

Download ZIP

Raw

dual-stack-openfabric-mesh.md

THIS GIST IS NOW DEPRECATED NEW ONE IS AVAILABLE HERE I WONT BE UPDATING THIS ONE OR REPLYING TO COMMENTS ON THIS ONE (COMMENTS NOW DISABLED).

Enable Dual Stack (IPv4 and IPv6) OpenFabric Routing

this gist is part of this series

This assumes you are running Proxmox 8.2 and that the line source /etc/network/interfaces.d/* is at the end of the interfaces file (this is automatically added to both new and upgraded installations of Proxmox 8.2).

This changes the previous file design thanks to @NRGNet for the suggestions to move thunderbolt settings to a file in /etc/network/interfaces.d it makes the system much more reliable in general, more maintainable esp for folks using IPv4 on the private cluster network (i still recommend the use of the IPv6 FC00 network you will see in these docs)

This will result in an IPv4 and IPv6 routable mesh network that can survive any one node failure or any one cable failure. Alls the steps in this section must be performed on each node

NOTES on Dual Stack

I have included this for completeness, i only run the FC00:: IPv6 network as ceph does not support dual stack, i strongly recommend you consider only using IPv6. For example for ceph do not dual stack - either use IPv4 or IPv6 addressees for all the monitors, MDS and daemons - despite the docs implying it is ok my findings on quincy are is it is funky....

Defining thunderbolt network

Create a new file using nano /etc/network/interfaces.d/thunderbolt and populate with the following Remember X should match you node number, so for example 1,2 or 3.

auto lo:0
iface lo:0 inet static
        address 10.0.0.8X/32
        
auto lo:6
iface lo:6 inet static
        address fc00::8X/128
        
allow-hotplug en05
iface en05 inet manual
        mtu 65520

allow-hotplug en06
iface en06 inet manual
        mtu 65520

Save file, repeat on each node.

Enable IPv4 and IPv6 forwarding

use nano /etc/sysctl.conf to open the file
uncomment #net.ipv6.conf.all.forwarding=1 (remove the # symbol)
uncomment #net.ipv4.ip_forward=1 (remove the # symbol)
save the file
issue reboot now for a complete reboot

FRR Setup

Install FRR

Install Free Range Routing (FRR) apt install frr

Enable the fabricd daemon

edit the frr daemons file (nano /etc/frr/daemons) to change fabricd=no to fabricd=yes
save the file
restart the service with systemctl restart frr

Mitigate FRR Timing Issues at Boot

Add post-up command to /etc/network/interfaces

sudo nano /etc/network/interfaces

Add post-up /usr/bin/systemctl restart frr.serviceas the last line in the file (this should go after the line that starts source)

NOTE for Minisforum MS-01 users

make the post-up line above read post-up sleep 5 && /usr/bin/systemctl restart frr.service instead this has been verified to be required due to timing issues see on those units, exact cause unknown, may be needed on other hardware too.

Configure OpenFabric (perforn on all nodes)

enter the FRR shell with vtysh
optionally show the current config with show running-config
enter the configure mode with configure
Apply the bellow configuration (it is possible to cut and paste this into the shell instead of typing it manually, you may need to press return to set the last !. Also check there were no errors in repsonse to the paste text.).

Note: the X should be the number of the node you are working on, as an example - 0.0.0.1, 0.0.0.2 or 0.0.0.3.

ip forwarding
ipv6 forwarding
!
interface en05
ip router openfabric 1
ipv6 router openfabric 1
exit
!
interface en06
ip router openfabric 1
ipv6 router openfabric 1
exit
!
interface lo
ip router openfabric 1
ipv6 router openfabric 1
openfabric passive
exit
!
router openfabric 1
net 49.0000.0000.000X.00
exit
!

you may need to pres return after the last ! to get to a new line - if so do this
exit the configure mode with the command end
save the configu with write memory
show the configure applied correctly with show running-config - note the order of the items will be different to how you entered them and thats ok. (If you made a mistake i found the easiest way was to edt /etc/frr/frr.conf - but be careful if you do that.)
use the command exit to leave setup
repeat steps 1 to 9 on the other 3 nodes
once you have configured all 3 nodes issue the command vtysh -c "show openfabric topology" if you did everything right you will see:

Area 1:
IS-IS paths to level-2 routers that speak IP
Vertex               Type         Metric Next-Hop             Interface Parent
pve1                                                                  
10.0.0.81/32         IP internal  0                                     pve1(4)
pve2                 TE-IS        10     pve2                 en06      pve1(4)
pve3                 TE-IS        10     pve3                 en05      pve1(4)
10.0.0.82/32         IP TE        20     pve2                 en06      pve2(4)
10.0.0.83/32         IP TE        20     pve3                 en05      pve3(4)

IS-IS paths to level-2 routers that speak IPv6
Vertex               Type         Metric Next-Hop             Interface Parent
pve1                                                                  
fc00::81/128         IP6 internal 0                                     pve1(4)
pve2                 TE-IS        10     pve2                 en06      pve1(4)
pve3                 TE-IS        10     pve3                 en05      pve1(4)
fc00::82/128         IP6 internal 20     pve2                 en06      pve2(4)
fc00::83/128         IP6 internal 20     pve3                 en05      pve3(4)

IS-IS paths to level-2 routers with hop-by-hop metric
Vertex               Type         Metric Next-Hop             Interface Parent

Now you should be in a place to ping each node from evey node across the thunderbolt mesh using IPv4 or IPv6 as you see fit.

THIS GIST IS NOW DEPRECATED NEW ONE IS AVAILABLE HERE I WONT BE UPDATING THIS ONE OR REPLYING TO COMMENTS ON THIS ONE (COMMENTS NOW DISABLED).

Author

scyto commented Apr 20, 2025

@corvy more thoughts

i want the frr service to start even if en05 and en06 are not up because the new SDN functionality of proxmox uses frr - so blocking any other SDN functionality while i wait for en05 and en06 to be up seems like a bad long term approacj
i see there is difference in requires / bindsto and wants - as such i think revising this so ceph doesn't start until frr is up and either en05 and en06 is up - i agree it shouldn't be bindsto as that would stop ceph / frr i think - so i concurr your wants and after is the right approach - i am just thinking it is ceph that should be dependent on that. I think the moving the IPv4 and IPv6 address from interfaces into the frr config might really help here (it works, but as i never really had many failures, really hard for me to test.

let me know what you think, i am going to try and add your wants/after to the ceph service and see what happens....

Author

scyto commented Apr 20, 2025 •

edited

Loading

@corvy this path /etc/systemd/system/frr.service.d/ does not exist on my proxmox, did you create the frr.service.d sub-directory? can i do the same for ceph is someway? or am i ok continuing to edit the ceph.target file instead?

corvy commented Apr 20, 2025

Yes I did. I think you can do this for any systemd service.

Author

scyto commented Apr 20, 2025

@corvy

I am just wondering why you are not revisiting getting IPv4 to work?

well 'life' a) i didn't need ipv4 to work properly as i dropped it 12mo ago as i was using IPv6 and b)brain surgey means i haven't had the energy for any of this for 6mo+ and c)what energy i did have went into testing truenas on a zimacube pro and then deciding to build a rackmount epyc 9115 NAS (still not in production lol)

corvy commented Apr 20, 2025

On your othe questions I think we just need to do some testing. The important thing is that frrr requires the ip stack to be up before starting the first time. After this it seems more robust to things restarting or changing. No expert here - just my experience.

Author

scyto commented Apr 20, 2025 •

edited

Loading

Yes I did. I think you can do this for any systemd service.

thanks for helping me understand systemd - something i have avoided to now :-)

well i like that, better than modifying files proxmox may overwite on upgrades..... I assume the dir name will be ceph.service.d - i am a little concerned that might not work as proxmox seems to define each of the ceph sub services specifically....

root@pve1:/etc/systemd/system# find / -name ceph*service
/var/lib/systemd/deb-systemd-helper-enabled/ceph.target.wants/ceph-crash.service
/usr/lib/systemd/system/ceph-crash.service
/usr/lib/systemd/system/[email protected]
/usr/lib/systemd/system/[email protected]
/usr/lib/systemd/system/[email protected]
/usr/lib/systemd/system/[email protected]
/usr/lib/systemd/system/[email protected]
/usr/lib/systemd/system/[email protected]

so there is no 'ceph service' and there multiple named instances of the services running

ceph-crash.service            loaded active running Ceph crash dump collector
[email protected]       loaded active running Ceph metadata server daemon
[email protected]         loaded active running Ceph metadata server daemon
[email protected]         loaded active running Ceph cluster manager daemon
[email protected]    loaded active running Ceph cluster monitor daemon
[email protected]            loaded active running Ceph object storage daemon osd.0
[email protected]            loaded active running Ceph object storage daemon osd.1

so i am stumped on how to create a dependencies.conf file for these that would work easily for different installs for different numbers of services and names....

..sometime later...

ok copilot says i should create this, i gues i could create one file and symlink all these........

/etc/systemd/system/[email protected]/dependencies.conf
/etc/systemd/system/[email protected]/dependencies.conf
/etc/systemd/system/[email protected]/dependencies.conf
/etc/systemd/system/[email protected]/dependencies.conf
/etc/systemd/system/[email protected]/dependencies.conf

..some more time later....

i think copilot helped me yet again, i did the following

# Create directories for drop-in configuration files
sudo mkdir -p /etc/systemd/system/[email protected]
sudo mkdir -p /etc/systemd/system/[email protected]
sudo mkdir -p /etc/systemd/system/[email protected]
sudo mkdir -p /etc/systemd/system/[email protected]
sudo mkdir -p /etc/systemd/system/[email protected]

# Create a single dependencies configuration file with the updated content
echo -e "[Unit]\nWants=frr.service sys-subsystem-net-devices-en05.device sys-subsystem-net-devices-en06.device\nAfter=frr.service sys-subsystem-net-devices-en05.device sys-subsystem-net-devices-en06.device" | sudo tee /etc/systemd/ceph-dependencies.conf

# Create symlinks for the drop-in configuration files
sudo ln -s /etc/systemd/ceph-dependencies.conf /etc/systemd/system/[email protected]/dependencies.conf
sudo ln -s /etc/systemd/ceph-dependencies.conf /etc/systemd/system/[email protected]/dependencies.conf
sudo ln -s /etc/systemd/ceph-dependencies.conf /etc/systemd/system/[email protected]/dependencies.conf
sudo ln -s /etc/systemd/ceph-dependencies.conf /etc/systemd/system/[email protected]/dependencies.conf
sudo ln -s /etc/systemd/ceph-dependencies.conf /etc/systemd/system/[email protected]/dependencies.conf

then i did systemctl daemon-reload - seems to have worked.

[email protected]
● ├─-.mount
● ├─frr.service
● ├─sys-subsystem-net-devices-en05.device
● ├─sys-subsystem-net-devices-en06.device

i will keep on this one node for a while and see what its like over reboots etc

--edit-- Apr 20yj 6:52pm PDT

ok so this does seem to help (or at least do no harm, still not sure if it is essential)

Author

scyto commented Apr 20, 2025 •

edited

Loading

On your othe questions I think we just need to do some testing. The important thing is that frrr requires the ip stack to be up before starting the first time. After this it seems more robust to things restarting or changing. No expert here - just my experience.

oh i agree in principle, i just am trying to udnerstand the exact sequencing and why it varies from machine to machine, without knowing that the fix is a little hard... also the interfaces coming up doesn't mean IP is up.....

my issue is i can't test for the failures y'all see - they just don't happen on my machine in general, if I run IPv4 yes eventually it has issues but that happens inconsistently - when i test bouncing 3 nodes in a row IPv4 generally comes up every time....

Author

scyto commented Apr 20, 2025 •

edited

Loading

@corvy what kernel are you on?

pve-n05.sh + pve-06.sh (amend the IF="en0x" part for each script)

were you seeing an issue where both thunderbolt ports don't come up? .... i just started seeing this.....

...some time later...

enables your scripts and add -v - i see - wow, I literally never had this issue before, guess mika and crew improved the thunderbolt code to bring up the interfaces faster in the kernel! thanks, will add this to the main gist

corvy commented Apr 20, 2025

Yes I got problems getting both interfaces up early on. The only way I got them up was by running the script I made. I think I also had this issue before going to 6.11 but I honestly don't remember.

I am on 6.11.11-2-pve currently.

Author

scyto commented Apr 21, 2025 •

edited

Loading

@corvy thanks, your script is fabulous, literally the only thing i am going to do is amend to log to the system log by default (next week) as using the logger command is my new fave toy

biggest thing i need to solve - is pvestatd tries to start VMs before cephFS is fully up, and i stored a VM hook script in a share snippets location

this is supposed to be the solution

but it doesn't work

anyhoo my new version of the thunderbolt setup gist and the dual fabric gist are done modulo any typos - thats how my system is configured tho, seems to be working

0xD4 commented Apr 21, 2025 •

edited

Loading

Thank you @scyto for the great work and effort you put into this project. ❤️

I spent hours and days trying to build tunderbolt network with my three MS-01 - without success. There are various approaches and solutions from different people in the comments, but nothing worked reliably for me. Either the routing did not work, the services did not start reliably or there was unexplained instability, making it completely unusable. I reinstalled my PVE cluster several times.

I was about to give up on thunderbolt and switch to 10G Ethernet. Now it seems that I finally found a stable and reliable IPv4-only variant for MS-01. In the end @corvy's solution led to the desired result! The network seems rock solid. Thank you for sharing this with us!

scyto/dual-stack-openfabric-mesh.md

THIS GIST IS NOW DEPRECATED NEW ONE IS AVAILABLE HERE I WONT BE UPDATING THIS ONE OR REPLYING TO COMMENTS ON THIS ONE (COMMENTS NOW DISABLED).

Enable Dual Stack (IPv4 and IPv6) OpenFabric Routing

NOTES on Dual Stack

Defining thunderbolt network

Enable IPv4 and IPv6 forwarding

FRR Setup

Install FRR

Enable the fabricd daemon

Mitigate FRR Timing Issues at Boot

Add post-up command to /etc/network/interfaces

NOTE for Minisforum MS-01 users

Configure OpenFabric (perforn on all nodes)

THIS GIST IS NOW DEPRECATED NEW ONE IS AVAILABLE HERE I WONT BE UPDATING THIS ONE OR REPLYING TO COMMENTS ON THIS ONE (COMMENTS NOW DISABLED).

scyto commented Apr 20, 2025

scyto commented Apr 20, 2025 • edited Loading

corvy commented Apr 20, 2025

scyto commented Apr 20, 2025

corvy commented Apr 20, 2025

scyto commented Apr 20, 2025 • edited Loading

scyto commented Apr 20, 2025 • edited Loading

scyto commented Apr 20, 2025 • edited Loading

corvy commented Apr 20, 2025

scyto commented Apr 21, 2025 • edited Loading

0xD4 commented Apr 21, 2025 • edited Loading

scyto commented Apr 20, 2025 •

edited

Loading

scyto commented Apr 20, 2025 •

edited

Loading

scyto commented Apr 20, 2025 •

edited

Loading

scyto commented Apr 20, 2025 •

edited

Loading

scyto commented Apr 21, 2025 •

edited

Loading

0xD4 commented Apr 21, 2025 •

edited

Loading