Skip to content

Instantly share code, notes, and snippets.

@scyto
Last active May 28, 2025 21:29
Show Gist options
  • Save scyto/58b5cd9a18e1f5846048aabd4b152564 to your computer and use it in GitHub Desktop.
Save scyto/58b5cd9a18e1f5846048aabd4b152564 to your computer and use it in GitHub Desktop.
New version of my mesh network using openfabric

Enable Dual Stack (IPv4 and IPv6) OpenFabric Routing

Version 2.5 (2025.04.27)

this gist is part of this series

This assumes you are running Proxmox 8.4 and that the line source /etc/network/interfaces.d/* is at the end of the interfaces file (this is automatically added to both new and upgraded installations of Proxmox 8.2).

This changes the previous file design thanks to @NRGNet and @tisayama to make the system much more reliable in general, more maintainable esp for folks using IPv4 on the private cluster network (i still recommend the use of the IPv6 FC00 network you will see in these docs)

Notable changes from original version here

  • move IP address configuration from interfaces.d/thundebolt to frr configuration i reverted this on 2025.04.27 and improved settings in interfaces.d/thunderbolt based on recommendations from chatGPT to solve issues i hit it my routed network setup (coming soon)
  • new approach to remove dependecy on post-up with new scripts in if-up.d that logs to systemlog
  • reminder to copy frr.conf > frr.conf.local to prevent breakage if you enable Proxmox SDN
  • dependent on the changes to the udev link scripts here

This will result in an IPv4 and IPv6 routable mesh network that can survive any one node failure or any one cable failure. Alls the steps in this section must be performed on each node

** NOTES on Dual Stack*

Having spent 3 days hammering my network and playing with various different routed toplogies i am of the current opinion

  • i still prefer IPv6 for my mesh but if you setup for IPv4 it should now be fine but my gists will continue to assume you used IPv6 for ceph
  • i have no opinion on squid and dual stack yet - should be doable... we will seee
  • if you use ONLY IPv6 for the love-of-god(tm) make sure that ms_bind_ipv4 = false is set in ceph.conf or really bad things will eventuall happen

Defining thunderbolt network

This was revised on 2025.04.27 to move loopback IP addressing back from frr.conf to here (along with some reliability changes recommended by chatgpt) having loopback IPs was a stupid idea as they should be up irrespective of the state of the mesh to allow ceph processes to start binding to it.

Create a new file using nano /etc/network/interfaces.d/thunderbolt and populate with the following

# Thunderbolt interfaces for pve1 (Node 81)

auto en05
iface en05 inet6 static
    pre-up ip link set $IFACE up
    mtu 65520

auto en06
iface en06 inet6 static
    pre-up ip link set $IFACE up
    mtu 65520

# Loopback for Ceph MON
auto lo
iface lo inet loopback
    up ip -6 addr add fc00::81/128 dev lo
    up ip addr add 10.0.0.81/32 dev lo

Notes:

  • doing loopback IP is more reliable in interfaces file than in frr.conf the ip address will always be available for the mon, mgr, and mds processes of ceph to bind to irrespective of frr service status
  • mtus are super importantor BGP and openfabric seem to have node to node negotiation issues
  • the pre-up and up directives were recommended by chatGPT to ensure the interfaces are up before applying the IP address and MTU - should make things more reliable

Enable IPv4 and IPv6 forwarding

  1. use nano /etc/sysctl.conf to open the file
  2. uncomment #net.ipv6.conf.all.forwarding=1 (remove the # symbol)
  3. uncomment #net.ipv4.ip_forward=1 (remove the # symbol)
  4. save the file
  5. issue reboot now for a complete reboot

FRR Setup

Install & enable FRR (not needed on proxmox 8.4+ )

  1. Install Free Range Routing (FRR) apt install frr
  2. Enable frr systemctl enable frr

Enable the fabricd daemon

  1. edit the frr daemons file (nano /etc/frr/daemons) to change fabricd=no to fabricd=yes
  2. save the file
  3. restart the service with systemctl restart frr

Mitigate FRR Timing Issues (I need someone with an MS-101 to confirm if helps solve their IPv4 issues)

create script that is automatically processed when en05/en06 are brougt up to restart frr

notes

  • this should make IPv4 more stable for all users (i ended up seeing IPv4 issues too, just less commonly than MS-101 users)
  • i found the chnages i introduced in 2.5 version of this gist make this less needed, occasionally ifreload / ifupdown2 may cause enough changes that frr gets restarted too often and the service will need to be unblocked with systemctl.
  1. create a new file with nano /etc/network/if-up.d/en0x
  2. add to file the following
#!/bin/bash
# note the logger entries log to the system journal in the pve UI etc

INTERFACE=$IFACE

if [ "$INTERFACE" = "en05" ] || [ "$INTERFACE" = "en06" ]; then
    logger "Checking if frr.service is running for $INTERFACE"
    
    if ! systemctl is-active --quiet frr.service; then
        logger -t SCYTO "   [SCYTO SCRIPT ] frr.service not running. Starting service."
        if systemctl start frr.service; then
            logger -t SCYTO "   [SCYTO SCRIPT ] Successfully started frr.service"
        else
            logger -t SCYTO "   [SCYTO SCRIPT ] Failed to start frr.service"
        fi
        exit 0
    fi

    logger "Attempting to reload frr.service for $INTERFACE"
    if systemctl reload frr.service; then
        logger -t SCYTO "   [SCYTO SCRIPT ] Successfully reloaded frr.service for $INTERFACE"
    else
        logger -t SCYTO "   [SCYTO SCRIPT ] Failed to reload frr.service for $INTERFACE"
    fi
fi
  1. make it executable with chmod +x /etc/network/if-up.d/en0x

mitgigate issues cause by things that reset the loopback

create script that is automatically processed when lo is reprocessed by ifreload, ifupdown2, pve set, etc

  1. create a new file with nano /etc/network/if-up.d/lo
  2. add to file the following
#!/bin/bash

INTERFACE=$IFACE

if [ "$INTERFACE" = "lo" ]  ; then
    logger "Attempting to restart frr.service for $INTERFACE"
    if systemctl restart frr.service; then
        logger -t SCYTO "   [SCYTO SCRIPT ] Successfully restart frr.service for $INTERFACE"
    else
        logger -t SCYTO "   [SCYTO SCRIPT ] Failed to restart frr.service for $INTERFACE"
    fi
fi

make it executable with chmod +x /etc/network/if-up.d/lo

Configure OpenFabric (perforn on all nodes)

**note: if (and only if) you have already configured SDN you should make these settings in /etc/frr/frr.conf.local and reapply your SDN configuration to have SDN propogate these into frr.conf (you can also make the edits to both files if you prefer) if you make these edits to only frr.conf with SDN active and then reapply the settings it will loose these settings.

  1. enter the FRR shell with vtysh
  2. optionally show the current config with show running-config
  3. enter the configure mode with configure
  4. Apply the bellow configuration (it is possible to cut and paste this into the shell instead of typing it manually, you may need to press return to set the last !. Also check there were no errors in repsonse to the paste text.).

Note: the X should be the number of the node you are working on For example node 1 would use 1 in place of X

ip forwarding
ipv6 forwarding

interface en05
 ip router openfabric 1
 ipv6 router openfabric 1
 openfabric hello-interval 1
 openfabric hello-multiplier 3
 openfabric csnp-interval 5
 openfabric psnp-interval 2
exit

interface en06
 ip router openfabric 1
 ipv6 router openfabric 1
 openfabric hello-interval 1
 openfabric hello-multiplier 3
 openfabric csnp-interval 5
 openfabric psnp-interval 2
exit

interface lo
 ip router openfabric 1
 ipv6 router openfabric 1
 openfabric hello-interval 1
 openfabric hello-multiplier 3
 openfabric csnp-interval 5
 openfabric psnp-interval 2
 openfabric passive
exit

router openfabric 1
net 49.0000.0000.000x.00
lsp-gen-interval 5
exit
!
exit

  1. you may need to press return after the last exit to get to a new line - if so do this
  2. save the configu with write memory
  3. show the configure applied correctly with show running-config - note the order of the items will be different to how you entered them and thats ok. (If you made a mistake i found the easiest way was to edt /etc/frr/frr.conf - but be careful if you do that.)
  4. use the command exit to leave setup
  5. repeat steps 1 to 9 on the other 3 nodes
  6. once you have configured all 3 nodes issue the command vtysh -c "show openfabric topology" if you did everything right you will see (note it may take 45 seconds for for all routes to show if you just restarted frr for any reason):
Area 1:
IS-IS paths to level-2 routers that speak IP
Vertex               Type         Metric Next-Hop             Interface Parent
pve1                                                                  
10.0.0.81/32         IP internal  0                                     pve1(4)
pve2                 TE-IS        10     pve2                 en06      pve1(4)
pve3                 TE-IS        10     pve3                 en05      pve1(4)
10.0.0.82/32         IP TE        20     pve2                 en06      pve2(4)
10.0.0.83/32         IP TE        20     pve3                 en05      pve3(4)

IS-IS paths to level-2 routers that speak IPv6
Vertex               Type         Metric Next-Hop             Interface Parent
pve1                                                                  
fc00::81/128         IP6 internal 0                                     pve1(4)
pve2                 TE-IS        10     pve2                 en06      pve1(4)
pve3                 TE-IS        10     pve3                 en05      pve1(4)
fc00::82/128         IP6 internal 20     pve2                 en06      pve2(4)
fc00::83/128         IP6 internal 20     pve3                 en05      pve3(4)

IS-IS paths to level-2 routers with hop-by-hop metric
Vertex               Type         Metric Next-Hop             Interface Parent

Now you should be in a place to ping each node from evey node across the thunderbolt mesh using IPv4 or IPv6 as you see fit.

IMPORTAT - you need to do this to stop SDN breaking you in future

if all is working issue a cp /etc/frr/frr.conf /etc/frr/frr.conf.local this is because when enabling proxmox SDN proxmox will overwrite frr.conf - however it will read the .local file and apply that.

**note: if you already have SDN configured do not do the step above as you will mess both your SDN and this openfabric topology (see note at start of frr instructions)

based on this response https://forum.proxmox.com/threads/relationship-of-frr-conf-and-frr-conf-local.165465/ if you have SDN all local (non SDN) configuration changes should be made in .local, this should be read next time SDN apply is used. do not copy frr.conf > frr.conf.local after doing anything with SDN or when you tear down SDN the settings will not be removed from frr.conf

@christensenjairus
Copy link

I saw your post in the Proxmox forum. I think I'm trying to do the same thing as you. I need frr for my local mesh network (100gbe) but SDN blows away the file. I also get strange functionality when using simple routing instead of frr, so I'm interested to see what the answer is there.

@ronindesign
Copy link

ronindesign commented Apr 25, 2025

good to hear its working, did you use my old gist this new gist or forge your own path (i am asking to know if the instructions work above, along with the chnages in the thunderbolt gist)

I followed all of your most recent gists (including this gist @ v2.1), including edits made over the last few days. I thought it would be a helpful opportunity to provide feedback with a fresh cluster deployment using your latest instructions.

All has worked 100%, no issues, multiple reboots, TB network survives every time, without errors; haven't tried unplugging/replugging much at all yet. I doubt it matters, but I am using BIOS v1.26 (latest) on the MS-01s, due to Proxmox instability on previous BIOS versions (e.g. kernel panics, etc.)

Anyways, don't want to add any further noise -- clearly bigger fish to fry with SDN it looks like! Just wanted to give a 👍for latest revisions of the guide. Thanks again so much, coming back to this a year or more later and it's such a helpful resource, appreciate all your time and energy on it (hope the surgery went ok!)

@scyto
Copy link
Author

scyto commented Apr 25, 2025

I thought it would be a helpful opportunity to provide feedback with a fresh cluster deployment using your latest instructions.

thanka i really appreicate that, glad to hear it worked!

Eek on the BIOS issues, i hadn't heard about that. Add as much noise as you want, i do :-) (and yes my surgery went well, thanks for asking)

@scyto
Copy link
Author

scyto commented Apr 25, 2025

I saw your post in the Proxmox forum. I think I'm trying to do the same thing as you. I need frr for my local mesh network (100gbe) but SDN blows away the file. I also get strange functionality when using simple routing instead of frr, so I'm interested to see what the answer is there.

yeah, i searched in frr.conf.local in the forum and realized i couldn't find a good description of how it is used, i also found that the SDN left networking.service in weird invalid states until a reboot - i will repeat my SND tests if i get time (though this weekend is a new server rack so that will take most of my time!)

@scyto
Copy link
Author

scyto commented Apr 25, 2025

@ALL i changed the guidance on copying frr.conf after SDN has been configured - if you copy ffr.conf to ffr.conf.local after configuring SDN then SDN won't teardown the settings as it thinks they are local and not SDN settings and this means SDN settings remain in your frr.conf when they shouldn't

@scyto
Copy link
Author

scyto commented Apr 25, 2025

@folks using these settings

 openfabric csnp-interval 2
 openfabric hello-interval 1
 openfabric hello-multiplier 2
  • how long does it take from doing an frr restart till you see all 3 routes doing vtysh -c "sh open topo"?
  • have you had any issues woith flapping routes - where the route changes constantly (this could cause variable ping times for example or even dropped packets as the routing changes)?

my testing shows it doesn't make convergce of routes faster at frr service start - seems to always take 45 seconds+

hmm well this is interesting https://chatgpt.com/share/680bcb97-3598-800d-9c54-22f27173f658

@scyto
Copy link
Author

scyto commented Apr 25, 2025

i think the 3 settings above are basically irrelevant on startup, i don't think they harm, i don't know what benefit they are giving - like the vtysh line that is also irrelevant (and i notice that SDN adds it too).

try adding the 3 spf and 1 lsp settings below to you router section - for me the routes converge almost instantly compared to >45 seconds before on frr start.... this would mean ceph has the chance to come up 45 seconds faster.....

--edit= those 3 spf settings caused crashes as they were not supposed to be in the router section, thanks chatgpt

i have this configured on all 3 nodes, if no one experiences issues i will add these 3 new settings to the gist

(these settings may not be a good thing where there is a large routed network, but fine for homelabs / esp isolated mesh)

example of what my node 3 looks like:

!
router openfabric 1
 net 49.0000.0000.0003.00
 lsp-gen-interval 5
exit
!

it might also be good to move to point to point link than broadcasts, then csnp and hello timings are basically irrelevant, might test that over the weekend

@xenpie
Copy link

xenpie commented Apr 25, 2025

 openfabric csnp-interval 2
 openfabric hello-interval 1
 openfabric hello-multiplier 2

I have been using these for a while, since I saw them in the SDN forum tutorial and they didn't seem to cause any harm so I just kept them in.

* how long does it take from doing an frr restart till you see all 3 routes doing `vtysh -c "sh open topo"`?

I just checked on my system, after restarting the frr service it takes less than 5 seconds before I see all the routes. Tried it multiple times on all nodes, always with the same result.

* have you had any issues woith flapping routes - where the route changes constantly (this could cause variable ping times for example or even dropped packets as the routing changes)?

I'd say no but then again not sure if I would notice it with my current use case. I just ran a quick ping test for 10 minutes and it looks good to me.

--- 10.0.0.82 ping statistics ---
574 packets transmitted, 574 received, 0% packet loss, time 586720ms
rtt min/avg/max/mdev = 0.038/0.145/0.358/0.059 ms
root@pve1:~#

--- 10.0.0.83 ping statistics ---
570 packets transmitted, 570 received, 0% packet loss, time 582585ms
rtt min/avg/max/mdev = 0.044/0.131/0.316/0.050 ms
root@pve2:~#

--- 10.0.0.81 ping statistics ---
567 packets transmitted, 567 received, 0% packet loss, time 579606ms
rtt min/avg/max/mdev = 0.045/0.139/0.345/0.054 ms
root@pve3:~#

@scyto
Copy link
Author

scyto commented Apr 25, 2025

I just checked on my system, after restarting the frr service it takes less than 5 seconds before I see all the routes.

thanks, ineresting, those made no difference to the route convergence time on startup for me, agree they are harmless in a small isolated mesh

@scyto
Copy link
Author

scyto commented Apr 26, 2025

@ALL i edited the settings under the router don't use the spf settings i had the earlier, remove them immediately if you implemented them or things will get very wonky

@scyto
Copy link
Author

scyto commented Apr 27, 2025

so i have spent the day with chatgpt and ceph - trying several new topolgies, hilariously most didn't work, but i understand why and chatgpt move me on - until we got right back to basically the design in this git with a few key difference - i am not ready to post that, but as part of this i needed to move my cluster from having /128s on each node to having /64 addresses (part of plan to try different routing options as it really looks like thunderbolt ports cannot be bridged!

Anyhoo. this is the migration plan i did, chatgpt made the document content and markdown for me too based on the hours of conversationsi had....

https://gist.github.com/scyto/64e79a694b286d3b70f8b3663d19eb76

not linking to this in my gists, but thought folks might be interested, i can share the chatgpt logs of how i got here, but its long and starts with broadcast storm issue (after trying a bridging solution to allow VMs to bridge to the thunderbolt network and is several hours of troubleshooting very very broken ceph clusters, times when i ignored its intructions because etc) if anyon thinks that would be interesting i can link to that too

this is an FYI as i just thought it was incredibly interesting how chatgpt let me try many different mesh network configurations, gave me wrong answers some times, but ultimately helped me in the back and forth

-edit-

shit i asked it to summarize what the setup was when we started before migration and how to make it,

and it gave me this! straight away https://gist.github.com/scyto/bdd5381fe9170ec10009cddf8687446b - not sure why it insists this IS-IS when its openfrabric, but whatever, i can edit that, the rest is right

--edit2--
so now i am using it for options on how to connect VMs to the ceph mesh, it remebered from hours ago that bridging doesn't work with thunderbolt (at least it doesn't for me and thats what i told it)

now it offers to summarize what to do AND because i have twice asked for gist.md format asks me if i want it in that, i am beyond impressed

@scyto
Copy link
Author

scyto commented Apr 27, 2025

image

it is a bit too fucking chipper mind you

@scyto
Copy link
Author

scyto commented Apr 27, 2025

i have been doing this nearly 10 hrs straight....

i now have a fully routed mesh network - VMs can access the ceph mesh network, anything anywhere on my lan can access the mesh network - i have tested with shh and ping, ceph next..... going to bed now.... oh and so far i see no evidence i need the frr restart scripts either.... but no promises.... but it now seems to all work as it should.... will publish a v3 setup in the next few days.... no complex SDN stuff needed....

@ronindesign
Copy link

Success -- very nice! Can't wait to see the results, well done! Will be great to be able to bridge for VM access.

@eidgenosse
Copy link

A first rough test with an MS-01 shows that the reboot problem is fixed with the script. Thanks a lot for that.

@scyto
Copy link
Author

scyto commented Apr 28, 2025

@ALL i modified this gist to move the loopback IP addressing back from frr.conf into thunderbolt this to ensure the loopback addresses remain present no matter what frr and thunderbolt is doing - this will solve a bunch of failure edge cases. Sorry i ever thought it was good to put in frr.conf.

@scyto
Copy link
Author

scyto commented Apr 28, 2025

Success -- very nice! Can't wait to see the results, well done! Will be great to be able to bridge for VM access.

you are not going to like how complex it is.... i literally couldn't have done this without chatGPT i hit sooooo many issues - the root cause seems to be TB interfaces advertise to the kernel they are only point to point links

i am hoping someone can show me how this could have been done more simply... note i went for gold and got routing working for every client on my LAN to reach ceph too.... if you don't want that you can ignoe the bgp stuff, i may refactor this first draft to reflect that before posting link in the gist TOC

https://gist.github.com/scyto/dbbe5483f2779228ff743c5f333effe0

and two failed attempts at getting chatgpt to refactor it for me - i think i broke its processing, it just kept loosing info an context

https://gist.github.com/scyto/a02bbcf947f4a18773c30fa3d12bf495
https://gist.github.com/scyto/935b6d214ee6d87741fb5e9646e98161

i will get to it cleaning all this up over the next week, thought i would share these as a giggle, let me be clearnthe two links above are chatgpt mangles of the initial like where i aske it to refactor it...

@scyto
Copy link
Author

scyto commented Apr 28, 2025

A first rough test with an MS-01 shows that the reboot problem is fixed with the script. Thanks a lot for that.

thanks to the folks who suggest the original and edits, i just chucked it into chatGPT and got it to improve it slightly! I like xenpie's consolidated one - just need to test it, we also probably need to account for service lockout (when it restarts too many times too quickly and disables itself - i hit that a few times...)

@scyto
Copy link
Author

scyto commented Apr 29, 2025

@ALL - ok here is the first version of how to give VMs access to the mesh

https://gist.github.com/scyto/dbbe5483f2779228ff743c5f333effe0

@scyto
Copy link
Author

scyto commented Apr 29, 2025

@ALL and now how to access the mesh from any LAN client

https://gist.github.com/scyto/c0df83c269c5f5c192cb8a08a0d4a559

@ronindesign
Copy link

Fantastic! Will test asap. I've got another 3x MS-01 cluster to deploy and will wait until I've reviewed the above before working on that deployment so I can again use it as a test in deploying the most recent changes.

Appreciate all your work on this, really amazing!

@scyto
Copy link
Author

scyto commented Apr 30, 2025

Appreciate all your work on this, really amazing!

thanks i appreciate that, let me know how it goes and I can update / modify

for example last night i found linux versions that run ifreload packaging instead of ifupdown2 have issues if two interfaces are configured (like in the VM example) very weird

@scyto
Copy link
Author

scyto commented May 2, 2025

I made a script to help me troubleshoot - it attempts to show me all the client connections / connections to MONs, MDs and what VMs are mapped to OSDs with what client IP (aka proxmox host IP).

Idea is to make sure that connections are not leaking on to my LAN now that i have full mush routing.

https://github.com/scyto/ceph-connections

Is this something might find useful? (it probably needs a lot more work to make it portable - like different better way to derive client names - currently it cheats by using the MDS connections as a reference table...)

@scyto
Copy link
Author

scyto commented May 2, 2025

I also made a script that wraps FIO to make benchmarking easier and prevent me trashing a disk with FIO accidentally :-)

https://github.com/scyto/fio-test-script

let me know if its something intersting / i should keep working on.

@scyto
Copy link
Author

scyto commented May 2, 2025

and lastly this is my first draft of how to mount ceph across the network into a VM with routed network, or any device on LAN if you have implemented that gist

https://gist.github.com/scyto/61b38c47cb2c79db279ee1cbb6f31772

personally based on benchmarks i will stay with virtioFS, i guess i should write up my approach / need hookscripts to make sure VM only starts if the cephFS mount is there

@zejar
Copy link

zejar commented May 5, 2025

Great write-up! I do have a question about the experience of others regarding MTU size between the nodes.
From my experience utilising an MTU of 65520 results in rather unstable iperf3 performance (haven't gotten to setting up Ceph yet) over thunderbolt between the nodes (3x MS-01). Even an MTU of 9000 isn't as stable as an MTU of 1500.

IOMMU has been enabled in the Grub config and the thunderbolt affinity script has been used to select the P-cores for the processing of traffic over the thunderbolt interfaces.

Below some iperf3 results (IPv4 and IPv6 are similar, using point-to-point addresses instead of loopback addresses to rule out as much as possible):

MTU 1500 (upload & download)
Connecting to host fd00::2, port 5201
[  5] local fd00::1 port 48008 connected to fd00::2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  2.37 GBytes  20.3 Gbits/sec  305    945 KBytes
[  5]   1.00-2.00   sec  2.46 GBytes  21.1 Gbits/sec  728    883 KBytes
[  5]   2.00-3.00   sec  2.47 GBytes  21.2 Gbits/sec  315   1.15 MBytes
[  5]   3.00-4.00   sec  2.45 GBytes  21.0 Gbits/sec  495   1.10 MBytes
[  5]   4.00-5.00   sec  2.50 GBytes  21.5 Gbits/sec  364   1.14 MBytes
[  5]   5.00-6.00   sec  2.42 GBytes  20.8 Gbits/sec  495    866 KBytes
[  5]   6.00-7.00   sec  2.44 GBytes  21.0 Gbits/sec  360   1.09 MBytes
[  5]   7.00-8.00   sec  2.43 GBytes  20.9 Gbits/sec  495    890 KBytes
[  5]   8.00-9.00   sec  2.45 GBytes  21.0 Gbits/sec  405   1.07 MBytes
[  5]   9.00-10.00  sec  2.44 GBytes  21.0 Gbits/sec  405   1.23 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  24.4 GBytes  21.0 Gbits/sec  4367             sender
[  5]   0.00-10.00  sec  24.4 GBytes  21.0 Gbits/sec                  receiver

iperf Done.


Connecting to host fd00::2, port 5201
Reverse mode, remote host fd00::2 is sending
[  5] local fd00::1 port 49726 connected to fd00::2 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  2.89 GBytes  24.8 Gbits/sec
[  5]   1.00-2.00   sec  2.83 GBytes  24.3 Gbits/sec
[  5]   2.00-3.00   sec  2.73 GBytes  23.5 Gbits/sec
[  5]   3.00-4.00   sec  2.76 GBytes  23.8 Gbits/sec
[  5]   4.00-5.00   sec  2.80 GBytes  24.0 Gbits/sec
[  5]   5.00-6.00   sec  2.77 GBytes  23.8 Gbits/sec
[  5]   6.00-7.00   sec  2.73 GBytes  23.4 Gbits/sec
[  5]   7.00-8.00   sec  2.75 GBytes  23.6 Gbits/sec
[  5]   8.00-9.00   sec  2.70 GBytes  23.2 Gbits/sec
[  5]   9.00-10.00  sec  2.72 GBytes  23.4 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  27.7 GBytes  23.8 Gbits/sec  5370             sender
[  5]   0.00-10.00  sec  27.7 GBytes  23.8 Gbits/sec                  receiver

iperf Done.
MTU 9000 (upload & download)
Connecting to host fd00::2, port 5201
[  5] local fd00::1 port 52748 connected to fd00::2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  2.45 GBytes  21.0 Gbits/sec  3577   1003 KBytes
[  5]   1.00-2.00   sec  1.90 GBytes  16.3 Gbits/sec  2321   1.12 MBytes
[  5]   2.00-3.00   sec  1.43 GBytes  12.3 Gbits/sec  1700    968 KBytes
[  5]   3.00-4.00   sec  1.88 GBytes  16.1 Gbits/sec  2575   1.01 MBytes
[  5]   4.00-5.00   sec  2.36 GBytes  20.3 Gbits/sec  3282   1.04 MBytes
[  5]   5.00-6.00   sec  2.34 GBytes  20.1 Gbits/sec  3125    994 KBytes
[  5]   6.00-7.00   sec  2.31 GBytes  19.9 Gbits/sec  2463   1.16 MBytes
[  5]   7.00-8.00   sec  2.36 GBytes  20.3 Gbits/sec  3084   1020 KBytes
[  5]   8.00-9.00   sec  2.27 GBytes  19.5 Gbits/sec  2386    619 KBytes
[  5]   9.00-10.00  sec  1.89 GBytes  16.2 Gbits/sec  2545    872 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  21.2 GBytes  18.2 Gbits/sec  27058             sender
[  5]   0.00-10.00  sec  21.2 GBytes  18.2 Gbits/sec                  receiver

iperf Done.


Connecting to host fd00::2, port 5201
Reverse mode, remote host fd00::2 is sending
[  5] local fd00::1 port 38058 connected to fd00::2 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  2.19 GBytes  18.8 Gbits/sec
[  5]   1.00-2.00   sec  2.70 GBytes  23.2 Gbits/sec
[  5]   2.00-3.00   sec  2.65 GBytes  22.8 Gbits/sec
[  5]   3.00-4.00   sec  2.15 GBytes  18.5 Gbits/sec
[  5]   4.00-5.00   sec  2.13 GBytes  18.3 Gbits/sec
[  5]   5.00-6.00   sec  2.09 GBytes  18.0 Gbits/sec
[  5]   6.00-7.00   sec  2.14 GBytes  18.4 Gbits/sec
[  5]   7.00-8.00   sec  1.56 GBytes  13.4 Gbits/sec
[  5]   8.00-9.00   sec  2.64 GBytes  22.7 Gbits/sec
[  5]   9.00-10.00  sec  2.67 GBytes  22.9 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  22.9 GBytes  19.7 Gbits/sec  23803             sender
[  5]   0.00-10.00  sec  22.9 GBytes  19.7 Gbits/sec                  receiver

iperf Done.
MTU 65520 (upload & download)
Connecting to host fd00::2, port 5201
[  5] local fd00::1 port 35406 connected to fd00::2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  1.58 GBytes  13.5 Gbits/sec  735   63.9 KBytes
[  5]   1.00-2.00   sec  1.75 GBytes  15.0 Gbits/sec  837   1.50 MBytes
[  5]   2.00-3.00   sec  2.28 GBytes  19.6 Gbits/sec  955   1.19 MBytes
[  5]   3.00-4.00   sec  1.90 GBytes  16.3 Gbits/sec  784   2.18 MBytes
[  5]   4.00-5.00   sec  2.21 GBytes  19.0 Gbits/sec  839    831 KBytes
[  5]   5.00-6.00   sec  1.51 GBytes  13.0 Gbits/sec  680   2.18 MBytes
[  5]   6.00-7.00   sec  2.20 GBytes  18.9 Gbits/sec  920   2.37 MBytes
[  5]   7.00-8.00   sec  2.22 GBytes  19.1 Gbits/sec  897   1.19 MBytes
[  5]   8.00-9.00   sec   909 MBytes  7.62 Gbits/sec  403   2.75 MBytes
[  5]   9.00-10.00  sec  1.94 GBytes  16.7 Gbits/sec  1048   1.12 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  18.5 GBytes  15.9 Gbits/sec  8098             sender
[  5]   0.00-10.00  sec  18.5 GBytes  15.9 Gbits/sec                  receiver

iperf Done.

Connecting to host fd00::2, port 5201
Reverse mode, remote host fd00::2 is sending
[  5] local fd00::1 port 50834 connected to fd00::2 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  2.70 GBytes  23.2 Gbits/sec
[  5]   1.00-2.00   sec  2.05 GBytes  17.6 Gbits/sec
[  5]   2.00-3.00   sec  2.57 GBytes  22.1 Gbits/sec
[  5]   3.00-4.00   sec  2.06 GBytes  17.7 Gbits/sec
[  5]   4.00-5.00   sec  2.12 GBytes  18.2 Gbits/sec
[  5]   5.00-6.00   sec  2.16 GBytes  18.6 Gbits/sec
[  5]   6.00-7.00   sec  2.51 GBytes  21.6 Gbits/sec
[  5]   7.00-8.00   sec  2.11 GBytes  18.1 Gbits/sec
[  5]   8.00-9.00   sec  2.13 GBytes  18.3 Gbits/sec
[  5]   9.00-10.00  sec  1.59 GBytes  13.6 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  22.0 GBytes  18.9 Gbits/sec  8358             sender
[  5]   0.00-10.00  sec  22.0 GBytes  18.9 Gbits/sec                  receiver

iperf Done.

What is the experience (and results) of others?

@DarkPhyber-hg
Copy link

@zejar which cpu do you have on your MS-01s? I've got 3x 13900h and i've been trying to get my retries down to near zero. I've made a lot of progress, but my initial results were nowhere near that bad, and mine are really only bad in bi-directional tests.

@zejar
Copy link

zejar commented May 19, 2025

@DarkPhyber-hg Hmm that is strange, I also have the i9-13900H in my three MS-01's.
Which microcode are you using on your MS-01's? I am running "microcode : 0x4124" (grep microcode /proc/cpuinfo | uniq) and BIOS version 1.26.
Also, which Thunderbolt cables are you using? I'm using the Cable Matters TB4 cables (80cm).

@DarkPhyber-hg
Copy link

DarkPhyber-hg commented May 19, 2025

@zejar to answer your questions:
on all 3 nodes i'm running microcode : 0x4124.
I'm running firmware 1.27 on all 3 nodes.
I'm currently running the opt-in kernel 6.14.0-2-pve
I'm using 30cm OWC cables, they're a little tight but i wanted to have as short cables as possible.

you can see everything i did on another page in this gist. I spammed like 5 posts in a row. https://gist.github.com/scyto/67fdc9a517faefa68f730f82d7fa3570?permalink_comment_id=5579176#gistcomment-5579176

If i have time before i leave for vacation i'm going to turn off the traffic shaping and try different MTU sizes.

@Randymartin1991
Copy link

Randymartin1991 commented May 28, 2025

I got everyting working and the ping as well, however I do not see the loopback interfaces in the GUI therfore I cannot use it as a Ceph cluster network or do anything with it, I am doing only an ipv4 version but this should not be an issue.

I run proxmox 8.4 and I have created the new thunderbolt file: /etc/network/interfaces.d/thunderbolt
With the content:
auto en05
iface en05 inet static
pre-up ip link set $IFACE up
mtu 65520

auto en06
iface en06 inet static
pre-up ip link set $IFACE up
mtu 65520

#Loopback for Ceph MON
auto lo
iface lo inet loopback
up ip addr add 10.10.10.1/32 dev lo

I do have the interface en05 and en06 in the gui but not the lo.
Here the fabric:

IS-IS paths to level-2 routers that speak IP
Vertex Type Metric Next-Hop Interface Parent

node1
10.10.10.1/32 IP internal 0 node1(4)
node2 TE-IS 10 node2 en05 node1(4)
node3 TE-IS 10 node3 en06 node1(4)
10.10.10.2/32 IP TE 20 node2 en05 node2(4)
10.10.10.3/32 IP TE 20 node3 en06 node3(4)

IS-IS paths to level-2 routers with hop-by-hop metric
Vertex Type Metric Next-Hop Interface Parent

What am i Missing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment