scyto/proxmox-tb-net.md

Last active July 19, 2025 18:59

Star (102) You must be signed in to star a gist
Fork (20) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/scyto/67fdc9a517faefa68f730f82d7fa3570.js"></script>
Save scyto/67fdc9a517faefa68f730f82d7fa3570 to your computer and use it in GitHub Desktop.

Download ZIP

Thunderbolt Networking Setup

Raw

proxmox-tb-net.md

Thunderbolt Networking

this gist is part of this series

you wil need proxmox kernel 6.2.16-14-pve or higher.

Load Kernel Modules

add thunderbolt and thunderbolt-net kernel modules (this must be done all nodes - yes i know it can sometimes work withoutm but the thuderbolt-net one has interesting behaviou' so do as i say - add both ;-)
1. nano /etc/modules add modules at bottom of file, one on each line
2. save using x then y then enter

Prepare /etc/network/interfaces

doing this means we don't have to give each thunderbolt a manual IPv6 addrees and that these addresses stay constant no matter what Add the following to each node using nano /etc/network/interfaces

If you see any sections called thunderbolt0 or thunderbol1 delete them at this point.

Create entries to prepopulate gui with reminder

Doing this means we don't have to give each thunderbolt a manual IPv6 or IPv4 addrees and that these addresses stay constant no matter what.

Add the following to each node using nano /etc/network/interfaces this to remind you not to edit en05 and en06 in the GUI

This fragment should go between the existing auto lo section and adapater sections.

iface en05 inet manual
#do not edit it GUI

iface en06 inet manual
#do not edit in GUI

If you see any thunderbol sections delete them from the file before you save it.

*DO NOT DELETE the source /etc/network/interfaces.d/* this will always exist on the latest versions and should be the last or next to last line in /interfaces file

Rename Thunderbolt Connections

This is needed as proxmox doesn't recognize the thunderbolt interface name. There are various methods to do this. This method was selected after trial and error because:

the thunderboltX naming is not fixed to a port (it seems to be based on sequence you plug the cables in)
the MAC address of the interfaces changes with most cable insertion and removale events

use udevadm monitor command to find your device IDs when you insert and remove each TB4 cable. Yes you can use other ways to do this, i recommend this one as it is great way to understand what udev does - the command proved more useful to me than the syslog or lspci command for troublehsooting thunderbolt issues and behavious. In my case my two pci paths are 0000:00:0d.2and 0000:00:0d.3 if you bought the same hardware this will be the same on all 3 units. Don't assume your PCI device paths will be the same as mine.
create a link file using nano /etc/systemd/network/00-thunderbolt0.link and enter the following content:

[Match]
Path=pci-0000:00:0d.2
Driver=thunderbolt-net
[Link]
MACAddressPolicy=none
Name=en05

create a second link file using nano /etc/systemd/network/00-thunderbolt1.link and enter the following content:

[Match]
Path=pci-0000:00:0d.3
Driver=thunderbolt-net
[Link]
MACAddressPolicy=none
Name=en06

Set Interfaces to UP on reboots and cable insertions

This section en sure that the interfaces will be brought up at boot or cable insertion with whatever settings are in /etc/network/interfaces - this shouldn't need to be done, it seems like a bug in the way thunderbolt networking is handled (i assume this is debian wide but haven't checked).

Huge thanks to @corvy for figuring out a script that should make this much much more reliable for most

create a udev rule to detect for cable insertion using nano /etc/udev/rules.d/10-tb-en.rules with the following content:

ACTION=="move", SUBSYSTEM=="net", KERNEL=="en05", RUN+="/usr/local/bin/pve-en05.sh"
ACTION=="move", SUBSYSTEM=="net", KERNEL=="en06", RUN+="/usr/local/bin/pve-en06.sh"

save the file
create the first script referenced above using nano /usr/local/bin/pve-en05.sh and with the follwing content:

#!/bin/bash

LOGFILE="/tmp/udev-debug.log"
VERBOSE="" # Set this to "-v" for verbose logging
IF="en05"

echo "$(date): pve-$IF.sh triggered by udev" >> "$LOGFILE"

# If multiple interfaces go up at the same time, 
# retry 10 times and break the retry when successful
for i in {1..10}; do
    echo "$(date): Attempt $i to bring up $IF" >> "$LOGFILE"
    /usr/sbin/ifup $VERBOSE $IF >> "$LOGFILE" 2>&1 && {
        echo "$(date): Successfully brought up $IF on attempt $i" >> "$LOGFILE"
        break
    }
  
    echo "$(date): Attempt $i failed, retrying in 3 seconds..." >> "$LOGFILE"
    sleep 3
done

save the file and then

create the second script referenced above using nano /usr/local/bin/pve-en06.sh and with the follwing content:

#!/bin/bash

LOGFILE="/tmp/udev-debug.log"
VERBOSE="" # Set this to "-v" for verbose logging
IF="en06"

echo "$(date): pve-$IF.sh triggered by udev" >> "$LOGFILE"

# If multiple interfaces go up at the same time, 
# retry 10 times and break the retry when successful
for i in {1..10}; do
    echo "$(date): Attempt $i to bring up $IF" >> "$LOGFILE"
    /usr/sbin/ifup $VERBOSE $IF >> "$LOGFILE" 2>&1 && {
        echo "$(date): Successfully brought up $IF on attempt $i" >> "$LOGFILE"
        break
    }
  
    echo "$(date): Attempt $i failed, retrying in 3 seconds..." >> "$LOGFILE"
    sleep 3
done

and save the file

make both scripts executable with chmod +x /usr/local/bin/*.sh
run update-initramfs -u -k all to propogate the new link files into initramfs
Reboot (restarting networking, init 1 and init 3 are not good enough, so reboot)

Enabling IP Connectivity

proceed to the next gist

Slow Thunderbolt Performance? Too Many Retries? No traffic? Try this!

verify neighbors can see each other (connectivity troubleshooting)

##3 Install LLDP - this is great to see what nodes can see which.

install lldpctl with apt install lldpd on all 3 nodes
execute lldpctl you should info

make sure iommu is enabled (speed troubleshooting)

if you are having speed issues make sure the following is set on the kernel command line in /etc/default/grub file intel_iommu=on iommu=pt one set be sure to run update-grub and reboot

everyones grub command line is different this is mine because i also have i915 virtualization, if you get this wrong you can break your machine, if you are not doing that you don't need the i915 entries you see below

GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt" (note if you have more things in your cmd line DO NOT REMOVE them, just add the two intel ones, doesnt matter where.

Pinning the Thunderbolt Driver (speed and retries troubleshooting)

identify you P and E cores by running the following

cat /sys/devices/cpu_core/cpus && cat /sys/devices/cpu_atom/cpus

you should get two lines on an intel system with P and E cores. first line should be your P cores second line should be your E cores

for example on mine:

root@pve1:/etc/pve# cat /sys/devices/cpu_core/cpus && cat /sys/devices/cpu_atom/cpus
0-7
8-15

create a script to apply affinity settings everytime a thunderbolt interface comes up

make a file at /etc/network/if-up.d/thunderbolt-affinity
add the following to it - make sure to replace echo X-Y with whatever the report told you were your performance cores - e.g. echo 0-7

#!/bin/bash

# Check if the interface is either en05 or en06
if [ "$IFACE" = "en05" ] || [ "$IFACE" = "en06" ]; then
# Set Thunderbot affinity to Pcores
    grep thunderbolt /proc/interrupts | cut -d ":" -f1 | xargs -I {} sh -c 'echo X-Y | tee "/proc/irq/{}/smp_affinity_list"'
fi

save the file - done

Extra Debugging for Thunderbolt

dynamic kernel tracing - adds more info to dmesg, doesn't overhwelm dmesg

I have only tried this on 6.8 kernels, so YMMV If you want more TB messages in dmesg to see why connection might be failing here is how to turn on dynamic tracing

For bootime you will need to add it to the kernel command line by adding thunderbolt.dyndbg=+p to your /etc/default/grub file, running update-grub and rebooting.

To expand the example above"

`GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt thunderbolt.dyndbg=+p"`

Don't forget to run update-grub after saving the change to the grub file.

For runtime debug you can run the following command (it will revert on next boot) so this cant be used to cpature what happens at boot time.

`echo -n 'module thunderbolt =p' > /sys/kernel/debug/dynamic_debug/control`

install tbtools

these tools can be used to inspect your thundebolt system, note they rely on rust to be installedm you must use the rustup script below and not intsall rust by package manager at this time (9/15/24)

apt install pkg-config libudev-dev git curl
curl https://sh.rustup.rs -sSf | sh
git clone https://github.com/intel/tbtools
restart you ssh session
cd tbtools
cargo install --path .

DarkPhyber-hg commented May 13, 2025 •

edited

Loading

I've been doing some troubleshooting the last couple of days on my cluster due to some recent stability issues. Basically i'm just tracking down every message in the logs that looks out of place in hopes i figure something out.

I think I was the first person on the proxmox forums who discovered, at least on the minisforum ms-01, that you need to pin the thunderbolt irq's to the P-Cores. Some folks were asking about smp_affinity vs smp_affinity_list they do the same thing, just in a different way. smp_affinity is a bitmask to determine which cores to use. I prefer smp_affinity_list due to it being human readable.

I have a 3 node ms-01 13900h cluster (PVE2, PVE3, PVE4 - there is no PVE1 it's retired). I've noticed some errors in my ceph log that seem to occur along side dropped packets. When trying to figure this out, i discovered one of my nodes has worse tb-network performance and more retries than the others with iperf3. The only thing different with this node is it has a pcie hba, otherwise the hardware is identical.

I'm not seeing excessivly high cpu utilization in top, nor excessively high software interrupts. ksoftirqd processes are not using excessive cpu. I still can't figure this out.

However I had a thought that might be helpful for others, esp others with slower CPU's. Hyper-threading. Each thunderbolt link basically has 2 IRQ's that it hits hard, one for transmit one for receive. So since we have 2 thunderbolt links, we each have 4 irq's that are really hitting interrupts. With hyperthreading the OS displays 2 logical cores that are really the same physical CPU. It's possible the OS could assign both interrupts to the same physical core, which might result in a little bit a performance penalty. For example On my box Core 0 & 1 are the same physical core. If they're both slammed with IRQ requests i theorize that we might see worse performance. So I looked at /proc/interrupts, determined which irq's I need to focus on. For me that wound up being TB1: 133/134 and TB2: 250/251. Where 133 is send or receive, and 134 is likely the opposite. And i assigned each interrupt to a different physical core manually. I looked at /proc/cpu and looked at "core id" to identify different physical cores, just to verify.

echo 0,1 > /proc/irq/133/smp_affinity_list  #this is physical core 0
echo 2,3 > /proc/irq/134/smp_affinity_list  #this is physical core 1
echo 4,5 > /proc/irq/250/smp_affinity_list #this is physical core 3
echo 6,7 > /proc/irq/251/smp_affinity_list  # this is physical core 4

Now i also tried only picking even/odd cores, for example echo "0,2,4,6,8,10" across all of these, but i ran into a situation where two irq's were using the same physical core causing a performance penalty.

Not ideal since i'm telling the OS i'm smarter than it, and trust me i'm not, but it seems to have slightly helped performance on my slow node, retries have seemed to cut in half. The interrupt numbers on the slow node were different than the other 2 nodes, this is due to the pcie card taking up an interrupt. On the other 2 nodes it doesn't seem to have made much of a difference, my retries were pretty low anyway so i'm not surprised.

EDIT:

This oddly only seemed to help one of my MS-01's PVE4. This actually hurt performance on my other 2 ms-01, they actually performed best when TX/RX were on the same physical core, so for example when irq 133 was on logical core 0, and irq 134 was on logical core 1. It's possible there's something else going on in my configuration with PVE4 that's causing this.

DarkPhyber-hg commented May 14, 2025

Posting another update on my issue tracking down less than perfect thunderbolt performance. Also sharing a little more information, i'd like to know if anyone else is seeing these ceph messages. They show up in the ceph osd logs, and in the journal logs.

What has sent me down the path of investigating TB-net performance is whichever node is running my microsoft exchange server VM, seems to be locking up. This server has the highest amount of iops a lot of small read/write operations to the database. My theory is that ceph is glitching out due to dropped packets, the dropped packets are causing ceph to somehow lose communication with the rest of the cluster, or pause long enough to piss off the VM, causing the VM to lock up, and eventually cause the host to lock up. I could be completely wrong, it's just my current theory until i disprove it.

You may note the "errors", i hate to call them errors since they just seem informational, seem to happen on the hour. And for the most part they do, but not always. But the ones that happen on the hour were easy enough to track down: I run proxmox backup every hour on my nodes, backups are sent to PBS over my 10gig network, not over thunderbolt. So my thought process is backups cause high io, disk activity, etc. Could a spike in activity be causing the dominoes to start falling, dropped packets, ceph glitches, vm hangs, then brings down host? I can't replicate these messages every time i run a backup, but if i run my backups manually a few times it happens, and while watching ifconfig and the interface statistics I can see every time thereare aio_submit retry messages, the thunderbolt interface(s) increment the values for rx dropped, rx error, and rx frame. These values also increment when i saturate the interface with iperf3 and get retries, which makes sense probably losing interrupt requests, thus losing data/packets. Which has put me down the path of trying to improve the quality of the thunderbolt network.

Here are the ceph messages I'd be curious to see if anyone else is experiencing?

May 14 02:00:12 pve3 ceph-osd[2440]: 2025-05-14T02:00:12.450-0400 7702695c26c0 -1 bdev(0x5b127bcb7000 /var/lib/ceph/osd/ceph-2/block) aio_submit retries 3
May 14 02:00:12 pve3 ceph-osd[2440]: 2025-05-14T02:00:12.454-0400 77026adc56c0 -1 bdev(0x5b127bcb7000 /var/lib/ceph/osd/ceph-2/block) aio_submit retries 5
May 14 02:00:12 pve3 ceph-osd[2440]: 2025-05-14T02:00:12.458-0400 770267dbf6c0 -1 bdev(0x5b127bcb7000 /var/lib/ceph/osd/ceph-2/block) aio_submit retries 6
May 14 02:00:12 pve3 ceph-osd[2440]: 2025-05-14T02:00:12.458-0400 77026e5cc6c0 -1 bdev(0x5b127bcb7000 /var/lib/ceph/osd/ceph-2/block) aio_submit retries 6
May 14 02:00:12 pve3 ceph-osd[2440]: 2025-05-14T02:00:12.459-0400 77026a5c46c0 -1 bdev(0x5b127bcb7000 /var/lib/ceph/osd/ceph-2/block) aio_submit retries 6
May 14 02:00:12 pve3 ceph-osd[2440]: 2025-05-14T02:00:12.459-0400 770266dbd6c0 -1 bdev(0x5b127bcb7000 /var/lib/ceph/osd/ceph-2/block) aio_submit retries 6
May 14 04:00:13 pve3 ceph-osd[2423]: 2025-05-14T04:00:13.641-0400 7955f34276c0 -1 bdev(0x63529cf35400 /var/lib/ceph/osd/ceph-4/block) aio_submit retries 2
May 14 05:00:13 pve3 ceph-osd[2423]: 2025-05-14T05:00:13.699-0400 7955ef41f6c0 -1 bdev(0x63529cf35400 /var/lib/ceph/osd/ceph-4/block) aio_submit retries 1
May 14 05:00:13 pve3 ceph-osd[2423]: 2025-05-14T05:00:13.700-0400 7955f1c246c0 -1 bdev(0x63529cf35400 /var/lib/ceph/osd/ceph-4/block) aio_submit retries 2
May 14 11:00:14 pve3 ceph-osd[2423]: 2025-05-14T11:00:14.625-0400 7955f34276c0 -1 bdev(0x63529cf35400 /var/lib/ceph/osd/ceph-4/block) aio_submit retries 2
May 14 11:00:14 pve3 ceph-osd[2423]: 2025-05-14T11:00:14.625-0400 7955f1c246c0 -1 bdev(0x63529cf35400 /var/lib/ceph/osd/ceph-4/block) aio_submit retries 2

I made a few more changes after my last post. Last night I upgraded the kernel to the 6.14 pve opt-in kernel on all 3 nodes. My thunderbolt networking performance seems to have improved a bit more.

Using a 10 sec bidir iperf3 between a fast node and the "slow" node I was seeing about 19/22 gbps with around 2200/880 retries. Now with the updated kernel and pinning irq by physical core instead of logical core, i'm seeing about 24/22gbps with about 700/860 retries. In a single direction iperf maxes out at 26gbps and has minimal retries. I didn't do a meticulous job tracking performance/load through various changes I've made. It's possible only 1 of these made a measurable impact and the other is just in my head. I'd be curious if anyone else were to try these changes.

DarkPhyber-hg commented May 14, 2025 •

edited

Loading

Sharing a shower-thought. I have not tested this yet, but will once i get a stable system. I was previously running the powersave governor on all cores, until i get a stable system i have all cores set to performance. I have seen more drops in thunderbolt-networking with powersave, but felt it was an acceptable tradeoff. I know others have had similar findings. Since I am considering pinning each IRQ to a specific core, I wonder if we can mix cores on the performance governor. Assign thunderbolt to specific cores and then, run those cores with the performance governor, and set all the other cores to powersave. Maybe something someone with a stable system wants to try.

I was lazy and just used chatgpt to write the scripts

identify available cpu governors:

for file in /sys/devices/system/cpu/cpu*/cpufreq/scaling_available_governors; do
  cpu=$(basename $(dirname $(dirname $file)))
  echo -n "$cpu: "
  cat "$file"
done


cpu0: performance powersave
cpu10: performance powersave
cpu11: performance powersave
cpu12: performance powersave
cpu13: performance powersave
cpu14: performance powersave
cpu15: performance powersave
cpu16: performance powersave
cpu17: performance powersave
cpu18: performance powersave
cpu19: performance powersave
cpu1: performance powersave
cpu2: performance powersave
cpu3: performance powersave
cpu4: performance powersave
cpu5: performance powersave
cpu6: performance powersave
cpu7: performance powersave
cpu8: performance powersave
cpu9: performance powersave

Verify currently active cpu governors:

for file in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do   cpu=$(basename $(dirname $(dirname $file)));   echo -n "$cpu: ";   cat "$file"; done

cpu0: performance
cpu10: performance
cpu11: performance
cpu12: performance
cpu13: performance
cpu14: performance
cpu15: performance
cpu16: performance
cpu17: performance
cpu18: performance
cpu19: performance
cpu1: performance
cpu2: performance
cpu3: performance
cpu4: performance
cpu5: performance
cpu6: performance
cpu7: performance
cpu8: performance
cpu9: performance

then just change the value of /sys/devices/system/cpu/cpu<X>/cpufreq/scaling_governor to whichever availible governor you want to use for each core <X> independently. I have no idea how the system would behave if a hyperthreading core had 1 logical core set to perf and the other set to powersave, i imagine strange things would occur.

example might be something like this, i beleive brace expansion should work here:

echo "performance" >  /sys/devices/system/cpu/cpu[0-7]/cpufreq/scaling_governor
echo "powersave" >  /sys/devices/system/cpu/cpu[8-19]/cpufreq/scaling_governor

EDIT: I tested this yesterday. It made virtually no difference in power usage ( my pdu measures power draw per outlet) on Proxmox opt-in kernel 6.14.0-2 and all aspm disabled, but it did still cause a significant increase in dropped packets. I don't feel like it's worth experimenting with any further.

DarkPhyber-hg commented May 17, 2025 •

edited

Loading

spamming another update. while i haven't had a lockup of my vm after the last changes i made, i'm still looking to improve the retries and these ceph aio_submit retry messages in my system log.

I've been doing a bunch of testing since i still think it stems from packet loss. I've found no appreciable difference messing with kernel level settings for tcp window size, net.core.rmem_max and wmem_max and a few other kernel level settings. In fact i often made things worse. I also tried to disable offloading on the thunderbolt interfaces, it made performance worse, but i didn't methodically try different offloading combinations, i have seen some improvements on physical nic's with disabling only specific offloading parameters.

At this point i'm thinking either there's some kind of issue with flow control not working right, or the thunderbolt controller just can't keep up and is dropping/corrupting data when it's loaded bidirectionally. Why do i say corrupting? because looking at interface stats using ip link i'm also seeing crc errors. I see crc errors on all of my nodes, i'm using certified owc tb4 cables and i even tried an expensive active apple thunderbolt 4 cable, which rules out bad cables.

root@pve2:~# ip -s -s link show en05
10: en05: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc cake state UP mode DEFAULT group default qlen 1000
    link/ether 02:ea:fb:06:bf:19 brd ff:ff:ff:ff:ff:ff
    RX:    bytes  packets errors dropped  missed   mcast           
    215534087648 67161931    514       0     480       0 
    RX errors:     length    crc   frame    fifo overrun
                        2     32       0       0       0 
    TX:    bytes  packets errors dropped carrier collsns           
    300539920795 16346041      0       0       0       0 
    TX errors:    aborted   fifo  window heartbt transns
                        0      0       0       0       2

I decided to mess around with the queueing discipline (qdisc) first thinking it might be a flow control issue. On my machine the thunderbolt interfaces default to a qdisc of pfifo_fast in my testing this has the highest retries. I found arguably less retries with pfifo. Enough of an improvement with fq to say it's not within the range of error, and a significant improvement with fq_codel. I found on average a 60-70% reduction in retries with fq_codel and with "iperf3 --bidir" a bidirectional 25-26gbps. I was still getting some packet drops on the interfaces, but as long as the application layer wasn't getting pissed off I'm not sure i care that much.

I wanted to take it a step further now since with ceph both en05 and en06 could be loaded concurrently. So i ran 2 x iperf3 servers on a node on different ports, for example say PVE3. Then i ran a bidirectional iperf3's from both PVE2 and PVE4 at the same time to PVE3. The idea being to try and load both thunderbolt ports on PVE3 to see what happens. I immediately saw significantly reduced performance and increased retries. When i was running the iperf3 on a single machine i was seeing 25-26gbps both ways, but when both machines were hitting PVE3 throughput dropped off, was kind of asymetric for example i saw something like 14gbps/18gbps. I verified this with different nodes running the server each time.

Now i remember a post earlier from @razqqm using tc qdisc's to rate limit, so i tried a few rate limiting qdisc's. I tried "cake" and "tbf w/ fq_codel", i didn't try hbt as @razqqm used. Cake is much easier to configure, but I thought maybe tbf w/ fq_codel might perform better, since fq_codel performed better on it's own. I experimented with both of them at different bandwidth limits, on my 13900h's 15gbps seemed to be about the sweet spot when loading en05 and en06 at the same time and seeing minimal retries. I didn't see any significant difference in performance between the two of them, so i implemented cake. I still get retries and packet loss if both interfaces are loaded, but significantly less. Also significantly less packet loss and crc errors in production.

In production I'm still getting some of the ceph aio_submit retry messages in my system logs; however both are significantly reduced. I'm hopeful I can resolve these damn lockups, especially since i'm going on vacation in a week. I'm still trying to isolate a few more possible causes. But i'm hopeful other may find my multi-post novel here helpful.

to set the qdisc on boot

create a file in /etc/network/if-up.d/, i called mine set-qdisc

vi /etc/network/if-up.d/set-qdisc

#!/bin/bash

# Check if the interface is either en05 or en06
if [ "$IFACE" = "en05" ] || [ "$IFACE" = "en06" ]; then
   tc qdisc del dev $IFACE root
   tc qdisc replace dev $IFACE root cake   bandwidth 15gbit 
fi

D-i-t-gh commented Jun 22, 2025 •

edited

Loading

For those using a Mini PC (e.g. Minisforum/GMKtec) experiencing slow speed as I do:

Server listening on 5201 (test #5)
-----------------------------------------------------------
Accepted connection from 10.0.0.10, port 44716
[  5] local 10.0.0.41 port 5201 connected to 10.0.0.10 port 44718
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  1.35 GBytes  11.6 Gbits/sec    0   2.31 MBytes
[  5]   1.00-2.00   sec  1.35 GBytes  11.6 Gbits/sec    0   2.31 MBytes
[  5]   2.00-3.00   sec  1.35 GBytes  11.6 Gbits/sec    0   2.31 MBytes
[  5]   3.00-4.00   sec  1.35 GBytes  11.6 Gbits/sec    0   2.31 MBytes
[  5]   4.00-5.00   sec  1.35 GBytes  11.6 Gbits/sec    0   2.31 MBytes
[  5]   5.00-6.00   sec  1.35 GBytes  11.6 Gbits/sec    0   2.31 MBytes
[  5]   5.00-6.00   sec  1.35 GBytes  11.6 Gbits/sec    0   2.31 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-6.00   sec  9.04 GBytes  12.9 Gbits/sec    0             sender

There seems to be a hardware limitation (2.5 GT/s) for the PCIe tunnels:

# lspci | grep -i Thunderbolt
00:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 19h USB4/Thunderbolt PCIe tunnel
00:04.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 19h USB4/Thunderbolt PCIe tunnel
c9:00.5 USB controller: Advanced Micro Devices, Inc. [AMD] Pink Sardine USB4/Thunderbolt NHI controller
c9:00.6 USB controller: Advanced Micro Devices, Inc. [AMD] Pink Sardine USB4/Thunderbolt NHI controller

# lspci | grep -i Thunderbolt | awk '{print $1}' | while read dev; do echo "=== Device $dev ==="; lspci -vv -s "$dev" | grep -i 'lnkcap\|lnksta'; done
=== Device 00:03.1 ===
                LnkCap: Port #247, Speed 2.5GT/s, Width x1, ASPM L1, Exit Latency L1 <4us
                LnkSta: Speed 2.5GT/s, Width x16 (overdriven)
                LnkCap2: Supported Link Speeds: 2.5GT/s, Crosslink- Retimer- 2Retimers- DRS-
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
=== Device 00:04.1 ===
                LnkCap: Port #247, Speed 2.5GT/s, Width x1, ASPM L1, Exit Latency L1 <4us
                LnkSta: Speed 2.5GT/s, Width x16 (overdriven)
                LnkCap2: Supported Link Speeds: 2.5GT/s, Crosslink- Retimer- 2Retimers- DRS-
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
=== Device c9:00.5 ===
                LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
                LnkSta: Speed 16GT/s, Width x16
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete- EqualizationPhase1-
=== Device c9:00.6 ===
                LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
                LnkSta: Speed 16GT/s, Width x16
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete- EqualizationPhase1-

Randymartin1991 commented Jul 4, 2025 •

edited

Loading

After a while my migration speed, over the thunderbolt network drops down to 1gb/s. When I have rebooted all the cluster nodes it is up again to the normal speed:

Before reboot:
2025-07-04 08:42:24 use dedicated network address for sending migration traffic (10.10.10.1)
2025-07-04 08:42:24 starting migration of VM 107 to node 'node1' (10.10.10.1)
2025-07-04 08:42:24 starting VM 107 on remote node 'node1'
2025-07-04 08:42:32 start remote tunnel
2025-07-04 08:42:36 ssh tunnel ver 1
2025-07-04 08:42:36 starting online/live migration on unix:/run/qemu-server/107.migrate
2025-07-04 08:42:36 set migration capabilities
2025-07-04 08:42:36 migration downtime limit: 100 ms
2025-07-04 08:42:36 migration cachesize: 1.0 GiB
2025-07-04 08:42:36 set migration parameters
2025-07-04 08:42:36 start migrate command to unix:/run/qemu-server/107.migrate
2025-07-04 08:42:37 migration active, transferred 66.8 MiB of 8.0 GiB VM-state, 107.5 MiB/s
2025-07-04 08:42:38 migration active, transferred 159.3 MiB of 8.0 GiB VM-state, 37.3 MiB/s
2025-07-04 08:42:39 migration active, transferred 275.6 MiB of 8.0 GiB VM-state, 107.8 MiB/s
2025-07-04 08:42:40 migration active, transferred 388.1 MiB of 8.0 GiB VM-state, 123.9 MiB/s
2025-07-04 08:42:41 migration active, transferred 482.7 MiB of 8.0 GiB VM-state, 125.1 MiB/s
2025-07-04 08:42:42 migration active, transferred 529.8 MiB of 8.0 GiB VM-state, 114.8 MiB/s
2025-07-04 08:42:43 migration active, transferred 625.2 MiB of 8.0 GiB VM-state, 110.2 MiB/s
2025-07-04 08:42:44 migration active, transferred 734.2 MiB of 8.0 GiB VM-state, 91.8 MiB/s
2025-07-04 08:42:45 migration active, transferred 781.0 MiB of 8.0 GiB VM-state, 43.4 MiB/s
2025-07-04 08:42:46 migration active, transferred 872.9 MiB of 8.0 GiB VM-state, 129.7 MiB/s
2025-07-04 08:42:47 migration active, transferred 980.6 MiB of 8.0 GiB VM-state, 80.3 MiB/s
2025-07-04 08:42:48 migration active, transferred 1.1 GiB of 8.0 GiB VM-state, 187.6 MiB/s

After Reboot:
task started by HA resource agent
2025-07-04 09:17:11 use dedicated network address for sending migration traffic (10.10.10.3)
2025-07-04 09:17:11 starting migration of VM 103 to node 'node3' (10.10.10.3)
2025-07-04 09:17:11 starting VM 103 on remote node 'node3'
2025-07-04 09:17:13 start remote tunnel
2025-07-04 09:17:14 ssh tunnel ver 1
2025-07-04 09:17:14 starting online/live migration on unix:/run/qemu-server/103.migrate
2025-07-04 09:17:14 set migration capabilities
2025-07-04 09:17:14 migration downtime limit: 100 ms
2025-07-04 09:17:14 migration cachesize: 1.0 GiB
2025-07-04 09:17:14 set migration parameters
2025-07-04 09:17:14 start migrate command to unix:/run/qemu-server/103.migrate
2025-07-04 09:17:15 migration active, transferred 553.9 MiB of 8.0 GiB VM-state, 822.9 MiB/s
2025-07-04 09:17:16 migration active, transferred 1.2 GiB of 8.0 GiB VM-state, 1.0 GiB/s
2025-07-04 09:17:17 migration active, transferred 1.9 GiB of 8.0 GiB VM-state, 644.7 MiB/s
2025-07-04 09:17:18 migration active, transferred 2.6 GiB of 8.0 GiB VM-state, 810.8 MiB/s
2025-07-04 09:17:19 migration active, transferred 3.3 GiB of 8.0 GiB VM-state, 1.3 GiB/s

Came accros this post:
https://forum.proxmox.com/threads/slow-migrations.104405/

Perhaps an issue with proxmox itself

mattyjew commented Jul 17, 2025

This ensure it gets set everytime the if en05 or en06 goes up or down. Including cable connect / disconnect. I prefer this over rc.local. Should the device change IRQ then the rc.local approach will fail. Not sure who suggested this approach, maybe it was @nickglott but I cannot remember. At least doing it this way is very robust and would be my suggestion.

Thanks, i was also contemplating telling folks to add it to the user crontab using the crontab -e command with at @daily but if this needs to be done each time the driver is loaded thats also a bust too. I agree you way looks robust - which i think is key.

its also wild to me i just don't get the issue... this is between two of my two nodes, i have never set affinity, i would love to understand why the difference occurs....
Connecting to host fc00::81, port 5201
[  5] local fc00::82 port 38314 connected to fc00::81 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  3.05 GBytes  26.2 Gbits/sec   28   3.06 MBytes       
[  5]   1.00-2.00   sec  3.12 GBytes  26.8 Gbits/sec    3   2.81 MBytes       
[  5]   2.00-3.00   sec  3.09 GBytes  26.6 Gbits/sec   31   3.87 MBytes       
[  5]   3.00-4.00   sec  3.12 GBytes  26.8 Gbits/sec    0   3.87 MBytes       
[  5]   4.00-5.00   sec  3.12 GBytes  26.8 Gbits/sec    8   2.81 MBytes       
[  5]   5.00-6.00   sec  3.10 GBytes  26.7 Gbits/sec    1   3.81 MBytes       
[  5]   6.00-7.00   sec  3.11 GBytes  26.7 Gbits/sec    0   3.81 MBytes       
[  5]   7.00-8.00   sec  3.11 GBytes  26.7 Gbits/sec    0   3.81 MBytes       
[  5]   8.00-9.00   sec  3.09 GBytes  26.6 Gbits/sec    0   3.81 MBytes       
[  5]   9.00-10.00  sec  3.10 GBytes  26.6 Gbits/sec    1   3.81 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  31.0 GBytes  26.6 Gbits/sec   72             sender
[  5]   0.00-10.00  sec  31.0 GBytes  26.6 Gbits/sec                  receiver
out of interest what is you smp_affinity setting, is it ffff?
root@pve2:~# cat /proc/irq/129/smp_affinity
ffff
root@pve2:~# cat /proc/irq/129/smp_affinity_list
0-15

Hi, I'm getting that cat response but not the stable bitrate and retries. My transfer rate is all over the place from 8-17Gbits and hundreds of retries. I'm running intel Nuc12's in a 3 node mesh. Followed the Gist to the T and everything talks but just not stable still. I had it running stable previously but had to do a reinstall (my stuff up). Following it this time was certainly easier but I've come unstuck now.

mattyjew commented Jul 17, 2025

A little bit more info and a small win, I get good stable 26Gbits from node 2 back to node 1, and node 3 back to node 2. But not in other directions. Interestingly too, I get the following on node 1 sometimes but not all the time. Its like a number of directories are missing in that irq set including 129 compared to the other 2 nodes:

root@Water:~# cat /proc/irq/129/smp_affinity
cat: /proc/irq/129/smp_affinity: No such file or directory

All 3 NUC's are the same make and model, P cores are 0-7.

mattyjew commented Jul 18, 2025

Managed to fix the random no affinity file directory and running lldpctl now on each node shows all neighbore nodes correctly. I needed to add auto-hotplug en05 and auto-hotplug en06 to my interfaces file on all three nodes. Now all three are consistantly coming up, just need to get them all stable at 26G and minimum retries. It looks like it works on some nodes some of the time, but not all 3. I'm running NUC12 Pros (1 intel 2 asus versions).

mattyjew commented Jul 18, 2025

And got the P core script to work. Needed to run: chmod +x /etc/network/if-up.d/thunderbolt-affinity after setting the affinity script. Once done when I run cat /proc/irq/129/smp_affinity I get 00ff for the 0-7 cores instead of ffff indicating all cores. Thanks Gemma27b local AI!

#noob-to-linux

michaeleberhardt commented Jul 18, 2025

Hey Folks,
I followed the guide (thanks a lo!!) completely and my 3-node MS-01 Cluster ran fine for weeks..
Today I discovered, that thunderbold networking got incredibly slow. Just between 2-10mbps, no matter between which nodes..
Affinity etc is all fine.. did anybody face that problem before? I am on the latest proxmox kernel..

thanks a lot & best regards,
Michael

Allistah commented Jul 18, 2025

@michaeleberhardt - Roll back to kernel 6.8.12-1-pve and see if the issue goes away. I've had problems with later kernels so I've stuck with this one. I just checked and my cluster has been up for 125 days and is still rockin' 26Gb/s to all nodes.

michaeleberhardt commented Jul 19, 2025

@Allistah - Thanks, I rolled back to 6.8.12-1-pve, unfortunately no change:

root@node1:~# uname -r 6.8.12-1-pve

root@node1:~# iperf3 -c 172.16.0.2
Connecting to host 172.16.0.2, port 5201
[  5] local 172.16.0.1 port 56376 connected to 172.16.0.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   608 KBytes  4.98 Mbits/sec   44   2.83 KBytes       
[  5]   1.00-2.00   sec   956 KBytes  7.83 Mbits/sec   44   2.83 KBytes       
[  5]   2.00-3.00   sec   157 KBytes  1.29 Mbits/sec   24   2.83 KBytes       
[  5]   3.00-4.00   sec   472 KBytes  3.87 Mbits/sec   40   2.83 KBytes       
[  5]   4.00-5.00   sec   160 KBytes  1.31 Mbits/sec   28   2.83 KBytes       
[  5]   5.00-6.00   sec   481 KBytes  3.94 Mbits/sec   28   2.83 KBytes       
[  5]   6.00-7.00   sec   478 KBytes  3.92 Mbits/sec   34   2.83 KBytes       
[  5]   7.00-8.00   sec   479 KBytes  3.93 Mbits/sec   34   2.83 KBytes       
[  5]   8.00-9.00   sec  1.32 MBytes  11.1 Mbits/sec   39   7.07 KBytes       
[  5]   9.00-10.00  sec   474 KBytes  3.88 Mbits/sec   38   2.83 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  5.49 MBytes  4.60 Mbits/sec  353             sender
[  5]   0.00-10.00  sec  5.37 MBytes  4.50 Mbits/sec                  receiver

iperf Done.
root@node1:~#

Any help is very appreciated :-)
Best regards!
Michael

michaeleberhardt commented Jul 19, 2025

Okay, I found a solution..
Don´t ask me why, but from the start it worked without setting a MTU explicitly.
Now I set MTU to 65520 and it works at about 24-26Gbps..
So if anybody faces a similiar issue, check MTU.
btw: it works on Kernel 6.8.12-12-pve.

Best regards!
Michael

scyto/proxmox-tb-net.md

Thunderbolt Networking

you wil need proxmox kernel 6.2.16-14-pve or higher.

Load Kernel Modules

Prepare /etc/network/interfaces

Create entries to prepopulate gui with reminder

Rename Thunderbolt Connections

Set Interfaces to UP on reboots and cable insertions

Enabling IP Connectivity

Slow Thunderbolt Performance? Too Many Retries? No traffic? Try this!

verify neighbors can see each other (connectivity troubleshooting)

make sure iommu is enabled (speed troubleshooting)

Pinning the Thunderbolt Driver (speed and retries troubleshooting)

identify you P and E cores by running the following

create a script to apply affinity settings everytime a thunderbolt interface comes up

Extra Debugging for Thunderbolt

dynamic kernel tracing - adds more info to dmesg, doesn't overhwelm dmesg

install tbtools

DarkPhyber-hg commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkPhyber-hg commented May 14, 2025

Uh oh!

DarkPhyber-hg commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkPhyber-hg commented May 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

D-i-t-gh commented Jun 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Randymartin1991 commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattyjew commented Jul 17, 2025

Uh oh!

mattyjew commented Jul 17, 2025

Uh oh!

mattyjew commented Jul 18, 2025

Uh oh!

mattyjew commented Jul 18, 2025

Uh oh!

michaeleberhardt commented Jul 18, 2025

Uh oh!

Allistah commented Jul 18, 2025

Uh oh!

michaeleberhardt commented Jul 19, 2025

Uh oh!

michaeleberhardt commented Jul 19, 2025

Uh oh!

DarkPhyber-hg commented May 13, 2025 •

edited

Loading

DarkPhyber-hg commented May 14, 2025 •

edited

Loading

DarkPhyber-hg commented May 17, 2025 •

edited

Loading

D-i-t-gh commented Jun 22, 2025 •

edited

Loading

Randymartin1991 commented Jul 4, 2025 •

edited

Loading