this gist is part of this series
- add
thunderbolt
andthunderbolt-net
kernel modules (this must be done all nodes - yes i know it can sometimes work withoutm but the thuderbolt-net one has interesting behaviou' so do as i say - add both ;-)nano /etc/modules
add modules at bottom of file, one on each line- save using
x
theny
thenenter
doing this means we don't have to give each thunderbolt a manual IPv6 addrees and that these addresses stay constant no matter what
Add the following to each node using nano /etc/network/interfaces
If you see any sections called thunderbolt0 or thunderbol1 delete them at this point.
Doing this means we don't have to give each thunderbolt a manual IPv6 or IPv4 addrees and that these addresses stay constant no matter what.
Add the following to each node using nano /etc/network/interfaces
this to remind you not to edit en05 and en06 in the GUI
This fragment should go between the existing auto lo
section and adapater sections.
iface en05 inet manual
#do not edit it GUI
iface en06 inet manual
#do not edit in GUI
If you see any thunderbol sections delete them from the file before you save it.
*DO NOT DELETE the source /etc/network/interfaces.d/*
this will always exist on the latest versions and should be the last or next to last line in /interfaces file
This is needed as proxmox doesn't recognize the thunderbolt interface name. There are various methods to do this. This method was selected after trial and error because:
- the thunderboltX naming is not fixed to a port (it seems to be based on sequence you plug the cables in)
- the MAC address of the interfaces changes with most cable insertion and removale events
-
use
udevadm monitor
command to find your device IDs when you insert and remove each TB4 cable. Yes you can use other ways to do this, i recommend this one as it is great way to understand what udev does - the command proved more useful to me thanthe syslog
orlspci command
for troublehsooting thunderbolt issues and behavious. In my case my two pci paths are0000:00:0d.2
and0000:00:0d.3
if you bought the same hardware this will be the same on all 3 units. Don't assume your PCI device paths will be the same as mine. -
create a link file using
nano /etc/systemd/network/00-thunderbolt0.link
and enter the following content:
[Match]
Path=pci-0000:00:0d.2
Driver=thunderbolt-net
[Link]
MACAddressPolicy=none
Name=en05
- create a second link file using
nano /etc/systemd/network/00-thunderbolt1.link
and enter the following content:
[Match]
Path=pci-0000:00:0d.3
Driver=thunderbolt-net
[Link]
MACAddressPolicy=none
Name=en06
This section en sure that the interfaces will be brought up at boot or cable insertion with whatever settings are in /etc/network/interfaces - this shouldn't need to be done, it seems like a bug in the way thunderbolt networking is handled (i assume this is debian wide but haven't checked).
Huge thanks to @corvy for figuring out a script that should make this much much more reliable for most
- create a udev rule to detect for cable insertion using
nano /etc/udev/rules.d/10-tb-en.rules
with the following content:
ACTION=="move", SUBSYSTEM=="net", KERNEL=="en05", RUN+="/usr/local/bin/pve-en05.sh"
ACTION=="move", SUBSYSTEM=="net", KERNEL=="en06", RUN+="/usr/local/bin/pve-en06.sh"
-
save the file
-
create the first script referenced above using
nano /usr/local/bin/pve-en05.sh
and with the follwing content:
#!/bin/bash
LOGFILE="/tmp/udev-debug.log"
VERBOSE="" # Set this to "-v" for verbose logging
IF="en05"
echo "$(date): pve-$IF.sh triggered by udev" >> "$LOGFILE"
# If multiple interfaces go up at the same time,
# retry 10 times and break the retry when successful
for i in {1..10}; do
echo "$(date): Attempt $i to bring up $IF" >> "$LOGFILE"
/usr/sbin/ifup $VERBOSE $IF >> "$LOGFILE" 2>&1 && {
echo "$(date): Successfully brought up $IF on attempt $i" >> "$LOGFILE"
break
}
echo "$(date): Attempt $i failed, retrying in 3 seconds..." >> "$LOGFILE"
sleep 3
done
save the file and then
- create the second script referenced above using
nano /usr/local/bin/pve-en06.sh
and with the follwing content:
#!/bin/bash
LOGFILE="/tmp/udev-debug.log"
VERBOSE="" # Set this to "-v" for verbose logging
IF="en06"
echo "$(date): pve-$IF.sh triggered by udev" >> "$LOGFILE"
# If multiple interfaces go up at the same time,
# retry 10 times and break the retry when successful
for i in {1..10}; do
echo "$(date): Attempt $i to bring up $IF" >> "$LOGFILE"
/usr/sbin/ifup $VERBOSE $IF >> "$LOGFILE" 2>&1 && {
echo "$(date): Successfully brought up $IF on attempt $i" >> "$LOGFILE"
break
}
echo "$(date): Attempt $i failed, retrying in 3 seconds..." >> "$LOGFILE"
sleep 3
done
and save the file
- make both scripts executable with
chmod +x /usr/local/bin/*.sh
- run
update-initramfs -u -k all
to propogate the new link files into initramfs - Reboot (restarting networking, init 1 and init 3 are not good enough, so reboot)
##3 Install LLDP - this is great to see what nodes can see which.
- install lldpctl with
apt install lldpd
on all 3 nodes - execute
lldpctl
you should info
if you are having speed issues make sure the following is set on the kernel command line in /etc/default/grub
file
intel_iommu=on iommu=pt
one set be sure to run update-grub
and reboot
everyones grub command line is different this is mine because i also have i915 virtualization, if you get this wrong you can break your machine, if you are not doing that you don't need the i915 entries you see below
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt"
(note if you have more things in your cmd line DO NOT REMOVE them, just add the two intel ones, doesnt matter where.
cat /sys/devices/cpu_core/cpus && cat /sys/devices/cpu_atom/cpus
you should get two lines on an intel system with P and E cores. first line should be your P cores second line should be your E cores
for example on mine:
root@pve1:/etc/pve# cat /sys/devices/cpu_core/cpus && cat /sys/devices/cpu_atom/cpus
0-7
8-15
- make a file at
/etc/network/if-up.d/thunderbolt-affinity
- add the following to it - make sure to replace
echo X-Y
with whatever the report told you were your performance cores - e.g.echo 0-7
#!/bin/bash
# Check if the interface is either en05 or en06
if [ "$IFACE" = "en05" ] || [ "$IFACE" = "en06" ]; then
# Set Thunderbot affinity to Pcores
grep thunderbolt /proc/interrupts | cut -d ":" -f1 | xargs -I {} sh -c 'echo X-Y | tee "/proc/irq/{}/smp_affinity_list"'
fi
- save the file - done
I have only tried this on 6.8 kernels, so YMMV If you want more TB messages in dmesg to see why connection might be failing here is how to turn on dynamic tracing
For bootime you will need to add it to the kernel command line by adding thunderbolt.dyndbg=+p
to your /etc/default/grub file, running update-grub
and rebooting.
To expand the example above"
`GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt thunderbolt.dyndbg=+p"`
Don't forget to run update-grub
after saving the change to the grub file.
For runtime debug you can run the following command (it will revert on next boot) so this cant be used to cpature what happens at boot time.
`echo -n 'module thunderbolt =p' > /sys/kernel/debug/dynamic_debug/control`
these tools can be used to inspect your thundebolt system, note they rely on rust to be installedm you must use the rustup script below and not intsall rust by package manager at this time (9/15/24)
apt install pkg-config libudev-dev git curl
curl https://sh.rustup.rs -sSf | sh
git clone https://github.com/intel/tbtools
restart you ssh session
cd tbtools
cargo install --path .
I've been doing some troubleshooting the last couple of days on my cluster due to some recent stability issues. Basically i'm just tracking down every message in the logs that looks out of place in hopes i figure something out.
I think I was the first person on the proxmox forums who discovered, at least on the minisforum ms-01, that you need to pin the thunderbolt irq's to the P-Cores. Some folks were asking about smp_affinity vs smp_affinity_list they do the same thing, just in a different way. smp_affinity is a bitmask to determine which cores to use. I prefer smp_affinity_list due to it being human readable.
I have a 3 node ms-01 13900h cluster (PVE2, PVE3, PVE4 - there is no PVE1 it's retired). I've noticed some errors in my ceph log that seem to occur along side dropped packets. When trying to figure this out, i discovered one of my nodes has worse tb-network performance and more retries than the others with iperf3. The only thing different with this node is it has a pcie hba, otherwise the hardware is identical.
I'm not seeing excessivly high cpu utilization in top, nor excessively high software interrupts. ksoftirqd processes are not using excessive cpu. I still can't figure this out.
However I had a thought that might be helpful for others, esp others with slower CPU's. Hyper-threading. Each thunderbolt link basically has 2 IRQ's that it hits hard, one for transmit one for receive. So since we have 2 thunderbolt links, we each have 4 irq's that are really hitting interrupts. With hyperthreading the OS displays 2 logical cores that are really the same physical CPU. It's possible the OS could assign both interrupts to the same physical core, which might result in a little bit a performance penalty. For example On my box Core 0 & 1 are the same physical core. If they're both slammed with IRQ requests i theorize that we might see worse performance. So I looked at /proc/interrupts, determined which irq's I need to focus on. For me that wound up being TB1: 133/134 and TB2: 250/251. Where 133 is send or receive, and 134 is likely the opposite. And i assigned each interrupt to a different physical core manually. I looked at /proc/cpu and looked at "core id" to identify different physical cores, just to verify.
Now i also tried only picking even/odd cores, for example echo "0,2,4,6,8,10" across all of these, but i ran into a situation where two irq's were using the same physical core causing a performance penalty.
Not ideal since i'm telling the OS i'm smarter than it, and trust me i'm not, but it seems to have slightly helped performance on my slow node, retries have seemed to cut in half. The interrupt numbers on the slow node were different than the other 2 nodes, this is due to the pcie card taking up an interrupt. On the other 2 nodes it doesn't seem to have made much of a difference, my retries were pretty low anyway so i'm not surprised.
EDIT:
This oddly only seemed to help one of my MS-01's PVE4. This actually hurt performance on my other 2 ms-01, they actually performed best when TX/RX were on the same physical core, so for example when irq 133 was on logical core 0, and irq 134 was on logical core 1. It's possible there's something else going on in my configuration with PVE4 that's causing this.