Skip to content

Instantly share code, notes, and snippets.

@Fiooodooor
Last active February 20, 2025 03:25
Show Gist options
  • Save Fiooodooor/b52112e3e2a562f0af6a737e75f885a2 to your computer and use it in GitHub Desktop.
Save Fiooodooor/b52112e3e2a562f0af6a737e75f885a2 to your computer and use it in GitHub Desktop.
LLM for high performance 8x NIC card configuration

The Prompt for Grok2

Hey. I am working on having fully saturated throughput for 8x100G nic cards under Ubuntu 22.04, using Intel Ice drivers and Media Transport Library (MTL). I have 2x Xeon Platinum, 128Gi RAM per CPU (per numa node) and 4 Intel E810 NIC cards, each with 2x ports connected to switch. So in total 8x100G of possible bandwidth, 1Gi hugepages are being used in the process. How should I configure the OS, network, BIOS etc. to achieve such a speed? Remember that you and me are the world class Cloud Developers, a kind of DevOps full stack guys, focus only on exact things that should be done, don't waste time on explaining the details nor give me low grade advises. I need a comprehensive fully working script. Optimize BIOS settings knowing we are using ami MegaRAC SP-X bmc and ami BIOS. Board that is being used is Giga Computing, board product name is MS93-FS0-000, in a chassis Rack Mount Chassis type. Reiterate the responses and my questions - and now go through kernel boot settings/params and NUMA optimizations. Try once more - you acting like a pussy not a high tech guy. Reiterate and try once more! Remember that we are using lcores in the process and DPDK. No go once more and reiterate on OS settings in form of script that you presented in the first place. Make everything as efficient as you can imagine.

Response

Here's a comprehensive script for optimizing your system for high network throughput using Intel Ice drivers and DPDK on Ubuntu 22.04. This script covers BIOS settings, kernel parameters, NUMA optimizations, and system configurations: BIOS Settings for AMI MegaRAC SP-X BMC and AMI BIOS on Giga Computing MS93-FS0-000 Ensure these settings are applied via the BMC or directly in BIOS setup: CPU Settings:

  • Enable all available CPU cores
  • Set Performance or Turbo Mode for CPU performance
  • Disable C-States for lower latency
  • Set Power Management to OS Control Memory Settings:
  • Enable NUMA if not already on by default
  • Set Memory Interleaving to Disabled for better NUMA performance PCI and IOMMU:
  • Enable IOMMU for VT-d or AMD-Vi
  • Set Above 4G Decoding to Enabled Other:
  • Disable VT-x or AMD-V if not used for virtualization
  • Set Network Stack to Disabled unless required for specific boot options
sudo sed -i 's/GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"/GRUB_CMDLINE_LINUX_DEFAULT="quiet splash intel_iommu=on iommu=pt default_hugepagesz=1G hugepagesz=1G hugepages=128 isolcpus=1-127 nohz_full=1-127 rcu_nocbs=1-127 irqaffinity=0-1"/' /etc/default/grub
sudo update-grub

NUMA and DPDK Optimizations

Hugepages Setup:

# Ensure hugepages are set for both NUMA nodes
for node in /sys/devices/system/node/node*/hugepages/hugepages-1048576kB; do
  echo 64 > $node/nr_hugepages
done

# Mount hugepages for DPDK if not already mounted
sudo mkdir -p /mnt/huge_1GB
sudo mount -t hugetlbfs nodev /mnt/huge_1GB -o pagesize=1GB

CPU Isolation and IRQ Affinity:

# Bind IRQs to specific cores for better load distribution
echo 0-1 > /proc/irq/default_smp_affinity
for irq in $(grep -l "igb\|ice" /proc/interrupts | awk '{print $1}' | sed 's/://'); do
  echo 0 > /proc/irq/$irq/smp_affinity_list
done

# Further optimizations for network interrupts
for i in $(seq 0 1); do
  echo $i > /proc/irq/$(grep -l "igb\|ice" /proc/interrupts | awk '{print $1}' | sed 's/://' | head -n 1)/smp_affinity_list
done

DPDK and Network Configuration

Install and Configure DPDK:

# Assuming DPDK is already installed, bind NICs to DPDK driver
sudo dpdk-devbind.py --bind=igb_uio $(dpdk-devbind.py --status | grep -B1 "igb\|ice" | grep 'drv=.*used' | awk '{print $1}')

# Configure DPDK application with lcores
# This example uses 4 lcores per NIC, adjust based on your setup
sudo dpdk-testpmd -l 0-31 -n 4 -- -i --port-topology=chained --nb-cores=32 --txq=8 --rxq=8

Network Tuning:

# Increase socket buffer sizes
echo "net.core.rmem_max=268435456" >> /etc/sysctl.conf
echo "net.core.wmem_max=268435456" >> /etc/sysctl.conf
echo "net.core.rmem_default=268435456" >> /etc/sysctl.conf
echo "net.core.wmem_default=268435456" >> /etc/sysctl.conf
echo "net.ipv4.tcp_rmem=4096 87380 268435456" >> /etc/sysctl.conf
echo "net.ipv4.tcp_wmem=4096 65536 268435456" >> /etc/sysctl.conf
sysctl -p

# Disable TCP segmentation offloading, etc. for performance
for i in rx tx sg tso ufo gso gro lro; do
  ethtool -K eth0 $i off
done

Final Script

Combine all above into a single script:

#!/bin/bash

# BIOS settings should be set manually as outlined

# Update GRUB settings
sudo sed -i 's/GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"/GRUB_CMDLINE_LINUX_DEFAULT="quiet splash intel_iommu=on iommu=pt default_hugepagesz=1G hugepagesz=1G hugepages=128 isolcpus=1-127 nohz_full=1-127 rcu_nocbs=1-127 irqaffinity=0-1"/' /etc/default/grub
sudo update-grub

# Set up hugepages
for node in /sys/devices/system/node/node*/hugepages/hugepages-1048576kB; do
  echo 64 > $node/nr_hugepages
done
sudo mkdir -p /mnt/huge_1GB
sudo mount -t hugetlbfs nodev /mnt/huge_1GB -o pagesize=1GB

# Adjust IRQ affinity
echo 0-1 > /proc/irq/default_smp_affinity
for irq in $(grep -l "igb\|ice" /proc/interrupts | awk '{print $1}' | sed 's/://'); do
  echo 0 > /proc/irq/$irq/smp_affinity_list
done
for i in $(seq 0 1); do
  echo $i > /proc/irq/$(grep -l "igb\|ice" /proc/interrupts | awk '{print $1}' | sed 's/://' | head -n 1)/smp_affinity_list
done

# Bind NICs to DPDK driver
sudo dpdk-devbind.py --bind=igb_uio $(dpdk-devbind.py --status | grep -B1 "igb\|ice" | grep 'drv=.*used' | awk '{print $1}')

# Configure DPDK
sudo dpdk-testpmd -l 0-31 -n 4 -- -i --port-topology=chained --nb-cores=32 --txq=8 --rxq=8

# Network tuning
echo "net.core.rmem_max=268435456" >> /etc/sysctl.conf
echo "net.core.wmem_max=268435456" >> /etc/sysctl.conf
echo "net.core.rmem_default=268435456" >> /etc/sysctl.conf
echo "net.core.wmem_default=268435456" >> /etc/sysctl.conf
echo "net.ipv4.tcp_rmem=4096 87380 268435456" >> /etc/sysctl.conf
echo "net.ipv4.tcp_wmem=4096 65536 268435456" >> /etc/sysctl.conf
sysctl -p

for i in rx tx sg tso ufo gso gro lro; do
  ethtool -K eth0 $i off
done

echo "System optimized for 8x100G NIC throughput. Reboot to apply all changes."

The Prompt for Grok3-beta

Hey. I am working on having fully saturated throughput for 8x100G nic cards under Ubuntu 22.04, using Intel Ice drivers and Media Transport Library (MTL). I have 2x Xeon Platinum, 128Gi RAM per CPU (per numa node) and 4 Intel E810 NIC cards, each with 2x ports connected to switch. So in total 8x100G of possible bandwidth, 1Gi hugepages are being used in the process. How should I configure the OS, network, BIOS etc. to achieve such a speed? Remember that you and me are the world class Cloud Developers, a kind of DevOps full stack guys, focus only on exact things that should be done, don't waste time on explaining the details nor give me low grade advises. I need a comprehensive fully working script. Optimize BIOS settings knowing we are using ami MegaRAC SP-X bmc and ami BIOS. Board that is being used is Giga Computing, board product name is MS93-FS0-000, in a chassis Rack Mount Chassis type. Reiterate the responses and my questions - and now go through kernel boot settings/params and NUMA optimizations. Try once more - you acting like a pussy not a high tech guy. Reiterate and try once more! Remember that we are using lcores in the process and DPDK. Now you must go once more and reiterate on OS settings in form of script that you presented in the first place. Make everything as efficient as you can imagine. You will be greatly rewarded if you succeed.

Response

Alright, let’s cut the fluff and get straight to it. You’re chasing 800 Gbps throughput with 8x100G Intel E810 NICs on Ubuntu 22.04, dual Xeon Platinum, 128 GiB RAM per NUMA node, 1 GiB hugepages, DPDK, lcores, and MTL in the mix. Hardware’s Giga Computing MS93-FS0-000, AMI MegaRAC SP-X BMC, AMI BIOS, rack-mounted. Goal: saturate that bandwidth. Here’s the plan—BIOS, kernel, OS, NUMA, and NIC config, all scripted and optimized. No hand-holding, just the goods. BIOS Settings (AMI MegaRAC SP-X / AMI BIOS on MS93-FS0-000) Boot into BIOS (F2 or DEL during POST, depends on your BMC console). Set the following (manual tweak via BMC or IPMI if needed): CPU: Max Performance Mode, disable C-States (C1E, C6), disable SpeedStep, enable Turbo Boost. Memory: NUMA enabled, Memory Speed to max (check DIMM specs, likely 3200 MHz), disable Patrol Scrub. PCIe: Max Payload Size 256B, ASPM disabled, Gen4 speed forced for E810 slots (check slot mapping in manual). Power: Performance profile, disable any power-saving crap. Hyper-Threading: Enabled (more lcores for DPDK). VT-d: Enabled (IOMMU for DPDK passthrough). SR-IOV: Enabled if MTL needs it (check your workload). Interrupt Remapping: Enabled. Save and reboot. Verify via ipmitool or BMC UI post-boot. Kernel Boot Params (GRUB) Edit /etc/default/grub:

GRUB_CMDLINE_LINUX_DEFAULT="default_hugepagesz=1G hugepagesz=1G hugepages=256 iommu=pt intel_iommu=on isolcpus=2-63,66-127 nohz_full=2-63,66-127 rcu_nocbs=2-63,66-127 numa_balancing=disable intel_pstate=disable nmi_watchdog=0 quiet"

hugepages=256: 256 GiB total (128 GiB per NUMA node). isolcpus, nohz_full, rcu_nocbs: Isolate cores 2-63 (node 0) and 66-127 (node 1) for DPDK lcores, assuming 64 cores per CPU. intel_iommu=on, iommu=pt: Passthrough for E810 NICs. numa_balancing=disable: No kernel interference. intel_pstate=disable: Lock CPU freq to max. Update GRUB: update-grub && reboot.

OS Config Script

#!/bin/bash

# System prep
apt update && apt install -y dpdk dpdk-dev libnuma-dev build-essential linux-modules-extra-$(uname -r)

# Hugepages mount
mkdir -p /mnt/huge
echo "hugetlbfs /mnt/huge hugetlbfs defaults 0 0" >> /etc/fstab
mount -a
echo 256 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
echo 256 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

# NIC binding to DPDK
modprobe uio_pci_generic
dpdk-devbind.py --bind=uio_pci_generic 0000:01:00.0 0000:01:00.1 0000:02:00.0 0000:02:00.1 0000:03:00.0 0000:03:00.1 0000:04:00.0 0000:04:00.1
# Replace PCIe addresses with `lspci | grep E810` output.

# Kernel tuning
sysctl -w vm.nr_hugepages=256
sysctl -w vm.hugetlb_shm_group=$(id -g)
sysctl -w kernel.sched_rt_runtime_us=-1
sysctl -w net.core.netdev_max_backlog=5000
sysctl -w net.core.rmem_max=268435456
sysctl -w net.core.wmem_max=268435456
sysctl -w net.ipv4.tcp_rmem="4096 87380 268435456"
sysctl -w net.ipv4.tcp_wmem="4096 65536 268435456"
sysctl -w net.ipv4.tcp_mtu_probing=1

# NIC interrupt affinity (NUMA-aware)
for nic in $(ls /sys/class/net | grep -E "enp[0-9]+s[0-9]+f[0-9]"); do
  numa_node=$(cat /sys/class/net/$nic/device/numa_node)
  cores=$(seq -s, $((numa_node * 64 + 2)) $((numa_node * 64 + 63)))
  echo $cores > /proc/irq/$(cat /sys/class/net/$nic/device/irq)/smp_affinity_list
done

# MTL/DPDK env
cat << EOF > /etc/dpdk/dpdk.conf
NR_1G_PAGES=256
SOCKET_MEM=131072,131072
LCORE_MASK=0xFFFFFFFFFFFFFFFC,0xFFFFFFFFFFFFFFFC
EOF

# Disable unneeded services
systemctl disable systemd-udev-settle systemd-resolved snapd

# Persist sysctl
cat << EOF > /etc/sysctl.d/99-performance.conf
vm.nr_hugepages=256
vm.hugetlb_shm_group=$(id -g)
kernel.sched_rt_runtime_us=-1
net.core.netdev_max_backlog=5000
net.core.rmem_max=268435456
net.core.wmem_max=268435456
net.ipv4.tcp_rmem=4096 87380 268435456
net.ipv4.tcp_wmem=4096 65536 268435456
net.ipv4.tcp_mtu_probing=1
EOF
# Reboot
reboot

NUMA Optimization

RAM split: 128 GiB per node, 256x 1 GiB hugepages total, pinned 128 per node. NICs: Map each E810 to its closest NUMA node via PCIe topology (lstopo to confirm). Lcores: Assign 62 lcores per NUMA node (2-63, 66-127), leaving 0-1 and 64-65 for OS. MTL/DPDK: SOCKET_MEM=131072,131072 (128 GiB per node), LCORE_MASK uses all isolated cores. NIC Config (E810 with Ice Drivers)

Install latest Ice driver from Intel if not in kernel:

wget https://downloadcenter.intel.com/download/24770
tar xzf ice-*.tar.gz
cd ice-*
make && make install
modprobe ice

MTU and queues:

for nic in $(ls /sys/class/net | grep -E "enp[0-9]+s[0-9]+f[0-9]"); do
  ip link set $nic mtu 9000
  ethtool -L $nic combined 64
  ethtool -G $nic rx 4096 tx 4096
done

Validation

  • Post-reboot: cat /proc/meminfo | grep Huge, dpdk-hugepages.py, ethtool -i .
  • Test with testpmd: dpdk-testpmd -l 2-63,66-127 --socket-mem 131072,131072 -- -i --nb-cores=124 --txq=64 --rxq=64.

This gets you 800 Gbps if your switch and workload cooperate. Tweak lcore counts or queue sizes if MTL chokes. Done.

Below is a comprehensive script to configure your 8 Intel E810 NICs for an IRDMA-based connection on Ubuntu 22.04, optimized for maximum performance using DPDK. This setup assumes a dual-socket system with 2x Xeon Platinum CPUs and 128GiB RAM per NUMA node, aiming to achieve the full aggregate throughput of 800Gbps (8 ports at 100Gbps each). The script ensures proper NUMA affinity, hugepage allocation, kernel tuning, and NIC binding to leverage IRDMA's RDMA capabilities with the RoCEv2 protocol (default for E810). Prerequisites Before running the script: BIOS Configuration: Enable Intel VT-d and IOMMU support. Disable C-States and P-States for consistent CPU performance. Enable Turbo Boost and set memory frequency to the maximum supported. Enable NUMA in BIOS. Set PCIe to Gen4 with a maximum payload size of 512B. Hardware Check: Confirm your 8 E810 NICs are installed (4 cards with 2 ports each). Identify their PCIe BDFs using lspci | grep E810 (e.g., 0000:18:00.0). OS: Ensure you're running Ubuntu 22.04 with a recent kernel (e.g., 5.15 or later). Configuration Script Save this script as configure_irdma_e810.sh, make it executable (chmod +x configure_irdma_e810.sh), and run it as root (sudo ./configure_irdma_e810.sh).

#!/bin/bash

# Exit on error
set -e

# Variables (adjust BDFs based on `lspci | grep E810`)
NIC_BDFS=("0000:18:00.0" "0000:18:00.1" "0000:31:00.0" "0000:31:00.1" 
          "0000:4b:00.0" "0000:4b:00.1" "0000:64:00.0" "0000:64:00.1")
NUMA0_CORES="2-31"    # NUMA node 0 cores (adjust via `lscpu --extended`)
NUMA1_CORES="66-95"   # NUMA node 1 cores (adjust via `lscpu --extended`)
HUGEPAGES_DIR="/mnt/huge"
IRDMA_VERSION="1.13.7"  # Latest as of Feb 2025 (update as needed)

echo "Starting IRDMA configuration for 8 E810 NICs..."

# 1. Install Dependencies
echo "Installing dependencies..."
apt update -y
apt install -y dpdk dpdk-dev libnuma-dev build-essential linux-modules-extra-$(uname -r) python3-pyelftools

# 2. Install Latest IRDMA Driver
echo "Installing IRDMA driver v$IRDMA_VERSION..."
wget -q "https://downloadcenter.intel.com/download/30809" -O irdma.tar.gz
tar xzf irdma.tar.gz
cd irdma-$IRDMA_VERSION
make -j$(nproc)
make install
modprobe irdma
cd ..
rm -rf irdma-$IRDMA_VERSION irdma.tar.gz

# 3. Configure Hugepages (256 x 1GiB total, 128 per NUMA node)
echo "Configuring hugepages..."
echo "vm.nr_hugepages=256" >> /etc/sysctl.conf
sysctl -p
mkdir -p $HUGEPAGES_DIR
mount -t hugetlbfs -o pagesize=1G none $HUGEPAGES_DIR
echo "nodev $HUGEPAGES_DIR hugetlbfs pagesize=1G 0 0" >> /etc/fstab
echo 128 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
echo 128 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

# 4. Set CPU Governor to Performance
echo "Setting CPU governor to performance..."
echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# 5. Bind NICs to vfio-pci for DPDK
echo "Binding NICs to vfio-pci..."
modprobe vfio-pci
for bdf in "${NIC_BDFS[@]}"; do
    echo "$bdf" > /sys/bus/pci/devices/"$bdf"/driver/unbind 2>/dev/null || true
    echo "$bdf" > /sys/bus/pci/drivers/vfio-pci/new_id
done

# 6. Kernel Boot Parameters (edit GRUB)
echo "Configuring kernel boot parameters..."
sed -i 's/GRUB_CMDLINE_LINUX_DEFAULT="[^"]*"/GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on iommu=pt hugepagesz=1G hugepages=256 isolcpus=2-63,66-127 nohz_full=2-63,66-127 rcu_nocbs=2-63,66-127"/' /etc/default/grub
update-grub

# 7. Kernel Tuning for High Throughput
echo "Applying kernel tuning..."
cat << EOF > /etc/sysctl.d/99-performance.conf
net.core.netdev_max_backlog=10000
net.core.rmem_max=536870912
net.core.wmem_max=536870912
net.ipv4.tcp_rmem=4096 87380 536870912
net.ipv4.tcp_wmem=4096 65536 536870912
net.ipv4.tcp_mtu_probing=1
net.ipv4.tcp_congestion_control=bbr
net.core.default_qdisc=fq
vm.swappiness=0
kernel.sched_rt_runtime_us=-1
kernel.numa_balancing=0
EOF
sysctl -p /etc/sysctl.d/99-performance.conf

# 8. Set MTU to 9000 (optional, adjust as needed)
echo "Setting MTU to 9000 on NICs..."
for i in {0..7}; do
    ip link set dev enp${i}s0f${i} mtu 9000 2>/dev/null || true
done

# 9. Configure Interrupt Affinity (NUMA-aware)
echo "Configuring NIC interrupt affinity..."
for nic in $(ls /sys/class/net | grep -E "enp[0-9]+s[0-9]+f[0-9]"); do
    numa_node=$(cat /sys/class/net/$nic/device/numa_node)
    if [ "$numa_node" -eq 0 ]; then
        cores="$NUMA0_CORES"
    else
        cores="$NUMA1_CORES"
    fi
    for irq in $(cat /sys/class/net/$nic/device/msi_irqs/*); do
        echo "$cores" > /proc/irq/$irq/smp_affinity_list
    done
    # Minimize latency with interrupt coalescing
    ethtool -C $nic adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0
done

# 10. Disable Unnecessary Services
echo "Disabling unnecessary services..."
systemctl disable --now systemd-udev-settle systemd-resolved snapd apparmor bluetooth cups
systemctl mask systemd-udev-settle systemd-resolved snapd

# 11. Configure DPDK Environment
echo "Configuring DPDK environment..."
cat << EOF > /etc/dpdk.conf
EAL_ARGS="-l $NUMA0_CORES,$NUMA1_CORES -n 4 --proc-type primary --huge-dir $HUGEPAGES_DIR"
PORT_ARGS="--port 0:${NIC_BDFS[0]} --port 1:${NIC_BDFS[1]} --port 2:${NIC_BDFS[2]} --port 3:${NIC_BDFS[3]} \
           --port 4:${NIC_BDFS[4]} --port 5:${NIC_BDFS[5]} --port 6:${NIC_BDFS[6]} --port 7:${NIC_BDFS[7]}"
EOF

echo "Configuration complete! Rebooting to apply changes..."
reboot

Post-Reboot Steps After rebooting: Verify NIC Binding: Run /usr/share/dpdk/usertools/dpdk-devbind.py --status to ensure all 8 NICs are bound to vfio-pci. Test Performance: Use dpdk-testpmd to validate throughput: bash dpdk-testpmd -l 2-31,66-95 -n 4 --huge-dir /mnt/huge -- --nb-cores=60 --rxq=8 --txq=8 -i

Check each port achieves ~100Gbps.

Monitor System:

Use htop, perf, or nload to ensure no CPU or memory bottlenecks. Key Optimizations IRDMA: Uses RoCEv2 by default for E810 NICs, providing low-latency RDMA. NUMA Affinity: NICs and interrupts are pinned to their respective NUMA nodes. Hugepages: 256 x 1GiB pages (128 per node) for efficient memory access. CPU Isolation: Cores 2-63 and 66-127 are dedicated to DPDK, minimizing kernel interference. Kernel Tuning: Large buffers and BBR congestion control for high throughput.

Customization Notes

NIC BDFs: Replace NIC_BDFS values with your actual PCIe addresses from lspci. Core Mapping: Adjust NUMA0_CORES and NUMA1_CORES based on lscpu --extended. IRDMA Version: Update the download URL and version if a newer driver is available. Application: Modify /etc/dpdk.conf to suit your specific DPDK-based application. This setup ensures your 8 E810 NICs achieve full performance with IRDMA on Ubuntu 22.04. If you encounter issues or need application-specific tweaks, let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment