Skip to content

Instantly share code, notes, and snippets.

@jamesob
Created April 11, 2026 17:42
Show Gist options
  • Select an option

  • Save jamesob/ea1d9dc6d8b80af20e6ec3a3a915ff83 to your computer and use it in GitHub Desktop.

Select an option

Save jamesob/ea1d9dc6d8b80af20e6ec3a3a915ff83 to your computer and use it in GitHub Desktop.
#!/usr/bin/env bash
set -euo pipefail
die() { echo "error: $*" >&2; exit 1; }
# --- Check prerequisites ---
command -v nvcc >/dev/null 2>&1 || {
if [[ -x /usr/local/cuda/bin/nvcc ]]; then
export PATH=/usr/local/cuda/bin:$PATH
else
die "nvcc not found. Install the CUDA toolkit first."
fi
}
command -v git >/dev/null 2>&1 || die "git not found. Install with: sudo apt install git"
nvidia-smi >/dev/null 2>&1 || die "nvidia-smi failed. Check your NVIDIA driver installation."
# --- Detect GPU compute capability ---
echo "Detecting GPU architecture..."
ARCH=$(nvidia-smi --query-gpu=compute_cap --format=csv,noheader | head -1 | tr -d '.')
if [[ -z "$ARCH" ]]; then
die "Could not detect GPU compute capability"
fi
echo " Compute capability: $(nvidia-smi --query-gpu=compute_cap --format=csv,noheader | head -1) (sm_${ARCH})"
GPU_COUNT=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
echo " GPUs detected: $GPU_COUNT"
if [[ "$GPU_COUNT" -lt 2 ]]; then
echo "warning: only $GPU_COUNT GPU detected. P2P test needs at least 2."
fi
# --- Clone cuda-samples if needed ---
WORKDIR="${TMPDIR:-/tmp}/cuda-p2p-test"
SAMPLE_DIR="$WORKDIR/cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest"
if [[ ! -d "$WORKDIR/cuda-samples" ]]; then
echo ""
echo "Cloning cuda-samples..."
mkdir -p "$WORKDIR"
git clone --depth 1 https://github.com/NVIDIA/cuda-samples.git "$WORKDIR/cuda-samples"
else
echo "cuda-samples already cloned at $WORKDIR/cuda-samples"
fi
# --- Build ---
echo ""
echo "Building p2pBandwidthLatencyTest (sm_${ARCH})..."
cd "$SAMPLE_DIR"
nvcc -arch=sm_${ARCH} -o p2pBandwidthLatencyTest p2pBandwidthLatencyTest.cu -I../../../Common
echo "Build OK."
# --- Run ---
echo ""
echo "=== P2P Bandwidth & Latency Test ==="
echo ""
./p2pBandwidthLatencyTest
# --- Show topology for context ---
echo ""
echo "=== GPU Topology ==="
nvidia-smi topo -m
# --- Show link status ---
echo ""
echo "=== PCIe Link Status ==="
for gpu in $(nvidia-smi --query-gpu=pci.bus_id --format=csv,noheader); do
bdf=$(echo "$gpu" | sed 's/00000000://' | tr '[:upper:]' '[:lower:]')
echo -n " $gpu: "
lspci -vvv -s "$bdf" 2>/dev/null | grep "LnkSta:" | head -1 | sed 's/.*LnkSta://' | xargs
done
@jamesob
Copy link
Copy Markdown
Author

jamesob commented Apr 11, 2026

james@mickey:~$ ./cuda-speed.sh
Detecting GPU architecture...
  Compute capability: 12.0 (sm_120)
  GPUs detected: 2

Cloning cuda-samples...
Cloning into '/tmp/cuda-p2p-test/cuda-samples'...
remote: Enumerating objects: 2115, done.
remote: Counting objects: 100% (2115/2115), done.
remote: Compressing objects: 100% (1403/1403), done.
remote: Total 2115 (delta 1102), reused 1189 (delta 692), pack-reused 0 (from 0)
Receiving objects: 100% (2115/2115), 90.77 MiB | 10.99 MiB/s, done.
Resolving deltas: 100% (1102/1102), done.

Building p2pBandwidthLatencyTest (sm_120)...
Build OK.

=== P2P Bandwidth & Latency Test ===

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA RTX PRO 6000 Blackwell Workstation Edition, pciBusID: 83, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA RTX PRO 6000 Blackwell Workstation Edition, pciBusID: 84, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0	     1     1
     1	     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 1506.21  12.19
     1  12.18 1531.86
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1
     0 1508.16  27.58
     1  27.47 1530.36
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 1483.81  11.75
     1  11.76 1500.92
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 1483.11  50.80
     1  50.59 1502.36
P2P=Disabled Latency Matrix (us)
   GPU     0      1
     0   1.03  14.52
     1  14.48   1.05

   CPU     0      1
     0   2.67   8.35
     1   8.41   2.67
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1
     0   1.00   0.38
     1   0.38   1.03

   CPU     0      1
     0   2.72   2.16
     1   2.18   2.74

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

=== GPU Topology ===
	GPU0	GPU1	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	PIX	0-31	0		N/A
GPU1	PIX	 X 	0-31	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

=== PCIe Link Status ===
  00000000:83:00.0:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment