Skip to content

Instantly share code, notes, and snippets.

@guilt
Created July 22, 2025 23:45
Show Gist options
  • Select an option

  • Save guilt/023f630c8c18a18ce37edcf56cdecbbc to your computer and use it in GitHub Desktop.

Select an option

Save guilt/023f630c8c18a18ce37edcf56cdecbbc to your computer and use it in GitHub Desktop.
AI Interconnects Study

Comparison of Interconnect Technologies: Hardware, Latency, Lanes, Overhead, Throughput, Vendor, and OS Support

Interconnect Hardware Support Latency Number of Lanes Protocol Overhead True Unidirectional Throughput Per Lane True Unidirectional Throughput (Overall) True Bidirectional Throughput Vendor Support OS Support
NVLink NVIDIA GPUs (Pascal, Volta, Hopper, Blackwell), NVIDIA CPUs (e.g., Grace) Sub-µs (0.1–0.5 µs, intra-node P2P) 72 (18 ports × 4 lanes, NVLink 5.0) ~2.5–3.5% (128b/130b, headers) 6.09 GB/s (48.72 Gbps) 438.75 GB/s 877.5 GB/s NVIDIA Linux (primary), Windows (limited via WSL)
InfiniBand CPUs, NVIDIA/AMD/Intel GPUs (via GPUDirect RDMA), NICs (e.g., Mellanox) ~1–5 µs (point-to-point) 32 (8 ports × 4 lanes, NDR) ~8–13% (64b/66b, headers, CRC) 11.25 GB/s (90 Gbps) 360 GB/s 720 GB/s NVIDIA (Mellanox), Intel, Broadcom Linux, Windows (partial), macOS (limited)
PCIe CPUs, GPUs (NVIDIA, AMD, Intel), accelerators, NICs ~0.5–2 µs (intra-node, PCIe 4.0/5.0) 16 (x16, PCIe 5.0) ~3.5–6.5% (128b/130b, TLPs) 1.915 GB/s (15.32 Gbps) 30.64 GB/s 61.28 GB/s Broad (Intel, AMD, NVIDIA, etc.) Linux, Windows, macOS
Ethernet CPUs, GPUs (via TCP/IP or RoCE), NICs ~10–100 µs (network-dependent) 4 (100GbE), 8 (400GbE) ~13–18% (64b/66b, TCP/IP headers) 2.69 GB/s (100GbE), 5.375 GB/s (400GbE) 10.75 GB/s (100GbE), 43 GB/s (400GbE) 21.5 GB/s (100GbE), 86 GB/s (400GbE) Broad (Intel, Broadcom, Cisco, etc.) Linux, Windows, macOS
Infinity Fabric AMD GPUs (e.g., MI300x), AMD CPUs (EPYC, Ryzen) Sub-µs (0.1–0.5 µs, intra-node) ~24 (estimated, proprietary) ~2–5% (proprietary, coherency) 4.67 GB/s (37.36 Gbps) 112.13 GB/s 224.26 GB/s AMD Linux (primary), Windows (limited)
RoCE CPUs, GPUs (via GPUDirect RDMA), NICs (e.g., Mellanox) ~1–10 µs 4 (100GbE), 8 (200GbE) ~8–13% (64b/66b, RDMA headers) 2.81 GB/s (22.48 Gbps) 11.25 GB/s (100GbE), 22.5 GB/s (200GbE) 22.5 GB/s (100GbE), 45 GB/s (200GbE) NVIDIA (Mellanox), Broadcom, Intel Linux, Windows (partial), macOS (limited)
xGMI AMD GPUs (e.g., MI250, MI300), AMD CPUs Sub-µs (0.1–0.5 µs, intra-node) ~24 (estimated, proprietary) ~2–5% (proprietary, coherency) 4.06 GB/s (32.48 Gbps) 97.5 GB/s 195 GB/s AMD Linux (primary, via ROCm)
Shared Memory CPUs, GPUs (within same node, e.g., NUMA) <1 µs (fastest for intra-node) N/A (memory bus) ~1–5% (cache coherency, bus) N/A 243.75 GB/s 487.5 GB/s Broad (Intel, AMD, NVIDIA) Linux, Windows, macOS
CXL (Current) CPUs (Intel Xeon, AMD EPYC, NVIDIA Grace), GPUs (Intel, AMD, emerging NVIDIA), accelerators, memory ~0.1–0.2 µs (intra-node), 1–10 µs (inter-node) 16 (PCIe 5.0/6.0 x16) ~3.5–6.5% (128b/130b, headers) 7.66 GB/s (61.28 Gbps, CXL 4.0) 61.28 GB/s (CXL 3.0), 122.56 GB/s (CXL 4.0) 122.56 GB/s (CXL 3.0), 245.12 GB/s (CXL 4.0) CXL Consortium (Intel, AMD, NVIDIA, Arm, Broadcom, Google, Meta, Microsoft) Linux, Windows (partial), macOS (limited)
CXL (Current, Scaled to 64 Lanes) CPUs (Intel Xeon, AMD EPYC, NVIDIA Grace), GPUs (Intel, AMD, NVIDIA), accelerators, memory ~0.1–0.2 µs (intra-node), 1–5 µs (inter-node) 64 (4 × PCIe 6.0 x16) ~3.5–6.5% (128b/130b, headers) 7.66 GB/s (61.28 Gbps) 490.24 GB/s 980.48 GB/s CXL Consortium (Intel, AMD, NVIDIA, Arm, Broadcom, Google, Meta, Microsoft) Linux, Windows (partial), macOS (limited)
CXL (Hypothetical 256 GB/s) CPUs (Intel Xeon, AMD EPYC, NVIDIA Grace), GPUs (Intel, AMD, NVIDIA), accelerators, memory ~0.05–0.1 µs (intra-node), 0.5–5 µs (inter-node) 16 (PCIe 7.0 x16) ~3.5–6.5% (128b/130b, headers) 15.32 GB/s (122.56 Gbps) 245.12 GB/s 490.24 GB/s CXL Consortium (Intel, AMD, NVIDIA, Arm, Broadcom, Google, Meta, Microsoft) Linux, Windows (partial), macOS (limited)
CXL (Hypothetical 512 GB/s) CPUs (Intel Xeon, AMD EPYC, NVIDIA Grace), GPUs (Intel, AMD, NVIDIA), accelerators, memory ~0.025–0.05 µs (intra-node), 0.25–2.5 µs (inter-node) 32 (PCIe 7.0 x32) or 16 (PCIe 8.0 x16) ~3.5–6.5% (128b/130b, headers) 15.32 GB/s (PCIe 7.0 x32), 30.64 GB/s (PCIe 8.0 x16) 490.24 GB/s 980.48 GB/s CXL Consortium (Intel, AMD, NVIDIA, Arm, Broadcom, Google, Meta, Microsoft) Linux, Windows (partial), macOS (limited)

Updated Comparison of Communication Libraries: Hardware, Latency, Vendor, and OS Support

Technology Hardware Support Latency Vendor Support OS Support
PCCL Platform-agnostic, TCP/IP (CPUs, GPUs over Ethernet), no CXL support Moderate (5× NCCL in latency-bound, 16–32 MB, 1024–2048 processes) Prime Intellect (open-source, MIT license) Linux
NCCL NVIDIA GPUs, NVLink, PCIe, InfiniBand, RoCE, Ethernet, no CXL support Very low (7.6× improvement with LL algorithm for small messages) NVIDIA Linux, Windows (partial via WSL/specific builds)
RCCL AMD GPUs (e.g., MI300x), Infinity Fabric, PCIe, InfiniBand, CXL (emerging, via AMD EPYC) Low, less optimized than NCCL for small messages AMD Linux
Gloo CPUs, NVIDIA GPUs, InfiniBand, RoCE, GPUDirect RDMA, no CXL support Low in CPU setups (36% lower than NCCL in single-docker) Meta AI (open-source, PyTorch) Linux, Windows (partial), macOS (limited, CPU-only)
NVSHMEM NVIDIA GPUs, NVLink, PCIe, InfiniBand, RoCE, no CXL support Very low (in-kernel GPU-initiated communication) NVIDIA Linux
oneCCL Intel GPUs (e.g., Ponte Vecchio), CPUs, PCIe, InfiniBand, CXL (native support) Moderate, less optimized than NCCL Intel Linux, Windows (partial via oneAPI)
MSCCL++ NVIDIA GPUs (CUDA), AMD GPUs (HIP), NVLink, xGMI, InfiniBand, CXL (emerging, via AMD) Low (in-kernel, comparable to NVSHMEM) Microsoft (open-source) Linux, Windows (experimental via CUDA/HIP)
HiCCL NVIDIA, AMD, Intel GPUs, various interconnects, CXL (potential support) Low, optimized for hierarchical topologies Academic/Experimental (open-source) Linux
UCC NVIDIA GPUs, AMD GPUs, CPUs, InfiniBand, RoCE, CXL (via UCX) Varies by backend (e.g., NCCL, RCCL) UC Consortium (open-source) Linux, Windows (limited)
UCX CPUs, GPUs, InfiniBand, RoCE, shared memory, CXL (emerging support) Low (point-to-point), higher for collectives UC Consortium (open-source) Linux, Windows (partial), macOS (limited)
Libfabric InfiniBand, RoCE, Ethernet, shared memory, CXL (emerging support) Low (point-to-point), less optimized for GPU collectives OFIWG (open-source) Linux, Windows (partial), macOS (limited)
MPI CPUs, GPUs (GPU-aware via OpenMPI/MPICH), InfiniBand, RoCE, Ethernet, CXL (emerging support) Moderate (10–100 µs for small messages) Multiple (MPICH, OpenMPI, vendor-specific) Linux, Windows, macOS (limited)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment