guilt/Interconnect-Comparison.md

Created July 22, 2025 23:45

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/guilt/023f630c8c18a18ce37edcf56cdecbbc.js"></script>
Save guilt/023f630c8c18a18ce37edcf56cdecbbc to your computer and use it in GitHub Desktop.

Download ZIP

AI Interconnects Study

Raw

Interconnect-Comparison.md

Comparison of Interconnect Technologies: Hardware, Latency, Lanes, Overhead, Throughput, Vendor, and OS Support

Interconnect	Hardware Support	Latency	Number of Lanes	Protocol Overhead	True Unidirectional Throughput Per Lane	True Unidirectional Throughput (Overall)	True Bidirectional Throughput	Vendor Support	OS Support
NVLink	NVIDIA GPUs (Pascal, Volta, Hopper, Blackwell), NVIDIA CPUs (e.g., Grace)	Sub-µs (0.1–0.5 µs, intra-node P2P)	72 (18 ports × 4 lanes, NVLink 5.0)	~2.5–3.5% (128b/130b, headers)	6.09 GB/s (48.72 Gbps)	438.75 GB/s	877.5 GB/s	NVIDIA	Linux (primary), Windows (limited via WSL)
InfiniBand	CPUs, NVIDIA/AMD/Intel GPUs (via GPUDirect RDMA), NICs (e.g., Mellanox)	~1–5 µs (point-to-point)	32 (8 ports × 4 lanes, NDR)	~8–13% (64b/66b, headers, CRC)	11.25 GB/s (90 Gbps)	360 GB/s	720 GB/s	NVIDIA (Mellanox), Intel, Broadcom	Linux, Windows (partial), macOS (limited)
PCIe	CPUs, GPUs (NVIDIA, AMD, Intel), accelerators, NICs	~0.5–2 µs (intra-node, PCIe 4.0/5.0)	16 (x16, PCIe 5.0)	~3.5–6.5% (128b/130b, TLPs)	1.915 GB/s (15.32 Gbps)	30.64 GB/s	61.28 GB/s	Broad (Intel, AMD, NVIDIA, etc.)	Linux, Windows, macOS
Ethernet	CPUs, GPUs (via TCP/IP or RoCE), NICs	~10–100 µs (network-dependent)	4 (100GbE), 8 (400GbE)	~13–18% (64b/66b, TCP/IP headers)	2.69 GB/s (100GbE), 5.375 GB/s (400GbE)	10.75 GB/s (100GbE), 43 GB/s (400GbE)	21.5 GB/s (100GbE), 86 GB/s (400GbE)	Broad (Intel, Broadcom, Cisco, etc.)	Linux, Windows, macOS
Infinity Fabric	AMD GPUs (e.g., MI300x), AMD CPUs (EPYC, Ryzen)	Sub-µs (0.1–0.5 µs, intra-node)	~24 (estimated, proprietary)	~2–5% (proprietary, coherency)	4.67 GB/s (37.36 Gbps)	112.13 GB/s	224.26 GB/s	AMD	Linux (primary), Windows (limited)
RoCE	CPUs, GPUs (via GPUDirect RDMA), NICs (e.g., Mellanox)	~1–10 µs	4 (100GbE), 8 (200GbE)	~8–13% (64b/66b, RDMA headers)	2.81 GB/s (22.48 Gbps)	11.25 GB/s (100GbE), 22.5 GB/s (200GbE)	22.5 GB/s (100GbE), 45 GB/s (200GbE)	NVIDIA (Mellanox), Broadcom, Intel	Linux, Windows (partial), macOS (limited)
xGMI	AMD GPUs (e.g., MI250, MI300), AMD CPUs	Sub-µs (0.1–0.5 µs, intra-node)	~24 (estimated, proprietary)	~2–5% (proprietary, coherency)	4.06 GB/s (32.48 Gbps)	97.5 GB/s	195 GB/s	AMD	Linux (primary, via ROCm)
Shared Memory	CPUs, GPUs (within same node, e.g., NUMA)	<1 µs (fastest for intra-node)	N/A (memory bus)	~1–5% (cache coherency, bus)	N/A	243.75 GB/s	487.5 GB/s	Broad (Intel, AMD, NVIDIA)	Linux, Windows, macOS
CXL (Current)	CPUs (Intel Xeon, AMD EPYC, NVIDIA Grace), GPUs (Intel, AMD, emerging NVIDIA), accelerators, memory	~0.1–0.2 µs (intra-node), 1–10 µs (inter-node)	16 (PCIe 5.0/6.0 x16)	~3.5–6.5% (128b/130b, headers)	7.66 GB/s (61.28 Gbps, CXL 4.0)	61.28 GB/s (CXL 3.0), 122.56 GB/s (CXL 4.0)	122.56 GB/s (CXL 3.0), 245.12 GB/s (CXL 4.0)	CXL Consortium (Intel, AMD, NVIDIA, Arm, Broadcom, Google, Meta, Microsoft)	Linux, Windows (partial), macOS (limited)
CXL (Current, Scaled to 64 Lanes)	CPUs (Intel Xeon, AMD EPYC, NVIDIA Grace), GPUs (Intel, AMD, NVIDIA), accelerators, memory	~0.1–0.2 µs (intra-node), 1–5 µs (inter-node)	64 (4 × PCIe 6.0 x16)	~3.5–6.5% (128b/130b, headers)	7.66 GB/s (61.28 Gbps)	490.24 GB/s	980.48 GB/s	CXL Consortium (Intel, AMD, NVIDIA, Arm, Broadcom, Google, Meta, Microsoft)	Linux, Windows (partial), macOS (limited)
CXL (Hypothetical 256 GB/s)	CPUs (Intel Xeon, AMD EPYC, NVIDIA Grace), GPUs (Intel, AMD, NVIDIA), accelerators, memory	~0.05–0.1 µs (intra-node), 0.5–5 µs (inter-node)	16 (PCIe 7.0 x16)	~3.5–6.5% (128b/130b, headers)	15.32 GB/s (122.56 Gbps)	245.12 GB/s	490.24 GB/s	CXL Consortium (Intel, AMD, NVIDIA, Arm, Broadcom, Google, Meta, Microsoft)	Linux, Windows (partial), macOS (limited)
CXL (Hypothetical 512 GB/s)	CPUs (Intel Xeon, AMD EPYC, NVIDIA Grace), GPUs (Intel, AMD, NVIDIA), accelerators, memory	~0.025–0.05 µs (intra-node), 0.25–2.5 µs (inter-node)	32 (PCIe 7.0 x32) or 16 (PCIe 8.0 x16)	~3.5–6.5% (128b/130b, headers)	15.32 GB/s (PCIe 7.0 x32), 30.64 GB/s (PCIe 8.0 x16)	490.24 GB/s	980.48 GB/s	CXL Consortium (Intel, AMD, NVIDIA, Arm, Broadcom, Google, Meta, Microsoft)	Linux, Windows (partial), macOS (limited)

Raw

Software-Comparison.md

Updated Comparison of Communication Libraries: Hardware, Latency, Vendor, and OS Support

Technology	Hardware Support	Latency	Vendor Support	OS Support
PCCL	Platform-agnostic, TCP/IP (CPUs, GPUs over Ethernet), no CXL support	Moderate (5× NCCL in latency-bound, 16–32 MB, 1024–2048 processes)	Prime Intellect (open-source, MIT license)	Linux
NCCL	NVIDIA GPUs, NVLink, PCIe, InfiniBand, RoCE, Ethernet, no CXL support	Very low (7.6× improvement with LL algorithm for small messages)	NVIDIA	Linux, Windows (partial via WSL/specific builds)
RCCL	AMD GPUs (e.g., MI300x), Infinity Fabric, PCIe, InfiniBand, CXL (emerging, via AMD EPYC)	Low, less optimized than NCCL for small messages	AMD	Linux
Gloo	CPUs, NVIDIA GPUs, InfiniBand, RoCE, GPUDirect RDMA, no CXL support	Low in CPU setups (36% lower than NCCL in single-docker)	Meta AI (open-source, PyTorch)	Linux, Windows (partial), macOS (limited, CPU-only)
NVSHMEM	NVIDIA GPUs, NVLink, PCIe, InfiniBand, RoCE, no CXL support	Very low (in-kernel GPU-initiated communication)	NVIDIA	Linux
oneCCL	Intel GPUs (e.g., Ponte Vecchio), CPUs, PCIe, InfiniBand, CXL (native support)	Moderate, less optimized than NCCL	Intel	Linux, Windows (partial via oneAPI)
MSCCL++	NVIDIA GPUs (CUDA), AMD GPUs (HIP), NVLink, xGMI, InfiniBand, CXL (emerging, via AMD)	Low (in-kernel, comparable to NVSHMEM)	Microsoft (open-source)	Linux, Windows (experimental via CUDA/HIP)
HiCCL	NVIDIA, AMD, Intel GPUs, various interconnects, CXL (potential support)	Low, optimized for hierarchical topologies	Academic/Experimental (open-source)	Linux
UCC	NVIDIA GPUs, AMD GPUs, CPUs, InfiniBand, RoCE, CXL (via UCX)	Varies by backend (e.g., NCCL, RCCL)	UC Consortium (open-source)	Linux, Windows (limited)
UCX	CPUs, GPUs, InfiniBand, RoCE, shared memory, CXL (emerging support)	Low (point-to-point), higher for collectives	UC Consortium (open-source)	Linux, Windows (partial), macOS (limited)
Libfabric	InfiniBand, RoCE, Ethernet, shared memory, CXL (emerging support)	Low (point-to-point), less optimized for GPU collectives	OFIWG (open-source)	Linux, Windows (partial), macOS (limited)
MPI	CPUs, GPUs (GPU-aware via OpenMPI/MPICH), InfiniBand, RoCE, Ethernet, CXL (emerging support)	Moderate (10–100 µs for small messages)	Multiple (MPICH, OpenMPI, vendor-specific)	Linux, Windows, macOS (limited)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment