Install NCCL on a Cluster

To install on the cluster we'll need to install on all nodes in the /opt/nccl directory. In order to do this we'll create a script and then run it on all nodes using the srun command.

Create a script ./install-nccl.sh : and chmod +x install

#!/bin/bash

# install nccl
cd /opt
sudo git clone -b v2.18.5-1 https://github.com/NVIDIA/nccl.git nccl && cd $_
sudo make -j src.build CUDA_HOME=/usr/local/cuda

# install nccl-tests
cd /opt
sudo git clone https://github.com/NVIDIA/nccl-tests.git && cd $_
export LD_LIBRARY_PATH=/opt/amazon/efa/lib:$LD_LIBRARY_PATH
make MPI=1 MPI_HOME=/opt/amazon/openmpi NCCL_HOME=/opt/nccl/build CUDA_HOME=/usr/local/cuda
# on the deep learning ami this is:
# make MPI=1 MPI_HOME=/opt/amazon/openmpi NCCL_HOME=/usr/local/cuda-12.2 CUDA_HOME=/usr/local/cuda-12.2/

Run the install script on all nodes (i.e. 4) like so:

srun -N 4 ./install-nccl.sh

Check the installed version:

wget https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/4.validation_and_observability/efa-versions.py
srun python3 efa-versions.py

sean-smith/install_nccl.md

Select an option

No results found

Select an option

No results found

Install NCCL on a Cluster