To install on the cluster we'll need to install on all nodes in the /opt/nccl
directory. In order to do this we'll create a script and then run it on all nodes using the srun
command.
- Create a script
./install-nccl.sh
: andchmod +x install
#!/bin/bash
# install nccl
cd /opt
sudo git clone -b v2.18.5-1 https://github.com/NVIDIA/nccl.git nccl && cd $_
sudo make -j src.build CUDA_HOME=/usr/local/cuda
# install nccl-tests
cd /opt
sudo git clone https://github.com/NVIDIA/nccl-tests.git && cd $_
export LD_LIBRARY_PATH=/opt/amazon/efa/lib:$LD_LIBRARY_PATH
make MPI=1 MPI_HOME=/opt/amazon/openmpi NCCL_HOME=/opt/nccl/build CUDA_HOME=/usr/local/cuda
# on the deep learning ami this is:
# make MPI=1 MPI_HOME=/opt/amazon/openmpi NCCL_HOME=/usr/local/cuda-12.2 CUDA_HOME=/usr/local/cuda-12.2/
- Run the install script on all nodes (i.e. 4) like so:
srun -N 4 ./install-nccl.sh
- Check the installed version:
wget https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/4.validation_and_observability/efa-versions.py
srun python3 efa-versions.py