Skip to content

Instantly share code, notes, and snippets.

@sean-smith
Last active April 22, 2024 23:20
Show Gist options
  • Save sean-smith/a7085d63474a61232e7ff2f57a890d63 to your computer and use it in GitHub Desktop.
Save sean-smith/a7085d63474a61232e7ff2f57a890d63 to your computer and use it in GitHub Desktop.
Install NCCL

Install NCCL on a Cluster

To install on the cluster we'll need to install on all nodes in the /opt/nccl directory. In order to do this we'll create a script and then run it on all nodes using the srun command.

  1. Create a script ./install-nccl.sh : and chmod +x install
#!/bin/bash

# install nccl
cd /opt
sudo git clone -b v2.18.5-1 https://github.com/NVIDIA/nccl.git nccl && cd $_
sudo make -j src.build CUDA_HOME=/usr/local/cuda

# install nccl-tests
cd /opt
sudo git clone https://github.com/NVIDIA/nccl-tests.git && cd $_
export LD_LIBRARY_PATH=/opt/amazon/efa/lib:$LD_LIBRARY_PATH
make MPI=1 MPI_HOME=/opt/amazon/openmpi NCCL_HOME=/opt/nccl/build CUDA_HOME=/usr/local/cuda
# on the deep learning ami this is:
# make MPI=1 MPI_HOME=/opt/amazon/openmpi NCCL_HOME=/usr/local/cuda-12.2 CUDA_HOME=/usr/local/cuda-12.2/
  1. Run the install script on all nodes (i.e. 4) like so:
srun -N 4 ./install-nccl.sh
  1. Check the installed version:
wget https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/4.validation_and_observability/efa-versions.py
srun python3 efa-versions.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment