- Change into the shared directory
- Create a script
install-nccl-aws-ofi.sh
to install AWS OFI NCCL:
#!/bin/bash
wget https://github.com/aws/aws-ofi-nccl/releases/download/v1.8.1-aws/aws-ofi-nccl-1.8.1-aws.tar.gz
tar -xzf aws-ofi-nccl-1.8.1-aws.tar.gz
cd aws-ofi-nccl-1.8.1-aws
./autogen.sh
./configure --enable-platform-aws --with-libfabric=/opt/amazon/efa --with-mpi=/opt/amazon/openmpi --with-cuda=/usr/local/cuda --prefix=/opt/aws-ofi-nccl
make
sudo make install
- Execute:
bash install-nccl-aws-ofi.sh
- Check the installed version
$ strings /opt/aws-ofi-nccl/lib/libnccl-net.so | grep Initializing
NET/OFI Initializing aws-ofi-nccl GitHub-dev
- Install on all compute nodes:
cd /fsx/aws-ofi-nccl-1.8.1-aws
srun -N 4 sudo make install