Distributed Model Training in PyTorch DDP

Prerequisites

Ensure GPU device is ready

Check GPU by lspci

Ensure enough free disk space

The cuda-toolkit package and PyTorch with CUDA support requires around 16 GB disk space to install.

Install CUDA toolkit

Refer to https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

Before you do:

Check supported OS and versions. We will use Ubuntu 20.04.
Check supported C/C++ compiler version. We will use GCC 9.4.
Ensure kernal headers and kernel development packages are installed with EXACT VERSION matching uname -r.

If you perform a system update which changes the version of the Linux kernel being used, make sure to rerun the commands below to ensure you have the correct kernel headers and kernel development packages installed. Otherwise, the CUDA Driver will fail to work with the new kernel.

Package name for headers is like linux-headers-5.15.0-1042, while package name for kernel development files is like linux-image-5.15.0-1042. Note the prefixes linux-headers- and linux-image-. The remaining part is usually determined by uname -r.
cuda-toolkit requires 6+GB disk space. Ensure you have enough free disk space (for /tmp).

Then install cudo toolkit on Ubuntu 20.04

Refer to https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#ubuntu

sudo apt-get install linux-headers-$(uname -r)

Install keyring

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb

Install the meta package cuda-toolkit

sudo apt-get update
sudo apt-get install cuda-toolkit

When cuda-toolkit is installed, you could check the version of installed CUDA by nvcc --version. The version info is required to select a proper version of PyTorch.

Install PyTorch with CUDA support

Refer to https://pytorch.org/get-started/locally/

The correct install command depends on OS type and CUDA version. In our case the OS is Linux and the CUDA version is 12.x. So we choose

pip3 install torch torchvision torchaudio

Build your model on PyTorch DDP

Basically, you need to

Wrap your model in a DDP model.
Split your training dataset based on rank and world size at runtime so that each training process works on one subset.
Run your training script by a DDP command line.

DDP will take care of the training processes and data syncing among the processes across nodes.

See https://github.com/coin8086/ml-lab/tree/main/src/pytorch_ddp for sample code.

Refer to

Run training job in HPC Pack

Setup a shared directory

Before you can run a training job, you need a shared directory that can be accessed by all the compute nodes. The directory is used for training code and data (both input data set and output trained model).

You can setup a SMB share directory on a head node and then mount it on each compute node with cifs, like this:

On a head node, make a directory app under %CCP_DATA%\SpoolDir, which is already shared as CcpSpoolDir by HPC Pack by default.
On a compute node, mount the app directory like
```
sudo mkdir /app
sudo mount -t cifs //<your head node name>/CcpSpoolDir/app /app -o vers=2.1,domain=<hpc admin domain>,username=<hpc admin>,password=<your password>,dir_mode=0777,file_mode=0777
```
NOTE:
- The password option can be omitted in an interactive shell. You will be prompted for it in that case.
- The dir_mode and file_mode is set to 0777, so that any Linux user can read/write it. A restricted permission is possible, but more complicated to be configurated.

Optionally, make the mounting permanently by adding a line in /etc/fstab like

//<your head node name>/CcpSpoolDir/app cifs vers=2.1,domain=<hpc admin domain>,username=<hpc admin>,password=<your password>,dir_mode=0777,file_mode=0777 0 2

Here the password is required.

Start a training job

Using the sample code, download the following files into the shared directory %CCP_DATA%\SpoolDir\app

neural_network.py
operations.py
run_ddp.py

Then create a job with node as resource unit. The job's tasks command lines are all the same, like

python3 -m torch.distributed.run --nnodes=<the number of compute nodes> --nproc_per_node=<the processes on each node> --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=<a node name>:29400 /app/run_ddp.py

Note:

nnodes specifies the number of compute nodes for your training job.
nproc_per_node specifies the number of processes on each compute node. It can not exceed the number of GPUs on a node. That is, one GPU can have one process at most.
rdzv_endpoint specifies a name and a port of a node that acts as a Rendezvous. Any node in the training job can work.
"/app/run_ddp.py" is the path to your training code file. Remember that /app is a shared directory on the head node.