Check GPU by lspci
The cuda-toolkit
package and PyTorch with CUDA support requires around 16 GB disk space to install.
Refer to https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
Before you do:
-
Check supported OS and versions. We will use Ubuntu 20.04.
-
Check supported C/C++ compiler version. We will use GCC 9.4.
-
Ensure kernal headers and kernel development packages are installed with EXACT VERSION matching
uname -r
.If you perform a system update which changes the version of the Linux kernel being used, make sure to rerun the commands below to ensure you have the correct kernel headers and kernel development packages installed. Otherwise, the CUDA Driver will fail to work with the new kernel.
Package name for headers is like
linux-headers-5.15.0-1042
, while package name for kernel development files is likelinux-image-5.15.0-1042
. Note the prefixeslinux-headers-
andlinux-image-
. The remaining part is usually determined byuname -r
. -
cuda-toolkit
requires 6+GB disk space. Ensure you have enough free disk space (for /tmp).
Then install cudo toolkit on Ubuntu 20.04
Refer to https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#ubuntu
-
sudo apt-get install linux-headers-$(uname -r)
-
Install keyring
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.1-1_all.deb sudo dpkg -i cuda-keyring_1.1-1_all.deb
-
Install the meta package cuda-toolkit
sudo apt-get update sudo apt-get install cuda-toolkit
When cuda-toolkit is installed, you could check the version of installed CUDA by nvcc --version
. The version info is required to select a proper version of PyTorch.
Refer to https://pytorch.org/get-started/locally/
The correct install command depends on OS type and CUDA version. In our case the OS is Linux and the CUDA version is 12.x. So we choose
pip3 install torch torchvision torchaudio
Basically, you need to
- Wrap your model in a DDP model.
- Split your training dataset based on rank and world size at runtime so that each training process works on one subset.
- Run your training script by a DDP command line.
DDP will take care of the training processes and data syncing among the processes across nodes.
See https://github.com/coin8086/ml-lab/tree/main/src/pytorch_ddp for sample code.
Refer to
- https://pytorch.org/tutorials/beginner/dist_overview.html
- https://pytorch.org/docs/stable/notes/ddp.html
- https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
Before you can run a training job, you need a shared directory that can be accessed by all the compute nodes. The directory is used for training code and data (both input data set and output trained model).
You can setup a SMB share directory on a head node and then mount it on each compute node with cifs, like this:
-
On a head node, make a directory
app
under%CCP_DATA%\SpoolDir
, which is already shared asCcpSpoolDir
by HPC Pack by default. -
On a compute node, mount the
app
directory likesudo mkdir /app sudo mount -t cifs //<your head node name>/CcpSpoolDir/app /app -o vers=2.1,domain=<hpc admin domain>,username=<hpc admin>,password=<your password>,dir_mode=0777,file_mode=0777
NOTE:
- The
password
option can be omitted in an interactive shell. You will be prompted for it in that case. - The
dir_mode
andfile_mode
is set to 0777, so that any Linux user can read/write it. A restricted permission is possible, but more complicated to be configurated.
- The
-
Optionally, make the mounting permanently by adding a line in
/etc/fstab
like//<your head node name>/CcpSpoolDir/app cifs vers=2.1,domain=<hpc admin domain>,username=<hpc admin>,password=<your password>,dir_mode=0777,file_mode=0777 0 2
Here the
password
is required.
Using the sample code, download the following files into the shared directory %CCP_DATA%\SpoolDir\app
- neural_network.py
- operations.py
- run_ddp.py
Then create a job with node as resource unit. The job's tasks command lines are all the same, like
python3 -m torch.distributed.run --nnodes=<the number of compute nodes> --nproc_per_node=<the processes on each node> --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=<a node name>:29400 /app/run_ddp.py
Note:
nnodes
specifies the number of compute nodes for your training job.nproc_per_node
specifies the number of processes on each compute node. It can not exceed the number of GPUs on a node. That is, one GPU can have one process at most.rdzv_endpoint
specifies a name and a port of a node that acts as a Rendezvous. Any node in the training job can work.- "/app/run_ddp.py" is the path to your training code file. Remember that
/app
is a shared directory on the head node.
Refer to
- https://pytorch.org/tutorials/intermediate/ddp_tutorial.html#initialize-ddp-with-torch-distributed-run-torchrun
- https://pytorch.org/docs/stable/elastic/run.html
-
When "Run Command", check "Run as local system account NT AUTHORITY/SYSTEM" to run as root on Linux node.
-
When seting up the environment by "Run Command", "nvcc --version" failed after cuda-toolkit was installed. The error is
IaaSCN116 -> Failed --------------------------------------------------------------------------------------------------- /tmp/nodemanager_task_374_0.SlfRQe/cmd.sh: line 3: nvcc: command not found Task failed during execution with exit code . Please check task's output for error details.
However the command succeeded in another SSH shell. It seems the "Run Command" shell doesn't have proper PATH, which can be told by
echo "$(IFS=: ; for p in $PATH; do echo "$p"; done)"
It can be corrected by
bash -ic "nvcc --version"
in "Run command", forcing an interactive shell by-i
, and thus reading/etc/bash.bashrc
, which has corerct path setup for CUDA stuff.