installing torch/transformers under ROCm on Pawsey

Here's my alias in .bashrc for getting a gpu-dev instance based on https://support.pawsey.org.au/documentation/display/US/Setonix+GPU+Partition+Quick+Start

alias getgpunode='salloc -p gpu-dev --nodes=1 --gpus-per-node=1 --account=${PAWSEY_PROJECT}-gpu'

First, to make a fresh environment:

mamba create -p `pwd`/transformers transformers python=3.10

Install Torch with the closest ROCm version (nothing for 5.4.3, the current 'new' version on Pawsey, and nothing for 5.2.3, the default version). Also setting the pip-cache-dir to somewhere on /scratch.

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2 --cache-dir `pwd`/pipcache

Test whether torch can see the GPU:

srun python -c 'import torch; print(torch.cuda.is_available())'

Should print True!

Install the 'right' Tensorflow:

pip install tensorflow-rocm

Seems to work? Now install transformers:

pip install transformers

Change transformers' cache to somewhere on /scratch:

export TRANSFORMERS_CACHE=`pwd`/tf_cache

I also had to upgrade accelerate:

pip install -U accelerate

Then run some code!

I got an error like

python3.10/site-packages/torch/lib/libtorch_cpu.so: undefined symbol: roctracer_next_record, version ROCTRACER_4.1

had a typo in the module load rocm/5.4.3 line which ended up loading an old rocm. Loading the correct rocm/5.4.3 solved it.

I also got an error like

Device-side assertion `t >= 0 && t < n_classes' failed.

My class labels did not start with 0, they accidentally started with the taxonomy ID (some large number). Replacing all taxonomy IDs by a counter 0 to len(tax_ids) fixed this.

philippbayer/torch.md