Here's my alias in .bashrc for getting a gpu-dev instance based on https://support.pawsey.org.au/documentation/display/US/Setonix+GPU+Partition+Quick+Start
alias getgpunode='salloc -p gpu-dev --nodes=1 --gpus-per-node=1 --account=${PAWSEY_PROJECT}-gpu'
First, to make a fresh environment:
mamba create -p `pwd`/transformers transformers python=3.10
Install Torch with the closest ROCm version (nothing for 5.4.3, the current 'new' version on Pawsey, and nothing for 5.2.3, the default version). Also setting the pip-cache-dir to somewhere on /scratch.
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2 --cache-dir `pwd`/pipcache
Test whether torch can see the GPU:
srun python -c 'import torch; print(torch.cuda.is_available())'
Should print True!
Install the 'right' Tensorflow:
pip install tensorflow-rocm
Seems to work? Now install transformers:
pip install transformers
Change transformers' cache to somewhere on /scratch:
export TRANSFORMERS_CACHE=`pwd`/tf_cache
I also had to upgrade accelerate
:
pip install -U accelerate
Then run some code!
I got an error like
python3.10/site-packages/torch/lib/libtorch_cpu.so: undefined symbol: roctracer_next_record, version ROCTRACER_4.1
had a typo in the module load rocm/5.4.3
line which ended up loading an old rocm. Loading the correct rocm/5.4.3 solved it.
I also got an error like
Device-side assertion `t >= 0 && t < n_classes' failed.
My class labels did not start with 0, they accidentally started with the taxonomy ID (some large number). Replacing all taxonomy IDs by a counter 0 to len(tax_ids) fixed this.