Check resources available on Swing: NVIDIA A100 GPUs (8 GPUs per node) - 1/8 node allocated when requesting 1 gpu
MACE requires Pytorch2, which needs CUDA 11.8
or 11.7
. Check required CUDA versions here
Create a Conda environment with Python 3.10
conda create -n "CUDA-torch-base" python=3.10.0
Activate once the environment is created
conda activate CUDA-torch-base
Read install Pytorch instructions from their official website. For me this is the command:
conda install pytorch=2.2.0 torchvision=0.17.0 torchaudio=2.2.0 pytorch-cuda=11.8 -c pytorch -c nvidia
Before Torch, Torchvision, Torchaudio are installed: check that they are installing the cuda version and not the cpu version. For example a cpu version of Pytorch source will be pytorch/linux-64::pytorch-2.0.1-py3.10_cpu_0
while the cuda version will say pytorch/linux-64::pytorch-2.2.0-py3.10_cuda11.8_cudnn8.7.0_0
. For Pytorch 1.11.0 (for Allegro), I had success with cuda 11.3.
Check whether cuda-enabled torch is installed correctly. Request a GPU (dont forget to ssh in the compute node after resource granted) In Python use
import torch
assert torch.cuda.is_available() = True
If all is well, install MACE following their installation guide
conda create --name NNFF-MACE --clone CUDA-base2
conda activate NNFF-MACE
pip install mace-torch
To learn about MACE, follow this tutorial at https://github.com/ilyes319/mace-tutorials/blob/main/mace-users/MACE_users.ipynb
. As a test run download solvent_test.xyz and solvent_train.xyz from the repo, then run this command on a compute node:
mace_run_train \
--name="model" \
--train_file="$DATA/solvent_train.xyz" \
--valid_fraction=0.05 \
--test_file="$DATA/solvent_test.xyz" \
--E0s="isolated" \
--energy_key="energy" \
--forces_key="forces" \
--model="MACE" \
--num_interactions=2 \
--max_ell=2 \
--hidden_irreps="16x0e" \
--num_cutoff_basis=5 \
--correlation=2 \
--r_max=3.0 \
--batch_size=5 \
--valid_batch_size=5 \
--eval_interval=1 \
--max_num_epochs=50 \
--start_swa=15 \
--swa_energy_weight=1000 \
--ema \
--ema_decay=0.99 \
--amsgrad \
--error_table="PerAtomRMSE" \
--default_dtype="float32" \
--swa \
--device=cuda \
--seed=1234
Instructions for LAMMPS with MACE here
Instructions to compile Polaris with Kokkos here