Check resources available on Swing: NVIDIA A100 GPUs (8 GPUs per node) - 1/8 node allocated when requesting 1 gpu
MACE requires Pytorch2, which needs CUDA 11.8
or 11.7
. Check required CUDA versions here
Create a Conda environment with Python 3.10
conda create -n "CUDA-torch-base" python=3.10.0
Activate once the environment is created
conda activate CUDA-torch-base
Read install Pytorch instructions from their official website. For me this is the command:
conda install pytorch=2.2.0 torchvision=0.17.0 torchaudio=2.2.0 pytorch-cuda=11.8 -c pytorch -c nvidia
Before Torch, Torchvision, Torchaudio are installed: check that they are installing the cuda version and not the cpu version. For example a cpu version of Pytorch source will be pytorch/linux-64::pytorch-2.0.1-py3.10_cpu_0
while the cuda version will say pytorch/linux-64::pytorch-2.2.0-py3.10_cuda11.8_cudnn8.7.0_0
. For Pytorch 1.11.0 (for Allegro), I had success with cuda 11.3.
Check whether cuda-enabled torch is installed correctly. Request a GPU (dont forget to ssh in the compute node after resource granted) In Python use
import torch
assert torch.cuda.is_available() = True
If all is well, install MACE following their installation guide
conda create --name NNFF-MACE --clone CUDA-torch-base
conda activate NNFF-MACE
pip install mace-torch
To learn about MACE, follow this tutorial at https://github.com/ilyes319/mace-tutorials/blob/main/mace-users/MACE_users.ipynb
. As a test run download solvent_test.xyz and solvent_train.xyz from the repo, then run this command on a compute node:
mace_run_train \
--name="model" \
--train_file="$DATA/solvent_train.xyz" \
--valid_fraction=0.05 \
--test_file="$DATA/solvent_test.xyz" \
--E0s="isolated" \
--energy_key="energy" \
--forces_key="forces" \
--model="MACE" \
--num_interactions=2 \
--max_ell=2 \
--hidden_irreps="16x0e" \
--num_cutoff_basis=5 \
--correlation=2 \
--r_max=3.0 \
--batch_size=5 \
--valid_batch_size=5 \
--eval_interval=1 \
--max_num_epochs=50 \
--start_swa=15 \
--swa_energy_weight=1000 \
--ema \
--ema_decay=0.99 \
--amsgrad \
--error_table="PerAtomRMSE" \
--default_dtype="float32" \
--swa \
--device=cuda \
--seed=1234
Wandb can be used to log validation loss, error in energy and forces. Install wandb using
pip install wandb
Do not forget to log in
wandb login
to tell MACE to use Wandb add the following lines to the mace_run_train options
--wandb \
--wandb_project="$YOUR_WANDB_PROJECT_NAME" \
--wandb_entity="$YOUR_WANDB_USERNAME"
Currently limited properties are logged. Check what are logged here:
if log_wandb:
wandb_log_dict = {
"epoch": epoch,
"valid_loss": valid_loss,
"valid_rmse_e_per_atom": eval_metrics["rmse_e_per_atom"],
"valid_rmse_f": eval_metrics["rmse_f"],
}
wandb.log(wandb_log_dict)
To log more properties modify wandb_log_dict at this validation stage or at the training stage. I will be opening a ticket on MACE then add a Git branch later.