Mace installation on ANL's Swing as of 1/30/24

Check resources available on Swing: NVIDIA A100 GPUs (8 GPUs per node) - 1/8 node allocated when requesting 1 gpu

MACE requires Pytorch2, which needs CUDA 11.8 or 11.7. Check required CUDA versions here

Install Torch and Cuda

Create a Conda environment with Python 3.10

conda create -n "CUDA-torch-base" python=3.10.0

Activate once the environment is created

conda activate CUDA-torch-base

Read install Pytorch instructions from their official website. For me this is the command:

conda install pytorch=2.2.0 torchvision=0.17.0 torchaudio=2.2.0 pytorch-cuda=11.8  -c pytorch -c nvidia

Before Torch, Torchvision, Torchaudio are installed: check that they are installing the cuda version and not the cpu version. For example a cpu version of Pytorch source will be pytorch/linux-64::pytorch-2.0.1-py3.10_cpu_0 while the cuda version will say pytorch/linux-64::pytorch-2.2.0-py3.10_cuda11.8_cudnn8.7.0_0. For Pytorch 1.11.0 (for Allegro), I had success with cuda 11.3.

Check whether cuda-enabled torch is installed correctly. Request a GPU (dont forget to ssh in the compute node after resource granted) In Python use

import torch
assert torch.cuda.is_available() = True

Install MACE

If all is well, install MACE following their installation guide

conda create --name NNFF-MACE --clone CUDA-torch-base
conda activate NNFF-MACE

pip install mace-torch

To learn about MACE, follow this tutorial at https://github.com/ilyes319/mace-tutorials/blob/main/mace-users/MACE_users.ipynb. As a test run download solvent_test.xyz and solvent_train.xyz from the repo, then run this command on a compute node:

mace_run_train \
    --name="model" \
    --train_file="$DATA/solvent_train.xyz" \
    --valid_fraction=0.05 \
    --test_file="$DATA/solvent_test.xyz" \
    --E0s="isolated" \
    --energy_key="energy" \
    --forces_key="forces" \
    --model="MACE" \
    --num_interactions=2 \
    --max_ell=2 \
    --hidden_irreps="16x0e" \
    --num_cutoff_basis=5 \
    --correlation=2 \
    --r_max=3.0 \
    --batch_size=5 \
    --valid_batch_size=5 \
    --eval_interval=1 \
    --max_num_epochs=50 \
    --start_swa=15 \
    --swa_energy_weight=1000 \
    --ema \
    --ema_decay=0.99 \
    --amsgrad \
    --error_table="PerAtomRMSE" \
    --default_dtype="float32" \
    --swa \
    --device=cuda \
    --seed=1234

Wandb for monitoring train/test progress

Wandb can be used to log validation loss, error in energy and forces. Install wandb using

pip install wandb

Do not forget to log in

wandb login

to tell MACE to use Wandb add the following lines to the mace_run_train options

    --wandb \
    --wandb_project="$YOUR_WANDB_PROJECT_NAME" \
    --wandb_entity="$YOUR_WANDB_USERNAME"

Currently limited properties are logged. Check what are logged here:

            if log_wandb:
                wandb_log_dict = {
                    "epoch": epoch,
                    "valid_loss": valid_loss,
                    "valid_rmse_e_per_atom": eval_metrics["rmse_e_per_atom"],
                    "valid_rmse_f": eval_metrics["rmse_f"],
                }
                wandb.log(wandb_log_dict)

To log more properties modify wandb_log_dict at this validation stage or at the training stage. I will be opening a ticket on MACE then add a Git branch later.

jingsk/MACE_install.md

Mace installation on ANL's Swing as of 1/30/24

Install Torch and Cuda

Install MACE

Wandb for monitoring train/test progress