Quantum-Accelerators/electrai#100 Add Modal GPU CI workflow
Add Modal GPU infrastructure for CI, benchmarking, and training experiments.
- GPU e2e test on Modal L4, parallel to existing EC2-based
gpu-e2e.yml - Faster cold start (~30s vs ~3-5min), simpler setup (no runner registration, no OIDC)
- Produces identical results to EC2 L4 (
val_loss=0.364269)
- Full training entrypoint for experiments on Modal GPUs (L4/A100/H100)
- Uses
electrai-dataVolume withdataset_4(2,885 samples, ~205 GiB) - Configurable model size, epochs, learning rate, WandB logging
- Checkpoint persistence via
electrai-checkpointsVolume - Replaces Lambda Labs for experiments (better GPU availability)
- Mirrors
gpu-benchmark.ymlbut runs on Modal - Configurable sample count (subsample from 2,885 or use all)
modal run modal/benchmark.py --gpu A100 --samples 50 --epochs 5- Validated via push-triggered run on
modal-benchmarkbranch (50 samples, 5 epochs, L4) - Note:
gpu-benchmark-modal.ymlusesworkflow_dispatchonly, so it won't be dispatchable until this PR merges tomain
modal/populate_volume.py: S3 → Modal Volume sync- Data provenance: Globus (Della) → S3 (
s3://openathena/electrai/mp/chg_datasets/dataset_4/) → Modal Volume
- Dependencies read from
pyproject.tomlviapip_install_from_pyproject(no duplication) retries=0to prevent crash loops during iteration
MODAL_TOKEN_ID/MODAL_TOKEN_SECRET— repo secrets (set)wandb-credentials— Modal secret withWANDB_API_KEY(set)aws-credentials— Modal secret forpopulate_volume.py(set, uses SSO session token)
-
modal run modal/ci.py— val_loss matcheslinux-gpuexpected values -
modal run modal/train.py --epochs 2— trains on Volume data with WandB - GHA
gpu-e2e-modal.ymltriggers on PR, passes -
modal/populate_volume.py— 5,771 files synced to Volume -
modal run modal/benchmark.py— 50 samples, 5 epochs, L4, green - GHA
gpu-benchmark-modal.yml— push-triggered iteration, green - Betsy validates training workflow as Lambda Labs replacement