ryan-williams/electrai#100.md

Last active March 25, 2026 20:45

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/ryan-williams/219b38f3ce1f6147113c2b133500c458.js"></script>
Save ryan-williams/219b38f3ce1f6147113c2b133500c458 to your computer and use it in GitHub Desktop.

Download ZIP

Quantum-Accelerators/electrai#100 - 2-way sync via ghpr (https://github.com/runsascoded/ghpr)

Raw

electrai#100.md

Quantum-Accelerators/electrai#100 Add Modal GPU CI workflow

Summary

Add Modal GPU infrastructure for CI, benchmarking, and training experiments.

CI (`modal/ci.py` + `.github/workflows/gpu-e2e-modal.yml`)

GPU e2e test on Modal L4, parallel to existing EC2-based gpu-e2e.yml
Faster cold start (~30s vs ~3-5min), simpler setup (no runner registration, no OIDC)
Produces identical results to EC2 L4 (val_loss=0.364269)

Training (`modal/train.py`)

Full training entrypoint for experiments on Modal GPUs (L4/A100/H100)
Uses electrai-data Volume with dataset_4 (2,885 samples, ~205 GiB)
Configurable model size, epochs, learning rate, WandB logging
Checkpoint persistence via electrai-checkpoints Volume
Replaces Lambda Labs for experiments (better GPU availability)

Benchmark (`modal/benchmark.py` + `.github/workflows/gpu-benchmark-modal.yml`)

Mirrors gpu-benchmark.yml but runs on Modal
Configurable sample count (subsample from 2,885 or use all)
modal run modal/benchmark.py --gpu A100 --samples 50 --epochs 5
Validated via push-triggered run on modal-benchmark branch (50 samples, 5 epochs, L4)
Note: gpu-benchmark-modal.yml uses workflow_dispatch only, so it won't be dispatchable until this PR merges to main

Data pipeline

modal/populate_volume.py: S3 → Modal Volume sync
Data provenance: Globus (Della) → S3 (s3://openathena/electrai/mp/chg_datasets/dataset_4/) → Modal Volume

Image construction

Dependencies read from pyproject.toml via pip_install_from_pyproject (no duplication)
retries=0 to prevent crash loops during iteration

Secrets required

MODAL_TOKEN_ID / MODAL_TOKEN_SECRET — repo secrets (set)
wandb-credentials — Modal secret with WANDB_API_KEY (set)
aws-credentials — Modal secret for populate_volume.py (set, uses SSO session token)

Test plan

modal run modal/ci.py — val_loss matches linux-gpu expected values
modal run modal/train.py --epochs 2 — trains on Volume data with WandB
GHA gpu-e2e-modal.yml triggers on PR, passes
modal/populate_volume.py — 5,771 files synced to Volume
modal run modal/benchmark.py — 50 samples, 5 epochs, L4, green
GHA gpu-benchmark-modal.yml — push-triggered iteration, green
Betsy validates training workflow as Lambda Labs replacement

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment