Skip to content

Instantly share code, notes, and snippets.

@ryan-williams
Last active March 25, 2026 20:45
Show Gist options
  • Select an option

  • Save ryan-williams/219b38f3ce1f6147113c2b133500c458 to your computer and use it in GitHub Desktop.

Select an option

Save ryan-williams/219b38f3ce1f6147113c2b133500c458 to your computer and use it in GitHub Desktop.

Quantum-Accelerators/electrai#100 Add Modal GPU CI workflow

Summary

Add Modal GPU infrastructure for CI, benchmarking, and training experiments.

CI (modal/ci.py + .github/workflows/gpu-e2e-modal.yml)

  • GPU e2e test on Modal L4, parallel to existing EC2-based gpu-e2e.yml
  • Faster cold start (~30s vs ~3-5min), simpler setup (no runner registration, no OIDC)
  • Produces identical results to EC2 L4 (val_loss=0.364269)

Training (modal/train.py)

  • Full training entrypoint for experiments on Modal GPUs (L4/A100/H100)
  • Uses electrai-data Volume with dataset_4 (2,885 samples, ~205 GiB)
  • Configurable model size, epochs, learning rate, WandB logging
  • Checkpoint persistence via electrai-checkpoints Volume
  • Replaces Lambda Labs for experiments (better GPU availability)

Benchmark (modal/benchmark.py + .github/workflows/gpu-benchmark-modal.yml)

  • Mirrors gpu-benchmark.yml but runs on Modal
  • Configurable sample count (subsample from 2,885 or use all)
  • modal run modal/benchmark.py --gpu A100 --samples 50 --epochs 5
  • Validated via push-triggered run on modal-benchmark branch (50 samples, 5 epochs, L4)
  • Note: gpu-benchmark-modal.yml uses workflow_dispatch only, so it won't be dispatchable until this PR merges to main

Data pipeline

  • modal/populate_volume.py: S3 → Modal Volume sync
  • Data provenance: Globus (Della) → S3 (s3://openathena/electrai/mp/chg_datasets/dataset_4/) → Modal Volume

Image construction

  • Dependencies read from pyproject.toml via pip_install_from_pyproject (no duplication)
  • retries=0 to prevent crash loops during iteration

Secrets required

  • MODAL_TOKEN_ID / MODAL_TOKEN_SECRET — repo secrets (set)
  • wandb-credentials — Modal secret with WANDB_API_KEY (set)
  • aws-credentials — Modal secret for populate_volume.py (set, uses SSO session token)

Test plan

  • modal run modal/ci.py — val_loss matches linux-gpu expected values
  • modal run modal/train.py --epochs 2 — trains on Volume data with WandB
  • GHA gpu-e2e-modal.yml triggers on PR, passes
  • modal/populate_volume.py — 5,771 files synced to Volume
  • modal run modal/benchmark.py — 50 samples, 5 epochs, L4, green
  • GHA gpu-benchmark-modal.yml — push-triggered iteration, green
  • Betsy validates training workflow as Lambda Labs replacement
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment