slurm_skills.md

Env setup

For python project, you should first follow the README to setup. If not specified, then use uv to create local env for dev. If it requires specific cuda version, or depends on library like (verl,sglang,vllm), then use conda instead.

SLURM Cluster Setup

Detect whether you are running on a SLURM cluster

which srun
which squeue
which sinfo

If all of srun, squeue, and sinfo are installed, you are running on a SLURM cluster. Otherwise you can ignore the SLURM-related instructions below.

Determine login-node vs. worker-node

Check whether the environment has SLURM_JOB_ID set or nvidia-smi is available. If neither is present, you are on a login node; otherwise you are on a worker node.

Allocate a GPU node for GPU-related tasks

The login node can handle most tasks, but if a program (*.py, *.sh, or commands like nvidia-smi) requires GPU access, you must allocate a GPU node. Wrap the command with eai-run:

srun --account <slurm account> --partition <slurm partitions> --job-name <slurm account>:dev/eai-test --nodes 1 --gpus-per-node 8 --time 4:00:00 --exclusive --pty bash

if eai-run is not installed, then you should fall back to raw srun.

Also note that all slurm GPU jobs are a max running time of 4 hours. Therefore,

it is necessary support/implement checkpointing mechanism of weigts/optims/logs if doing training related jobs.
if the training is not finished after the job, launch a new one to continue.

# Python scripts
python a.py            -> eai-run -i -J ralph/{a-suitable-job-name} --pty python a.py

# uv-managed scripts
uv run a.py            -> eai-run -i -J ralph/{a-suitable-job-name} --pty uv run a.py

# GPU diagnostics
nvidia-smi             -> eai-run -i -J ralph/{a-suitable-job-name} --pty nvidia-smi

# Shell scripts requiring GPUs
bash run.sh            -> eai-run -i -J ralph/{a-suitable-job-name} --pty bash run.sh

Check queue status and running time

After submitting a job to SLURM, check its status with squeue:

Sat Mar 28 11:40:04 2026
     ACCOUNT          JOBID    PARTITION     USER    STATE     TIME TIME_LIMI NODES NAME {NODELIST(REASON) START_TIME}
 nvr_elm_llm        8947377 interactive,  ligengz  PENDING     0:00   4:00:00     1 nvr_elm_llm:dev/eai-test {(QOSMaxJobsPerUserLimit) 2026-03-28T15:39:39}

The last timestamp is the expected launch time. Check the job status:

Every 1 minute if the expected launch time is within 5 minutes
Every 5 minutes otherwise

When running jobs with a time limit (e.g., 5 minutes), ensure the job gets enough runtime after allocation. The time count starts once SLURM allocates the resources, from submission.

Lyken17/slurm_skills.md

Select an option

No results found