For python project, you should first follow the README to setup. If not specified, then use uv to create local env for dev. If it requires specific cuda version, or depends on library like (verl,sglang,vllm), then use conda instead.
which srun
which squeue
which sinfoIf all of srun, squeue, and sinfo are installed, you are running on a SLURM cluster. Otherwise you can ignore the SLURM-related instructions below.
Check whether the environment has SLURM_JOB_ID set or nvidia-smi is available. If neither is present, you are on a login node; otherwise you are on a worker node.
The login node can handle most tasks, but if a program (*.py, *.sh, or commands like nvidia-smi) requires GPU access, you must allocate a GPU node. Wrap the command with eai-run:
srun --account <slurm account> --partition <slurm partitions> --job-name <slurm account>:dev/eai-test --nodes 1 --gpus-per-node 8 --time 4:00:00 --exclusive --pty bashif eai-run is not installed, then you should fall back to raw srun.
Also note that all slurm GPU jobs are a max running time of 4 hours. Therefore,
- it is necessary support/implement checkpointing mechanism of weigts/optims/logs if doing training related jobs.
- if the training is not finished after the job, launch a new one to continue.
# Python scripts
python a.py -> eai-run -i -J ralph/{a-suitable-job-name} --pty python a.py
# uv-managed scripts
uv run a.py -> eai-run -i -J ralph/{a-suitable-job-name} --pty uv run a.py
# GPU diagnostics
nvidia-smi -> eai-run -i -J ralph/{a-suitable-job-name} --pty nvidia-smi
# Shell scripts requiring GPUs
bash run.sh -> eai-run -i -J ralph/{a-suitable-job-name} --pty bash run.shAfter submitting a job to SLURM, check its status with squeue:
Sat Mar 28 11:40:04 2026
ACCOUNT JOBID PARTITION USER STATE TIME TIME_LIMI NODES NAME {NODELIST(REASON) START_TIME}
nvr_elm_llm 8947377 interactive, ligengz PENDING 0:00 4:00:00 1 nvr_elm_llm:dev/eai-test {(QOSMaxJobsPerUserLimit) 2026-03-28T15:39:39}
The last timestamp is the expected launch time. Check the job status:
- Every 1 minute if the expected launch time is within 5 minutes
- Every 5 minutes otherwise
When running jobs with a time limit (e.g., 5 minutes), ensure the job gets enough runtime after allocation. The time count starts once SLURM allocates the resources, from submission.