- Introduction
- Launching Jobs
- Job Submission Methods
- Slurm Flags
- Monitoring Jobs
- Updating Jobs
- Canceling Jobs
- Examining Jobs
- Why is my job PENDING?
Slurm can launch jobs in three ways:
srun --pty
— interactive shell sessionsrun
— run a single commandsbatch
— submit a full script (batch)
Once a job is launched, check the queue:
squeue
Important: After submitting, always verify your job with squeue
. Errors (e.g., invalid QoS, bad constraints) will keep a job pending indefinitely.
Note: Some fields (like
START_TIME
) are computed once per minute and may be blank right after submission.
Specify these frequently:
--gres=gpu:N
— number of GPUs (prefer this over--gpus=N
unless you intend multi-node GPU placement)--time=[days-]hh:mm:ss
— wallclock time (days optional), e.g.1-12:00:00
for 36h--qos=<name>
— Quality of Service to choose limits/priority-w <node>
— run on a specific node (useful if files aren’t on a shared FS yet)
name | priority | max jobs | max cpus/gpus | max time |
---|---|---|---|---|
cpu | 10 | 4 | 32 / 0 GPUs | ∞ |
gpu-debug | 20 | 1 | — / 8 GPUs | 01:00:00 |
gpu-short | 10 | 4 | — / 4 GPUs | 04:00:00 |
gpu-medium | 5 | 1 | — / 4 GPUs | 2-00:00:00 |
gpu-long | 2 | 2 | — / 2 GPUs | 7-00:00:00 |
gpu-h100 | 10 | 2 | — / 4 GPUs | 2-00:00:00 |
gpu-h200 | 10 | 2 | — / 4 GPUs | 4-00:00:00 |
gpu-hero | 100 | 3 | — / 3 GPUs | ∞ |
- Note:
gpu-hero
is for urgent deadlines; ask an admin for temporary access. - Note 2:
gpu-h100
runs on Dionysus;gpu-h200
runs on Hades.
Allocate one GPU on poseidon
for 1 hour:
srun --time=01:00:00 --gres=gpu:1 --qos=gpu-debug -w poseidon python3 your_script.py
Use tmux
so the job persists after you disconnect:
# start tmux (on artemis, for example)
tmux
# inside tmux
cd myproject
source myenv/bin/activate
srun --time=04:00:00 --gres=gpu:1 --qos=gpu-long -w artemis python3 your_script.py
Get an interactive shell with your requested resources:
srun --time=01:00:00 --gres=gpu:1 --qos=gpu-debug -w artemis --pty bash
Then run python3 your_script.py
inside the shell. Exit with CTRL-D
or exit
.
Create test.sbatch
:
#!/bin/bash
#SBATCH --job-name=my_script
#SBATCH --output="job.%x.%j.out"
#SBATCH --time=01:00:00
#SBATCH --gres=gpu:1
#SBATCH --qos=gpu-debug
python3 your_script.py
Submit and check:
sbatch -w artemis ./test.sbatch
squeue
Output will be in job.<name>.<jobid>.out
.
Request resources interactively:
srun --gres=gpu:1 --qos=gpu-long -w artemis --pty bash
Activate your environment, then:
jupyter notebook --port <PORT> --no-browser
In VS Code, select Existing Jupyter Server and paste the URL.
Remote tip: if connecting over SSH, forward the port:
ssh -L <PORT>:localhost:<PORT> user@artemis
Commonly useful flags:
Flag | Info |
---|---|
--job-name=<name> |
Name of the job. |
--output=<file> |
Output file; supports placeholders like %x (job name), %j (job ID). See sbatch docs. |
--time=<time> |
[days-]hours:minutes:seconds , e.g. 1-12:00:00 . |
--mem=<size[KMGT]> |
Memory per node, e.g. --mem=4G . |
--cpus-per-task=<n> |
CPUs per task for multithreaded apps. |
--gres=gpu:<N> |
Number/type of GPUs, e.g. --gres=gpu:2 or --gres=gpu:h100:2 . |
Slurm exports variables like $SLURM_JOB_NAME
and $SLURM_JOB_ID
.
squeue [options]
START_TIME
(for pending jobs) shows when Slurm expects to start the job (may start earlier).
Prettier squeue
:
echo 'export SQUEUE_FORMAT="%.7i %9P %35j %.8u %.2t %.12M %.12L %.5C %.7m %.4D %R"' >> ~/.bashrc
source ~/.bashrc
Change settings of pending jobs:
scontrol update job <jobid> SETTING=VALUE [...]
Discover fields first:
scontrol show job <jobid>
Example:
scontrol update job <jobid> TimeLimit=03:00:00
Cancel specific jobs:
scancel -u <your_username> <job_id> [...]
Cancel all your jobs (no prompt):
squeue -h -o %i | xargs scancel -u <your_username>
More detailed history/status:
sacct
sacct --jobs=<jobid>
Common reasons (see also Slurm docs):
Reason Code | Explanation |
---|---|
Priority | Higher priority jobs ahead; yours will run eventually. |
Dependency | Waiting for dependent job(s) to complete. |
Resources | Waiting for resources (GPUs/memory/nodes). |
InvalidAccount | Bad account setting; cancel and resubmit with correct one. |
InvalidQoS | Bad QoS; cancel and resubmit. |
QOSMaxGRESPerUser | You exceeded per-user GPU quota for the chosen QoS. |
PartitionMaxJobsLimit | Partition max jobs reached. |
AssociationMaxJobsLimit | Association max jobs reached. |
JobLaunchFailure | Launch failed (bad path, FS issue, etc.). |
NonZeroExitCode | Job exited with non-zero status. |
SystemFailure | Slurm/FS/network failure. |
TimeLimit | Job hit its time limit. |
WaitingForScheduling | Reason not yet set; scheduler deciding. |
BadConstraints | Constraints cannot be satisfied. |
Sometimes you’ll see: “Nodes required for job are DOWN, DRAINED…” → often equivalent to waiting on Priority/Resources or misconfiguration.
More details:
- Reasons: https://slurm.schedmd.com/squeue.html#SECTION_JOB-REASON-CODES
- Resource limits: https://slurm.schedmd.com/resource_limits.html
Sort pending jobs by priority:
psqueue --sort=-p,i --states=PD
Top entries will launch first.