- Introduction
- Launching Jobs
- Job Submission Methods
- Slurm Flags
- Monitoring Jobs
- Updating Jobs
- Canceling Jobs
- Examining Jobs
- Why is my job PENDING?
Slurm can launch jobs in three ways:
srun --pty— interactive shell sessionsrun— run a single commandsbatch— submit a full script (batch)
Once a job is launched, check the queue:
squeueImportant: After submitting, always verify your job with squeue. Errors (e.g., invalid QoS, bad constraints) will keep a job pending indefinitely.
Note: Some fields (like
START_TIME) are computed once per minute and may be blank right after submission.
Specify these frequently:
--gres=gpu:N— number of GPUs (prefer this over--gpus=Nunless you intend multi-node GPU placement)--time=[days-]hh:mm:ss— wallclock time (days optional), e.g.1-12:00:00for 36h--qos=<name>— Quality of Service to choose limits/priority-w <node>— run on a specific node (useful if files aren’t on a shared FS yet)
| name | priority | max jobs | max cpus/gpus | max time |
|---|---|---|---|---|
| cpu | 10 | 4 | 32 / 0 GPUs | ∞ |
| gpu-debug | 20 | 1 | — / 8 GPUs | 01:00:00 |
| gpu-short | 10 | 4 | — / 4 GPUs | 04:00:00 |
| gpu-medium | 5 | 1 | — / 4 GPUs | 2-00:00:00 |
| gpu-long | 2 | 2 | — / 2 GPUs | 7-00:00:00 |
| gpu-h100 | 10 | 2 | — / 4 GPUs | 2-00:00:00 |
| gpu-h200 | 10 | 2 | — / 4 GPUs | 4-00:00:00 |
| gpu-hero | 100 | 3 | — / 3 GPUs | ∞ |
- Note:
gpu-herois for urgent deadlines; ask an admin for temporary access. - Note 2:
gpu-h100runs on Dionysus;gpu-h200runs on Hades.
Allocate one GPU on poseidon for 1 hour:
srun --time=01:00:00 --gres=gpu:1 --qos=gpu-debug -w poseidon python3 your_script.pyUse tmux so the job persists after you disconnect:
# start tmux (on artemis, for example)
tmux
# inside tmux
cd myproject
source myenv/bin/activate
srun --time=04:00:00 --gres=gpu:1 --qos=gpu-long -w artemis python3 your_script.pyGet an interactive shell with your requested resources:
srun --time=01:00:00 --gres=gpu:1 --qos=gpu-debug -w artemis --pty bashThen run python3 your_script.py inside the shell. Exit with CTRL-D or exit.
Create test.sbatch:
#!/bin/bash
#SBATCH --job-name=my_script
#SBATCH --output="job.%x.%j.out"
#SBATCH --time=01:00:00
#SBATCH --gres=gpu:1
#SBATCH --qos=gpu-debug
python3 your_script.pySubmit and check:
sbatch -w artemis ./test.sbatch
squeueOutput will be in job.<name>.<jobid>.out.
Request resources interactively:
srun --gres=gpu:1 --qos=gpu-long -w artemis --pty bashActivate your environment, then:
jupyter notebook --port <PORT> --no-browserIn VS Code, select Existing Jupyter Server and paste the URL.
Remote tip: if connecting over SSH, forward the port:
ssh -L <PORT>:localhost:<PORT> user@artemisCommonly useful flags:
| Flag | Info |
|---|---|
--job-name=<name> |
Name of the job. |
--output=<file> |
Output file; supports placeholders like %x (job name), %j (job ID). See sbatch docs. |
--time=<time> |
[days-]hours:minutes:seconds, e.g. 1-12:00:00. |
--mem=<size[KMGT]> |
Memory per node, e.g. --mem=4G. |
--cpus-per-task=<n> |
CPUs per task for multithreaded apps. |
--gres=gpu:<N> |
Number/type of GPUs, e.g. --gres=gpu:2 or --gres=gpu:h100:2. |
Slurm exports variables like $SLURM_JOB_NAME and $SLURM_JOB_ID.
squeue [options]START_TIME (for pending jobs) shows when Slurm expects to start the job (may start earlier).
Prettier squeue:
echo 'export SQUEUE_FORMAT="%.7i %9P %35j %.8u %.2t %.12M %.12L %.5C %.7m %.4D %R"' >> ~/.bashrc
source ~/.bashrcChange settings of pending jobs:
scontrol update job <jobid> SETTING=VALUE [...]Discover fields first:
scontrol show job <jobid>Example:
scontrol update job <jobid> TimeLimit=03:00:00Cancel specific jobs:
scancel -u <your_username> <job_id> [...]Cancel all your jobs (no prompt):
squeue -h -o %i | xargs scancel -u <your_username>More detailed history/status:
sacct
sacct --jobs=<jobid>Common reasons (see also Slurm docs):
| Reason Code | Explanation |
|---|---|
| Priority | Higher priority jobs ahead; yours will run eventually. |
| Dependency | Waiting for dependent job(s) to complete. |
| Resources | Waiting for resources (GPUs/memory/nodes). |
| InvalidAccount | Bad account setting; cancel and resubmit with correct one. |
| InvalidQoS | Bad QoS; cancel and resubmit. |
| QOSMaxGRESPerUser | You exceeded per-user GPU quota for the chosen QoS. |
| PartitionMaxJobsLimit | Partition max jobs reached. |
| AssociationMaxJobsLimit | Association max jobs reached. |
| JobLaunchFailure | Launch failed (bad path, FS issue, etc.). |
| NonZeroExitCode | Job exited with non-zero status. |
| SystemFailure | Slurm/FS/network failure. |
| TimeLimit | Job hit its time limit. |
| WaitingForScheduling | Reason not yet set; scheduler deciding. |
| BadConstraints | Constraints cannot be satisfied. |
Sometimes you’ll see: “Nodes required for job are DOWN, DRAINED…” → often equivalent to waiting on Priority/Resources or misconfiguration.
More details:
- Reasons: https://slurm.schedmd.com/squeue.html#SECTION_JOB-REASON-CODES
- Resource limits: https://slurm.schedmd.com/resource_limits.html
Sort pending jobs by priority:
psqueue --sort=-p,i --states=PDTop entries will launch first.