Skip to content

Instantly share code, notes, and snippets.

@mtreviso
Created September 13, 2025 14:28
Show Gist options
  • Save mtreviso/ee2517f69211af8f27feef9b11cddab4 to your computer and use it in GitHub Desktop.
Save mtreviso/ee2517f69211af8f27feef9b11cddab4 to your computer and use it in GitHub Desktop.

Slurm User Quickstart: Launching & Managing Jobs

Table of Contents

  1. Introduction
  2. Launching Jobs
  3. Job Submission Methods
  4. Slurm Flags
  5. Monitoring Jobs
  6. Updating Jobs
  7. Canceling Jobs
  8. Examining Jobs
  9. Why is my job PENDING?

Introduction

Methods to Launch Jobs

Slurm can launch jobs in three ways:

  • srun --pty — interactive shell session
  • srun — run a single command
  • sbatch — submit a full script (batch)

Checking Job Status

Once a job is launched, check the queue:

squeue

Important: After submitting, always verify your job with squeue. Errors (e.g., invalid QoS, bad constraints) will keep a job pending indefinitely.

Note: Some fields (like START_TIME) are computed once per minute and may be blank right after submission.


Launching Jobs

Important Parameters

Specify these frequently:

  • --gres=gpu:N — number of GPUs (prefer this over --gpus=N unless you intend multi-node GPU placement)
  • --time=[days-]hh:mm:ss — wallclock time (days optional), e.g. 1-12:00:00 for 36h
  • --qos=<name> — Quality of Service to choose limits/priority
  • -w <node> — run on a specific node (useful if files aren’t on a shared FS yet)

Quality of Service (QoS) Options

name priority max jobs max cpus/gpus max time
cpu 10 4 32 / 0 GPUs
gpu-debug 20 1 — / 8 GPUs 01:00:00
gpu-short 10 4 — / 4 GPUs 04:00:00
gpu-medium 5 1 — / 4 GPUs 2-00:00:00
gpu-long 2 2 — / 2 GPUs 7-00:00:00
gpu-h100 10 2 — / 4 GPUs 2-00:00:00
gpu-h200 10 2 — / 4 GPUs 4-00:00:00
gpu-hero 100 3 — / 3 GPUs
  • Note: gpu-hero is for urgent deadlines; ask an admin for temporary access.
  • Note 2: gpu-h100 runs on Dionysus; gpu-h200 runs on Hades.

Job Submission Methods

Running a simple script with srun

Allocate one GPU on poseidon for 1 hour:

srun --time=01:00:00 --gres=gpu:1 --qos=gpu-debug -w poseidon python3 your_script.py

Using tmux sessions to leave jobs running

Use tmux so the job persists after you disconnect:

# start tmux (on artemis, for example)
tmux

# inside tmux
cd myproject
source myenv/bin/activate
srun --time=04:00:00 --gres=gpu:1 --qos=gpu-long -w artemis python3 your_script.py

Debugging with interactive jobs using srun --pty

Get an interactive shell with your requested resources:

srun --time=01:00:00 --gres=gpu:1 --qos=gpu-debug -w artemis --pty bash

Then run python3 your_script.py inside the shell. Exit with CTRL-D or exit.

Running complex scripts with sbatch (advanced)

Create test.sbatch:

#!/bin/bash
#SBATCH --job-name=my_script
#SBATCH --output="job.%x.%j.out"
#SBATCH --time=01:00:00
#SBATCH --gres=gpu:1
#SBATCH --qos=gpu-debug

python3 your_script.py

Submit and check:

sbatch -w artemis ./test.sbatch
squeue

Output will be in job.<name>.<jobid>.out.

Using Jupyter Notebooks with VS Code

Request resources interactively:

srun --gres=gpu:1 --qos=gpu-long -w artemis --pty bash

Activate your environment, then:

jupyter notebook --port <PORT> --no-browser

In VS Code, select Existing Jupyter Server and paste the URL.

Remote tip: if connecting over SSH, forward the port:

ssh -L <PORT>:localhost:<PORT> user@artemis

Slurm Flags

Commonly useful flags:

Flag Info
--job-name=<name> Name of the job.
--output=<file> Output file; supports placeholders like %x (job name), %j (job ID). See sbatch docs.
--time=<time> [days-]hours:minutes:seconds, e.g. 1-12:00:00.
--mem=<size[KMGT]> Memory per node, e.g. --mem=4G.
--cpus-per-task=<n> CPUs per task for multithreaded apps.
--gres=gpu:<N> Number/type of GPUs, e.g. --gres=gpu:2 or --gres=gpu:h100:2.

Slurm exports variables like $SLURM_JOB_NAME and $SLURM_JOB_ID.


Monitoring Jobs

squeue [options]

START_TIME (for pending jobs) shows when Slurm expects to start the job (may start earlier).

Prettier squeue:

echo 'export SQUEUE_FORMAT="%.7i %9P %35j %.8u %.2t %.12M %.12L %.5C %.7m %.4D %R"' >> ~/.bashrc
source ~/.bashrc

Updating Jobs

Change settings of pending jobs:

scontrol update job <jobid> SETTING=VALUE [...]

Discover fields first:

scontrol show job <jobid>

Example:

scontrol update job <jobid> TimeLimit=03:00:00

Canceling Jobs

Cancel specific jobs:

scancel -u <your_username> <job_id> [...]

Cancel all your jobs (no prompt):

squeue -h -o %i | xargs scancel -u <your_username>

Examining Jobs

More detailed history/status:

sacct
sacct --jobs=<jobid>

Why is my job PENDING?

Common reasons (see also Slurm docs):

Reason Code Explanation
Priority Higher priority jobs ahead; yours will run eventually.
Dependency Waiting for dependent job(s) to complete.
Resources Waiting for resources (GPUs/memory/nodes).
InvalidAccount Bad account setting; cancel and resubmit with correct one.
InvalidQoS Bad QoS; cancel and resubmit.
QOSMaxGRESPerUser You exceeded per-user GPU quota for the chosen QoS.
PartitionMaxJobsLimit Partition max jobs reached.
AssociationMaxJobsLimit Association max jobs reached.
JobLaunchFailure Launch failed (bad path, FS issue, etc.).
NonZeroExitCode Job exited with non-zero status.
SystemFailure Slurm/FS/network failure.
TimeLimit Job hit its time limit.
WaitingForScheduling Reason not yet set; scheduler deciding.
BadConstraints Constraints cannot be satisfied.

Sometimes you’ll see: “Nodes required for job are DOWN, DRAINED…” → often equivalent to waiting on Priority/Resources or misconfiguration.

More details:

Which job will run next?

Sort pending jobs by priority:

psqueue --sort=-p,i --states=PD

Top entries will launch first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment