Slurm User Quickstart: Launching & Managing Jobs

Introduction
- Methods to Launch Jobs
- Checking Job Status
Launching Jobs
- Important Parameters
- Quality of Service (QoS) Options
Job Submission Methods
Slurm Flags
Monitoring Jobs
Updating Jobs
Canceling Jobs
Examining Jobs
Why is my job PENDING?
- Which job will run next?

Introduction

Methods to Launch Jobs

Slurm can launch jobs in three ways:

srun --pty — interactive shell session
srun — run a single command
sbatch — submit a full script (batch)

Checking Job Status

Once a job is launched, check the queue:

squeue

Important: After submitting, always verify your job with squeue. Errors (e.g., invalid QoS, bad constraints) will keep a job pending indefinitely.

Note: Some fields (like START_TIME) are computed once per minute and may be blank right after submission.

Launching Jobs

Important Parameters

Specify these frequently:

--gres=gpu:N — number of GPUs (prefer this over --gpus=N unless you intend multi-node GPU placement)
--time=[days-]hh:mm:ss — wallclock time (days optional), e.g. 1-12:00:00 for 36h
--qos=<name> — Quality of Service to choose limits/priority
-w <node> — run on a specific node (useful if files aren’t on a shared FS yet)

Quality of Service (QoS) Options

name	priority	max jobs	max cpus/gpus	max time
cpu	10	4	32 / 0 GPUs	∞
gpu-debug	20	1	— / 8 GPUs	01:00:00
gpu-short	10	4	— / 4 GPUs	04:00:00
gpu-medium	5	1	— / 4 GPUs	2-00:00:00
gpu-long	2	2	— / 2 GPUs	7-00:00:00
gpu-h100	10	2	— / 4 GPUs	2-00:00:00
gpu-h200	10	2	— / 4 GPUs	4-00:00:00
gpu-hero	100	3	— / 3 GPUs	∞

Note: gpu-hero is for urgent deadlines; ask an admin for temporary access.
Note 2: gpu-h100 runs on Dionysus; gpu-h200 runs on Hades.

Job Submission Methods

Running a simple script with `srun`

Allocate one GPU on poseidon for 1 hour:

srun --time=01:00:00 --gres=gpu:1 --qos=gpu-debug -w poseidon python3 your_script.py

Using tmux sessions to leave jobs running

Use tmux so the job persists after you disconnect:

# start tmux (on artemis, for example)
tmux

# inside tmux
cd myproject
source myenv/bin/activate
srun --time=04:00:00 --gres=gpu:1 --qos=gpu-long -w artemis python3 your_script.py

Debugging with interactive jobs using `srun --pty`

Get an interactive shell with your requested resources:

srun --time=01:00:00 --gres=gpu:1 --qos=gpu-debug -w artemis --pty bash

Then run python3 your_script.py inside the shell. Exit with CTRL-D or exit.

Running complex scripts with `sbatch` (advanced)

Create test.sbatch:

#!/bin/bash
#SBATCH --job-name=my_script
#SBATCH --output="job.%x.%j.out"
#SBATCH --time=01:00:00
#SBATCH --gres=gpu:1
#SBATCH --qos=gpu-debug

python3 your_script.py

Submit and check:

sbatch -w artemis ./test.sbatch
squeue

Output will be in job.<name>.<jobid>.out.

Using Jupyter Notebooks with VS Code

Request resources interactively:

srun --gres=gpu:1 --qos=gpu-long -w artemis --pty bash

Activate your environment, then:

jupyter notebook --port <PORT> --no-browser

In VS Code, select Existing Jupyter Server and paste the URL.

Remote tip: if connecting over SSH, forward the port:

ssh -L <PORT>:localhost:<PORT> user@artemis

Slurm Flags

Commonly useful flags:

Flag	Info
`--job-name=<name>`	Name of the job.
`--output=<file>`	Output file; supports placeholders like `%x` (job name), `%j` (job ID). See `sbatch` docs.
`--time=<time>`	`[days-]hours:minutes:seconds`, e.g. `1-12:00:00`.
`--mem=<size[KMGT]>`	Memory per node, e.g. `--mem=4G`.
`--cpus-per-task=<n>`	CPUs per task for multithreaded apps.
`--gres=gpu:<N>`	Number/type of GPUs, e.g. `--gres=gpu:2` or `--gres=gpu:h100:2`.

Slurm exports variables like $SLURM_JOB_NAME and $SLURM_JOB_ID.

Monitoring Jobs

squeue [options]

START_TIME (for pending jobs) shows when Slurm expects to start the job (may start earlier).

Prettier squeue:

echo 'export SQUEUE_FORMAT="%.7i %9P %35j %.8u %.2t %.12M %.12L %.5C %.7m %.4D %R"' >> ~/.bashrc
source ~/.bashrc

Updating Jobs

Change settings of pending jobs:

scontrol update job <jobid> SETTING=VALUE [...]

Discover fields first:

scontrol show job <jobid>

Example:

scontrol update job <jobid> TimeLimit=03:00:00

Canceling Jobs

Cancel specific jobs:

scancel -u <your_username> <job_id> [...]

Cancel all your jobs (no prompt):

squeue -h -o %i | xargs scancel -u <your_username>

Examining Jobs

More detailed history/status:

sacct
sacct --jobs=<jobid>

Why is my job `PENDING`?

Common reasons (see also Slurm docs):

Reason Code	Explanation
Priority	Higher priority jobs ahead; yours will run eventually.
Dependency	Waiting for dependent job(s) to complete.
Resources	Waiting for resources (GPUs/memory/nodes).
InvalidAccount	Bad account setting; cancel and resubmit with correct one.
InvalidQoS	Bad QoS; cancel and resubmit.
QOSMaxGRESPerUser	You exceeded per-user GPU quota for the chosen QoS.
PartitionMaxJobsLimit	Partition max jobs reached.
AssociationMaxJobsLimit	Association max jobs reached.
JobLaunchFailure	Launch failed (bad path, FS issue, etc.).
NonZeroExitCode	Job exited with non-zero status.
SystemFailure	Slurm/FS/network failure.
TimeLimit	Job hit its time limit.
WaitingForScheduling	Reason not yet set; scheduler deciding.
BadConstraints	Constraints cannot be satisfied.

Sometimes you’ll see: “Nodes required for job are DOWN, DRAINED…” → often equivalent to waiting on Priority/Resources or misconfiguration.

More details:

Reasons: https://slurm.schedmd.com/squeue.html#SECTION_JOB-REASON-CODES
Resource limits: https://slurm.schedmd.com/resource_limits.html

Which job will run next?

Sort pending jobs by priority:

psqueue --sort=-p,i --states=PD

Top entries will launch first.

mtreviso/SLURM_USER_GUIDE.md

Select an option

No results found

Select an option

No results found

Slurm User Quickstart: Launching & Managing Jobs

Table of Contents

Introduction

Methods to Launch Jobs

Checking Job Status

Launching Jobs

Important Parameters

Quality of Service (QoS) Options

Job Submission Methods

Running a simple script with `srun`

Using tmux sessions to leave jobs running

Debugging with interactive jobs using `srun --pty`

Running complex scripts with `sbatch` (advanced)

Using Jupyter Notebooks with VS Code

Slurm Flags

Monitoring Jobs

Updating Jobs

Canceling Jobs

Examining Jobs

Why is my job `PENDING`?

Which job will run next?

mtreviso/SLURM_USER_GUIDE.md

Slurm User Quickstart: Launching & Managing Jobs

Table of Contents

Introduction

Methods to Launch Jobs

Checking Job Status

Launching Jobs

Important Parameters

Quality of Service (QoS) Options

Job Submission Methods

Running a simple script with srun

Using tmux sessions to leave jobs running

Debugging with interactive jobs using srun --pty

Running complex scripts with sbatch (advanced)

Using Jupyter Notebooks with VS Code

Slurm Flags

Monitoring Jobs

Updating Jobs

Canceling Jobs

Examining Jobs

Why is my job PENDING?

Which job will run next?

Running a simple script with `srun`

Debugging with interactive jobs using `srun --pty`

Running complex scripts with `sbatch` (advanced)

Why is my job `PENDING`?