Skip to content

Instantly share code, notes, and snippets.

@mtreviso
Last active September 13, 2025 14:30
Show Gist options
  • Save mtreviso/051761a0ae51a2aad220e158667361ba to your computer and use it in GitHub Desktop.
Save mtreviso/051761a0ae51a2aad220e158667361ba to your computer and use it in GitHub Desktop.

Slurm Admin Notes: Cluster Management & QoS Setup

Prereq: Slurm is installed and slurmdbd is configured/running. Controller and nodes share the same slurm.conf (including AccountingStorageType=accounting_storage/slurmdbd) and Munge works across nodes.

If Slurm isn’t installed yet, follow: https://mtreviso.github.io/blog/slurm.html


Check or Create the Cluster (sardine-cluster)

Show clusters:

sudo sacctmgr show clusters

Create if missing:

sudo sacctmgr add cluster sardine-cluster

Note: Changes require a healthy slurmdbd and a matching ClusterName in /etc/slurm/slurm.conf.


Add the sardine Account

List:

sudo sacctmgr show account

Create if missing:

sudo sacctmgr add account sardine Description="SARDINE" Organization=sardine

QoS (Quality of Service)

QoS controls job limits and priority. Higher numeric priority runs sooner (e.g., priority=100 > priority=10).

List:

sudo sacctmgr show qos

Delete (example):

sudo sacctmgr delete qos gpu-debug

Add (example policy set—tune to taste):

sudo sacctmgr add qos cpu        set priority=10  MaxJobsPerUser=4 MaxTRESPerUser=cpu=32,mem=128G,gres/gpu=0
sudo sacctmgr add qos gpu-debug  set priority=20  MaxJobsPerUser=1 MaxTRESPerUser=gres/gpu=8  MaxWallDurationPerJob=01:00:00
sudo sacctmgr add qos gpu-short  set priority=10  MaxJobsPerUser=4 MaxTRESPerUser=gres/gpu=4  MaxWallDurationPerJob=04:00:00
sudo sacctmgr add qos gpu-medium set priority=5   MaxJobsPerUser=1 MaxTRESPerUser=gres/gpu=4  MaxWallDurationPerJob=2-00:00:00
sudo sacctmgr add qos gpu-long   set priority=2   MaxJobsPerUser=2 MaxTRESPerUser=gres/gpu=2  MaxWallDurationPerJob=7-00:00:00
sudo sacctmgr add qos gpu-h100   set priority=10  MaxJobsPerUser=2 MaxTRESPerUser=gres/gpu=4  MaxWallDurationPerJob=2-00:00:00
sudo sacctmgr add qos gpu-h200   set priority=10  MaxJobsPerUser=2 MaxTRESPerUser=gres/gpu=4  MaxWallDurationPerJob=4-00:00:00
sudo sacctmgr add qos gpu-hero   set priority=100 MaxJobsPerUser=3 MaxTRESPerUser=gres/gpu=3
  • priority — higher means earlier dispatch (subject to other factors)
  • MaxJobsPerUser — limit concurrent jobs per user in that QoS
  • MaxTRESPerUser — cap total resources (e.g., gres/gpu=4)
  • MaxWallDurationPerJob — per-job wallclock limit

Modify QoS:

sudo sacctmgr update qos gpu-debug set priority=20

Unset a value:

sudo sacctmgr update qos gpu-debug set priority=-1

Users

List:

sudo sacctmgr show user -s

Add a user with allowed QoS:

sudo sacctmgr create user --immediate name=mtreviso account=sardine QOS=gpu-debug,gpu-short,gpu-medium,gpu-long

--immediate skips the interactive confirmation. Omit it if you prefer to review changes.

Modify:

sudo sacctmgr -i modify user where name=mtreviso set QOS=gpu-debug,gpu-short,gpu-medium,gpu-long

Delete:

sudo sacctmgr delete user mtreviso

Troubleshooting

See reasons for drained nodes:

sudo sinfo -R

Draining due to memory mismatch:

  • Ensure node hardware lines in /etc/slurm/slurm.conf match sudo slurmd -C and real memory from free -m.
  • Update slurm.conf on all nodes, then:
sudo scontrol update NodeName=<nodename> State=RESUME

Service order on controller:

sudo systemctl restart slurmdbd
sudo systemctl restart slurmctld
sudo systemctl restart slurmd

Check logs:

  • /var/log/slurm/slurmdbd.log
  • /var/log/slurm/slurmctld.log
  • /var/log/slurm/slurmd.log

Helpful References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment