Prereq: Slurm is installed and
slurmdbd
is configured/running. Controller and nodes share the sameslurm.conf
(includingAccountingStorageType=accounting_storage/slurmdbd
) and Munge works across nodes.
If Slurm isn’t installed yet, follow: https://mtreviso.github.io/blog/slurm.html
Show clusters:
sudo sacctmgr show clusters
Create if missing:
sudo sacctmgr add cluster sardine-cluster
Note: Changes require a healthy
slurmdbd
and a matchingClusterName
in/etc/slurm/slurm.conf
.
List:
sudo sacctmgr show account
Create if missing:
sudo sacctmgr add account sardine Description="SARDINE" Organization=sardine
QoS controls job limits and priority. Higher numeric priority
runs sooner (e.g., priority=100
> priority=10
).
List:
sudo sacctmgr show qos
Delete (example):
sudo sacctmgr delete qos gpu-debug
Add (example policy set—tune to taste):
sudo sacctmgr add qos cpu set priority=10 MaxJobsPerUser=4 MaxTRESPerUser=cpu=32,mem=128G,gres/gpu=0
sudo sacctmgr add qos gpu-debug set priority=20 MaxJobsPerUser=1 MaxTRESPerUser=gres/gpu=8 MaxWallDurationPerJob=01:00:00
sudo sacctmgr add qos gpu-short set priority=10 MaxJobsPerUser=4 MaxTRESPerUser=gres/gpu=4 MaxWallDurationPerJob=04:00:00
sudo sacctmgr add qos gpu-medium set priority=5 MaxJobsPerUser=1 MaxTRESPerUser=gres/gpu=4 MaxWallDurationPerJob=2-00:00:00
sudo sacctmgr add qos gpu-long set priority=2 MaxJobsPerUser=2 MaxTRESPerUser=gres/gpu=2 MaxWallDurationPerJob=7-00:00:00
sudo sacctmgr add qos gpu-h100 set priority=10 MaxJobsPerUser=2 MaxTRESPerUser=gres/gpu=4 MaxWallDurationPerJob=2-00:00:00
sudo sacctmgr add qos gpu-h200 set priority=10 MaxJobsPerUser=2 MaxTRESPerUser=gres/gpu=4 MaxWallDurationPerJob=4-00:00:00
sudo sacctmgr add qos gpu-hero set priority=100 MaxJobsPerUser=3 MaxTRESPerUser=gres/gpu=3
priority
— higher means earlier dispatch (subject to other factors)MaxJobsPerUser
— limit concurrent jobs per user in that QoSMaxTRESPerUser
— cap total resources (e.g.,gres/gpu=4
)MaxWallDurationPerJob
— per-job wallclock limit
Modify QoS:
sudo sacctmgr update qos gpu-debug set priority=20
Unset a value:
sudo sacctmgr update qos gpu-debug set priority=-1
List:
sudo sacctmgr show user -s
Add a user with allowed QoS:
sudo sacctmgr create user --immediate name=mtreviso account=sardine QOS=gpu-debug,gpu-short,gpu-medium,gpu-long
--immediate
skips the interactive confirmation. Omit it if you prefer to review changes.
Modify:
sudo sacctmgr -i modify user where name=mtreviso set QOS=gpu-debug,gpu-short,gpu-medium,gpu-long
Delete:
sudo sacctmgr delete user mtreviso
See reasons for drained nodes:
sudo sinfo -R
Draining due to memory mismatch:
- Ensure node hardware lines in
/etc/slurm/slurm.conf
matchsudo slurmd -C
and real memory fromfree -m
. - Update
slurm.conf
on all nodes, then:
sudo scontrol update NodeName=<nodename> State=RESUME
Service order on controller:
sudo systemctl restart slurmdbd
sudo systemctl restart slurmctld
sudo systemctl restart slurmd
Check logs:
/var/log/slurm/slurmdbd.log
/var/log/slurm/slurmctld.log
/var/log/slurm/slurmd.log
- Priority multifactor: https://slurm.schedmd.com/priority_multifactor.html
- squeue reason codes: https://slurm.schedmd.com/squeue.html#SECTION_JOB-REASON-CODES
- Resource limits: https://slurm.schedmd.com/resource_limits.html
- Handy scripts repo: https://github.com/cdt-data-science/cluster-scripts