slurm-note.md

tl;dr

Gather Clusters Information

# Gather all simple information
$ sinfo

# Display node and other information
$ sinfo -Nl

Update node states

# drain nodes to maintain mode. ex: nodes=worker[01-02],worker08
$ scontrol update NodeName=${nodes} State=DOWN Reason=”Maintain”

# resume nodes
$ scontrol update NodeName=worker[01-02] State=Resume

Run jobs

# run a job
$ srun -N1 hostname

# run jobs on specific nodes
$ srun --nodelist=compute-[0-5] hostname

# run jobs on specific partition
$ srun -p ${PARTITION} --nodelist=compute-[0-5] hostname

# run a job via srun on 2 nodes (using dd to simulate a high CPU consume job)
$ srun -N2 dd if=/dev/zero of=/dev/null

# run a job with a time constrain. Format: 
# - minute
# - minute:second
# - hours:minutes:seconds
# - days-hours
# - days-hours:minutes 
# - days-hours:minutes:seconds)
$ srun -N2 --time=01:00 dd if=/dev/zero of=/dev/null

# login to the node
$ srun -N 1 --pty /bin/bash

Reservation

# reserve nodes for a user to test
$ scontrol create reservation ReservationName=maintain \
  starttime=now duration=120 user=root flags=maint,ignore_jobs nodes=ALL
# must specify reservation; otherwise, the job will not run
$ srun --reservation=maintain ping 8.8.8.8 2>&1 > /dev/null &

# show reservations
$ scontrol show res

# delete a reservation
$ scontrol delete ReservationName=maintain

sbatch & salloc

salloc is employed to set up a Slurm environment, initiating a shell with Slurm configurations like CPU or memory usage. This concept is akin to using pip -m venv venv && source venv/bin/activate to establish an environment for running a process. For instance,

$ salloc --nodelist=computer-[0-1]
salloc: Granted job allocation 26744
salloc: Waiting for resource configuration
salloc: Nodes compute-4 are ready for job
$ srun hostname
compute-0
# run exit or ctrl+d will exit current salloc env.

Based on what's mentioned earlier, it's evident that when using salloc, it specifies that subsequent Slurm jobs should execute on nodes labeled as compute-0 and compute-1. Consequently, upon executing srun, it's noticeable that all tasks are assigned to compute-[0,1]. Comparatively, sbatch functions similarly to salloc, but it involves defining resources within a shell script.

#!/bin/bash
#SBATCH --nodelist=compute-[0-1]
srun hostname

The prior example demonstrates that salloc yields identical outcomes. Additionally, by incorporating scontrol show hostnames within our bash script, sbatch provides visibility into the nodes currently in operation.

#!/bin/bash

#SBATCH --output=logs/%x_%j.out  
#SBATCH --error=logs/%x_%j.out 

HOSTFILE="hosts_${SLURM_JOB_ID}"
scontrol show hostnames | sort > "$HOSTFILE"

We can see a host file which displays nodes that are running.

$ sbatch -N 5 get_host_file.sh       
Submitted batch job 27009
$ cat hosts_27009
compute-53
compute-54
compute-55
compute-56
compute-57

Cancel jobs

# cancel a job
$ scancel "${jobid}"

# cancel a job and disable warnings
$ scancel -q "${jobid}"

# cancel all jobs which are belong to an account
$ scancel --account="${account}"

# cancel all jobs which are belong to a partition
$ scancel --partition="${partition}"

# cancel all pending jobs
$ scancel --state="PENDING"

# cancel all running jobs
$ scancel --state="RUNNING"

# cancel all jobs
$ squeue -l | awk '{ print $ 1}' | grep '[[:digit:]].*' | xargs scancel

# cancel all jobs (using state option)
$ for s in "RUNNING" "PENDING" "SUSPAND"; do scancel --state="$s"; done

Account & User

# create a cluster (the clustername should be identical to ClusterName in slurm.conf)
$ sacctmgr add cluster clustername

# create an account
$ sacctmgr -i add account worker description="worker account" Organization="your.org"

# create an user and add to an account
$ sacctmgr create user name=worker DefaultAccount=default

# create an user and add to additional accounts
$ sacctmgr -i create user "worker" account="worker" adminlevel="None"

# modify user fairshare configuration
$ sacctmgr modify user where name="worker" account="worker" set fairshare=0

# remove an user from an account
$ sacctmgr remove user "worker" where account="worker"

# show all users
$ sacctmgr show account

# show all users with associations
$ sacctmgr show account -s

Reference

crazyguitar/slurm-note.md