# Gather all simple information
$ sinfo
# Display node and other information
$ sinfo -Nl
# drain nodes to maintain mode. ex: nodes=worker[01-02],worker08
$ scontrol update NodeName=${nodes} State=DOWN Reason=”Maintain”
# resume nodes
$ scontrol update NodeName=worker[01-02] State=Resume
# run a job
$ srun -N1 hostname
# run jobs on specific nodes
$ srun --nodelist=compute-[0-5] hostname
# run jobs on specific partition
$ srun -p ${PARTITION} --nodelist=compute-[0-5] hostname
# run a job via srun on 2 nodes (using dd to simulate a high CPU consume job)
$ srun -N2 dd if=/dev/zero of=/dev/null
# run a job with a time constrain. Format:
# - minute
# - minute:second
# - hours:minutes:seconds
# - days-hours
# - days-hours:minutes
# - days-hours:minutes:seconds)
$ srun -N2 --time=01:00 dd if=/dev/zero of=/dev/null
# login to the node
$ srun -N 1 --pty /bin/bash
# reserve nodes for a user to test
$ scontrol create reservation ReservationName=maintain \
starttime=now duration=120 user=root flags=maint,ignore_jobs nodes=ALL
# must specify reservation; otherwise, the job will not run
$ srun --reservation=maintain ping 8.8.8.8 2>&1 > /dev/null &
# show reservations
$ scontrol show res
# delete a reservation
$ scontrol delete ReservationName=maintain
salloc
is employed to set up a Slurm environment, initiating a shell with Slurm configurations like CPU or memory usage. This concept is akin to using pip -m venv venv && source venv/bin/activate
to establish an environment for running a process. For instance,
$ salloc --nodelist=computer-[0-1]
salloc: Granted job allocation 26744
salloc: Waiting for resource configuration
salloc: Nodes compute-4 are ready for job
$ srun hostname
compute-0
# run exit or ctrl+d will exit current salloc env.
Based on what's mentioned earlier, it's evident that when using salloc, it specifies that subsequent Slurm jobs should execute on nodes labeled as compute-0 and compute-1. Consequently, upon executing srun, it's noticeable that all tasks are assigned to compute-[0,1]. Comparatively, sbatch functions similarly to salloc, but it involves defining resources within a shell script.
#!/bin/bash
#SBATCH --nodelist=compute-[0-1]
srun hostname
The prior example demonstrates that salloc yields identical outcomes. Additionally, by incorporating scontrol show hostnames within our bash script, sbatch provides visibility into the nodes currently in operation.
#!/bin/bash
#SBATCH --output=logs/%x_%j.out
#SBATCH --error=logs/%x_%j.out
HOSTFILE="hosts_${SLURM_JOB_ID}"
scontrol show hostnames | sort > "$HOSTFILE"
We can see a host file which displays nodes that are running.
$ sbatch -N 5 get_host_file.sh
Submitted batch job 27009
$ cat hosts_27009
compute-53
compute-54
compute-55
compute-56
compute-57
# cancel a job
$ scancel "${jobid}"
# cancel a job and disable warnings
$ scancel -q "${jobid}"
# cancel all jobs which are belong to an account
$ scancel --account="${account}"
# cancel all jobs which are belong to a partition
$ scancel --partition="${partition}"
# cancel all pending jobs
$ scancel --state="PENDING"
# cancel all running jobs
$ scancel --state="RUNNING"
# cancel all jobs
$ squeue -l | awk '{ print $ 1}' | grep '[[:digit:]].*' | xargs scancel
# cancel all jobs (using state option)
$ for s in "RUNNING" "PENDING" "SUSPAND"; do scancel --state="$s"; done
# create a cluster (the clustername should be identical to ClusterName in slurm.conf)
$ sacctmgr add cluster clustername
# create an account
$ sacctmgr -i add account worker description="worker account" Organization="your.org"
# create an user and add to an account
$ sacctmgr create user name=worker DefaultAccount=default
# create an user and add to additional accounts
$ sacctmgr -i create user "worker" account="worker" adminlevel="None"
# modify user fairshare configuration
$ sacctmgr modify user where name="worker" account="worker" set fairshare=0
# remove an user from an account
$ sacctmgr remove user "worker" where account="worker"
# show all users
$ sacctmgr show account
# show all users with associations
$ sacctmgr show account -s
Reference