Slurm to Kubernetes Cheat Sheet
Slurm Concept
Kubernetes Equivalent
Description
Cluster
Cluster
Overall compute infrastructure
Node
Node
Physical/virtual machine in the cluster
Partition
Namespace + Resource Quotas
Logical division of resources
Account
RBAC Roles and RoleBindings
Access control mechanisms
Job
Pod/Job/CronJob
Unit of work to be executed
Job Step
Container
Process within a job
QOS
PriorityClass
Job priority and resource limits
Allocation
Resource Request
Specify required resources
Environment Module
Container Image
Software packaging
slurm.conf
ConfigMaps and Secrets
Configuration management
Command Translation Table
System Information Commands
Slurm Command
Kubernetes Command
Purpose
sinfo
kubectl get nodes
List all nodes
sinfo -N -l
kubectl describe nodes
Detailed node info
sinfo -a
kubectl get nodes --show-labels
Show all partitions/node labels
sinfo -p partition_name
kubectl get nodes -l partition=name
Info for specific partition/label
scontrol show partition
kubectl get namespaces
List partitions/namespaces
scontrol show config
kubectl cluster-info
Cluster configuration
sacctmgr show qos
kubectl get priorityclasses
Show QoS/priority settings
sshare
kubectl describe resourcequotas
Check fair-share info
Slurm Command
Kubernetes Command
Purpose
sbatch job.sh
kubectl apply -f job.yaml
Submit a job
squeue
kubectl get jobs
List all jobs
squeue -u <user>
kubectl get pods --selector=user=<username>
List user's jobs
squeue -p <partition>
kubectl get pods -n <namespace>
Jobs in partition/namespace
scancel <jobid>
kubectl delete job <jobname>
Cancel a job
scancel -u <user>
kubectl delete pods --selector=user=<username>
Cancel all user's jobs
scontrol show job <jobid>
kubectl describe job <jobname>
Job details
scontrol hold <jobid>
kubectl patch job <name> -p '{"spec":{"suspend":true}}'
Hold/suspend a job
scontrol release <jobid>
kubectl patch job <name> -p '{"spec":{"suspend":false}}'
Release a held job
sacct -j <jobid>
kubectl logs <podname>
Job output
sacct -u <user>
kubectl logs --selector=user=<username>
User's job logs
Slurm Command
Kubernetes Command
Purpose
srun --pty bash
kubectl run -it --rm debug --image=ubuntu -- bash
Interactive shell
salloc
kubectl apply -f interactive-pod.yaml
Allocate resources
salloc --gres=gpu:1
kubectl run -it --rm debug --image=nvidia/cuda --overrides='{"spec":{"containers":[{"name":"debug","resources":{"limits":{"nvidia.com/gpu":1}}}]}}'
Allocate GPU resources
ssh <node>
kubectl debug node/<node-name> -it --image=ubuntu
Access node directly
Job Specification Examples
Slurm Batch Script Example
#! /bin/bash
# SBATCH --job-name=ml_training
# SBATCH --output=output_%j.log
# SBATCH --error=error_%j.log
# SBATCH --nodes=2
# SBATCH --ntasks-per-node=4
# SBATCH --cpus-per-task=2
# SBATCH --mem=32G
# SBATCH --time=12:00:00
# SBATCH --partition=gpu
# SBATCH --gres=gpu:a100:4
# SBATCH --mail-type=ALL
# SBATCH [email protected]
module load cuda/11.7
module load python/3.9
cd /path/to/project
python train.py --epochs 100
Equivalent Kubernetes Job YAML
apiVersion : batch/v1
kind : Job
metadata :
name : ml-training
labels :
user : username
spec :
parallelism : 2 # Similar to --nodes=2
completions : 2
template :
metadata :
labels :
job-name : ml-training
spec :
containers :
- name : training-container
image : organization/ml-framework:cuda11.7-python3.9
resources :
requests :
memory : " 32Gi"
cpu : " 8" # 4 tasks * 2 cpus-per-task
nvidia.com/gpu : 4
limits :
memory : " 32Gi"
cpu : " 8"
nvidia.com/gpu : 4
command : ["/bin/bash", "-c"]
args :
- cd /workspace && python train.py --epochs 100
volumeMounts :
- name : workspace
mountPath : /workspace
volumes :
- name : workspace
persistentVolumeClaim :
claimName : project-data-pvc
restartPolicy : Never
backoffLimit : 0
activeDeadlineSeconds : 43200 # 12 hours in seconds
# SBATCH --array=1-10
echo " Task ID: $SLURM_ARRAY_TASK_ID "
apiVersion : batch/v1
kind : Job
metadata :
name : array-job
generateName : array-job-
spec :
parallelism : 10
completions : 10
template :
spec :
containers :
- name : array-task
image : ubuntu
command : ["/bin/bash", "-c"]
args :
- echo "Task ID : $JOB_COMPLETION_INDEX"
env :
- name : JOB_COMPLETION_INDEX
valueFrom :
fieldRef :
fieldPath : metadata.annotations['batch.kubernetes.io/job-completion-index']
restartPolicy : Never
# SBATCH --nodes=4
# SBATCH --ntasks=16
# SBATCH --ntasks-per-node=4
mpirun python mpi_script.py
Kubernetes with MPI Operator
apiVersion : kubeflow.org/v1
kind : MPIJob
metadata :
name : mpi-job
spec :
slotsPerWorker : 4
runPolicy :
cleanPodPolicy : Running
mpiReplicaSpecs :
Launcher :
replicas : 1
template :
spec :
containers :
- image : mpi-image
command : ["mpirun", "python", "mpi_script.py"]
Worker :
replicas : 4
template :
spec :
containers :
- image : mpi-image
resources :
limits :
cpu : 4
memory : 8Gi