Slurm to Kubernetes Cheat Sheet

Conceptual Mapping

Slurm Concept	Kubernetes Equivalent	Description
Cluster	Cluster	Overall compute infrastructure
Node	Node	Physical/virtual machine in the cluster
Partition	Namespace + Resource Quotas	Logical division of resources
Account	RBAC Roles and RoleBindings	Access control mechanisms
Job	Pod/Job/CronJob	Unit of work to be executed
Job Step	Container	Process within a job
QOS	PriorityClass	Job priority and resource limits
Allocation	Resource Request	Specify required resources
Environment Module	Container Image	Software packaging
`slurm.conf`	ConfigMaps and Secrets	Configuration management

Command Translation Table

System Information Commands

Slurm Command	Kubernetes Command	Purpose
`sinfo`	`kubectl get nodes`	List all nodes
`sinfo -N -l`	`kubectl describe nodes`	Detailed node info
`sinfo -a`	`kubectl get nodes --show-labels`	Show all partitions/node labels
`sinfo -p partition_name`	`kubectl get nodes -l partition=name`	Info for specific partition/label
`scontrol show partition`	`kubectl get namespaces`	List partitions/namespaces
`scontrol show config`	`kubectl cluster-info`	Cluster configuration
`sacctmgr show qos`	`kubectl get priorityclasses`	Show QoS/priority settings
`sshare`	`kubectl describe resourcequotas`	Check fair-share info

Job Management Commands

Slurm Command	Kubernetes Command	Purpose
`sbatch job.sh`	`kubectl apply -f job.yaml`	Submit a job
`squeue`	`kubectl get jobs`	List all jobs
`squeue -u <user>`	`kubectl get pods --selector=user=<username>`	List user's jobs
`squeue -p <partition>`	`kubectl get pods -n <namespace>`	Jobs in partition/namespace
`scancel <jobid>`	`kubectl delete job <jobname>`	Cancel a job
`scancel -u <user>`	`kubectl delete pods --selector=user=<username>`	Cancel all user's jobs
`scontrol show job <jobid>`	`kubectl describe job <jobname>`	Job details
`scontrol hold <jobid>`	`kubectl patch job <name> -p '{"spec":{"suspend":true}}'`	Hold/suspend a job
`scontrol release <jobid>`	`kubectl patch job <name> -p '{"spec":{"suspend":false}}'`	Release a held job
`sacct -j <jobid>`	`kubectl logs <podname>`	Job output
`sacct -u <user>`	`kubectl logs --selector=user=<username>`	User's job logs

Interactive Sessions

Slurm Command	Kubernetes Command	Purpose
`srun --pty bash`	`kubectl run -it --rm debug --image=ubuntu -- bash`	Interactive shell
`salloc`	`kubectl apply -f interactive-pod.yaml`	Allocate resources
`salloc --gres=gpu:1`	`kubectl run -it --rm debug --image=nvidia/cuda --overrides='{"spec":{"containers":[{"name":"debug","resources":{"limits":{"nvidia.com/gpu":1}}}]}}'`	Allocate GPU resources
`ssh <node>`	`kubectl debug node/<node-name> -it --image=ubuntu`	Access node directly

Job Specification Examples

Slurm Batch Script Example

#!/bin/bash
#SBATCH --job-name=ml_training
#SBATCH --output=output_%j.log
#SBATCH --error=error_%j.log
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=2
#SBATCH --mem=32G
#SBATCH --time=12:00:00
#SBATCH --partition=gpu
#SBATCH --gres=gpu:a100:4
#SBATCH --mail-type=ALL
#SBATCH [email protected]

module load cuda/11.7
module load python/3.9

cd /path/to/project
python train.py --epochs 100

Equivalent Kubernetes Job YAML

apiVersion: batch/v1
kind: Job
metadata:
  name: ml-training
  labels:
    user: username
spec:
  parallelism: 2  # Similar to --nodes=2
  completions: 2
  template:
    metadata:
      labels:
        job-name: ml-training
    spec:
      containers:
      - name: training-container
        image: organization/ml-framework:cuda11.7-python3.9
        resources:
          requests:
            memory: "32Gi"
            cpu: "8"        # 4 tasks * 2 cpus-per-task
            nvidia.com/gpu: 4
          limits:
            memory: "32Gi"
            cpu: "8"
            nvidia.com/gpu: 4
        command: ["/bin/bash", "-c"]
        args:
        - cd /workspace && python train.py --epochs 100
        volumeMounts:
        - name: workspace
          mountPath: /workspace
      volumes:
      - name: workspace
        persistentVolumeClaim:
          claimName: project-data-pvc
      restartPolicy: Never
  backoffLimit: 0
  activeDeadlineSeconds: 43200  # 12 hours in seconds

Array Jobs

Slurm Array Job

#SBATCH --array=1-10
echo "Task ID: $SLURM_ARRAY_TASK_ID"

Kubernetes Approach

apiVersion: batch/v1
kind: Job
metadata:
  name: array-job
  generateName: array-job-
spec:
  parallelism: 10
  completions: 10
  template:
    spec:
      containers:
      - name: array-task
        image: ubuntu
        command: ["/bin/bash", "-c"]
        args:
        - echo "Task ID: $JOB_COMPLETION_INDEX"
        env:
        - name: JOB_COMPLETION_INDEX
          valueFrom:
            fieldRef:
              fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
      restartPolicy: Never

MPI/Parallel Jobs

Slurm MPI Job

#SBATCH --nodes=4
#SBATCH --ntasks=16
#SBATCH --ntasks-per-node=4

mpirun python mpi_script.py

Kubernetes with MPI Operator

apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
  name: mpi-job
spec:
  slotsPerWorker: 4
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - image: mpi-image
            command: ["mpirun", "python", "mpi_script.py"]
    Worker:
      replicas: 4
      template:
        spec:
          containers:
          - image: mpi-image
            resources:
              limits:
                cpu: 4
                memory: 8Gi

Quentin-Anthony/slurm_to_kub.md

Slurm to Kubernetes Cheat Sheet

Conceptual Mapping

Command Translation Table

System Information Commands

Job Management Commands

Interactive Sessions

Job Specification Examples

Slurm Batch Script Example

Equivalent Kubernetes Job YAML

Array Jobs

Slurm Array Job

Kubernetes Approach

MPI/Parallel Jobs

Slurm MPI Job

Kubernetes with MPI Operator