Skip to content

Instantly share code, notes, and snippets.

@Quentin-Anthony
Created April 30, 2025 19:06
Show Gist options
  • Save Quentin-Anthony/2d3c00a6a6c0fb2f7b62226f0bfd4d53 to your computer and use it in GitHub Desktop.
Save Quentin-Anthony/2d3c00a6a6c0fb2f7b62226f0bfd4d53 to your computer and use it in GitHub Desktop.
cheatsheet for migrating from slurm to kubernetes

Slurm to Kubernetes Cheat Sheet

Conceptual Mapping

Slurm Concept Kubernetes Equivalent Description
Cluster Cluster Overall compute infrastructure
Node Node Physical/virtual machine in the cluster
Partition Namespace + Resource Quotas Logical division of resources
Account RBAC Roles and RoleBindings Access control mechanisms
Job Pod/Job/CronJob Unit of work to be executed
Job Step Container Process within a job
QOS PriorityClass Job priority and resource limits
Allocation Resource Request Specify required resources
Environment Module Container Image Software packaging
slurm.conf ConfigMaps and Secrets Configuration management

Command Translation Table

System Information Commands

Slurm Command Kubernetes Command Purpose
sinfo kubectl get nodes List all nodes
sinfo -N -l kubectl describe nodes Detailed node info
sinfo -a kubectl get nodes --show-labels Show all partitions/node labels
sinfo -p partition_name kubectl get nodes -l partition=name Info for specific partition/label
scontrol show partition kubectl get namespaces List partitions/namespaces
scontrol show config kubectl cluster-info Cluster configuration
sacctmgr show qos kubectl get priorityclasses Show QoS/priority settings
sshare kubectl describe resourcequotas Check fair-share info

Job Management Commands

Slurm Command Kubernetes Command Purpose
sbatch job.sh kubectl apply -f job.yaml Submit a job
squeue kubectl get jobs List all jobs
squeue -u <user> kubectl get pods --selector=user=<username> List user's jobs
squeue -p <partition> kubectl get pods -n <namespace> Jobs in partition/namespace
scancel <jobid> kubectl delete job <jobname> Cancel a job
scancel -u <user> kubectl delete pods --selector=user=<username> Cancel all user's jobs
scontrol show job <jobid> kubectl describe job <jobname> Job details
scontrol hold <jobid> kubectl patch job <name> -p '{"spec":{"suspend":true}}' Hold/suspend a job
scontrol release <jobid> kubectl patch job <name> -p '{"spec":{"suspend":false}}' Release a held job
sacct -j <jobid> kubectl logs <podname> Job output
sacct -u <user> kubectl logs --selector=user=<username> User's job logs

Interactive Sessions

Slurm Command Kubernetes Command Purpose
srun --pty bash kubectl run -it --rm debug --image=ubuntu -- bash Interactive shell
salloc kubectl apply -f interactive-pod.yaml Allocate resources
salloc --gres=gpu:1 kubectl run -it --rm debug --image=nvidia/cuda --overrides='{"spec":{"containers":[{"name":"debug","resources":{"limits":{"nvidia.com/gpu":1}}}]}}' Allocate GPU resources
ssh <node> kubectl debug node/<node-name> -it --image=ubuntu Access node directly

Job Specification Examples

Slurm Batch Script Example

#!/bin/bash
#SBATCH --job-name=ml_training
#SBATCH --output=output_%j.log
#SBATCH --error=error_%j.log
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=2
#SBATCH --mem=32G
#SBATCH --time=12:00:00
#SBATCH --partition=gpu
#SBATCH --gres=gpu:a100:4
#SBATCH --mail-type=ALL
#SBATCH [email protected]

module load cuda/11.7
module load python/3.9

cd /path/to/project
python train.py --epochs 100

Equivalent Kubernetes Job YAML

apiVersion: batch/v1
kind: Job
metadata:
  name: ml-training
  labels:
    user: username
spec:
  parallelism: 2  # Similar to --nodes=2
  completions: 2
  template:
    metadata:
      labels:
        job-name: ml-training
    spec:
      containers:
      - name: training-container
        image: organization/ml-framework:cuda11.7-python3.9
        resources:
          requests:
            memory: "32Gi"
            cpu: "8"        # 4 tasks * 2 cpus-per-task
            nvidia.com/gpu: 4
          limits:
            memory: "32Gi"
            cpu: "8"
            nvidia.com/gpu: 4
        command: ["/bin/bash", "-c"]
        args:
        - cd /workspace && python train.py --epochs 100
        volumeMounts:
        - name: workspace
          mountPath: /workspace
      volumes:
      - name: workspace
        persistentVolumeClaim:
          claimName: project-data-pvc
      restartPolicy: Never
  backoffLimit: 0
  activeDeadlineSeconds: 43200  # 12 hours in seconds

Array Jobs

Slurm Array Job

#SBATCH --array=1-10
echo "Task ID: $SLURM_ARRAY_TASK_ID"

Kubernetes Approach

apiVersion: batch/v1
kind: Job
metadata:
  name: array-job
  generateName: array-job-
spec:
  parallelism: 10
  completions: 10
  template:
    spec:
      containers:
      - name: array-task
        image: ubuntu
        command: ["/bin/bash", "-c"]
        args:
        - echo "Task ID: $JOB_COMPLETION_INDEX"
        env:
        - name: JOB_COMPLETION_INDEX
          valueFrom:
            fieldRef:
              fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
      restartPolicy: Never

MPI/Parallel Jobs

Slurm MPI Job

#SBATCH --nodes=4
#SBATCH --ntasks=16
#SBATCH --ntasks-per-node=4

mpirun python mpi_script.py

Kubernetes with MPI Operator

apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
  name: mpi-job
spec:
  slotsPerWorker: 4
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - image: mpi-image
            command: ["mpirun", "python", "mpi_script.py"]
    Worker:
      replicas: 4
      template:
        spec:
          containers:
          - image: mpi-image
            resources:
              limits:
                cpu: 4
                memory: 8Gi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment