Skip to content

Instantly share code, notes, and snippets.

@verdimrc
Last active April 15, 2025 10:08
Show Gist options
  • Save verdimrc/d8e0087140041b502d2b5f9cd0bad7a7 to your computer and use it in GitHub Desktop.
Save verdimrc/d8e0087140041b502d2b5f9cd0bad7a7 to your computer and use it in GitHub Desktop.
GPU on docker and minikube

https://developer.nvidia.com/deep-learning-performance-training-inference/training

# Additional args -- optional, on case-by-case basis
declare -a CONTAINER_ARGS=(                                                             
  --gpus all
  --ipc=host
  --ulimit memlock=1
  --ulimit stack=67108864

  --device=/dev/infiniband
  --device=/dev/gdrdrv

  --network=host

  --cap-add=SYS_PTRACE
  --security-opt seccomp=unconfined
  
  -v haha:/haha
)

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -it --rm "${CONTAINER_ARGS[@]}" ...

The HAHA series: minikube quickstart

Table of contents:

1. Pre-requisites

$ minikube version
minikube version: v1.35.0
commit: dd5d320e41b5451cdf3c01891bc4e13d189586ed-dirty

$ dpkg -l | grep docker
ii  dive                          0.12.0                                  amd64        A tool for exploring layers in a docker image
ii  docker-buildx-plugin          0.22.0-1~ubuntu.22.04~jammy             amd64        Docker Buildx cli plugin.
ii  docker-ce                     5:28.0.4-1~ubuntu.22.04~jammy           amd64        Docker: the open-source application container engine
ii  docker-ce-cli                 5:28.0.4-1~ubuntu.22.04~jammy           amd64        Docker CLI: the open-source application container engine
ii  docker-ce-rootless-extras     5:28.0.4-1~ubuntu.22.04~jammy           amd64        Rootless support for Docker.
ii  docker-compose-plugin         2.34.0-1~ubuntu.22.04~jammy             amd64        Docker Compose (V2) plugin for the Docker CLI.

$ uname -r
5.15.167.4-microsoft-standard-WSL2

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.5 LTS
Release:        22.04
Codename:       jammy

2. And here we go...

2.1. Preparation

minikube addons enable metrics-server
minikube start --driver docker --container-runtime docker --gpus all

minikube addons list

docker pull nvcr.io/nvidia/pytorch:25.03-py3

# Workaround OOM during minikube image load <docker image>.
# https://github.com/kubernetes/minikube/issues/17785#issuecomment-1906422218
#
# NOTE: may take a while. nemo:25.02.01 (56 GB) took 5.5min and 9min. The pytorch
#       container is 24 GB, so times taken should be halved.
docker save --output /tmp/haha.tar nvcr.io/nvidia/pytorch:25.03-py3
minikube image load /tmp/haha.tar
rm /tmp/haha.tar

Validate images:

$ docker images
REPOSITORY                TAG          IMAGE ID       CREATED        SIZE
nvcr.io/nvidia/pytorch    25.03-py3    f065828cf368   4 weeks ago    24GB

$ minikube image ls
nvcr.io/nvidia/pytorch:25.03-py3

2.2. Baseline with docker run

Let's start with docker, to collect the baseline behavior. Notice that NVIDIA's AI containers (pytorch, nemo, etc.) may detect unoptimized configuration, and recommend remedies.

## The container warns on unoptimized runtime setting, and will recommend remedies.
$ docker run -it --rm --gpus all \
    nvcr.io/nvidia/pytorch:25.03-py3 \
    /bin/bash -c "df -k /dev/shm ; echo ; ulimit -a | egrep '^max locked|^stack'"
...
NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for PyTorch.  NVIDIA recommends the use of the following flags:
   docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

Filesystem     1K-blocks  Used Available Use% Mounted on
shm                65536     0     65536   0% /dev/shm

max locked memory           (kbytes, -l) unlimited
stack size                  (kbytes, -s) unlimited

## Repeat with recommended flags.
$ docker run -it --rm --gpus all \
    --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
    nvcr.io/nvidia/pytorch:25.03-py3 \
    /bin/bash -c "df -k /dev/shm ; echo ; ulimit -a | egrep '^max locked|^stack'"
...
Filesystem     1K-blocks  Used Available Use% Mounted on
none             8088756     0   8088756   0% /dev/shm

max locked memory           (kbytes, -l) unlimited
stack size                  (kbytes, -s) 65536

2.3. Transferring the docker run flags to kubectl

To apply the optimization to minikube:

  1. set the default ulimit in dockerd.
  2. custom shm volume to mount.

2.3.1. Remove limits in dockerd

Ensure that /etc/docker/daemon.json has default-ulimits section:

{
    "default-ulimits": {
        "memlock": {
            "Hard": -1,
            "Name": "memlock",
            "Soft": -1
        },
        "stack": {
            "Hard": -1,
            "Name": "stack",
            "Soft": -1
        }
    },
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}

Then, restart the dockerd service. The minikube needs to restart too.

sudo systemctl restart docker.service
minikube start --driver docker --container-runtime docker --gpus all

2.3.2. Run pod with custom shm volume

Previously, we configure dockerd to always run containers with ulimited memlock and stack size.

The only thing remaining is to increase the /dev/shm to completely pass the check done by /opt/nvidia/entrypoint.d/70-shm-check.sh.

With kubectl run, we need to override the shm using --override-type json --overrides='...'.

$ kubectl run nvidia-smi --restart=Never --rm -it \
    --image nvcr.io/nvidia/pytorch:25.03-py3 \
    --image-pull-policy=Never \
    --override-type json \
    --overrides='
[
{"op": "add", "path": "/spec/containers/0/volumeMounts", "value": [{"name": "shmem", "mountPath": "/dev/shm"}]},
{"op": "add", "path": "/spec/volumes", "value": [{"name": "shmem", "emptyDir": {"medium": "Memory"}}]}
]
' \
    -- bash -c "df -k /dev/shm ; echo ; ulimit -a | egrep '^max locked|^stack' ; nvidia-smi"
...
Filesystem     1K-blocks  Used Available Use% Mounted on
tmpfs           16177516     0  16177516   0% /dev/shm

max locked memory           (kbytes, -l) unlimited
stack size                  (kbytes, -s) unlimited
Wed Apr  9 11:40:22 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.07             Driver Version: 572.83         CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A1000 6GB Lap...    On  |   00000000:01:00.0 Off |                  N/A |
| N/A   43C    P8              6W /   38W |     887MiB /   6144MiB |     23%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
pod "nvidia-smi" deleted

And voila! The shm size is greater than 64 MB, hence we don't see the SHM warning from /opt/nvidia/entrypoint.d/70-shm-check.sh. Notice also the unlimited memlock and stack size.

For those who prefer kubectl create, here's the /tmp/haha.yaml:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: nvidia-smi
  name: nvidia-smi
spec:
  volumes:
  - name: shmem
    emptyDir:
      medium: Memory
  containers:
  - name: nvidia-smi
    args:
    #command:
      - /bin/bash
      - -xc
      - "df -k /dev/shm ; echo ; ulimit -a | egrep '^max locked|^stack' ; nvidia-smi"
    image: nvcr.io/nvidia/pytorch:25.03-py3
    imagePullPolicy: Never
    volumeMounts:
      - name: shmem
        mountPath: /dev/shm
    resources: {}
  dnsPolicy: ClusterFirst
  restartPolicy: Never
status: {}

We can also generate the .yaml file (and edit as needed) as follows:

# Without: --rm -it
# With   : -o yaml
kubectl run nvidia-smi --restart=Never \
  --image-pull-policy=Never \
  --image nvcr.io/nvidia/pytorch:25.03-py3 \
  --dry-run=client -o yaml \
  -- nvidia-smi > /tmp/hehe.yaml

Then, create the pod from the .yaml file:

kubectl delete pod nvidia-smi
kubectl create -f /tmp/hehe.yaml
sleep 3
kubectl logs nvidia-smi   # To display output.
kubectl delete pod nvidia-smi

3. What didn't work to set minikube shm size and ulimit

Chronicle of the ineffective methods.

  1. Create /etc/system.d/system/docker.service.d/nvidia-ulimit.conf with these content:

    # Tried whatever suggestions found on the internet.
    [Manager]
    DefaultLimitMEMLOCK=infinity
    DefaultLimitSTACK=infinity
    LimitMEMLOCK=infinity
    LimitSTACK=infinity
    
    [Service]
    LimitMEMLOCK=infinity
    LimitSTACK=infinity
    DefaultLimitMEMLOCK=infinity
    DefaultLimitSTACK=infinity

    Do sudo systemctl daemon reload ; sudo system restart docker, then systemctl show docker will display the specified limits as unlimited. However, this is not effective on kubectl, and docker run ... without --ulimit.

  2. kubectl start ... --docker-opt "ulimit memlock=-1" --docker-opt "ulimit stack=67108864" --docker-opt "shm-size=1g" is also not effective. Pods still run with default shm size (64 MB), and limited memlock & stack size.

  3. Ineffectives settings in kubectl .yaml file.

    apiVersion: v1
    kind: Pod
    metadata: ...
    spec:
      ...
      hostIPC: True    # No effect on /dev/shm, unlike docker run --ipc=host
    
      # ADT version. Also no effect on /dev/shm, even when combined with hostIPC:True
      volumes:
      - name: shmem
        hostPath:
          path: /dev/shm
    containers:
    - name: nvidia-smi
      volumeMounts:    # <== THIS!!
        - name: shmem
            mountPath: /dev/shm
      ...

4. References

  1. awslabs/benchmark-ai#17 (comment) thinks that only /dev/shm size matters, and ignore the ulimits.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment