This process worked for me after 5 days of hell trying to get everything running. Most of this was pulled from this excellent blog post. https://jacobtomlinson.dev/posts/2022/running-kubeflow-inside-kind-with-gpu-support/
ToDo: Find the patch to kind to allow GPU attachment... (kind of important, I know...)
Create a kind-gpu.yaml
file
# kind-gpu.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
name: kubeflow-gpu
nodes:
- role: control-plane
image: kindest/node:v1.21.2
gpus: True
extraPortMappings:
- containerPort: 31080
listenAddress: 127.0.0.1
hostPort: 80
kubeadmConfigPatches:
- |
kind: ClusterConfiguration
apiServer:
extraArgs:
"service-account-issuer": "kubernetes.default.svc"
"service-account-signing-key-file": "/etc/kubernetes/pki/sa.key"
Fire up the cluster
$kind create cluster --config kind-gpu.yaml
Creating cluster "kubeflow-gpu" ...
β Ensuring node image (kindest/node:v1.21.2) πΌ
β Preparing nodes π¦
β Writing configuration π
β Starting control-plane πΉοΈ
β Installing CNI π
β Installing StorageClass πΎ
Set kubectl context to "kind-kubeflow-gpu"
You can now use your cluster with:
kubectl cluster-info --context kind-kubeflow-gpu
Have a nice day! π
Switch to the proper context
kubectx kind-kubeflow-gpu
Next we need to install the NVIDIA operator via helm. This will add the device plugins to the Kuberenetes API so it can detect GPUs and schedule them.
We want to avoid the operator trying to install drivers though as we already did that so we need to disable driver installs.
helm repo add nvidia https://nvidia.github.io/gpu-operator \
&& helm repo update
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set driver.enabled=false
Firse we need to clone the manifests provided by the working group
git clone https://github.com/kubeflow/manifests.git
cd manifests
git checkout v1.5-branch
Now it is time to run the install!
$ ./hack/setup-kubeflow.sh
Go grab some coffee and watch the pods spin up (roughly 74).
I had to run the setup-kubeflow.sh
script three times and it took
roughly 15 minutes on a 14 core machine with 128 GB ram.