k8sgpt

Per https://k8sgpt.ai/

K8sGPT is a tool for scanning your kubernetes clusters, diagnosing and triaging issues in simple english. It has SRE experience codified into its analyzers and helps to pull out the most relevant information to enrich it with AI.

K8sGPT Is For…

Workload health analysis - Find critical issues with your workloads.
Fast triage, AI analysis - Look at your cluster a glance or use AI to analyze your cluster in depth
Humans - Complex signals into easy to understand suggestions
Security CVE review - Connect to scanners like Trivy and triage issues

Kubernetes

Orchestration is a key component of CN systems.

What is Kubernetes (k8s)? Per https://kubernetes.io

Kubernetes, also known as K8s, is an open-source system for automating deployment, scaling, and management of containerized applications.

K8s is a complex system, the next sections are tools (e.g. Docker, K3d) we leverage to install a complete K8s system.

Docker

Another critical component of CN systems is the use of containers. Docker remains a viable container runtime.

What is Docker? Per https://docs.docker.com/get-started/overview/

Docker is an open platform for developing, shipping, and running applications. Docker enables you to separate your applications from your infrastructure so you can deliver software quickly. With Docker, you can manage your infrastructure in the same ways you manage your applications. By taking advantage of Docker's methodologies for shipping, testing, and deploying code, you can significantly reduce the delay between writing code and running it in production.

~$ curl -sL https://get.docker.com | sh -

...

~$ sudo usermod -aG docker $(whoami)

~$ exit # reload user permissions in terminal

$ ssh -i key.pem ubuntu@<VM_IP> # ssh back to VM

Prove Docker is up and running (this connects to the daemon, so we know its listening).

~$ docker system df

TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          0         0         0B        0B
Containers      0         0         0B        0B
Local Volumes   0         0         0B        0B
Build Cache     0         0         0B        0B

With our container runtime installed, we move to install K8s via k3d.

k3d

What is k3d? Per https://k3d.io

k3d is a lightweight wrapper to run k3s (Rancher Lab’s minimal Kubernetes distribution) in docker.

k3d makes it very easy to create single- and multi-node k3s clusters in docker, e.g. for local development on Kubernetes.

Yes, k3d is wrapper around k3s. So what is k3s?

Per https://github.com/k3s-io/k3s?tab=readme-ov-file#what-is-this

K3s is a fully conformant production-ready Kubernetes distribution with the following changes:

It is packaged as a single binary.

It adds support for sqlite3 as the default storage backend. Etcd3, MySQL, and Postgres are also supported.

It wraps Kubernetes and other components in a single, simple launcher.

It is secure by default with reasonable defaults for lightweight environments.

It has minimal to no OS dependencies (just a sane kernel and cgroup mounts needed).

It eliminates the need to expose a port on Kubernetes worker nodes for the kubelet API by exposing this API to the Kubernetes control plane nodes over a websocket tunnel.

Not to belabor the point, k3s provides defaults for many useful if not required k8s components.

K3s bundles the following technologies together into a single cohesive distribution:

Containerd & runc

Flannel for CNI

CoreDNS

Metrics Server

Traefik for ingress

Klipper-lb as an embedded service load balancer provider

Kube-router netpol controller for network policy

Helm-controller to allow for CRD-driven deployment of helm manifests

Kine as a datastore shim that allows etcd to be replaced with other databases

Local-path-provisioner for provisioning volumes using local storage

Host utilities such as iptables/nftables, ebtables, ethtool, & socat

We begin by installing k3d (and k3s).

~$ curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash

...

Finally we can install K8s vis k3d.

~$ k3d cluster create "k8sgpt-cluster" --image "rancher/k3s:v1.26.9-k3s1"

INFO[0000] Prep: Network                                
INFO[0000] Created network 'k3d-k8sgpt-cluster'         
INFO[0000] Created image volume k3d-k8sgpt-cluster-images 
INFO[0000] Starting new tools node...                   
INFO[0000] Pulling image 'ghcr.io/k3d-io/k3d-tools:5.6.0' 
INFO[0001] Creating node 'k3d-k8sgpt-cluster-server-0'  
INFO[0001] Starting Node 'k3d-k8sgpt-cluster-tools'     
INFO[0001] Pulling image 'rancher/k3s:v1.26.9-k3s1'     
INFO[0004] Creating LoadBalancer 'k3d-k8sgpt-cluster-serverlb' 
INFO[0005] Pulling image 'ghcr.io/k3d-io/k3d-proxy:5.6.0' 
INFO[0007] Using the k3d-tools node to gather environment information 
INFO[0007] HostIP: using network gateway 172.20.0.1 address 
INFO[0007] Starting cluster 'k8sgpt-cluster'            
INFO[0007] Starting servers...                          
INFO[0007] Starting Node 'k3d-k8sgpt-cluster-server-0'  
INFO[0012] All agents already running.                  
INFO[0012] Starting helpers...                          
INFO[0012] Starting Node 'k3d-k8sgpt-cluster-serverlb'  
INFO[0018] Injecting records for hostAliases (incl. host.k3d.internal) and for 2 network members into CoreDNS configmap... 
INFO[0020] Cluster 'k8sgpt-cluster' created successfully! 
INFO[0020] You can now use it like this:                
kubectl cluster-info

K8s nodes are normally VMs, however K3d allows us to run a node in a container.

~$ docker ps

CONTAINER ID   IMAGE                            COMMAND                  CREATED              STATUS              PORTS                             NAMES
0444a0d5c238   ghcr.io/k3d-io/k3d-proxy:5.6.0   "/bin/sh -c nginx-pr…"   About a minute ago   Up About a minute   80/tcp, 0.0.0.0:45105->6443/tcp   k3d-deploykf-serverlb
b7781ddae510   rancher/k3s:v1.26.9-k3s1         "/bin/k3d-entrypoint…"   About a minute ago   Up About a minute                                     k3d-deploykf-server-0

# ctr is a client for interacting with containerd daemon directly
# docker inturn talks to docker daemon which then talks to containerd
~$ sudo ctr -n moby c ls

CONTAINER                                                           IMAGE    RUNTIME                  
ab7a6c33402acef3d9a7c5254d2d3e3d057f8b38dd0135eb2ae64bc8ac9a7e25    -        io.containerd.runc.v2    
ca790d4975bfc82d49545fa4229e9033b1783d64e125e1a8d1df4dd4d244adc1    -        io.containerd.runc.v2

Using the k3d command line, we see the similar results.

~$ k3d node list

NAME                          ROLE           CLUSTER          STATUS
k3d-k8sgpt-cluster-server-0   server         k8sgpt-cluster   running
k3d-k8sgpt-cluster-serverlb   loadbalancer   k8sgpt-cluster   running

You can view what is running in the server container (aka the k8s control plane) as follows.

~$ docker top k3d-k8sgpt-cluster-server-0 co pid,command

PID                 COMMAND
7453                docker-init
7488                k3d-entrypoint.
7609                containerd
8348                containerd-shim
8361                containerd-shim
8375                containerd-shim
9680                containerd-shim
9785                containerd-shim
7504                k3s
9704                pause
9979                entry
9918                entry
8420                pause
8761                local-path-prov
10101               traefik
9805                pause
8934                metrics-server
8416                pause
8419                pause
8722                coredns

Notice the PIDs, that is because "top" is using the ps on your system. Compare that to ps in the container itself (which is busybox, so a less functional version of ps).

~$ docker exec k3d-k8sgpt-cluster-server-0 ps | awk '{print $1 " " $3}'

PID COMMAND
1 /sbin/docker-init
7 {k3d-entrypoint.}
23 /bin/k3s
128 containerd
681 /bin/containerd-shim-runc-v2
694 /bin/containerd-shim-runc-v2
708 /bin/containerd-shim-runc-v2
749 /pause
752 /pause
753 /pause
1055 /coredns
1094 local-path-provisioner
1267 /metrics-server
2010 /bin/containerd-shim-runc-v2
2034 /pause
2115 /bin/containerd-shim-runc-v2
2135 /pause
2247 {entry}
2308 {entry}
2429 traefik
3121 ps

On the host, k3s is pid 7504, but in the container its 23 (custom view of /proc filesystem is a key containerzation feature).

Note the availability of the metrics-server. This means commands such as kubectl top pods and kubectl top nodes will work. However, we need to install kubectl first!

kubectl

kubectl is the official CLI tool to operate K8s. As mentioned before, Kubeflow requires k8s 1.26, so we will use the same version of kubectl.

~$ curl -sLO "https://dl.k8s.io/release/v1.26.9/bin/linux/amd64/kubectl" # same version as k3d install of k8s control plane

~$ sudo mv kubectl /usr/bin/

~$ sudo chown $(whoami) /usr/bin/kubectl

~$ chmod u+x /usr/bin/kubectl

Confirm it works and can reach the K8s API server.

~$ kubectl version

...

~$ source <(kubectl completion bash)

When we created the cluster, k3d also provided the configuration (user, certs, etc.).

# cat ~/.kube/config 
# to see raw file
~$ kubectl config view

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: DATA+OMITTED
    server: https://0.0.0.0:40755
  name: k3d-k8sgpt-cluster
contexts:
- context:
    cluster: k3d-k8sgpt-cluster
    user: admin@k3d-k8sgpt-cluster
  name: k3d-k8sgpt-cluster
current-context: k3d-k8sgpt-cluster
kind: Config
preferences: {}
users:
- name: admin@k3d-k8sgpt-cluster
  user:
    client-certificate-data: DATA+OMITTED
    client-key-data: DATA+OMITTED

Testing K8s

In this section, we will run a simple test application and confirm all is well with our K8s installation.

We are going to launch a script that runs two commands "echo" and "tail"; confirming along the way K8s API operations.

~$ ps wuxa | grep tail # make sure tail is not running

# nothing

~$ kubectl run testing --image ubuntu:22.04 --command sh -- -c 'echo "hello world"; tail -f /dev/null' # launch script

~$ ps wuxa | grep tail # make sure tail is running

root       12118  0.2  0.0   2892  1664 ?        Ss   23:30   0:00 sh -c echo "hello world"; tail -f /dev/null
root       12130  0.0  0.0   2824  1536 ?        S    23:30   0:00 tail -f /dev/null
ubuntu     12154  0.0  0.0   7008  2304 pts/0    S+   23:30   0:00 grep --color=auto tail

# another way to confirm tail is running
~$ docker top k3d-k8sgpt-cluster-server-0 | grep tail

root                12118               12064               0                   23:30               ?                   00:00:00            sh -c echo "hello world"; tail -f /dev/null
root                12130               12118               0                   23:30               ?                   00:00:00            tail -f /dev/null

~$ kubectl logs testing # confirm echo ran

hello world

# show debugging information while deleting the pod, this may take a minute (finalizers, close connection, etc.)
~$ kubectl delete pod testing -v=9 

...

I0315 23:32:23.571322   12791 round_trippers.go:466] curl -v -XDELETE  -H "Accept: application/json" -H "Content-Type: application/json" -H "User-Agent: kubectl/v1.26.9 (linux/amd64) kubernetes/d1483fd" 'https://0.0.0.0:40755/api/v1/namespaces/default/pods/testing'

...

~$ kubectl get pod testing # confirm pod is gone

Error from server (NotFound): pods "testing" not found

~$ kubectl get namespaces # see available namespaces, these are logical groups for access and control in k8s

NAME              STATUS   AGE
kube-system       Active   21m
default           Active   21m
kube-public       Active   21m
kube-node-lease   Active   21m

~$ kubectl -n kube-system get pods #(show control plane and related pods)

NAME                                      READY   STATUS      RESTARTS   AGE
local-path-provisioner-76d776f6f9-zb2nh   1/1     Running     0          21m
coredns-59b4f5bbd5-4hzzl                  1/1     Running     0          21m
helm-install-traefik-crd-4lm64            0/1     Completed   0          21m
svclb-traefik-d88ed97a-cnf9n              2/2     Running     0          21m
helm-install-traefik-lwvmg                0/1     Completed   1          21m
traefik-57c84cf78d-zxqmg                  1/1     Running     0          21m
metrics-server-68cf49699b-tdm8c           1/1     Running     0          21m

Again, we see "metrics-..." is running, so lets use it.

~$ kubectl top pods -A

NAMESPACE     NAME                                      CPU(cores)   MEMORY(bytes)   
kube-system   coredns-59b4f5bbd5-4hzzl                  2m           13Mi            
kube-system   local-path-provisioner-76d776f6f9-zb2nh   1m           7Mi             
kube-system   metrics-server-68cf49699b-tdm8c           6m           17Mi            
kube-system   svclb-traefik-d88ed97a-cnf9n              0m           0Mi             
kube-system   traefik-57c84cf78d-zxqmg                  1m           26Mi           

~$ kubectl top nodes

NAME                          CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
k3d-k8sgpt-cluster-server-0   61m          0%     602Mi           1%

While we see a single node, there are two.

# default install for k3d is two nodes
~$ k3d node list

NAME                          ROLE           CLUSTER          STATUS
k3d-k8sgpt-cluster-server-0   server         k8sgpt-cluster   running
k3d-k8sgpt-cluster-serverlb   loadbalancer   k8sgpt-cluster   running

k3d-k8sgpt-cluster-serverlb purpose is to route traffic versus serve it, use ~$ docker top k3d-k8sgpt-cluster-serverlb to view processes (nginx).

Installing k8sgpt

k8sgpt can be installed via helm.

Helm

Helm is the most popular templating installation engine for CN projects. k8sgpt and other tools leverage it to install components.

~$ curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3

~$ chmod u+x get_helm.sh

~$ ./get_helm.sh

...

~$ helm version

...

Installation of K8sGPT Operator

Next we install the k8sgpt operator via helm. This operator manages the connection to the AI provider (e.g. OpenAI, LocalAI, etc.) and k8s itself.

~$ helm repo add k8sgpt https://charts.k8sgpt.ai/

...

~$ helm repo update

Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "k8sgpt" chart repository
Update Complete. ⎈Happy Helming!⎈

~$ helm install release k8sgpt/k8sgpt-operator -n k8sgpt-operator-system --create-namespace

...

The previous command did the actual install of the k8sgpt operator.

We can list releases as follows:

~$ helm list -A

NAME       	NAMESPACE             	REVISION	UPDATED                                	STATUS  	CHART                      	APP VERSION
release    	k8sgpt-operator-system	1       	2024-03-15 23:40:32.186355959 +0000 UTC	deployed	k8sgpt-operator-0.1.1      	0.0.26     
traefik    	kube-system           	1       	2024-03-15 23:12:57.963367114 +0000 UTC	deployed	traefik-21.2.1+up21.2.0    	v2.9.10    
traefik-crd	kube-system           	1       	2024-03-15 23:12:54.104198207 +0000 UTC	deployed	traefik-crd-21.2.1+up21.2.0	v2.9.10

Notice k3d leveraged helm (internally) to install components.

Finally, notice this operator works with customer resource definitions (a custom type of resource in k8s).

:~$ kubectl api-resources | grep k8sgpt

k8sgpts                                        core.k8sgpt.ai/v1alpha1                true         K8sGPT
results                                        core.k8sgpt.ai/v1alpha1                true         Result

If we try to list instances of these resources, we will not find any.

~
$ kubectl get k8sgpts.core.k8sgpt.ai -A
No resources found

~$ kubectl get results.core.k8sgpt.ai -A

No resources found

We can see what is running for k8sgpt as follows.

~$ kubectl get ns

NAME                     STATUS   AGE
kube-system              Active   32m
default                  Active   32m
kube-public              Active   32m
kube-node-lease          Active   32m
k8sgpt-operator-system   Active   4m53s

~$ kubectl get all -n k8sgpt-operator-system 

NAME                                                              READY   STATUS    RESTARTS   AGE
pod/release-k8sgpt-operator-controller-manager-7597b58757-dl9gv   2/2     Running   0          5m5s

NAME                                                              TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
service/release-k8sgpt-opera-controller-manager-metrics-service   ClusterIP   10.43.73.183   <none>        8443/TCP   5m5s

NAME                                                         READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/release-k8sgpt-operator-controller-manager   1/1     1            1           5m5s

NAME                                                                    DESIRED   CURRENT   READY   AGE
replicaset.apps/release-k8sgpt-operator-controller-manager-7597b58757   1         1         1       5m5s

Finally, the operator logs.

~$ kubectl -n k8sgpt-operator-system logs release-k8sgpt-operator-controller-manager-7597b58757-dl9gv 

2024-03-15T23:40:37Z	INFO	controller-runtime.metrics	Metrics server is starting to listen	{"addr": "127.0.0.1:8080"}
2024-03-15T23:40:37Z	INFO	setup	starting manager
2024-03-15T23:40:37Z	INFO	Starting server	{"kind": "health probe", "addr": "[::]:8081"}
2024-03-15T23:40:37Z	INFO	starting server	{"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:8080"}
I0315 23:40:37.524559       1 leaderelection.go:250] attempting to acquire leader lease k8sgpt-operator-system/ea9c19f7.k8sgpt.ai...
I0315 23:40:37.533829       1 leaderelection.go:260] successfully acquired lease k8sgpt-operator-system/ea9c19f7.k8sgpt.ai
2024-03-15T23:40:37Z	DEBUG	events	release-k8sgpt-operator-controller-manager-7597b58757-dl9gv_3f242be8-f527-4ced-a804-4f188c2ee15f became leader	{"type": "Normal", "object": {"kind":"Lease","namespace":"k8sgpt-operator-system","name":"ea9c19f7.k8sgpt.ai","uid":"9025d947-8c8a-4bb4-91b1-8daff736a2b1","apiVersion":"coordination.k8s.io/v1","resourceVersion":"1176"}, "reason": "LeaderElection"}
2024-03-15T23:40:37Z	INFO	Starting EventSource	{"controller": "k8sgpt", "controllerGroup": "core.k8sgpt.ai", "controllerKind": "K8sGPT", "source": "kind source: *v1alpha1.K8sGPT"}
2024-03-15T23:40:37Z	INFO	Starting Controller	{"controller": "k8sgpt", "controllerGroup": "core.k8sgpt.ai", "controllerKind": "K8sGPT"}
2024-03-15T23:40:37Z	INFO	Starting workers	{"controller": "k8sgpt", "controllerGroup": "core.k8sgpt.ai", "controllerKind": "K8sGPT", "worker count": 1}

We can see its interacting with the K8s API.

In order to use k8sgpt, we need connect the AI.

LocalAI

While k8sgpt can support many AI providers, we choose to use a local and free alternative called LocalAI.

What is LocalAI? Per https://github.com/mudler/LocalAI

The free, Open Source OpenAI alternative. Self-hosted, community-driven and local-first. Drop-in replacement for OpenAI running on consumer-grade hardware. No GPU required. Runs gguf, transformers, diffusers and many more models architectures. It allows to generate Text, Audio, Video, Images. Also with voice cloning capabilities.

LocalAI is part of a larger project called go-skynet (https://github.com/go-skynet).

A helm chart to install LocalAI is provided there (https://github.com/go-skynet/helm-charts#readme).

We now install LocalAI.

~$ helm repo add go-skynet https://go-skynet.github.io/helm-charts/

"go-skynet" has been added to your repositories

~$ helm repo update

Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "go-skynet" chart repository
...Successfully got an update from the "k8sgpt" chart repository
Update Complete. ⎈Happy Helming!⎈

Due to our setup, we will need to modify our values.yaml to work with k3d.

We start with the default HELM values for LocalAI (found on the website).

~$ vi values.yaml

replicaCount: 1

resources:
  {}
  # We usually recommend not to specify default resources and to leave this as a conscious
  # choice for the user. This also increases chances charts run on environments with little
  # resources, such as Minikube. If you do want to specify resources, uncomment the following
  # lines, adjust them as necessary, and remove the curly braces after 'resources:'.
  # limits:
  #   cpu: 100m
  #   memory: 128Mi
  # requests:
  #   cpu: 100m
  #   memory: 128Mi

# Prompt templates to include
# Note: the keys of this map will be the names of the prompt template files
promptTemplates:
  {}
  # ggml-gpt4all-j.tmpl: |
  #   The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.
  #   ### Prompt:
  #   {{.Input}}
  #   ### Response:

# Models to download at runtime
models:
  # Whether to force download models even if they already exist
  forceDownload: false

  # The list of URLs to download models from
  # Note: the name of the file will be the name of the loaded model
  list:
    - url: "https://gpt4all.io/models/ggml-gpt4all-j.bin"
      # basicAuth: base64EncodedCredentials

  # Persistent storage for models and prompt templates.
  # PVC and HostPath are mutually exclusive. If both are enabled,
  # PVC configuration takes precedence. If neither are enabled, ephemeral
  # storage is used.
  persistence:
    pvc:
      enabled: false 
      size: 6Gi
      accessModes:
        - ReadWriteOnce

      annotations: {}

      # Optional
      storageClass: local-path

    hostPath:
      enabled: false
      path: "/models"

service:
  type: ClusterIP
  port: 80
  annotations: {}
  # If using an AWS load balancer, you'll need to override the default 60s load balancer idle timeout
  # service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "1200"

ingress:
  enabled: false
  className: ""
  annotations:
    {}
    # kubernetes.io/ingress.class: nginx
    # kubernetes.io/tls-acme: "true"
  hosts:
    - host: chart-example.local
      paths:
        - path: /
          pathType: ImplementationSpecific
  tls: []
  #  - secretName: chart-example-tls
  #    hosts:
  #      - chart-example.local

nodeSelector: {}

tolerations: []

affinity: {}

Now we pull down the Helm chart for LocalAI and make the following modifications.

set template name
configure for k3d local-host provisioner
k3d provisional local-host only allows ReadWriteOnce for disk access mode

~$ helm pull go-skynet/local-ai

~$ ls -l local-ai-3.2.0.tgz 

-rw-r--r-- 1 ubuntu ubuntu 5106 Mar 15 23:53 local-ai-3.2.0.tgz

~$ tar xvf local-ai-3.2.0.tgz 
local-ai/Chart.yaml
local-ai/values.yaml
local-ai/templates/_helpers.tpl
local-ai/templates/_pvc.yaml
local-ai/templates/configmap-prompt-templates.yaml
local-ai/templates/deployment.yaml
local-ai/templates/ingress.yaml
local-ai/templates/pvcs.yaml
local-ai/templates/service.yaml

# todo fix for proper helm use
~$ helm template ./local-ai -f values.yaml --name-template mlops --debug | sed -e 's/ReadWriteMany/ReadWriteOnce/g' -e 's/hostPath/local-path/g' - > local-ai.yaml

install.go:218: [debug] Original chart version: ""
install.go:235: [debug] CHART PATH: /home/ubuntu/local-ai

The following command is a prime example why CNAI (MLOps) tooling continues to develop. Behind the scenes, the Pod image downloading is 10s of GBs in size.

~$ kubectl apply -f local-ai.yaml 

persistentvolumeclaim/mlops-local-ai-models created
persistentvolumeclaim/mlops-local-ai-output created
service/mlops-local-ai created
deployment.apps/mlops-local-ai created

Now would be a good time to get coffee. You can monitor the Pod status as follows (waiting for Running).

# this can take 30 plus minutes
~$ kubectl get pods
NAME                              READY   STATUS            RESTARTS   AGE
mlops-local-ai-7fd46c4d56-fjwc2   0/1     PodInitializing   0          3m23s

Use describe to see what is going on.

~$ kubectl describe pod mlops-local-ai-7fd46c4d56-fjwc2 | tail

                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  5m14s  default-scheduler  Successfully assigned default/mlops-local-ai-7fd46c4d56-fjwc2 to k3d-k8sgpt-cluster-server-0
  Normal  Pulling    5m14s  kubelet            Pulling image "busybox"
  Normal  Pulled     5m13s  kubelet            Successfully pulled image "busybox" in 767.213093ms (767.226223ms including waiting)
  Normal  Created    5m13s  kubelet            Created container download-model
  Normal  Started    5m13s  kubelet            Started container download-model
  Normal  Pulling    2m49s  kubelet            Pulling image "quay.io/go-skynet/local-ai:latest"

A command to monitor is use of time+watch.

~$ time kubectl get pods mlops-local-ai-7fd46c4d56-fjwc2 -w
NAME                              READY   STATUS            RESTARTS   AGE
mlops-local-ai-7fd46c4d56-fjwc2   0/1     PodInitializing   0          8m18s

# Eventually, it will change from PodInitializing to Running.

mlops-local-ai-7fd46c4d56-fjwc2   1/1     Running           0          8m51s

We saw 'Pulling image "quay.io/go-skynet/local-ai:latest"', initially empty, but once far enough downloaded we should see it here.

~$ docker exec k3d-k8sgpt-cluster-server-0 ctr image ls | grep local-ai

quay.io/go-skynet/local-ai:latest                                                                                  application/vnd.oci.image.index.v1+json                   sha256:7e75efb68e2da5d619648a2e7b163b14a486b24752f5ac312fdc01ae9361401e 15.0 GiB  linux/amd64,unknown/unknown                                                                                                        io.cri-containerd.image=managed                                 
quay.io/go-skynet/local-ai@sha256:7e75efb68e2da5d619648a2e7b163b14a486b24752f5ac312fdc01ae9361401e                 application/vnd.oci.image.index.v1+json                   sha256:7e75efb68e2da5d619648a2e7b163b14a486b24752f5ac312fdc01ae9361401e 15.0 GiB  linux/amd64,unknown/unknown

We are now ready to use k8sgpt.

k8sgpt Resouce Configuration

The k8sgpt operator looks for resources of type K8sGPT. The K8sGPT CRD tells the operator to install a given AI and how to find it once its there.

~$ vi k8sgpt-localai.yaml 

apiVersion: core.k8sgpt.ai/v1alpha1
kind: K8sGPT
metadata:
  name: k8sgpt-local-ai
  namespace: default
spec:
  ai:
    enabled: true
    model: ggml-gpt4all-j
    backend: localai
    baseUrl: http://mlops-local-ai.default.svc.cluster.local:8080/v1
  noCache: false
  repository: ghcr.io/k8sgpt-ai/k8sgpt
  version: v0.3.8

Install the K8sGPT CRD.

~$ kubectl apply -f k8sgpt-localai.yaml

k8sgpt.core.k8sgpt.ai/k8sgpt-local-ai created

Does k8sgpt have any advice for us?

~$ kubectl get results.core.k8sgpt.ai 

No resources found in default namespace.

Not yet! Looks like are system is good to go! Lets break it.

~$ kubectl run broken-pod --image=nginx:1.a.b.c

pod/broken-pod created

After several minutes.

~$ kubectl get results.core.k8sgpt.ai
NAME               KIND   BACKEND
defaultbrokenpod   Pod    localai

It looks like we been given some advice!

~$ kubectl get results defaultbrokenpod -o json

{
    "apiVersion": "core.k8sgpt.ai/v1alpha1",
    "kind": "Result",
    "metadata": {
        "creationTimestamp": "2024-03-16T00:29:28Z",
        "generation": 1,
        "labels": {
            "k8sgpts.k8sgpt.ai/backend": "localai",
            "k8sgpts.k8sgpt.ai/name": "k8sgpt-local-ai",
            "k8sgpts.k8sgpt.ai/namespace": "default"
        },
        "name": "defaultbrokenpod",
        "namespace": "default",
        "resourceVersion": "3847",
        "uid": "29aac81f-fc40-4ead-9226-7597bd7c2ce6"
    },
    "spec": {
        "backend": "localai",
        "details": "",
        "error": [
            {
                "text": "Back-off pulling image \"nginx:1.a.b.c\""
            }
        ],
        "kind": "Pod",
        "name": "default/broken-pod",
        "parentObject": ""
    },
    "status": {
        "lifecycle": "historical"
    }
}

Of note, the error text is pretty much to the point (and same as regular errors).

k8sgpt CLI

While we can retrieve results via kubectl, there is also a CLI tool for k8sgpt.

~$ curl -sLO https://github.com/k8sgpt-ai/k8sgpt/releases/download/v0.3.27/k8sgpt_amd64.deb

~$ sudo dpkg -i k8sgpt_amd64.deb

...

We can retrieve the results as follows.

~$ k8sgpt analyze -b localai
AI Provider: AI not used; --explain not set

0 default/broken-pod(broken-pod)
- Error: Back-off pulling image "nginx:1.a.b.c"

Challenge - Fix It!

Remove pod
Check that result is gone
Relaunch with valid tag
Check again, no new error!

Summary

K8sgpt is a an example of CNAI technology, one that helps the operator perform better (and the cluster)!

ronaldpetty/k8sgpt.md