Skip to content

Instantly share code, notes, and snippets.

@zufardhiyaulhaq
Last active April 5, 2026 15:14
Show Gist options
  • Select an option

  • Save zufardhiyaulhaq/4f7edbef05ed831d8b86c8a57913bf09 to your computer and use it in GitHub Desktop.

Select an option

Save zufardhiyaulhaq/4f7edbef05ed831d8b86c8a57913bf09 to your computer and use it in GitHub Desktop.
gemma4 26B A4B Kubernetes manifests with llama.cpp
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
namespace: llm
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
---
apiVersion: v1
kind: Secret
metadata:
name: hf-secret
namespace: llm
type: Opaque
stringData:
token: hf-xxxx
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: gemma4-26b-a4b
namespace: llm
labels:
app: gemma4-26b-a4b
spec:
replicas: 1
selector:
matchLabels:
app: gemma4-26b-a4b
template:
metadata:
labels:
app: gemma4-26b-a4b
spec:
nodeSelector:
name: gpu-nodepool
tolerations:
- key: nvidia.com/gpu
operator: Equal
value: present
effect: NoSchedule
containers:
- name: llama-cpp
image: ghcr.io/ggml-org/llama.cpp:server-cuda
args:
- --hf-repo
- ggml-org/gemma-4-26B-A4B-it-GGUF
- --hf-file
- gemma-4-26B-A4B-it-Q4_K_M.gguf
- --host
- 0.0.0.0
- --port
- "8080"
- --n-gpu-layers
- "99"
- --ctx-size
- "65536"
- --parallel
- "1"
- --temp
- "0.2"
- --api-key
- your_api_key_here
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: token
ports:
- name: http
containerPort: 8080
resources:
limits:
nvidia.com/gpu: "1"
memory: 28Gi
cpu: "8"
requests:
nvidia.com/gpu: "1"
memory: 8Gi
cpu: "4"
volumeMounts:
- name: shm
mountPath: /dev/shm
- name: model-cache
mountPath: /root/.cache
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 120
periodSeconds: 10
failureThreshold: 20
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 180
periodSeconds: 30
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 2Gi
- name: model-cache
persistentVolumeClaim:
claimName: model-cache
---
apiVersion: v1
kind: Service
metadata:
name: gemma4-26b-a4b
namespace: llm
spec:
selector:
app: gemma4-26b-a4b
ports:
- name: http
port: 8080
targetPort: 8080
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment