Skip to content

Instantly share code, notes, and snippets.

@SaschaHeyer
Last active September 2, 2024 12:15
Show Gist options
  • Save SaschaHeyer/0df452ddf243df5c7ef0fc8d760dfc1b to your computer and use it in GitHub Desktop.
Save SaschaHeyer/0df452ddf243df5c7ef0fc8d760dfc1b to your computer and use it in GitHub Desktop.
gemma vllm TPU

gcloud config set project sascha-playground-doit export PROJECT_ID=$(gcloud config get project) export REGION=us-central1 export CLUSTER_NAME=vllm export HF_TOKEN=XXX

gcloud container clusters create-auto ${CLUSTER_NAME}
--project=${PROJECT_ID}
--region=${REGION}
--release-channel=rapid
--cluster-version=1.28

gcloud container clusters get-credentials ${CLUSTER_NAME} --location=${REGION}

kubectl create secret generic hf-secret
--from-literal=hf_api_token=$HF_TOKEN
--dry-run=client -o yaml | kubectl apply -f -

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-gemma-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gemma-server
  template:
    metadata:
      labels:
        app: gemma-server
        ai.gke.io/model: gemma-2b-it
        ai.gke.io/inference-server: vllm
        examples.ai.gke.io/source: user-guide
    spec:
      containers:
      - name: inference-server
        image: us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20240220_0936_RC01
        resources:
          requests:
            cpu: "2"
            memory: "7Gi"
            ephemeral-storage: "10Gi"
            nvidia.com/gpu: 1
          limits:
            cpu: "2"
            memory: "7Gi"
            ephemeral-storage: "10Gi"
            nvidia.com/gpu: 1
        command: ["python3", "-m", "vllm.entrypoints.api_server"]
        args:
        - --model=$(MODEL_ID)
        - --tensor-parallel-size=1
        env:
        - name: MODEL_ID
          value: google/gemma-2b-it
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: hf_api_token
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      volumes:
      - name: dshm
        emptyDir:
            medium: Memory
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4
---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: gemma-server
  type: ClusterIP
  ports:
    - protocol: TCP
      port: 8000
      targetPort: 8000
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment