Skip to content

Instantly share code, notes, and snippets.

@elvircrn
Last active January 22, 2026 07:57
Show Gist options
  • Select an option

  • Save elvircrn/7abda284a8359aa6b676eeab1bd3f6b9 to your computer and use it in GitHub Desktop.

Select an option

Save elvircrn/7abda284a8359aa6b676eeab1bd3f6b9 to your computer and use it in GitHub Desktop.
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
name: wide-ep-llm-d-decode
labels:
llm-d.ai/inferenceServing: "true"
llm-d.ai/model: Qwen3-30B-A3B
llm-d.ai/role: decode
spec:
replicas: 1
leaderWorkerTemplate:
size: 1
workerTemplate:
metadata:
labels:
llm-d.ai/inferenceServing: "true"
llm-d.ai/model: Qwen3-30B-A3B
llm-d.ai/role: decode
spec:
serviceAccountName: deepseek-r1
imagePullSecrets:
- name: rh-ee-ecrncevi-ecrncevi-vllm-pull-pull-secret
initContainers:
- name: routing-proxy
args:
- --port=8000
- --vllm-port=8200
- --connector=nixlv2
- --zap-log-level=debug
- --secure-proxy=false
image: ghcr.io/llm-d/llm-d-routing-sidecar:latest@sha256:e408208da659a8d5b33be94eaaff437c0af8b0ea920a87e023901d66ae62a9bc
imagePullPolicy: Always
ports:
- containerPort: 8000
resources: {}
restartPolicy: Always
securityContext:
allowPrivilegeEscalation: false
runAsNonRoot: true
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 2Gi # 2Gi for NCCL 2.27 to be safe
- name: shared-cuda
emptyDir: {}
resourceClaims:
- name: compute-domain-channel
resourceClaimTemplateName: llm-d-dev-claim
containers:
- name: vllm-worker
image: quay.io/rh-ee-ecrncevi/llm-d-dev@sha256:032980d0a779f6e32f9606becf66841a0e46501aaf155b5678ff73c17c381992
securityContext:
privileged: true
runAsGroup: 0
runAsUser: 0
imagePullPolicy: Always
command:
- /bin/bash
- -c
args:
- |-
#################
# RUN vLLM decode worker
#################
START_RANK=$(( ${LWS_WORKER_INDEX:-0} * DP_SIZE_LOCAL ))
source /opt/vllm/bin/activate
exec vllm serve \
Qwen/Qwen3-30B-A3B \
--port 8200 \
--disable-uvicorn-access-log \
--enable-expert-parallel \
--data-parallel-hybrid-lb \
--data-parallel-size $((LWS_GROUP_SIZE * DP_SIZE_LOCAL)) \
--data-parallel-size-local $DP_SIZE_LOCAL \
--data-parallel-address ${LWS_LEADER_ADDRESS} \
--data-parallel-rpc-port 5555 \
--data-parallel-start-rank $START_RANK \
--trust-remote-code \
--kv_transfer_config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
env:
- name: VLLM_DEEPEP_LOW_LATENCY_ALLOW_NVLINK
value: "1"
- name: VLLM_DEEPEP_LOW_LATENCY_USE_MNNVL
value: "1"
- name: VLLM_FUSED_MOE_CHUNK_SIZE
value: "1024"
- name: DP_SIZE_LOCAL
value: "4"
- name: TRITON_LIBCUDA_PATH
value: /usr/lib64
- name: VLLM_RANDOMIZE_DP_DUMMY_INPUTS
value: "1"
- name: NCCL_DEBUG
value: INFO
- name: VLLM_USE_DEEP_GEMM
value: "1"
- name: VLLM_ALL2ALL_BACKEND
value: deepep_low_latency
- name: NVIDIA_GDRCOPY
value: enabled
- name: NVSHMEM_DEBUG
value: INFO
- name: NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME
value: eth0
- name: VLLM_LOGGING_LEVEL
value: INFO
- name: VLLM_NIXL_SIDE_CHANNEL_HOST
valueFrom:
fieldRef:
fieldPath: status.podIP
ports:
- containerPort: 8200
name: metrics
protocol: TCP
readinessProbe:
httpGet:
path: /health
port: 8200
resources:
limits:
ephemeral-storage: 75Gi
memory: 512Gi
nvidia.com/gpu: "4"
requests:
cpu: 32
ephemeral-storage: 75Gi
memory: 512Gi
nvidia.com/gpu: "4"
volumeMounts:
- mountPath: /dev/shm
name: dshm
- name: shared-cuda
mountPath: /usr/local/cuda
workingDir: /code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment