Skip to content

Instantly share code, notes, and snippets.

@elvircrn
Last active January 22, 2026 07:57
Show Gist options
  • Select an option

  • Save elvircrn/e7883671fdbf39fd0e0efed014371e40 to your computer and use it in GitHub Desktop.

Select an option

Save elvircrn/e7883671fdbf39fd0e0efed014371e40 to your computer and use it in GitHub Desktop.
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
name: wide-ep-llm-d-prefill
labels:
llm-d.ai/inferenceServing: "true"
llm-d.ai/model: Qwen3-30B-A3B
llm-d.ai/role: prefill
spec:
replicas: 1
leaderWorkerTemplate:
size: 1
workerTemplate:
metadata:
labels:
llm-d.ai/inferenceServing: "true"
llm-d.ai/model: Qwen3-30B-A3B
llm-d.ai/role: prefill
spec:
serviceAccountName: deepseek-r1
imagePullSecrets:
- name: rh-ee-ecrncevi-ecrncevi-vllm-pull-pull-secret
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 2Gi # 2Gi for NCCL 2.27 to be safe
- name: shared-cuda
emptyDir: {}
resourceClaims:
- name: compute-domain-channel
resourceClaimTemplateName: llm-d-dev-claim
containers:
- name: vllm-worker
image: quay.io/rh-ee-ecrncevi/llm-d-dev@sha256:032980d0a779f6e32f9606becf66841a0e46501aaf155b5678ff73c17c381992
securityContext:
privileged: true
runAsGroup: 0
runAsUser: 0
imagePullPolicy: Always
command:
- /bin/bash
- -c
args:
- |-
#################
# RUN vLLM prefill worker
#################
START_RANK=$(( ${LWS_WORKER_INDEX:-0} * DP_SIZE_LOCAL ))
source /opt/vllm/bin/activate
exec vllm serve \
Qwen/Qwen3-30B-A3B \
--port 8000 \
--disable-uvicorn-access-log \
--enable-expert-parallel \
--data-parallel-hybrid-lb \
--data-parallel-size $((LWS_GROUP_SIZE * DP_SIZE_LOCAL)) \
--data-parallel-size-local $DP_SIZE_LOCAL \
--data-parallel-address ${LWS_LEADER_ADDRESS} \
--data-parallel-rpc-port 5555 \
--data-parallel-start-rank $START_RANK \
--trust-remote-code \
--kv_transfer_config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
env:
- name: DP_SIZE_LOCAL
value: "4"
- name: TRITON_LIBCUDA_PATH
value: /usr/lib64
- name: VLLM_DEEPEP_HIGH_THROUGHPUT_FORCE_INTRA_NODE
value: "1"
- name: VLLM_RANDOMIZE_DP_DUMMY_INPUTS
value: "1"
- name: VLLM_USE_DEEP_GEMM
value: "1"
- name: VLLM_ALL2ALL_BACKEND
value: deepep_high_throughput
- name: NVIDIA_GDRCOPY
value: enabled
- name: NVSHMEM_DEBUG
value: INFO
- name: VLLM_LOGGING_LEVEL
value: INFO
- name: NCCL_DEBUG
value: INFO
- name: VLLM_NIXL_SIDE_CHANNEL_HOST
valueFrom:
fieldRef:
fieldPath: status.podIP
ports:
- containerPort: 8000
name: metrics
protocol: TCP
readinessProbe:
httpGet:
path: /health
port: 8000
resources:
limits:
ephemeral-storage: 75Gi
memory: 512Gi
nvidia.com/gpu: "4"
requests:
cpu: 32
ephemeral-storage: 75Gi
memory: 512Gi
nvidia.com/gpu: "4"
volumeMounts:
- mountPath: /dev/shm
name: dshm
- name: shared-cuda
mountPath: /usr/local/cuda
workingDir: /code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment