Copy-paste ready steps to validate the Direct vLLM provider end-to-end. Every example
uses namespace vllm-test — change it once in Setup and the rest follow.
Field names match controller/api/v1alpha1/modeldeployment_types.go, the sample
controller/config/samples/airunway_v1alpha1_modeldeployment.yaml, and docs/providers/vllm.md.
The provider default image is vllm/vllm-openai:cu130-nightly; the server, Service, and probes
are hard-wired to 0.0.0.0:8000.
Each test below states what it checks, why it matters, the steps, and what a PASS vs FAIL looks like.
# --- Install KAITO & Dynamo as fallback providers ---
cd airunway
make -C providers/kaito setup-kaito
make -C providers/dynamo setup-dynamo
# --- Build, push and deploy AI Runway ---
export REGISTRY="quay.io/surajd"
TAG=$(git describe --tags --always)-$(date +%Y-%m-%d-%H-%M-%S)
export TAG
export PUSH=true
# Controller
make controller-docker-build CONTROLLER_IMG="${REGISTRY}/kubeairunway-controller:${TAG}"
pushd controller && make deploy IMG="${REGISTRY}/kubeairunway-controller:${TAG}" && popd
# vLLM provider
pushd providers/vllm
export IMG="${REGISTRY}/kubeairunway-vllm-provider:${TAG}"
make docker-build && make deploy
popd
# Dynamo provider
pushd providers/dynamo
export IMG="${REGISTRY}/kubeairunway-dynamo-provider:${TAG}"
make docker-build && make deploy
popd
# KAITO provider
pushd providers/kaito
export IMG="${REGISTRY}/kubeairunway-kaito-provider:${TAG}"
make docker-build && make deploy
popd
# Test namespace + (optional) HF token secret for gated models
kubectl create namespace vllm-test
kubectl -n vllm-test create secret generic vllm-hf-token \
--from-literal=HF_TOKEN="${HF_TOKEN:-replace-me}"Verify provider self-registration — the vLLM provider is explicit-only, so it registers with no selection rules and must never be auto-selected:
kubectl get inferenceproviderconfig vllm \
-o jsonpath='{.status.ready}{"\n"}{.status.version}{"\n"}{.spec.selectionRules}{"\n"}'- PASS:
true/vllm-provider:v0.1.0/ empty (noselectionRules). - FAIL:
readyis false, orselectionRulesis non-empty (the provider would be eligible for auto-selection, which it should never be).
What it checks: auto-selection never picks vLLM; explicit provider.name: vllm works; and the
chosen provider is immutable once recorded.
Why it matters: vLLM has no selection rules, so a managed provider must win auto-selection.
Once a provider is recorded in status.provider.name, spec.provider.name is immutable — the
admission webhook rejects any in-place change synchronously, so switching providers requires
delete-and-recreate, not a patch.
cat <<EOF | kubectl apply -f -
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata:
name: sel-auto
namespace: vllm-test
spec:
model:
id: Qwen/Qwen2.5-0.5B-Instruct
source: huggingface
engine:
type: vllm
resources:
gpu:
count: 1
EOF
kubectl -n vllm-test get modeldeployment sel-auto -o jsonpath='{.status.provider.name}{"\n"}'- PASS: a managed provider (
dynamo/kuberay), notvllm. - FAIL:
vllm— auto-selection wrongly chose the explicit-only provider.
cat <<EOF | kubectl apply -f -
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata:
name: sel-vllm
namespace: vllm-test
spec:
provider:
name: vllm
model:
id: Qwen/Qwen2.5-0.5B-Instruct
source: huggingface
engine:
type: vllm
resources:
gpu:
count: 1
EOF
kubectl -n vllm-test get modeldeployment sel-vllm -o jsonpath='{.status.provider.name}{"\n"}'
kubectl -n vllm-test get modeldeployment sel-vllm -o jsonpath='{.metadata.finalizers}{"\n"}'
kubectl -n vllm-test get deploy,svc sel-vllm- PASS:
provider.name=vllm; finalizer["airunway.ai/vllm-provider"]; a Deployment and Service named after the MD both exist. - FAIL: provider not selected, finalizer missing, or Deployment/Service absent.
kubectl -n vllm-test patch modeldeployment sel-auto --type=merge \
-p '{"spec":{"provider":{"name":"vllm"}}}'
kubectl -n vllm-test get modeldeployment sel-auto \
-o jsonpath='{.status.phase}{"\n"}{range .status.conditions[?(@.type=="ProviderSelected")]}{.reason}: {.message}{"\n"}{end}'
kubectl -n vllm-test get modeldeployment sel-auto -o jsonpath='{.status.provider.name}{"\n"}'- PASS: the patch is denied synchronously by the admission webhook —
Error from server (Forbidden): ... spec.provider.name: Invalid value: "vllm": provider.name is immutable (changing it requires delete and recreate). Because the patch was rejected, the live object is unchanged: phase staysDeploying, theProviderSelectedcondition still readsAutoSelected: Provider dynamo auto-selected, andstatus.provider.nameis still the original provider (dynamo). - FAIL: the patch is accepted, or
status.provider.nameswitches tovllm.
kubectl -n vllm-test delete modeldeployment sel-auto
kubectl -n vllm-test delete modeldeployment sel-vllmWhat it checks: with no engine.image, the provider resolves the default nightly tag to a
digest and pins it — repeated reconciles never re-resolve.
Why it matters: this is a deliberate reproducibility trade-off (pin the first-resolved digest
forever) rather than tracking the moving cu130-nightly tag. The test confirms the trade-off is
acceptable and that the only refresh path is changing spec.engine.image.
cat <<EOF | kubectl apply -f -
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata:
name: pin-test
namespace: vllm-test
spec:
provider:
name: vllm
model:
id: Qwen/Qwen2.5-0.5B-Instruct
source: huggingface
engine:
type: vllm
resources:
gpu:
count: 1
EOF
sleep 10
kubectl -n vllm-test get modeldeployment pin-test -o jsonpath='{.status.image}{"\n"}' | python3 -m json.tool
DIGEST1=$(kubectl -n vllm-test get modeldeployment pin-test -o jsonpath='{.status.image.digest}')
# Force several reconciles and confirm the digest never changes:
for i in 1 2 3; do
kubectl -n vllm-test annotate modeldeployment pin-test airunway.ai/touch="$(date +%s%N)" --overwrite
sleep 35 # > the 30s RequeueInterval
done
DIGEST2=$(kubectl -n vllm-test get modeldeployment pin-test -o jsonpath='{.status.image.digest}')
echo "before=$DIGEST1"; echo "after =$DIGEST2"- PASS:
status.imageshowssource=nightly,inNightly=true, aresolveddigest, andrequested=vllm/vllm-openai:cu130-nightly;beforeandafterdigests are EQUAL (pinned, not re-pulled). - FAIL: the digest changes between reconciles (the moving tag is being re-pulled, contradicting the documented pinning behavior).
What it checks: the default image is a hard gate — if its digest can't be resolved, no Deployment is created. A user-supplied image that fails to resolve is non-fatal by contrast.
Why it matters: the default nightly tag must resolve to a digest before any pod runs (for reproducibility), whereas a user explicitly choosing a tag is trusted to proceed even if resolution fails.
cat <<'EOF' | kubectl apply -f -
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata:
name: outage-test
namespace: vllm-test
spec:
provider: { name: vllm }
model: { id: Qwen/Qwen2.5-0.5B-Instruct, source: huggingface }
engine: { type: vllm }
resources: { gpu: { count: 1 } }
EOF
# Image conditions are written by the vLLM PROVIDER controller ~1s after the core controller
# stamps provider.name. Querying immediately races that gap and returns EMPTY. Wait first:
kubectl -n vllm-test wait --for=jsonpath='{.status.image.source}'=nightly \
modeldeployment/outage-test --timeout=60s(A) Healthy cluster (registry reachable) — the normal expectation:
kubectl -n vllm-test get modeldeployment outage-test \
-o jsonpath='{range .status.conditions[?(@.type=="ImageResolved")]}{.status} {.reason}{"\n"}{end}'
kubectl -n vllm-test get modeldeployment outage-test \
-o jsonpath='{.status.image.source}={.status.image.resolved}{"\n"}'- PASS:
ImageResolvedisTrue(ImageResolutionReused, orImageResolvedon first resolve);source=nightlywith a digest-pinnedresolvedvalue.
(B) True outage (optional) — block the provider's egress, then create a fresh default-image MD:
# kubectl -n airunway-system apply -f - <<'NP' # deny-all egress for the provider pod
# apiVersion: networking.k8s.io/v1
# kind: NetworkPolicy
# metadata: { name: block-vllm-egress, namespace: airunway-system }
# spec:
# podSelector: { matchLabels: { control-plane: vllm-provider } }
# policyTypes: [Egress]
# egress: []
# NP- PASS (during outage):
ImageResolved=False(ImageResolutionFailed) andkubectl -n vllm-test get deploy <name>→NotFound(no pod until the digest resolves). - FAIL: a Deployment is created despite the default image failing to resolve.
Recovery / non-fatal user image — a user-set engine.image that fails to resolve keeps the tag and proceeds:
kubectl -n vllm-test patch modeldeployment outage-test --type=merge \
-p '{"spec":{"engine":{"image":"vllm/vllm-openai:latest"}}}'- PASS: the deployment proceeds with the user tag (unlike the default-image hard gate).
kubectl -n vllm-test delete modeldeployment outage-testWhat it checks: setting both the legacy spec.image and the preferred spec.engine.image to
different values is rejected.
Why it matters: two competing image sources are ambiguous; the validating webhook should refuse rather than silently pick one.
cat <<EOF | kubectl apply -f -
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata:
name: img-conflict
namespace: vllm-test
spec:
provider: { name: vllm }
model: { id: Qwen/Qwen2.5-0.5B-Instruct, source: huggingface }
image: ubuntu:latest # legacy field
engine:
type: vllm
image: vllm/vllm-openai:latest # preferred field — conflicts
resources: { gpu: { count: 1 } }
EOF
# If the webhook is disabled, the object persists; check it parks at Failed:
kubectl -n vllm-test get modeldeployment img-conflict \
-o jsonpath='{.status.phase}{"\n"}{.status.image.requested}{"\n"}'- PASS: the webhook rejects the apply with a message naming both
spec.imageandspec.engine.image(or, if the webhook is off, reconcile parks atFailed). - FAIL: the object is accepted and a deployment proceeds with one image silently chosen.
What it checks: host/port are provider-generated (container, Service, and probes hard-wired
to 0.0.0.0:8000), so any attempt to set them — in all four spellings — must be rejected.
Why it matters: a user-set host/port would desync the container from the Service/probes and break
routing. Rejection is asynchronous: the transformer fails the reconcile and sets phase: Failed ~1s
after apply, so each block waits for that.
The expected failure message for every block is ...conflicts with Direct vLLM generated networking....
- PASS (all four):
phase: Failedwith the networking-conflict message. - FAIL: the
waittimes out (not rejected) or the MD renders a Deployment.
cat <<'EOF' | kubectl apply -f -
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata: { name: reserved-a, namespace: vllm-test }
spec:
provider: { name: vllm }
model: { id: Qwen/Qwen2.5-0.5B-Instruct, source: huggingface }
resources: { gpu: { count: 1 } }
engine:
type: vllm
args:
port: "9000"
EOF
kubectl -n vllm-test wait --for=jsonpath='{.status.phase}'=Failed \
modeldeployment/reserved-a --timeout=60s
kubectl -n vllm-test get modeldeployment reserved-a -o jsonpath='{.status.phase}: {.status.message}{"\n"}'
kubectl -n vllm-test delete modeldeployment reserved-a --ignore-not-foundcat <<'EOF' | kubectl apply -f -
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata: { name: reserved-b, namespace: vllm-test }
spec:
provider: { name: vllm }
model: { id: Qwen/Qwen2.5-0.5B-Instruct, source: huggingface }
resources: { gpu: { count: 1 } }
engine:
type: vllm
extraArgs: ["--host=10.0.0.1"]
EOF
kubectl -n vllm-test wait --for=jsonpath='{.status.phase}'=Failed \
modeldeployment/reserved-b --timeout=60s
kubectl -n vllm-test get modeldeployment reserved-b -o jsonpath='{.status.phase}: {.status.message}{"\n"}'
kubectl -n vllm-test delete modeldeployment reserved-b --ignore-not-foundcat <<'EOF' | kubectl apply -f -
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata: { name: reserved-c, namespace: vllm-test }
spec:
provider: { name: vllm }
model: { id: Qwen/Qwen2.5-0.5B-Instruct, source: huggingface }
resources: { gpu: { count: 1 } }
engine:
type: vllm
extraArgs: ["--port", "9000"]
EOF
kubectl -n vllm-test wait --for=jsonpath='{.status.phase}'=Failed \
modeldeployment/reserved-c --timeout=60s
kubectl -n vllm-test get modeldeployment reserved-c -o jsonpath='{.status.phase}: {.status.message}{"\n"}'
kubectl -n vllm-test delete modeldeployment reserved-c --ignore-not-foundcat <<'EOF' | kubectl apply -f -
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata: { name: reserved-d, namespace: vllm-test }
spec:
provider: { name: vllm }
model: { id: Qwen/Qwen2.5-0.5B-Instruct, source: huggingface }
resources: { gpu: { count: 1 } }
engine:
type: vllm
extraArgs: ["--port=9000"]
EOF
kubectl -n vllm-test wait --for=jsonpath='{.status.phase}'=Failed \
modeldeployment/reserved-d --timeout=60s
kubectl -n vllm-test get modeldeployment reserved-d -o jsonpath='{.status.phase}: {.status.message}{"\n"}'
kubectl -n vllm-test delete modeldeployment reserved-d --ignore-not-foundWhat it checks: (6a) an explicit engine.args flag wins over the auto-derived flag — no
duplicate is rendered; (6b) setting the same key in both args and extraArgs is rejected at
admission.
Why it matters: a duplicate flag would be silently resolved by vLLM's last-wins argparse,
defeating the user's explicit value. The provider both dedups (derived flags suppressed when the user
set the key) and rejects genuine conflicts (same key in two places) synchronously at kubectl time.
6a inspects the rendered Deployment manifest, which is populated regardless of whether the pod can schedule — so it works even when GPU nodes can't satisfy
gpu.count: 2. (On single-GPU nodes the pod sitsPending; dropgpu.countto1andtensor-parallel-sizeto"1"if you also want it to run.)
cat <<EOF | kubectl apply -f -
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata:
name: tp-test
namespace: vllm-test
spec:
provider: { name: vllm }
model: { id: Qwen/Qwen2.5-0.5B-Instruct, source: huggingface }
engine:
type: vllm
args:
tensor-parallel-size: "4"
resources:
gpu:
count: 2
EOF
kubectl -n vllm-test wait --for=jsonpath='{.status.image.source}' \
modeldeployment/tp-test --timeout=60s
kubectl -n vllm-test get deploy tp-test \
-o jsonpath='{.spec.template.spec.containers[0].args}' | tr ',' '\n' | grep -n -A 1 -- '--tensor-parallel-size'- PASS: exactly one
--tensor-parallel-size, value4(explicit wins; the derived2fromgpu.countis suppressed). - FAIL: two entries, or the derived
2appears.
kubectl -n vllm-test patch modeldeployment tp-test --type=merge \
-p '{"spec":{"engine":{"extraArgs":["--tensor-parallel-size=2"]}}}'
# Re-check the rendered Deployment — it must still have exactly one flag, value 4:
kubectl -n vllm-test get deploy tp-test \
-o jsonpath='{.spec.template.spec.containers[0].args}' | tr ',' '\n' | grep -n -A 1 -- '--tensor-parallel-size'
kubectl -n vllm-test delete modeldeployment tp-test --ignore-not-found- PASS: the patch is denied synchronously —
Error from server (Forbidden): ... launch flag "tensor-parallel-size" is set in both spec.engine.args and spec.engine.extraArgs ...— and because the patch was rejected the live Deployment still shows one flag, value4. - FAIL: the patch is accepted, or the Deployment renders a duplicate/
2value.
What it checks: how the provider classifies the image source in status.image for a launch-style tag.
Why it matters: classification only labels a tag launch when it literally contains the substring
"launch". Real launch tags like vllm/vllm-openai:deepseekv4-cu130 fall through to custom and flip
UnsupportedImage=True, producing misleading provenance for the launch workflow the docs describe.
(Open finding — confirm current behavior; if broadened/fixed, the first case should report launch.)
The four status.image.source values (directVLLMImageSource, evaluated top-to-bottom — first match wins):
source |
Rule (first match wins) | UnsupportedImage |
|---|---|---|
launch |
tag contains the substring launch (case-insensitive) — e.g. cu130-launch |
False |
nightly |
image is exactly the provider default (vllm/vllm-openai:cu130-nightly), or repo is vllm/vllm-openai and the tag contains nightly |
False |
stable |
repo is vllm/vllm-openai and the tag is exactly latest |
False |
custom |
anything else (non-official repo, or an official repo with an unrecognized tag) | True (CustomImage) |
Only custom sets UnsupportedImage=True. The bug is the ordering/substring match: a genuine launch
image whose tag doesn't contain the word launch (e.g. deepseekv4-cu130) misses the first rule, isn't
nightly/latest either, and so lands in custom.
cat <<'EOF' | kubectl apply -f -
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata: { name: launch-test, namespace: vllm-test }
spec:
provider: { name: vllm }
model: { id: deepseek-ai/DeepSeek-V2-Lite, source: huggingface }
engine: { type: vllm, image: "vllm/vllm-openai:deepseekv4-cu130" }
resources: { gpu: { count: 1 } }
EOF
kubectl -n vllm-test wait --for=jsonpath='{.status.image.source}' \
modeldeployment/launch-test --timeout=60s
kubectl -n vllm-test get modeldeployment launch-test -o jsonpath='{.status.image.source}{"\n"}'
# Compare against a tag that DOES contain the literal "launch":
kubectl -n vllm-test patch modeldeployment launch-test --type=merge \
-p '{"spec":{"engine":{"image":"vllm/vllm-openai:cu130-launch"}}}'
sleep 2
kubectl -n vllm-test get modeldeployment launch-test -o jsonpath='{.status.image.source}{"\n"}'- Current (finding present):
deepseekv4-cu130→custom(+UnsupportedImage=True);cu130-launch→launch. - Fixed: the launch tag is classified
launchregardless of the literal substring.
kubectl -n vllm-test delete modeldeployment launch-testWhat it checks: dangerous pod-spec overrides under spec.provider.overrides are denied at
admission — both top-level and nested inside arrays.
Why it matters: the admission webhook blocks a denylist (securityContext, serviceAccountName,
serviceAccount, hostNetwork, hostPID, hostIPC, automountServiceAccountToken, nodeName,
priorityClassName, runtimeClassName), recursing through nested objects/arrays, plus a sizing guard
on replicas/resources. There must be no privilege-escalation path through overrides.
# Top-level securityContext + hostNetwork — must be DENIED:
cat <<EOF | kubectl apply -f -
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata:
name: override-test
namespace: vllm-test
spec:
provider:
name: vllm
overrides:
spec:
template:
spec:
securityContext:
runAsNonRoot: false
hostNetwork: true
model: { id: Qwen/Qwen2.5-0.5B-Instruct, source: huggingface }
engine: { type: vllm }
resources: { gpu: { count: 1 } }
EOF
kubectl -n vllm-test get modeldeployment,deploy override-test 2>&1 # expect NotFound for both
# Nested-array path — containers[*].securityContext must ALSO be caught:
cat <<EOF | kubectl apply -f - 2>&1 | tail -1
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata: { name: override-nested, namespace: vllm-test }
spec:
provider:
name: vllm
overrides:
spec:
template:
spec:
containers:
- name: vllm
securityContext: { privileged: true }
model: { id: Qwen/Qwen2.5-0.5B-Instruct, source: huggingface }
engine: { type: vllm }
resources: { gpu: { count: 1 } }
EOF- PASS: both applies fail with
Forbiddennaming the blocked key(s) (hostNetwork,securityContext);getreturnsNotFoundfor both the MD and the Deployment; the nested case reportsoverriding "securityContext" is not allowed for security reasons. - FAIL: either object is accepted, or a Deployment is created with the escalated pod spec.
What it checks: if a Deployment with the target name already exists and is not owned by the ModelDeployment, the provider refuses to take it over and surfaces the conflict instead of hijacking it.
Why it matters: the provider must never adopt or overwrite a resource it doesn't own. The discriminator is the Deployment UID — the provider applies by name via server-side apply and never deletes-and-recreates, so a protected foreign object keeps its same UID and original image.
Fixture gotcha — important: the foreign Deployment must have no
ownerReferences. An owner ref to a non-existent object is deleted by Kubernetes garbage collection within ~2s — before the provider runs — which masks the guard and makes a fresh, legitimately-owned Deployment look like a hijack. Use an unowned object so it survives GC and the guard is actually exercised.
# Pre-create a FOREIGN Deployment (no ownerReferences) with the name the MD will want.
# Capture its UID so we can prove it is untouched.
FOREIGN_UID=$(cat <<'EOF' | kubectl apply -f - -o jsonpath='{.metadata.uid}'
apiVersion: apps/v1
kind: Deployment
metadata:
name: owned-test
namespace: vllm-test
spec:
replicas: 1
selector: { matchLabels: { app: foreign } }
template:
metadata: { labels: { app: foreign } }
spec: { containers: [ { name: pause, image: registry.k8s.io/pause:3.9 } ] }
EOF
)
echo "FOREIGN_UID=$FOREIGN_UID"
cat <<'EOF' | kubectl apply -f -
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata: { name: owned-test, namespace: vllm-test }
spec:
provider: { name: vllm }
model: { id: Qwen/Qwen2.5-0.5B-Instruct, source: huggingface }
engine: { type: vllm }
resources: { gpu: { count: 1 } }
EOF
sleep 8
kubectl -n vllm-test get modeldeployment owned-test -o jsonpath='{.status.phase}{"\n"}'
kubectl -n vllm-test get modeldeployment owned-test \
-o jsonpath='{range .status.conditions[?(@.type=="ResourceCreated")]}{.status} {.reason}{"\n"}{end}'
kubectl -n vllm-test get modeldeployment owned-test -o jsonpath='{.status.message}{"\n"}'
# Prove the foreign object is the SAME, untouched (same UID, still the pause image):
echo "FINAL_UID=$(kubectl -n vllm-test get deploy owned-test -o jsonpath='{.metadata.uid}')"
kubectl -n vllm-test get deploy owned-test \
-o jsonpath='owners={range .metadata.ownerReferences[*]}[{.kind}/{.name}]{end} image={.spec.template.spec.containers[0].image}{"\n"}'
kubectl -n vllm-test delete modeldeployment,deploy owned-test --ignore-not-found- PASS: MD phase
Failed;ResourceCreated=False ResourceConflict; message...exists but is not managed by this ModelDeployment (no owner references); the foreign Deployment hasFINAL_UID == FOREIGN_UID, no owners, and stillregistry.k8s.io/pause:3.9. - FAIL: the foreign Deployment's image/owner changes (hijacked), or
FINAL_UID != FOREIGN_UID(silently recreated).
What it checks: when an owned Deployment is stuck Terminating, the ModelDeployment's
FinalizerTimeout (5 min) still fires so the MD doesn't requeue forever.
Why it matters: if the timeout check is gated behind a delete error that never occurs for a stuck-terminating object, the MD can loop indefinitely. (Open finding — confirm whether the timeout eventually forces removal or the MD is stuck.)
cat <<EOF | kubectl apply -f -
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata:
name: final-test
namespace: vllm-test
spec:
provider: { name: vllm }
model: { id: Qwen/Qwen2.5-0.5B-Instruct, source: huggingface }
engine: { type: vllm }
resources: { gpu: { count: 1 } }
EOF
kubectl -n vllm-test wait --for=jsonpath='{.status.provider.name}'=vllm \
modeldeployment/final-test --timeout=60s
# Make the owned Deployment un-deletable, then delete the MD:
kubectl -n vllm-test patch deploy final-test --type=merge \
-p '{"metadata":{"finalizers":["example.com/block-delete"]}}'
kubectl -n vllm-test delete modeldeployment final-test --wait=false
kubectl -n vllm-test get modeldeployment final-test -o jsonpath='{.status.phase}{"\n"}' # Terminating
sleep 330 # > the 5-min FinalizerTimeout
kubectl -n vllm-test get modeldeployment final-test 2>&1
# Cleanup:
kubectl -n vllm-test patch deploy final-test --type=merge -p '{"metadata":{"finalizers":[]}}'- PASS (fixed): after the timeout the MD is forcibly removed (
NotFound). - FAIL (finding present): the MD is still
Terminating/ still present after 5+ minutes.
What it checks: (11a) traversal/SSRF-style paths are rejected before any upstream fetch; (11b) the
5 MiB response cap actually bounds a chunked (no Content-Length) body; (11c) trailing-slash recipe
routes resolve correctly.
Why it matters: the recipe client proxies an external origin. Path-traversal must not reach a fetch, and a chunked response must not be fully buffered into memory before the size check (an OOM vector). Trailing-slash handling affects every recipe route.
Runtime gotchas (read before running):
- The backend listens on
PORT || 3001, not 3000 — hitting 3000 gives a connection refused.VLLM_RECIPES_BASE_URLis read once at construction — you must start the backend pointed at the bad origin; you can't redirect a running one.- The OOM probe must sample the backend's RSS, not curl's.
- Always kill a stale backend on 3001 first and use a readiness gate, or you'll silently test an old process (a classic "the fix didn't work" false alarm).
# Kill any stale backend on 3001 first, then start fresh and wait until it's listening.
OLD=$(lsof -tiTCP:3001 -sTCP:LISTEN 2>/dev/null); [ -n "$OLD" ] && kill $OLD 2>/dev/null
for i in $(seq 1 10); do lsof -iTCP:3001 -sTCP:LISTEN -n -P >/dev/null 2>&1 || break; sleep 1; done
( cd backend && bun run src/index.ts >/tmp/airunway-backend.log 2>&1 & )
for i in $(seq 1 30); do
lsof -iTCP:3001 -sTCP:LISTEN -n -P >/dev/null 2>&1 && { echo "backend up after ${i}s"; break; }
sleep 1
done
# --path-as-is stops curl from collapsing ../ client-side, so the server sees the raw path:
curl -s --path-as-is -o /dev/null -w '%{http_code}\n' 'http://localhost:3001/api/vllm/recipes/foo/../../etc/passwd'
curl -s -o /dev/null -w '%{http_code}\n' 'http://localhost:3001/api/vllm/recipes/foo/bar%2Fbaz'
curl -s -o /dev/null -w '%{http_code}\n' 'http://localhost:3001/api/vllm/recipes/onlyoneseg'- PASS:
404(escaped path),400(encoded slash inside a segment),404(single segment) — all non-2xx; the traversal never reaches an upstream fetch. - FAIL: any 2xx, or evidence the request reached an upstream fetch.
The cap is implemented in readBoundedBody (backend/src/services/vllmRecipesClient.ts): it streams
the body with getReader() and calls controller.abort() the instant the running byte total exceeds
5 MiB, so a chunked / no-Content-Length reply can't be fully buffered.
Measure it correctly — vary the body size; don't trust a single absolute delta. A cold backend's first request is dominated by Bun JIT/allocator warmup (~60-80 MB), which swamps the response body and makes one cold reading meaningless. The real discriminator is whether memory stays flat as the upstream body grows: a bounded reader aborts at 5 MiB regardless of body size, so an 8 MiB and a 128 MiB body produce the same delta; a buffering reader would scale ~linearly with the body.
# Origin that serves a configurable-size, NO-Content-Length JSON body (size in MiB = last URL segment):
python3 - <<'PY' >/tmp/badorigin.log 2>&1 &
from http.server import BaseHTTPRequestHandler, HTTPServer
class H(BaseHTTPRequestHandler):
def do_GET(self):
try: mib = int(self.path.strip('/').split('/')[-1])
except Exception: mib = 8
self.send_response(200); self.send_header('Content-Type','application/json'); self.end_headers()
self.wfile.write(b'{"models":[')
chunk = b'"xxxxxxxx",' * 13107 # ~128 KiB
try:
for _ in range(mib*8): self.wfile.write(chunk)
self.wfile.write(b'"x"]}')
except BrokenPipeError: pass
def log_message(self,*a): pass
HTTPServer(('127.0.0.1',9998),H).serve_forever()
PY
ORIGIN=$!; sleep 1
# Boot the backend AGAINST the bad origin (env read once at construction). Kill any stale backend first.
OLD=$(lsof -tiTCP:3001 -sTCP:LISTEN 2>/dev/null); [ -n "$OLD" ] && kill $OLD 2>/dev/null
for i in $(seq 1 10); do lsof -iTCP:3001 -sTCP:LISTEN -n -P >/dev/null 2>&1 || break; sleep 1; done
( cd backend && VLLM_RECIPES_BASE_URL=http://127.0.0.1:9998 bun run src/index.ts >/tmp/airunway-backend.log 2>&1 & )
for i in $(seq 1 30); do lsof -iTCP:3001 -sTCP:LISTEN -n -P >/dev/null 2>&1 && break; sleep 1; done
BPID=$(lsof -tiTCP:3001 -sTCP:LISTEN -n -P | head -1)
rss(){ ps -o rss= -p "$BPID" | tr -d ' '; }
hit(){ curl -s -o /dev/null -w '%{http_code}' "http://localhost:3001/api/vllm/recipes/foo/$1"; }
echo "warmup http=$(hit 8)"; sleep 1 # absorb one-time JIT/allocator cost
G0=$(rss); echo "baseline RSS (KB): $G0"
for SZ in 8 32 128; do
c=$(hit $SZ); sleep 1; r=$(rss)
printf 'body=%4s MiB http=%s delta_from_baseline=%s KB\n' "$SZ" "$c" "$((r-G0))"
done
kill "$ORIGIN" "$BPID" 2>/dev/null- PASS (cap works): every request returns
http=502, and the delta is flat across 8/32/128 MiB (e.g. ~26 MB each — a 16× larger body uses the same memory). The bounded reader aborts at 5 MiB regardless of how much the origin streams. - FAIL (finding present): the delta scales with body size (128 MiB body ≈ 16× the 8 MiB delta) — the whole body is being buffered before the size check.
Note on a single cold reading: hitting a freshly-started backend once and seeing a ~70-80 MB delta is not evidence of a leak — that's JIT warmup, and it looks the same whether the body is bounded or not. Only the body-size-vs-delta slope on a warm process distinguishes the two. (The earlier "~71 MB proves it's buffered" reading was this warmup artifact.)
# Kill any stale backend on 3001 first, then start fresh and wait until it's listening.
OLD=$(lsof -tiTCP:3001 -sTCP:LISTEN 2>/dev/null); [ -n "$OLD" ] && kill $OLD 2>/dev/null
for i in $(seq 1 10); do lsof -iTCP:3001 -sTCP:LISTEN -n -P >/dev/null 2>&1 || break; sleep 1; done
( cd backend && bun run src/index.ts >/tmp/airunway-backend.log 2>&1 & )
for i in $(seq 1 30); do
lsof -iTCP:3001 -sTCP:LISTEN -n -P >/dev/null 2>&1 && { echo "backend up after ${i}s"; break; }
sleep 1
done
# Backend on 3001 as in 11a (remember the kill-stale-listener step).
curl -s -o /dev/null -w '%{http_code} GET /api/vllm/recipes\n' 'http://localhost:3001/api/vllm/recipes'
curl -s -o /dev/null -w '%{http_code} GET /api/vllm/recipes/ (trailing)\n' 'http://localhost:3001/api/vllm/recipes/'
curl -s -o /dev/null -w '%{http_code} GET .../microsoft/Phi-4-mini-instruct\n' 'http://localhost:3001/api/vllm/recipes/microsoft/Phi-4-mini-instruct'
curl -s -o /dev/null -w '%{http_code} GET .../Phi-4-mini-instruct/ (trailing)\n' 'http://localhost:3001/api/vllm/recipes/microsoft/Phi-4-mini-instruct/'
# Confirm the redirect target and that following it lands on the real route:
curl -s -o /dev/null -w 'status=%{http_code} location=%header{location}\n' 'http://localhost:3001/api/vllm/recipes/'
curl -sL -o /dev/null -w 'final=%{http_code} redirects=%{num_redirects}\n' 'http://localhost:3001/api/vllm/recipes/'- PASS: no-slash routes match (2xx, or 5xx if upstream is unreachable — not 404); trailing-slash
GETreturns 301 redirecting to the no-slash path;curl -Lfollows to the real route. (POST .../resolve/is intentionally left at 404 — onlyGET/HEADare redirected.) - FAIL: a trailing-slash
GETreturns404(pre-fix behavior).
Also covered by
backend/src/routes/vllmRecipes.test.ts— runcd backend && bun test src/routes/vllmRecipes.test.ts.
What it checks: the create endpoint accepts engineArgs/engineExtraArgs/env as records
and converts them, and writes the recipe provenance annotations the UI "Apply recipe" produces.
Why it matters: this is the exact materialization the UI emits. env must land as the array wire
form (not a map), and the six airunway.ai/recipe.* provenance annotations must be present.
# Backend listens on PORT || 3001 (see step 11).
curl -s -X POST http://localhost:3001/api/deployments \
-H 'Content-Type: application/json' \
-d '{
"name": "recipe-test",
"namespace": "vllm-test",
"modelId": "microsoft/Phi-4-mini-instruct",
"engine": "vllm",
"provider": "vllm",
"mode": "aggregated",
"imageRef": "vllm/vllm-openai:cu130-nightly",
"resources": { "gpu": 1 },
"engineArgs": { "tensor-parallel-size": "1" },
"engineExtraArgs": ["--enable-chunked-prefill"],
"env": { "VLLM_USE_V1": "1", "NCCL_DEBUG": "INFO" },
"recipeProvenance": {
"source": "recipes.vllm.ai", "id": "microsoft/Phi-4-mini-instruct",
"strategy": "latency", "hardware": "a100", "features": ["tool_calling"]
}
}' | python3 -m json.tool
# The resolver/controller write annotations ~1s after create — wait for the provenance marker first:
kubectl -n vllm-test wait \
--for=jsonpath='{.metadata.annotations.airunway\.ai/generated-by}'=vllm-recipe-resolver \
modeldeployment/recipe-test --timeout=30s
# env must be the ARRAY form:
kubectl -n vllm-test get modeldeployment recipe-test -o jsonpath='{.spec.env}{"\n"}'
# Provenance annotations (jq is robust on the JSON blob):
kubectl -n vllm-test get modeldeployment recipe-test -o json \
| jq '.metadata.annotations | with_entries(select(.key | test("airunway\\.ai/(recipe\\.|generated-by)")))'- PASS:
spec.envis[{"name":"VLLM_USE_V1","value":"1"},{"name":"NCCL_DEBUG","value":"INFO"}](array, not a map); the jq output shows exactly these six keys:airunway.ai/generated-by=vllm-recipe-resolverairunway.ai/recipe.source=recipes.vllm.aiairunway.ai/recipe.id=microsoft/Phi-4-mini-instructairunway.ai/recipe.strategy=latencyairunway.ai/recipe.hardware=a100airunway.ai/recipe.features=["tool_calling"]
- FAIL:
envis stored as a map, or any provenance annotation is missing.
UI variant: open the Deploy page, pick
microsoft/Phi-4-mini-instruct, select Direct vLLM, wait for "Official vLLM recipe found", click Apply recipe. The YAML preview should show the eightairunway.ai/recipe.*annotations and nospec.recipefield. Blank outhardwareand re-apply → that annotation disappears (trim-and-skip).
What it checks: killing the provider pod leaves managed resources intact and the provider re-reconciles cleanly.
Why it matters: server-side apply is idempotent, so a provider restart must not disrupt running deployments, and the provider must re-register its heartbeat.
# With any MD Running:
kubectl -n airunway-system delete pod -l control-plane=vllm-provider
kubectl -n airunway-system rollout status deploy/airunway-vllm-provider --timeout=120s
kubectl -n vllm-test get deploy,svc pin-test
kubectl get inferenceproviderconfig vllm -o jsonpath='{.status.lastHeartbeat}{"\n"}'- PASS: the managed Deployment/Service are unchanged; the provider rolls out and re-registers with a fresh
lastHeartbeat. - FAIL: managed resources are recreated/disrupted, or the provider fails to re-register.
kubectl delete namespace vllm-test