Skip to content

Instantly share code, notes, and snippets.

@surajssd
Created June 17, 2026 23:57
Show Gist options
  • Select an option

  • Save surajssd/1e8ec9ee4f14c844923df030cb49402a to your computer and use it in GitHub Desktop.

Select an option

Save surajssd/1e8ec9ee4f14c844923df030cb49402a to your computer and use it in GitHub Desktop.

Manual Testing Plan — Direct vLLM Provider

Copy-paste ready steps to validate the Direct vLLM provider end-to-end. Every example uses namespace vllm-test — change it once in Setup and the rest follow.

Field names match controller/api/v1alpha1/modeldeployment_types.go, the sample controller/config/samples/airunway_v1alpha1_modeldeployment.yaml, and docs/providers/vllm.md. The provider default image is vllm/vllm-openai:cu130-nightly; the server, Service, and probes are hard-wired to 0.0.0.0:8000.

Each test below states what it checks, why it matters, the steps, and what a PASS vs FAIL looks like.


Setup (once)

# --- Install KAITO & Dynamo as fallback providers ---
cd airunway
make -C providers/kaito setup-kaito
make -C providers/dynamo setup-dynamo

# --- Build, push and deploy AI Runway ---
export REGISTRY="quay.io/surajd"
TAG=$(git describe --tags --always)-$(date +%Y-%m-%d-%H-%M-%S)
export TAG
export PUSH=true

# Controller
make controller-docker-build CONTROLLER_IMG="${REGISTRY}/kubeairunway-controller:${TAG}"
pushd controller && make deploy IMG="${REGISTRY}/kubeairunway-controller:${TAG}" && popd

# vLLM provider
pushd providers/vllm
export IMG="${REGISTRY}/kubeairunway-vllm-provider:${TAG}"
make docker-build && make deploy
popd

# Dynamo provider
pushd providers/dynamo
export IMG="${REGISTRY}/kubeairunway-dynamo-provider:${TAG}"
make docker-build && make deploy
popd

# KAITO provider
pushd providers/kaito
export IMG="${REGISTRY}/kubeairunway-kaito-provider:${TAG}"
make docker-build && make deploy
popd

# Test namespace + (optional) HF token secret for gated models
kubectl create namespace vllm-test
kubectl -n vllm-test create secret generic vllm-hf-token \
  --from-literal=HF_TOKEN="${HF_TOKEN:-replace-me}"

Verify provider self-registration — the vLLM provider is explicit-only, so it registers with no selection rules and must never be auto-selected:

kubectl get inferenceproviderconfig vllm \
  -o jsonpath='{.status.ready}{"\n"}{.status.version}{"\n"}{.spec.selectionRules}{"\n"}'
  • PASS: true / vllm-provider:v0.1.0 / empty (no selectionRules).
  • FAIL: ready is false, or selectionRules is non-empty (the provider would be eligible for auto-selection, which it should never be).

1 — Explicit-only provider selection

What it checks: auto-selection never picks vLLM; explicit provider.name: vllm works; and the chosen provider is immutable once recorded.

Why it matters: vLLM has no selection rules, so a managed provider must win auto-selection. Once a provider is recorded in status.provider.name, spec.provider.name is immutable — the admission webhook rejects any in-place change synchronously, so switching providers requires delete-and-recreate, not a patch.

1a — auto-selection must NOT pick vllm

cat <<EOF | kubectl apply -f -
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: sel-auto
  namespace: vllm-test
spec:
  model:
    id: Qwen/Qwen2.5-0.5B-Instruct
    source: huggingface
  engine:
    type: vllm
  resources:
    gpu:
      count: 1
EOF
kubectl -n vllm-test get modeldeployment sel-auto -o jsonpath='{.status.provider.name}{"\n"}'
  • PASS: a managed provider (dynamo/kuberay), not vllm.
  • FAIL: vllm — auto-selection wrongly chose the explicit-only provider.

1b — explicit provider.name: vllm (the supported path)

cat <<EOF | kubectl apply -f -
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: sel-vllm
  namespace: vllm-test
spec:
  provider:
    name: vllm
  model:
    id: Qwen/Qwen2.5-0.5B-Instruct
    source: huggingface
  engine:
    type: vllm
  resources:
    gpu:
      count: 1
EOF
kubectl -n vllm-test get modeldeployment sel-vllm -o jsonpath='{.status.provider.name}{"\n"}'
kubectl -n vllm-test get modeldeployment sel-vllm -o jsonpath='{.metadata.finalizers}{"\n"}'
kubectl -n vllm-test get deploy,svc sel-vllm
  • PASS: provider.name = vllm; finalizer ["airunway.ai/vllm-provider"]; a Deployment and Service named after the MD both exist.
  • FAIL: provider not selected, finalizer missing, or Deployment/Service absent.

1c — in-place provider switch is rejected

kubectl -n vllm-test patch modeldeployment sel-auto --type=merge \
  -p '{"spec":{"provider":{"name":"vllm"}}}'
kubectl -n vllm-test get modeldeployment sel-auto \
  -o jsonpath='{.status.phase}{"\n"}{range .status.conditions[?(@.type=="ProviderSelected")]}{.reason}: {.message}{"\n"}{end}'
kubectl -n vllm-test get modeldeployment sel-auto -o jsonpath='{.status.provider.name}{"\n"}'
  • PASS: the patch is denied synchronously by the admission webhook — Error from server (Forbidden): ... spec.provider.name: Invalid value: "vllm": provider.name is immutable (changing it requires delete and recreate). Because the patch was rejected, the live object is unchanged: phase stays Deploying, the ProviderSelected condition still reads AutoSelected: Provider dynamo auto-selected, and status.provider.name is still the original provider (dynamo).
  • FAIL: the patch is accepted, or status.provider.name switches to vllm.

Cleanup

kubectl -n vllm-test delete modeldeployment sel-auto
kubectl -n vllm-test delete modeldeployment sel-vllm

2 — Default image resolution + digest pinning

What it checks: with no engine.image, the provider resolves the default nightly tag to a digest and pins it — repeated reconciles never re-resolve.

Why it matters: this is a deliberate reproducibility trade-off (pin the first-resolved digest forever) rather than tracking the moving cu130-nightly tag. The test confirms the trade-off is acceptable and that the only refresh path is changing spec.engine.image.

cat <<EOF | kubectl apply -f -
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: pin-test
  namespace: vllm-test
spec:
  provider:
    name: vllm
  model:
    id: Qwen/Qwen2.5-0.5B-Instruct
    source: huggingface
  engine:
    type: vllm
  resources:
    gpu:
      count: 1
EOF
sleep 10
kubectl -n vllm-test get modeldeployment pin-test -o jsonpath='{.status.image}{"\n"}' | python3 -m json.tool
DIGEST1=$(kubectl -n vllm-test get modeldeployment pin-test -o jsonpath='{.status.image.digest}')

# Force several reconciles and confirm the digest never changes:
for i in 1 2 3; do
  kubectl -n vllm-test annotate modeldeployment pin-test airunway.ai/touch="$(date +%s%N)" --overwrite
  sleep 35   # > the 30s RequeueInterval
done
DIGEST2=$(kubectl -n vllm-test get modeldeployment pin-test -o jsonpath='{.status.image.digest}')
echo "before=$DIGEST1"; echo "after =$DIGEST2"
  • PASS: status.image shows source=nightly, inNightly=true, a resolved digest, and requested=vllm/vllm-openai:cu130-nightly; before and after digests are EQUAL (pinned, not re-pulled).
  • FAIL: the digest changes between reconciles (the moving tag is being re-pulled, contradicting the documented pinning behavior).

3 — Default-image outage gate

What it checks: the default image is a hard gate — if its digest can't be resolved, no Deployment is created. A user-supplied image that fails to resolve is non-fatal by contrast.

Why it matters: the default nightly tag must resolve to a digest before any pod runs (for reproducibility), whereas a user explicitly choosing a tag is trusted to proceed even if resolution fails.

cat <<'EOF' | kubectl apply -f -
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: outage-test
  namespace: vllm-test
spec:
  provider: { name: vllm }
  model: { id: Qwen/Qwen2.5-0.5B-Instruct, source: huggingface }
  engine: { type: vllm }
  resources: { gpu: { count: 1 } }
EOF

# Image conditions are written by the vLLM PROVIDER controller ~1s after the core controller
# stamps provider.name. Querying immediately races that gap and returns EMPTY. Wait first:
kubectl -n vllm-test wait --for=jsonpath='{.status.image.source}'=nightly \
  modeldeployment/outage-test --timeout=60s

(A) Healthy cluster (registry reachable) — the normal expectation:

kubectl -n vllm-test get modeldeployment outage-test \
  -o jsonpath='{range .status.conditions[?(@.type=="ImageResolved")]}{.status} {.reason}{"\n"}{end}'
kubectl -n vllm-test get modeldeployment outage-test \
  -o jsonpath='{.status.image.source}={.status.image.resolved}{"\n"}'
  • PASS: ImageResolved is True (ImageResolutionReused, or ImageResolved on first resolve); source=nightly with a digest-pinned resolved value.

(B) True outage (optional) — block the provider's egress, then create a fresh default-image MD:

# kubectl -n airunway-system apply -f - <<'NP'   # deny-all egress for the provider pod
# apiVersion: networking.k8s.io/v1
# kind: NetworkPolicy
# metadata: { name: block-vllm-egress, namespace: airunway-system }
# spec:
#   podSelector: { matchLabels: { control-plane: vllm-provider } }
#   policyTypes: [Egress]
#   egress: []
# NP
  • PASS (during outage): ImageResolved=False (ImageResolutionFailed) and kubectl -n vllm-test get deploy <name>NotFound (no pod until the digest resolves).
  • FAIL: a Deployment is created despite the default image failing to resolve.

Recovery / non-fatal user image — a user-set engine.image that fails to resolve keeps the tag and proceeds:

kubectl -n vllm-test patch modeldeployment outage-test --type=merge \
  -p '{"spec":{"engine":{"image":"vllm/vllm-openai:latest"}}}'
  • PASS: the deployment proceeds with the user tag (unlike the default-image hard gate).

Cleanup

kubectl -n vllm-test delete modeldeployment outage-test

4 — Conflicting image fields

What it checks: setting both the legacy spec.image and the preferred spec.engine.image to different values is rejected.

Why it matters: two competing image sources are ambiguous; the validating webhook should refuse rather than silently pick one.

cat <<EOF | kubectl apply -f -
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: img-conflict
  namespace: vllm-test
spec:
  provider: { name: vllm }
  model: { id: Qwen/Qwen2.5-0.5B-Instruct, source: huggingface }
  image: ubuntu:latest                       # legacy field
  engine:
    type: vllm
    image: vllm/vllm-openai:latest           # preferred field — conflicts
  resources: { gpu: { count: 1 } }
EOF
# If the webhook is disabled, the object persists; check it parks at Failed:
kubectl -n vllm-test get modeldeployment img-conflict \
  -o jsonpath='{.status.phase}{"\n"}{.status.image.requested}{"\n"}'
  • PASS: the webhook rejects the apply with a message naming both spec.image and spec.engine.image (or, if the webhook is off, reconcile parks at Failed).
  • FAIL: the object is accepted and a deployment proceeds with one image silently chosen.

5 — Reserved host/port rejection

What it checks: host/port are provider-generated (container, Service, and probes hard-wired to 0.0.0.0:8000), so any attempt to set them — in all four spellings — must be rejected.

Why it matters: a user-set host/port would desync the container from the Service/probes and break routing. Rejection is asynchronous: the transformer fails the reconcile and sets phase: Failed ~1s after apply, so each block waits for that.

The expected failure message for every block is ...conflicts with Direct vLLM generated networking....

  • PASS (all four): phase: Failed with the networking-conflict message.
  • FAIL: the wait times out (not rejected) or the MD renders a Deployment.

5a — reserved key in engine.args map form

cat <<'EOF' | kubectl apply -f -
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata: { name: reserved-a, namespace: vllm-test }
spec:
  provider: { name: vllm }
  model: { id: Qwen/Qwen2.5-0.5B-Instruct, source: huggingface }
  resources: { gpu: { count: 1 } }
  engine:
    type: vllm
    args:
      port: "9000"
EOF
kubectl -n vllm-test wait --for=jsonpath='{.status.phase}'=Failed \
  modeldeployment/reserved-a --timeout=60s
kubectl -n vllm-test get modeldeployment reserved-a -o jsonpath='{.status.phase}: {.status.message}{"\n"}'
kubectl -n vllm-test delete modeldeployment reserved-a --ignore-not-found

5b — --host=value in extraArgs

cat <<'EOF' | kubectl apply -f -
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata: { name: reserved-b, namespace: vllm-test }
spec:
  provider: { name: vllm }
  model: { id: Qwen/Qwen2.5-0.5B-Instruct, source: huggingface }
  resources: { gpu: { count: 1 } }
  engine:
    type: vllm
    extraArgs: ["--host=10.0.0.1"]
EOF
kubectl -n vllm-test wait --for=jsonpath='{.status.phase}'=Failed \
  modeldeployment/reserved-b --timeout=60s
kubectl -n vllm-test get modeldeployment reserved-b -o jsonpath='{.status.phase}: {.status.message}{"\n"}'
kubectl -n vllm-test delete modeldeployment reserved-b --ignore-not-found

5c — --port as a two-token flag in extraArgs

cat <<'EOF' | kubectl apply -f -
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata: { name: reserved-c, namespace: vllm-test }
spec:
  provider: { name: vllm }
  model: { id: Qwen/Qwen2.5-0.5B-Instruct, source: huggingface }
  resources: { gpu: { count: 1 } }
  engine:
    type: vllm
    extraArgs: ["--port", "9000"]
EOF
kubectl -n vllm-test wait --for=jsonpath='{.status.phase}'=Failed \
  modeldeployment/reserved-c --timeout=60s
kubectl -n vllm-test get modeldeployment reserved-c -o jsonpath='{.status.phase}: {.status.message}{"\n"}'
kubectl -n vllm-test delete modeldeployment reserved-c --ignore-not-found

5d — --port=value in extraArgs

cat <<'EOF' | kubectl apply -f -
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata: { name: reserved-d, namespace: vllm-test }
spec:
  provider: { name: vllm }
  model: { id: Qwen/Qwen2.5-0.5B-Instruct, source: huggingface }
  resources: { gpu: { count: 1 } }
  engine:
    type: vllm
    extraArgs: ["--port=9000"]
EOF
kubectl -n vllm-test wait --for=jsonpath='{.status.phase}'=Failed \
  modeldeployment/reserved-d --timeout=60s
kubectl -n vllm-test get modeldeployment reserved-d -o jsonpath='{.status.phase}: {.status.message}{"\n"}'
kubectl -n vllm-test delete modeldeployment reserved-d --ignore-not-found

6 — TP dedup + extraArgs conflict rejection

What it checks: (6a) an explicit engine.args flag wins over the auto-derived flag — no duplicate is rendered; (6b) setting the same key in both args and extraArgs is rejected at admission.

Why it matters: a duplicate flag would be silently resolved by vLLM's last-wins argparse, defeating the user's explicit value. The provider both dedups (derived flags suppressed when the user set the key) and rejects genuine conflicts (same key in two places) synchronously at kubectl time.

6a inspects the rendered Deployment manifest, which is populated regardless of whether the pod can schedule — so it works even when GPU nodes can't satisfy gpu.count: 2. (On single-GPU nodes the pod sits Pending; drop gpu.count to 1 and tensor-parallel-size to "1" if you also want it to run.)

6a — explicit args wins over the derived flag

cat <<EOF | kubectl apply -f -
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: tp-test
  namespace: vllm-test
spec:
  provider: { name: vllm }
  model: { id: Qwen/Qwen2.5-0.5B-Instruct, source: huggingface }
  engine:
    type: vllm
    args:
      tensor-parallel-size: "4"
  resources:
    gpu:
      count: 2
EOF
kubectl -n vllm-test wait --for=jsonpath='{.status.image.source}' \
  modeldeployment/tp-test --timeout=60s
kubectl -n vllm-test get deploy tp-test \
  -o jsonpath='{.spec.template.spec.containers[0].args}' | tr ',' '\n' | grep -n -A 1 -- '--tensor-parallel-size'
  • PASS: exactly one --tensor-parallel-size, value 4 (explicit wins; the derived 2 from gpu.count is suppressed).
  • FAIL: two entries, or the derived 2 appears.

6b — same key in both args and extraArgs is rejected at admission

kubectl -n vllm-test patch modeldeployment tp-test --type=merge \
  -p '{"spec":{"engine":{"extraArgs":["--tensor-parallel-size=2"]}}}'
# Re-check the rendered Deployment — it must still have exactly one flag, value 4:
kubectl -n vllm-test get deploy tp-test \
  -o jsonpath='{.spec.template.spec.containers[0].args}' | tr ',' '\n' | grep -n -A 1 -- '--tensor-parallel-size'
kubectl -n vllm-test delete modeldeployment tp-test --ignore-not-found
  • PASS: the patch is denied synchronously — Error from server (Forbidden): ... launch flag "tensor-parallel-size" is set in both spec.engine.args and spec.engine.extraArgs ... — and because the patch was rejected the live Deployment still shows one flag, value 4.
  • FAIL: the patch is accepted, or the Deployment renders a duplicate/2 value.

7 — launch-tag classification

What it checks: how the provider classifies the image source in status.image for a launch-style tag.

Why it matters: classification only labels a tag launch when it literally contains the substring "launch". Real launch tags like vllm/vllm-openai:deepseekv4-cu130 fall through to custom and flip UnsupportedImage=True, producing misleading provenance for the launch workflow the docs describe. (Open finding — confirm current behavior; if broadened/fixed, the first case should report launch.)

The four status.image.source values (directVLLMImageSource, evaluated top-to-bottom — first match wins):

source Rule (first match wins) UnsupportedImage
launch tag contains the substring launch (case-insensitive) — e.g. cu130-launch False
nightly image is exactly the provider default (vllm/vllm-openai:cu130-nightly), or repo is vllm/vllm-openai and the tag contains nightly False
stable repo is vllm/vllm-openai and the tag is exactly latest False
custom anything else (non-official repo, or an official repo with an unrecognized tag) True (CustomImage)

Only custom sets UnsupportedImage=True. The bug is the ordering/substring match: a genuine launch image whose tag doesn't contain the word launch (e.g. deepseekv4-cu130) misses the first rule, isn't nightly/latest either, and so lands in custom.

cat <<'EOF' | kubectl apply -f -
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata: { name: launch-test, namespace: vllm-test }
spec:
  provider: { name: vllm }
  model: { id: deepseek-ai/DeepSeek-V2-Lite, source: huggingface }
  engine: { type: vllm, image: "vllm/vllm-openai:deepseekv4-cu130" }
  resources: { gpu: { count: 1 } }
EOF
kubectl -n vllm-test wait --for=jsonpath='{.status.image.source}' \
  modeldeployment/launch-test --timeout=60s
kubectl -n vllm-test get modeldeployment launch-test -o jsonpath='{.status.image.source}{"\n"}'

# Compare against a tag that DOES contain the literal "launch":
kubectl -n vllm-test patch modeldeployment launch-test --type=merge \
  -p '{"spec":{"engine":{"image":"vllm/vllm-openai:cu130-launch"}}}'
sleep 2
kubectl -n vllm-test get modeldeployment launch-test -o jsonpath='{.status.image.source}{"\n"}'
  • Current (finding present): deepseekv4-cu130custom (+ UnsupportedImage=True); cu130-launchlaunch.
  • Fixed: the launch tag is classified launch regardless of the literal substring.

Cleanup

kubectl -n vllm-test delete modeldeployment launch-test

8 — Overrides security boundary

What it checks: dangerous pod-spec overrides under spec.provider.overrides are denied at admission — both top-level and nested inside arrays.

Why it matters: the admission webhook blocks a denylist (securityContext, serviceAccountName, serviceAccount, hostNetwork, hostPID, hostIPC, automountServiceAccountToken, nodeName, priorityClassName, runtimeClassName), recursing through nested objects/arrays, plus a sizing guard on replicas/resources. There must be no privilege-escalation path through overrides.

# Top-level securityContext + hostNetwork — must be DENIED:
cat <<EOF | kubectl apply -f -
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: override-test
  namespace: vllm-test
spec:
  provider:
    name: vllm
    overrides:
      spec:
        template:
          spec:
            securityContext:
              runAsNonRoot: false
            hostNetwork: true
  model: { id: Qwen/Qwen2.5-0.5B-Instruct, source: huggingface }
  engine: { type: vllm }
  resources: { gpu: { count: 1 } }
EOF
kubectl -n vllm-test get modeldeployment,deploy override-test 2>&1   # expect NotFound for both

# Nested-array path — containers[*].securityContext must ALSO be caught:
cat <<EOF | kubectl apply -f - 2>&1 | tail -1
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata: { name: override-nested, namespace: vllm-test }
spec:
  provider:
    name: vllm
    overrides:
      spec:
        template:
          spec:
            containers:
              - name: vllm
                securityContext: { privileged: true }
  model: { id: Qwen/Qwen2.5-0.5B-Instruct, source: huggingface }
  engine: { type: vllm }
  resources: { gpu: { count: 1 } }
EOF
  • PASS: both applies fail with Forbidden naming the blocked key(s) (hostNetwork, securityContext); get returns NotFound for both the MD and the Deployment; the nested case reports overriding "securityContext" is not allowed for security reasons.
  • FAIL: either object is accepted, or a Deployment is created with the escalated pod spec.

9 — Ownership conflict surfacing

What it checks: if a Deployment with the target name already exists and is not owned by the ModelDeployment, the provider refuses to take it over and surfaces the conflict instead of hijacking it.

Why it matters: the provider must never adopt or overwrite a resource it doesn't own. The discriminator is the Deployment UID — the provider applies by name via server-side apply and never deletes-and-recreates, so a protected foreign object keeps its same UID and original image.

Fixture gotcha — important: the foreign Deployment must have no ownerReferences. An owner ref to a non-existent object is deleted by Kubernetes garbage collection within ~2s — before the provider runs — which masks the guard and makes a fresh, legitimately-owned Deployment look like a hijack. Use an unowned object so it survives GC and the guard is actually exercised.

# Pre-create a FOREIGN Deployment (no ownerReferences) with the name the MD will want.
# Capture its UID so we can prove it is untouched.
FOREIGN_UID=$(cat <<'EOF' | kubectl apply -f - -o jsonpath='{.metadata.uid}'
apiVersion: apps/v1
kind: Deployment
metadata:
  name: owned-test
  namespace: vllm-test
spec:
  replicas: 1
  selector: { matchLabels: { app: foreign } }
  template:
    metadata: { labels: { app: foreign } }
    spec: { containers: [ { name: pause, image: registry.k8s.io/pause:3.9 } ] }
EOF
)
echo "FOREIGN_UID=$FOREIGN_UID"

cat <<'EOF' | kubectl apply -f -
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata: { name: owned-test, namespace: vllm-test }
spec:
  provider: { name: vllm }
  model: { id: Qwen/Qwen2.5-0.5B-Instruct, source: huggingface }
  engine: { type: vllm }
  resources: { gpu: { count: 1 } }
EOF
sleep 8

kubectl -n vllm-test get modeldeployment owned-test -o jsonpath='{.status.phase}{"\n"}'
kubectl -n vllm-test get modeldeployment owned-test \
  -o jsonpath='{range .status.conditions[?(@.type=="ResourceCreated")]}{.status} {.reason}{"\n"}{end}'
kubectl -n vllm-test get modeldeployment owned-test -o jsonpath='{.status.message}{"\n"}'

# Prove the foreign object is the SAME, untouched (same UID, still the pause image):
echo "FINAL_UID=$(kubectl -n vllm-test get deploy owned-test -o jsonpath='{.metadata.uid}')"
kubectl -n vllm-test get deploy owned-test \
  -o jsonpath='owners={range .metadata.ownerReferences[*]}[{.kind}/{.name}]{end} image={.spec.template.spec.containers[0].image}{"\n"}'

kubectl -n vllm-test delete modeldeployment,deploy owned-test --ignore-not-found
  • PASS: MD phase Failed; ResourceCreated = False ResourceConflict; message ...exists but is not managed by this ModelDeployment (no owner references); the foreign Deployment has FINAL_UID == FOREIGN_UID, no owners, and still registry.k8s.io/pause:3.9.
  • FAIL: the foreign Deployment's image/owner changes (hijacked), or FINAL_UID != FOREIGN_UID (silently recreated).

10 — Finalizer stuck-terminating

What it checks: when an owned Deployment is stuck Terminating, the ModelDeployment's FinalizerTimeout (5 min) still fires so the MD doesn't requeue forever.

Why it matters: if the timeout check is gated behind a delete error that never occurs for a stuck-terminating object, the MD can loop indefinitely. (Open finding — confirm whether the timeout eventually forces removal or the MD is stuck.)

cat <<EOF | kubectl apply -f -
apiVersion: airunway.ai/v1alpha1
kind: ModelDeployment
metadata:
  name: final-test
  namespace: vllm-test
spec:
  provider: { name: vllm }
  model: { id: Qwen/Qwen2.5-0.5B-Instruct, source: huggingface }
  engine: { type: vllm }
  resources: { gpu: { count: 1 } }
EOF
kubectl -n vllm-test wait --for=jsonpath='{.status.provider.name}'=vllm \
  modeldeployment/final-test --timeout=60s

# Make the owned Deployment un-deletable, then delete the MD:
kubectl -n vllm-test patch deploy final-test --type=merge \
  -p '{"metadata":{"finalizers":["example.com/block-delete"]}}'
kubectl -n vllm-test delete modeldeployment final-test --wait=false
kubectl -n vllm-test get modeldeployment final-test -o jsonpath='{.status.phase}{"\n"}'   # Terminating
sleep 330   # > the 5-min FinalizerTimeout
kubectl -n vllm-test get modeldeployment final-test 2>&1

# Cleanup:
kubectl -n vllm-test patch deploy final-test --type=merge -p '{"metadata":{"finalizers":[]}}'
  • PASS (fixed): after the timeout the MD is forcibly removed (NotFound).
  • FAIL (finding present): the MD is still Terminating / still present after 5+ minutes.

11 — Recipe client: SSRF / path-traversal + chunked-OOM

What it checks: (11a) traversal/SSRF-style paths are rejected before any upstream fetch; (11b) the 5 MiB response cap actually bounds a chunked (no Content-Length) body; (11c) trailing-slash recipe routes resolve correctly.

Why it matters: the recipe client proxies an external origin. Path-traversal must not reach a fetch, and a chunked response must not be fully buffered into memory before the size check (an OOM vector). Trailing-slash handling affects every recipe route.

Runtime gotchas (read before running):

  • The backend listens on PORT || 3001, not 3000 — hitting 3000 gives a connection refused.
  • VLLM_RECIPES_BASE_URL is read once at construction — you must start the backend pointed at the bad origin; you can't redirect a running one.
  • The OOM probe must sample the backend's RSS, not curl's.
  • Always kill a stale backend on 3001 first and use a readiness gate, or you'll silently test an old process (a classic "the fix didn't work" false alarm).

11a — SSRF / path-traversal (no upstream needed)

# Kill any stale backend on 3001 first, then start fresh and wait until it's listening.
OLD=$(lsof -tiTCP:3001 -sTCP:LISTEN 2>/dev/null); [ -n "$OLD" ] && kill $OLD 2>/dev/null
for i in $(seq 1 10); do lsof -iTCP:3001 -sTCP:LISTEN -n -P >/dev/null 2>&1 || break; sleep 1; done

( cd backend && bun run src/index.ts >/tmp/airunway-backend.log 2>&1 & )
for i in $(seq 1 30); do
  lsof -iTCP:3001 -sTCP:LISTEN -n -P >/dev/null 2>&1 && { echo "backend up after ${i}s"; break; }
  sleep 1
done

# --path-as-is stops curl from collapsing ../ client-side, so the server sees the raw path:
curl -s --path-as-is -o /dev/null -w '%{http_code}\n' 'http://localhost:3001/api/vllm/recipes/foo/../../etc/passwd'
curl -s              -o /dev/null -w '%{http_code}\n' 'http://localhost:3001/api/vllm/recipes/foo/bar%2Fbaz'
curl -s              -o /dev/null -w '%{http_code}\n' 'http://localhost:3001/api/vllm/recipes/onlyoneseg'
  • PASS: 404 (escaped path), 400 (encoded slash inside a segment), 404 (single segment) — all non-2xx; the traversal never reaches an upstream fetch.
  • FAIL: any 2xx, or evidence the request reached an upstream fetch.

11b — chunked-OOM probe (5 MiB cap must bound a chunked body)

The cap is implemented in readBoundedBody (backend/src/services/vllmRecipesClient.ts): it streams the body with getReader() and calls controller.abort() the instant the running byte total exceeds 5 MiB, so a chunked / no-Content-Length reply can't be fully buffered.

Measure it correctly — vary the body size; don't trust a single absolute delta. A cold backend's first request is dominated by Bun JIT/allocator warmup (~60-80 MB), which swamps the response body and makes one cold reading meaningless. The real discriminator is whether memory stays flat as the upstream body grows: a bounded reader aborts at 5 MiB regardless of body size, so an 8 MiB and a 128 MiB body produce the same delta; a buffering reader would scale ~linearly with the body.

# Origin that serves a configurable-size, NO-Content-Length JSON body (size in MiB = last URL segment):
python3 - <<'PY' >/tmp/badorigin.log 2>&1 &
from http.server import BaseHTTPRequestHandler, HTTPServer
class H(BaseHTTPRequestHandler):
    def do_GET(self):
        try: mib = int(self.path.strip('/').split('/')[-1])
        except Exception: mib = 8
        self.send_response(200); self.send_header('Content-Type','application/json'); self.end_headers()
        self.wfile.write(b'{"models":[')
        chunk = b'"xxxxxxxx",' * 13107   # ~128 KiB
        try:
            for _ in range(mib*8): self.wfile.write(chunk)
            self.wfile.write(b'"x"]}')
        except BrokenPipeError: pass
    def log_message(self,*a): pass
HTTPServer(('127.0.0.1',9998),H).serve_forever()
PY
ORIGIN=$!; sleep 1

# Boot the backend AGAINST the bad origin (env read once at construction). Kill any stale backend first.
OLD=$(lsof -tiTCP:3001 -sTCP:LISTEN 2>/dev/null); [ -n "$OLD" ] && kill $OLD 2>/dev/null
for i in $(seq 1 10); do lsof -iTCP:3001 -sTCP:LISTEN -n -P >/dev/null 2>&1 || break; sleep 1; done
( cd backend && VLLM_RECIPES_BASE_URL=http://127.0.0.1:9998 bun run src/index.ts >/tmp/airunway-backend.log 2>&1 & )
for i in $(seq 1 30); do lsof -iTCP:3001 -sTCP:LISTEN -n -P >/dev/null 2>&1 && break; sleep 1; done
BPID=$(lsof -tiTCP:3001 -sTCP:LISTEN -n -P | head -1)

rss(){ ps -o rss= -p "$BPID" | tr -d ' '; }
hit(){ curl -s -o /dev/null -w '%{http_code}' "http://localhost:3001/api/vllm/recipes/foo/$1"; }

echo "warmup http=$(hit 8)"; sleep 1            # absorb one-time JIT/allocator cost
G0=$(rss); echo "baseline RSS (KB): $G0"
for SZ in 8 32 128; do
  c=$(hit $SZ); sleep 1; r=$(rss)
  printf 'body=%4s MiB  http=%s  delta_from_baseline=%s KB\n' "$SZ" "$c" "$((r-G0))"
done
kill "$ORIGIN" "$BPID" 2>/dev/null
  • PASS (cap works): every request returns http=502, and the delta is flat across 8/32/128 MiB (e.g. ~26 MB each — a 16× larger body uses the same memory). The bounded reader aborts at 5 MiB regardless of how much the origin streams.
  • FAIL (finding present): the delta scales with body size (128 MiB body ≈ 16× the 8 MiB delta) — the whole body is being buffered before the size check.

Note on a single cold reading: hitting a freshly-started backend once and seeing a ~70-80 MB delta is not evidence of a leak — that's JIT warmup, and it looks the same whether the body is bounded or not. Only the body-size-vs-delta slope on a warm process distinguishes the two. (The earlier "~71 MB proves it's buffered" reading was this warmup artifact.)

11c — trailing slash vs no slash (routing)

# Kill any stale backend on 3001 first, then start fresh and wait until it's listening.
OLD=$(lsof -tiTCP:3001 -sTCP:LISTEN 2>/dev/null); [ -n "$OLD" ] && kill $OLD 2>/dev/null
for i in $(seq 1 10); do lsof -iTCP:3001 -sTCP:LISTEN -n -P >/dev/null 2>&1 || break; sleep 1; done

( cd backend && bun run src/index.ts >/tmp/airunway-backend.log 2>&1 & )
for i in $(seq 1 30); do
  lsof -iTCP:3001 -sTCP:LISTEN -n -P >/dev/null 2>&1 && { echo "backend up after ${i}s"; break; }
  sleep 1
done

# Backend on 3001 as in 11a (remember the kill-stale-listener step).
curl -s -o /dev/null -w '%{http_code}  GET /api/vllm/recipes\n'                                'http://localhost:3001/api/vllm/recipes'
curl -s -o /dev/null -w '%{http_code}  GET /api/vllm/recipes/  (trailing)\n'                   'http://localhost:3001/api/vllm/recipes/'
curl -s -o /dev/null -w '%{http_code}  GET .../microsoft/Phi-4-mini-instruct\n'                'http://localhost:3001/api/vllm/recipes/microsoft/Phi-4-mini-instruct'
curl -s -o /dev/null -w '%{http_code}  GET .../Phi-4-mini-instruct/ (trailing)\n'              'http://localhost:3001/api/vllm/recipes/microsoft/Phi-4-mini-instruct/'

# Confirm the redirect target and that following it lands on the real route:
curl -s  -o /dev/null -w 'status=%{http_code}  location=%header{location}\n' 'http://localhost:3001/api/vllm/recipes/'
curl -sL -o /dev/null -w 'final=%{http_code}  redirects=%{num_redirects}\n'  'http://localhost:3001/api/vllm/recipes/'
  • PASS: no-slash routes match (2xx, or 5xx if upstream is unreachable — not 404); trailing-slash GET returns 301 redirecting to the no-slash path; curl -L follows to the real route. (POST .../resolve/ is intentionally left at 404 — only GET/HEAD are redirected.)
  • FAIL: a trailing-slash GET returns 404 (pre-fix behavior).

Also covered by backend/src/routes/vllmRecipes.test.ts — run cd backend && bun test src/routes/vllmRecipes.test.ts.


12 — Recipe apply, end-to-end via the API

What it checks: the create endpoint accepts engineArgs/engineExtraArgs/env as records and converts them, and writes the recipe provenance annotations the UI "Apply recipe" produces.

Why it matters: this is the exact materialization the UI emits. env must land as the array wire form (not a map), and the six airunway.ai/recipe.* provenance annotations must be present.

# Backend listens on PORT || 3001 (see step 11).
curl -s -X POST http://localhost:3001/api/deployments \
  -H 'Content-Type: application/json' \
  -d '{
    "name": "recipe-test",
    "namespace": "vllm-test",
    "modelId": "microsoft/Phi-4-mini-instruct",
    "engine": "vllm",
    "provider": "vllm",
    "mode": "aggregated",
    "imageRef": "vllm/vllm-openai:cu130-nightly",
    "resources": { "gpu": 1 },
    "engineArgs": { "tensor-parallel-size": "1" },
    "engineExtraArgs": ["--enable-chunked-prefill"],
    "env": { "VLLM_USE_V1": "1", "NCCL_DEBUG": "INFO" },
    "recipeProvenance": {
      "source": "recipes.vllm.ai", "id": "microsoft/Phi-4-mini-instruct",
      "strategy": "latency", "hardware": "a100", "features": ["tool_calling"]
    }
  }' | python3 -m json.tool

# The resolver/controller write annotations ~1s after create — wait for the provenance marker first:
kubectl -n vllm-test wait \
  --for=jsonpath='{.metadata.annotations.airunway\.ai/generated-by}'=vllm-recipe-resolver \
  modeldeployment/recipe-test --timeout=30s

# env must be the ARRAY form:
kubectl -n vllm-test get modeldeployment recipe-test -o jsonpath='{.spec.env}{"\n"}'

# Provenance annotations (jq is robust on the JSON blob):
kubectl -n vllm-test get modeldeployment recipe-test -o json \
  | jq '.metadata.annotations | with_entries(select(.key | test("airunway\\.ai/(recipe\\.|generated-by)")))'
  • PASS: spec.env is [{"name":"VLLM_USE_V1","value":"1"},{"name":"NCCL_DEBUG","value":"INFO"}] (array, not a map); the jq output shows exactly these six keys:
    • airunway.ai/generated-by = vllm-recipe-resolver
    • airunway.ai/recipe.source = recipes.vllm.ai
    • airunway.ai/recipe.id = microsoft/Phi-4-mini-instruct
    • airunway.ai/recipe.strategy = latency
    • airunway.ai/recipe.hardware = a100
    • airunway.ai/recipe.features = ["tool_calling"]
  • FAIL: env is stored as a map, or any provenance annotation is missing.

UI variant: open the Deploy page, pick microsoft/Phi-4-mini-instruct, select Direct vLLM, wait for "Official vLLM recipe found", click Apply recipe. The YAML preview should show the eight airunway.ai/recipe.* annotations and no spec.recipe field. Blank out hardware and re-apply → that annotation disappears (trim-and-skip).


13 — Provider crash idempotency

What it checks: killing the provider pod leaves managed resources intact and the provider re-reconciles cleanly.

Why it matters: server-side apply is idempotent, so a provider restart must not disrupt running deployments, and the provider must re-register its heartbeat.

# With any MD Running:
kubectl -n airunway-system delete pod -l control-plane=vllm-provider
kubectl -n airunway-system rollout status deploy/airunway-vllm-provider --timeout=120s
kubectl -n vllm-test get deploy,svc pin-test
kubectl get inferenceproviderconfig vllm -o jsonpath='{.status.lastHeartbeat}{"\n"}'
  • PASS: the managed Deployment/Service are unchanged; the provider rolls out and re-registers with a fresh lastHeartbeat.
  • FAIL: managed resources are recreated/disrupted, or the provider fails to re-register.

14 — Cleanup

kubectl delete namespace vllm-test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment