windows_kserve_vllm.md

What’s happening

That log line:
```
Overriding ... dispatch key: AutocastCPU ... new kernel: ... ipex-cpu ...
INFO ... Automatically detected platform cpu.
```
means IPEX’s autocast kernels replaced the default ones. With --dtype=float16 on CPU, PyTorch/ipex either upcasts or hits slow/non-vectorized code paths and can “hang” at model load/compile.
On small models the server often appears stuck before the first “engine initialized” message—especially if threads/affinity are constrained.

Fixes (pick these in order)

Use a CPU-appropriate dtype

Prefer --dtype=float32 (works everywhere).
If your host CPU supports AVX512-bf16 (e.g., recent Xeon/WSL2 backend), you can try --dtype=bfloat16 for speed.
Do not use float16 on CPU.

Optionally disable IPEX to simplify

Set VLLM_USE_IPEX=0 to run pure PyTorch kernels. This avoids the autocast override entirely.
If you keep IPEX, also remove exotic thread binding envs.

Relax thread pinning / binding

Drop VLLM_CPU_OMP_THREADS_BIND and the tight thread limits while you test. Start simple:
- OMP_NUM_THREADS=2, OPENBLAS_NUM_THREADS=1, MKL_NUM_THREADS=1 is fine, but remove the 0-1 binding.

Bump memory a little

For safety, give the pod memory: "4Gi" (limit) for the first bring-up, then tune down.

Add verbosity to confirm progress

Add --log-level INFO (or DEBUG) to see loader/engine messages.

Minimal local sanity check (outside KServe)

Before fighting Knative probes, try the container by itself to confirm the model path works:

# Make sure /models/model actually contains config.json, model.safetensors, tokenizer files, etc.
docker run --rm -it -p 8000:8000 \
  -e VLLM_TARGET_DEVICE=cpu \
  -e VLLM_USE_IPEX=0 \
  -v /ABS/PATH/TO/MODEL_DIR:/models:ro \
  schoolofdevops/vllm-cpu-nonuma:0.9.1 \
  --model /models/model \
  --host 0.0.0.0 --port 8000 \
  --dtype float32 \
  --max-model-len 1024 \
  --served-model-name smollm2-135m-atharva \
  --disable-frontend-multiprocessing \
  --max-num-seqs 1 \
  --swap-space 0.5 \
  --log-level INFO

Then hit:

curl -s localhost:8000/health

and one test completion:

curl -s -X POST localhost:8000/v1/completions \
 -H 'content-type: application/json' \
 -d '{"model":"smollm2-135m-atharva","prompt":"Hello","max_tokens":16}'

If this works, move back to KServe.

Drop-in KServe patch

Replace your container section with this (key changes commented):

spec:
  predictor:
    minReplicas: 1
    maxReplicas: 1
    containerConcurrency: 1
    containers:
      - name: vllm
        image: schoolofdevops/vllm-cpu-nonuma:0.9.1
        args:
          - --model=/models/model
          - --host=0.0.0.0
          - --port=8000
          - --max-model-len=1024               # start smaller
          - --served-model-name=smollm2-135m-atharva
          - --dtype=float32                     # <-- critical
          - --disable-frontend-multiprocessing  # good for containerized CPU
          - --max-num-seqs=1
          - --swap-space=0.5
          - --log-level=INFO
        env:
          - name: VLLM_TARGET_DEVICE
            value: "cpu"
          - name: VLLM_USE_IPEX                # <-- disable IPEX to avoid autocast confusion
            value: "0"
          - name: OMP_NUM_THREADS
            value: "2"
          - name: OPENBLAS_NUM_THREADS
            value: "1"
          - name: MKL_NUM_THREADS
            value: "1"
          # Remove VLLM_CPU_KVCACHE_SPACE and VLLM_CPU_OMP_THREADS_BIND for now
        ports:
          - name: http1
            containerPort: 8000
        resources:
          requests:
            cpu: "2"
            memory: "2Gi"
          limits:
            cpu: "2"
            memory: "4Gi"  # give a bit more headroom for bring-up
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 5
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 15
          periodSeconds: 10
        volumeMounts:
          - name: model
            mountPath: /models
            readOnly: true
    nodeName: llmops-kind-worker
    volumes:
      - name: model
        image:
          reference: initcron/smollm2-135m-merged:v3
          pullPolicy: IfNotPresent

If it still stalls

Verify the model path inside the pod:

kubectl -n atharva-ml exec -it deploy/atharva-vllm-predictor-default-<hash> -- ls -lah /models/model
kubectl -n atharva-ml exec -it ... -- head -n 20 /models/model/config.json

If the dir is empty, your image: volume might not be mounting as expected.

Try re-enabling IPEX but with a good dtype:
- Set VLLM_USE_IPEX=1 and --dtype=bfloat16 only if your CPU supports bf16; otherwise keep float32.
Remove thread envs entirely (let libraries decide) to rule out an affinity deadlock:
- Temporarily drop OMP_NUM_THREADS, OPENBLAS_NUM_THREADS, MKL_NUM_THREADS.

TL;DR

The warning is fine; the hang is because float16 on CPU + IPEX autocast is a bad combo.
Switch to --dtype=float32 (or bfloat16 on bf16-capable CPUs), and either disable IPEX (VLLM_USE_IPEX=0) or keep it with a supported dtype.
Loosen thread/affinity settings and confirm the model path actually has files.