-
That log line:
Overriding ... dispatch key: AutocastCPU ... new kernel: ... ipex-cpu ... INFO ... Automatically detected platform cpu.means IPEX’s autocast kernels replaced the default ones. With
--dtype=float16on CPU, PyTorch/ipex either upcasts or hits slow/non-vectorized code paths and can “hang” at model load/compile. -
On small models the server often appears stuck before the first “engine initialized” message—especially if threads/affinity are constrained.
- Use a CPU-appropriate dtype
- Prefer
--dtype=float32(works everywhere). - If your host CPU supports AVX512-bf16 (e.g., recent Xeon/WSL2 backend), you can try
--dtype=bfloat16for speed. - Do not use
float16on CPU.
- Optionally disable IPEX to simplify
- Set
VLLM_USE_IPEX=0to run pure PyTorch kernels. This avoids the autocast override entirely. - If you keep IPEX, also remove exotic thread binding envs.
- Relax thread pinning / binding
-
Drop
VLLM_CPU_OMP_THREADS_BINDand the tight thread limits while you test. Start simple:OMP_NUM_THREADS=2,OPENBLAS_NUM_THREADS=1,MKL_NUM_THREADS=1is fine, but remove the0-1binding.
- Bump memory a little
- For safety, give the pod
memory: "4Gi"(limit) for the first bring-up, then tune down.
- Add verbosity to confirm progress
- Add
--log-level INFO(orDEBUG) to see loader/engine messages.
Before fighting Knative probes, try the container by itself to confirm the model path works:
# Make sure /models/model actually contains config.json, model.safetensors, tokenizer files, etc.
docker run --rm -it -p 8000:8000 \
-e VLLM_TARGET_DEVICE=cpu \
-e VLLM_USE_IPEX=0 \
-v /ABS/PATH/TO/MODEL_DIR:/models:ro \
schoolofdevops/vllm-cpu-nonuma:0.9.1 \
--model /models/model \
--host 0.0.0.0 --port 8000 \
--dtype float32 \
--max-model-len 1024 \
--served-model-name smollm2-135m-atharva \
--disable-frontend-multiprocessing \
--max-num-seqs 1 \
--swap-space 0.5 \
--log-level INFOThen hit:
curl -s localhost:8000/healthand one test completion:
curl -s -X POST localhost:8000/v1/completions \
-H 'content-type: application/json' \
-d '{"model":"smollm2-135m-atharva","prompt":"Hello","max_tokens":16}'If this works, move back to KServe.
Replace your container section with this (key changes commented):
spec:
predictor:
minReplicas: 1
maxReplicas: 1
containerConcurrency: 1
containers:
- name: vllm
image: schoolofdevops/vllm-cpu-nonuma:0.9.1
args:
- --model=/models/model
- --host=0.0.0.0
- --port=8000
- --max-model-len=1024 # start smaller
- --served-model-name=smollm2-135m-atharva
- --dtype=float32 # <-- critical
- --disable-frontend-multiprocessing # good for containerized CPU
- --max-num-seqs=1
- --swap-space=0.5
- --log-level=INFO
env:
- name: VLLM_TARGET_DEVICE
value: "cpu"
- name: VLLM_USE_IPEX # <-- disable IPEX to avoid autocast confusion
value: "0"
- name: OMP_NUM_THREADS
value: "2"
- name: OPENBLAS_NUM_THREADS
value: "1"
- name: MKL_NUM_THREADS
value: "1"
# Remove VLLM_CPU_KVCACHE_SPACE and VLLM_CPU_OMP_THREADS_BIND for now
ports:
- name: http1
containerPort: 8000
resources:
requests:
cpu: "2"
memory: "2Gi"
limits:
cpu: "2"
memory: "4Gi" # give a bit more headroom for bring-up
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 5
periodSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 15
periodSeconds: 10
volumeMounts:
- name: model
mountPath: /models
readOnly: true
nodeName: llmops-kind-worker
volumes:
- name: model
image:
reference: initcron/smollm2-135m-merged:v3
pullPolicy: IfNotPresent-
Verify the model path inside the pod:
kubectl -n atharva-ml exec -it deploy/atharva-vllm-predictor-default-<hash> -- ls -lah /models/model kubectl -n atharva-ml exec -it ... -- head -n 20 /models/model/config.json
If the dir is empty, your
image:volume might not be mounting as expected. -
Try re-enabling IPEX but with a good dtype:
- Set
VLLM_USE_IPEX=1and--dtype=bfloat16only if your CPU supports bf16; otherwise keepfloat32.
- Set
-
Remove thread envs entirely (let libraries decide) to rule out an affinity deadlock:
- Temporarily drop
OMP_NUM_THREADS,OPENBLAS_NUM_THREADS,MKL_NUM_THREADS.
- Temporarily drop
- The warning is fine; the hang is because
float16on CPU + IPEX autocast is a bad combo. - Switch to
--dtype=float32(orbfloat16on bf16-capable CPUs), and either disable IPEX (VLLM_USE_IPEX=0) or keep it with a supported dtype. - Loosen thread/affinity settings and confirm the model path actually has files.