Setup:
- Intel(R) Arc(tm) A380 Graphics (DG2) with 6 GB of RAM.
Monitor GPU usage:
sudo dnf install intel-gpu-tools nvtop
sudo intel_gpu_top
sudo nvtopInfo: https://github.com/ggml-org/llama.cpp/blob/master/docs/docker.md
On Intel, -intel images can be used for SYCL, or -vulkan. Performance are similar.
podman run --rm --device /dev/dri/card1 --device /dev/dri/renderD128 -p 8080:8080 -v ~/.cache/llama.cpp:/models ghcr.io/ggml-org/llama.cpp:server-intel --model /models/unsloth_Qwen3.5-9B-GGUF_Qwen3.5-9B-Q3_K_M.gguf -c 32768
OpenVINO seems much faster, but Qwen3.5 is not availabe yet, nor llama.cpp seems to support my GPU via OpenVINO. Unluckily, using OpenVINO with Qwen2.5-Coder and OpenCode does not provide good results (tools are not used).
SOURCE="OpenVINO/Qwen2.5-Coder-7B-Instruct-int4-ov"
NAME="Qwen2.5-Coder-7B-Instruct-int4"
OPTS="--rm --user root -v $PWD/models:/models:z openvino/model_server:latest-gpu --source_model $SOURCE --model_repository_path /models --model_name $NAME --target_device GPU --task text_generation"
podman run $OPTS --pull
podman run $OPTS --port 9001 --rest_port 8001
podman run --rm --network host -e BASE_URL=http://127.0.0.1:8001/v3 -e CUSTOM_MODELS="-all,$NAME" yidadaa/chatgpt-next-webThe context size is critical: opencode needs a context size of ~11.000 tokens to work, as tools are passed by.
~/.config/opencode/opencode.json:
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"openvino": {
"npm": "@ai-sdk/openai-compatible",
"name": "openvino server (local)",
"options": {
"baseURL": "http://127.0.0.1:8001/v3",
},
"models": {
"Qwen2.5-Coder-7B-Instruct-int4": {
"name": "Qwen2.5-Coder: 7b (local)",
"limit": {
"context": 32768,
"output": 4096
}
}
}
},
"llama.cpp": {
"npm": "@ai-sdk/openai-compatible",
"name": "llama.cpp server (local)",
"options": {
"baseURL": "http://127.0.0.1:8080/v1",
},
"models": {
"Qwen3.5-9B-Q3_K_M": {
"name": "Qwen3.5: 9B Q3_K_M (local)",
"limit": {
"context": 32768,
"output": 4096
}
}
}
}
}
}