Skip to content

Instantly share code, notes, and snippets.

@frafra
Last active March 29, 2026 11:54
Show Gist options
  • Select an option

  • Save frafra/4ff862ff1b0f433a1ed31072df53cf91 to your computer and use it in GitHub Desktop.

Select an option

Save frafra/4ff862ff1b0f433a1ed31072df53cf91 to your computer and use it in GitHub Desktop.
Personal notes of a Linux user with an Intel Arc GPU starting using local generative models

Setup:

  • Intel(R) Arc(tm) A380 Graphics (DG2) with 6 GB of RAM.

Monitor GPU usage:

sudo dnf install intel-gpu-tools nvtop
sudo intel_gpu_top
sudo nvtop

llama.cpp

Info: https://github.com/ggml-org/llama.cpp/blob/master/docs/docker.md

On Intel, -intel images can be used for SYCL, or -vulkan. Performance are similar.

podman run --rm --device /dev/dri/card1 --device /dev/dri/renderD128 -p 8080:8080 -v ~/.cache/llama.cpp:/models ghcr.io/ggml-org/llama.cpp:server-intel --model /models/unsloth_Qwen3.5-9B-GGUF_Qwen3.5-9B-Q3_K_M.gguf -c 32768

OpenVINO

OpenVINO seems much faster, but Qwen3.5 is not availabe yet, nor llama.cpp seems to support my GPU via OpenVINO. Unluckily, using OpenVINO with Qwen2.5-Coder and OpenCode does not provide good results (tools are not used).

SOURCE="OpenVINO/Qwen2.5-Coder-7B-Instruct-int4-ov"
NAME="Qwen2.5-Coder-7B-Instruct-int4"
OPTS="--rm --user root -v $PWD/models:/models:z openvino/model_server:latest-gpu --source_model $SOURCE --model_repository_path /models --model_name $NAME --target_device GPU --task text_generation"
podman run $OPTS --pull
podman run $OPTS --port 9001 --rest_port 8001
podman run --rm --network host -e BASE_URL=http://127.0.0.1:8001/v3 -e CUSTOM_MODELS="-all,$NAME" yidadaa/chatgpt-next-web

OpenCode

The context size is critical: opencode needs a context size of ~11.000 tokens to work, as tools are passed by.

~/.config/opencode/opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "openvino": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "openvino server (local)",
      "options": {
        "baseURL": "http://127.0.0.1:8001/v3",
      },
      "models": {
        "Qwen2.5-Coder-7B-Instruct-int4": {
          "name": "Qwen2.5-Coder: 7b (local)",
          "limit": {
            "context": 32768,
            "output": 4096
          }
        }
      }
    },
    "llama.cpp": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama.cpp server (local)",
      "options": {
        "baseURL": "http://127.0.0.1:8080/v1",
      },
      "models": {
        "Qwen3.5-9B-Q3_K_M": {
          "name": "Qwen3.5: 9B Q3_K_M (local)",
          "limit": {
            "context": 32768,
            "output": 4096
          }
        }
      }
    }
  }
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment