Setup:

Hardware
- Intel(R) Arc(tm) A380 Graphics (DG2) with 6 GB of VRAM
- 16 GB RAM
Software
- Fedora 44
- Podman
- uv
- pnpm

Monitor GPU usage:

sudo dnf install intel-gpu-tools nvtop
sudo intel_gpu_top
sudo nvtop

Dependencies:

sudo dnf install podman

llama.cpp

Info: https://github.com/ggml-org/llama.cpp/blob/master/docs/docker.md

Setup:

MODEL="google/gemma-4-E2B-it-qat-q4_0-gguf"
PODMAN_OPTS="--rm --device /dev/dri/renderD128 -p 8080:8080 -v $HOME/.cache/huggingface/hub:/root/.cache/huggingface/hub --init --pull newer"
LLAMACPP_OPTS="--no-warmup --no-mmproj -c 65536 -hf $MODEL"

The model supports a context size up to 131072 (128k), but 65536 (64k) should be more than enough.

Even if llama.cpp can automatically download the model if needed, the Hugging Face tool provides a better progress bar.

Configure Hugging Face command line for faster downloads (optional, but recommended):

sudo dnf install python3-uv
uvx hf auth login

Fetch the model:

uvx hf "$MODEL"

Vulkan

podman run $PODMAN_OPTS ghcr.io/ggml-org/llama.cpp:server-vulkan $LLAMACPP_OPTS

Performance: ~23 tokens/second.

Intel (SYCL)

podman run $PODMAN_OPTS ghcr.io/ggml-org/llama.cpp:server-intel $LLAMACPP_OPTS

It can take several minutes to compile the model: just wait for the llama-server to finish the operation.

Performance: ~19 tokens/second.

OpenVINO

podman run $PODMAN_OPTS -e GGML_OPENVINO_DEVICE=GPU ghcr.io/ggml-org/llama.cpp:server-openvino $LLAMACPP_OPTS

Performance: ~13 tokens/second.

For some reasons, OpenVINO seems to use twice the GPU memory.

OpenVINO Model Server

Setup:

HF_TOKEN="..." # set your Hugging Face token
mkdir -p ~/openvino
MODEL="OpenVINO/gemma-4-E2B-it-int4-ov"
PODMAN_OPTS="--rm --device /dev/dri/renderD128 -p 8080:8080 -p 8081:8081 --user root -v $HOME/openvino:/models:z --init --pull newer -e HF_TOKEN=$HF_TOKEN"
OPENVINO_OPTS="--source_model $MODEL --model_repository_path /models --model_name $(basename "$MODEL") --target_device GPU --task text_generation --pipeline_type VLM --rest_port 8080 --port 8081"

Run:

podman run $PODMAN_OPTS openvino/model_server:latest-gpu $OPENVINO_OPTS

Performance: ~24 tokens/second.

Even OpenVINO model server seems to use twice the GPU memory compared to llama.cpp with Vulkan or SYCL.

Coding agents

OpenCode

OpenCode works when using llama.cpp, as I was not able to make OpenVINO Model Server work without the VLM flag.

The context size is critical: opencode needs a context size of ~10.000 tokens to work, as tools are passed by.

~/.config/opencode/opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "llama.cpp": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama.cpp server (local)",
      "options": {
        "baseURL": "http://127.0.0.1:8080/v1"
      },
      "models": {
        "google/gemma-4-E2B-it-qat-q4_0-gguf:Q4_0": {
          "name": "Gemma 4: 2B Q4_0 (local)",
          "limit": {
            "context": 65536,
            "output": 4096
          }
        }
      }
    }
  }
}

Nanocoder

Nanocoder seems a lighter alternative compared to OpenCode, easier to use with local models.

sudo dnf install pnpm
export NANOCODER_CONTEXT_LIMIT=65536
pnpm dlx @nanocollective/nanocoder

Just follow the instructions :)

Known issues

Gemma 4 + llama.cpp:

ggml-org/llama.cpp#24033

Other models

Tested:

google/gemma-4-E4B-it-qat-q4_0-gguf:Q4_0: ~two times slower
unsloth/gemma-4-12b-it-GGUF:UD-IQ2_M: ~four times slower, and it does not fit into VRAM entirely

Might be interesting:

Qwen3.5-9B

Final considerations

llama.cpp with Vulkan seems to give the best results, but it cannot be used in a non-server environment, because it makes the graphical interface unresponsive. OpenVINO seems to use be memory inefficient. The Intel/SYCL seems to be the best compromise.

Using a lightweight coder agent is a must when using budget hardware.

frafra/linux-intel-llm.md

Select an option

No results found

Select an option

No results found

llama.cpp

Vulkan

Intel (SYCL)

OpenVINO

OpenVINO Model Server

Coding agents

OpenCode

Nanocoder

Known issues

Other models

Final considerations