Enable larger GTT to fit models into memory.
options ttm pages_limit=31457280
options ttm page_pool_size=15728640
Download and stage gemma4 variants into a local directory for llama-server.
mkdir -p /srv/models/{gemma-4-26B-A4B-it-GGUF,gemma-4-E2B-it-GGUF,gemma-4-31B-it-GGUF}
hf download unsloth/gemma-4-26B-A4B-it-GGUF gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf --local-dir /srv/models/gemma-4-26B-A4B-it-GGUF/
hf download unsloth/gemma-4-26B-A4B-it-GGUF mmproj-BF16.gguf --local-dir /srv/models/gemma-4-26B-A4B-it-GGUF/
hf download unsloth/gemma-4-E2B-it-GGUF gemma-4-E2B-it-UD-Q4_K_XL.gguf --local-dir /srv/models/gemma-4-E2B-it-GGUF/
hf download unsloth/gemma-4-E2B-it-GGUF mmproj-BF16.gguf --local-dir /srv/models/gemma-4-E2B-it-GGUF/
hf download unsloth/gemma-4-31B-it-GGUF gemma-4-31B-it-UD-Q4_K_XL.gguf --local-dir /srv/models/gemma-4-31B-it-GGUF/
hf download unsloth/gemma-4-31B-it-GGUF mmproj-BF16.gguf --local-dir /srv/models/gemma-4-31B-it-GGUF/
Create a model preset file at /srv/models/models.ini like this:
[*]
n-gpu-layers = all
[unsloth/gemma-4-26B-A4B-it-GGUF]
m = /models/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf
mmproj = /models/gemma-4-26B-A4B-it-GGUF/mmproj-BF16.gguf
temp = 1.0
top-p = 0.95
top-k = 64
[unsloth/gemma-4-31B-it-GGUF]
m = /models/gemma-4-31B-it-GGUF/gemma-4-31B-it-UD-Q4_K_XL.gguf
mmproj = /models/gemma-4-31B-it-GGUF/mmproj-BF16.gguf
temp = 1.0
top-p = 0.95
top-k = 64
[unsloth/gemma-4-E2B-it-GGUF]
m = /models/gemma-4-E2B-it-GGUF/gemma-4-E2B-it-UD-Q4_K_XL.gguf
mmproj = /models/gemma-4-E2B-it-GGUF/mmproj-BF16.gguf
temp = 1.0
top-p = 0.95
top-k = 64
Run llama-server using podman. This runs it in router mode so you can utilize all three of the above models on the fly.
podman run --rm \
--device /dev/kfd --device /dev/dri \
-v /srv/models:/models \
--network=host \
--name llama-server \
--group-add keep-groups \
ghcr.io/ggml-org/llama.cpp:server-vulkan \
--models-preset /models/models.ini \
--host 0.0.0.0 --port 8000 \
--reuse-port \
--api-key sk-local-example
Install and launch claude code with the environment configured to point at llama-server.
Once I've done the above, I make a wrapper script and call it something like local-claude.sh and put this in it (replacing 10.0.0.20 with your framework desktop's IP address):
#!/bin/sh
export ANTHROPIC_AUTH_TOKEN="sk-local-example"
export ANTHROPIC_API_KEY=""
export ANTHROPIC_BASE_URL=http://10.0.0.20:8000
export ANTHROPIC_MODEL=unsloth/gemma-4-26B-A4B-it-GGUF
export ANTHROPIC_DEFAULT_OPUS_MODEL=unsloth/gemma-4-31B-it-GGUF
export ANTHROPIC_DEFAULT_SONNET_MODEL=unsloth/gemma-4-26B-A4B-it-GGUF
export ANTHROPIC_DEFAULT_HAIKU_MODEL=unsloth/gemma-4-E2B-it-GGUF
export CLAUDE_CODE_SUBAGENT_MODEL=unsloth/gemma-4-26B-A4B-it-GGUF
exec claude "$@"
On NixOS, I change the final line like so:
#!/bin/sh
export ANTHROPIC_AUTH_TOKEN="sk-local-example"
export ANTHROPIC_API_KEY=""
export ANTHROPIC_BASE_URL=http://10.0.0.20:8000
export ANTHROPIC_MODEL=unsloth/gemma-4-26B-A4B-it-GGUF
export ANTHROPIC_DEFAULT_OPUS_MODEL=unsloth/gemma-4-31B-it-GGUF
export ANTHROPIC_DEFAULT_SONNET_MODEL=unsloth/gemma-4-26B-A4B-it-GGUF
export ANTHROPIC_DEFAULT_HAIKU_MODEL=unsloth/gemma-4-E2B-it-GGUF
export CLAUDE_CODE_SUBAGENT_MODEL=unsloth/gemma-4-26B-A4B-it-GGUF
nix run github:sadjow/claude-code-nix -- "$@"
This isn't all I do though. I actually take it a step further. I have wireguard tunnels set up for secure access as well as a public TLS protected endpoint via a reverse proxy. Either one works, but I use both for different situations.
To secure the TLS endpiont, I generate and use a strong token. You can make one like so:
import secrets
print(f"sk-local-{secrets.token_hex(32)}")