Skip to content

Instantly share code, notes, and snippets.

@ckandoth
Last active May 11, 2026 00:00
Show Gist options
  • Select an option

  • Save ckandoth/852bf956e27e12925b9d205c14349dc1 to your computer and use it in GitHub Desktop.

Select an option

Save ckandoth/852bf956e27e12925b9d205c14349dc1 to your computer and use it in GitHub Desktop.
Fully local AI agent using Gemma 4 on an Nvidia GPU with 24GB VRAM.

Purpose

Fully local AI agent using Gemma 4 on an Nvidia GPU with 24GB VRAM.

Tools

I'm currently on the latest rolling release of CachyOS using Linux kernel 7.0.5-2-cachyos on a 2025 ASUS ROG Zephyrus G16 with a 16-core Intel Ultra 9 285H, 64GB DDR5 RAM, and an Nvidia RTX 5090 Mobile with 24GB VRAM.

Start by installing system packages we'll need. If you're using another package manager like apt or zypper, google around to find the equivalent packages.

sudo pacman -S --noconfirm --needed git screen curl base-devel cmake cuda nccl openssl pkg-config gcc15 yq

Edit your dotfiles to add ~/.local/bin to $PATH if it isn't already, and exit terminal.

echo -e '\n# Add ~/.local/bin to PATH if found\nif [ -d "$HOME/.local/bin" ] && [[ ":$PATH:" != *":$HOME/.local/bin:"* ]]; then\n    export PATH="$HOME/.local/bin:$PATH"\nfi' >> ~/.bashrc
exit

Start a new terminal and compile llama-server from source to ensure native support for your GPU(s).

git clone --depth 1 --branch b9101 https://github.com/ggml-org/llama.cpp && cd llama.cpp
cmake -B build -DBUILD_SHARED_LIBS=OFF -DGGML_CCACHE=OFF -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=native -DCMAKE_CUDA_COMPILER=/opt/cuda/bin/nvcc -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-15 -DCMAKE_INSTALL_PREFIX=/usr/local
cmake --build build -j $(nproc) --config Release
cp build/bin/llama-server ~/.local/bin/
cd .. && rm -rf llama.cpp

Install the goose AI agent.

curl -fsSL https://github.com/aaif-goose/goose/releases/download/stable/download_cli.sh | CONFIGURE=false bash

Register a custom OpenAI-compatible provider pointing at the local llama-server, and set it as the default.

mkdir -p ~/.config/goose/custom_providers
cat > ~/.config/goose/custom_providers/llama-server.json << 'EOF'
{
  "name": "llama-server",
  "engine": "openai",
  "display_name": "llama-server",
  "base_url": "http://localhost:8080",
  "models": [{"name": "gemma-4", "context_limit": 262144}],
  "requires_auth": false,
  "supports_streaming": true
}
EOF
yq -i -y '.GOOSE_TELEMETRY_ENABLED = true | .GOOSE_PROVIDER = "llama-server" | .GOOSE_MODEL = "gemma-4"' ~/.config/goose/config.yaml

Deploy

Start a screen session, and serve Gemma 4 26B A4B (Instruction Tuned) with 4-bit quantization out of localhost:8080, which will consume ~16GB of VRAM. Set -np 1 to process only one query at a time so that the agent can use all 262144 tokens of KV-cache, potentially consuming more than the remaining 8GB VRAM at 4-bit quantization. A long-running goose session can tell us if this leads to an OOM crash or poor performance due to swapping with system RAM.

screen
llama-server -dio --no-warmup --spec-default --jinja --reasoning-format deepseek -fa on -ngl all -np 1 -c 262144 -b 2048 -ub 2048 -ctk q4_0 -ctv q4_0 --temp 0.8 --top-p 0.95 --top-k 64 -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_M -a gemma-4

Use Ctrl+A, C to create a new screen window. Then get the goose coding agent to do something useful using the local model.

goose run --no-session --text "Summarize host machine specs. Focus on local inference capabilities."
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment