Fully local AI agent using Gemma 4 on an Nvidia GPU with 24GB VRAM.

Purpose

Tools

I'm currently on the latest rolling release of CachyOS using Linux kernel 7.0.5-2-cachyos on a 2025 ASUS ROG Zephyrus G16 with a 16-core Intel Ultra 9 285H, 64GB DDR5 RAM, and an Nvidia RTX 5090 Mobile with 24GB VRAM.

Start by installing system packages we'll need. If you're using another package manager like apt or zypper, google around to find the equivalent packages.

sudo pacman -S --noconfirm --needed git screen curl base-devel cmake cuda nccl openssl pkg-config gcc15 yq

Edit your dotfiles to add ~/.local/bin to $PATH if it isn't already, and exit terminal.

echo -e '\n# Add ~/.local/bin to PATH if found\nif [ -d "$HOME/.local/bin" ] && [[ ":$PATH:" != *":$HOME/.local/bin:"* ]]; then\n    export PATH="$HOME/.local/bin:$PATH"\nfi' >> ~/.bashrc
exit

Start a new terminal and compile llama-server from source to ensure native support for your GPU(s).

git clone --depth 1 --branch b9101 https://github.com/ggml-org/llama.cpp && cd llama.cpp
cmake -B build -DBUILD_SHARED_LIBS=OFF -DGGML_CCACHE=OFF -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=native -DCMAKE_CUDA_COMPILER=/opt/cuda/bin/nvcc -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-15 -DCMAKE_INSTALL_PREFIX=/usr/local
cmake --build build -j $(nproc) --config Release
cp build/bin/llama-server ~/.local/bin/
cd .. && rm -rf llama.cpp

Install the goose AI agent.

curl -fsSL https://github.com/aaif-goose/goose/releases/download/stable/download_cli.sh | CONFIGURE=false bash

mkdir -p ~/.config/goose/custom_providers
cat > ~/.config/goose/custom_providers/llama-server.json << 'EOF'
{
  "name": "llama-server",
  "engine": "openai",
  "display_name": "llama-server",
  "base_url": "http://localhost:8080",
  "models": [{"name": "gemma-4", "context_limit": 262144}],
  "requires_auth": false,
  "supports_streaming": true
}
EOF
yq -i -y '.GOOSE_TELEMETRY_ENABLED = true | .GOOSE_PROVIDER = "llama-server" | .GOOSE_MODEL = "gemma-4"' ~/.config/goose/config.yaml

Deploy

Start a screen session, and serve Gemma 4 26B A4B (Instruction Tuned) with 4-bit quantization out of localhost:8080, which will consume ~16GB of VRAM. Set -np 1 to process only one query at a time so that the agent can use all 262144 tokens of KV-cache, potentially consuming more than the remaining 8GB VRAM at 4-bit quantization. A long-running goose session can tell us if this leads to an OOM crash or poor performance due to swapping with system RAM.

screen
llama-server -dio --no-warmup --spec-default --jinja --reasoning-format deepseek -fa on -ngl all -np 1 -c 262144 -b 2048 -ub 2048 -ctk q4_0 -ctv q4_0 --temp 0.8 --top-p 0.95 --top-k 64 -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_M -a gemma-4

Use Ctrl+A, C to create a new screen window. Then get the goose coding agent to do something useful using the local model.

goose run --no-session --text "Summarize host machine specs. Focus on local inference capabilities."

ckandoth/local-gemma.md

Select an option

No results found

Select an option

No results found

Purpose

Tools

Deploy