Skip to content

Instantly share code, notes, and snippets.

@jodoherty
jodoherty / README.md
Last active May 9, 2026 16:05
llama.cpp server AMD Radeon RX 7900 XTX perfect fit

This llama-server setup is specifically tuned to my AMD Radeon RX 7900 XTX for running gemma 4 26B A4B quantized by unsloth.

I've set it up to ensure it's stable, preferring as much practical quality as possible despite the VRAM limits.

This utilizes 99% of the VRAM on my setup so there's no room for improvement.

I get somewhere between 100-120 tokens/second token generation speeds with a single user.

@jodoherty
jodoherty / v100_benchmarks.md
Last active May 16, 2026 02:45
8xV100 Benchmarks

Benchmarked with llama-benchy:

https://github.com/eugr/llama-benchy

May 14th, 2026

This was done on a Lambda.ai rental, so I didn't want to spend a lot of time testing different prompt sizes and context depths. I just did a basic set of benchmarks with some different quantizations of Gemma 4 26B A4B and Gemma 4 31B.

The machine had 8 V100 GPUs with 16GB of VRAM each:

@jodoherty
jodoherty / rtxpro6000_benchmark.md
Last active May 16, 2026 03:26
RTX Pro 6000 Blackwell benchmarks (vLLM)

RTX Pro 6000 Blackwell vLLM Benchmarks

Hardware:

01:00.0 VGA compatible controller: NVIDIA Corporation GB202GL [RTX PRO 6000 Blackwell Workstation Edition] (rev a1)

Summary