Date: 2026-03-20 Upgrade: embeddinggemma-300M (768-dim) -> Qwen3-Embedding-0.6B (1024-dim) Host: artbird (RTX 3070, 8GB VRAM, i7-9700)
| Hash | Message |
|---|---|
291cf0c8 |
feat: add benchmark-search subcommand and upgrade QMD embedding model |
4423e081 |
fix: restart services when systemd unit files change |
5173a9d6 |
fix: pass QMD env vars to UI server and its subprocesses |
- Verified GPU health - RTX 3070 healthy, CUDA enabled, all 6 systemd services active
- Created
benchmark-searchsubcommand - 13 test queries across 4 categories (keyword, semantic, cross-collection, long-context, structured), YAML output, comparison mode - Ran "before" benchmark - All 13 queries passed, baseline captured
- Updated systemd unit - Added
QMD_EMBED_MODELandQMD_EXPAND_CONTEXT_SIZE=4096env vars to qmdctl.service - Added sandcastles entry - Tracked GGUF model filename as ephemeral
- Updated CLAUDE.md - Added benchmark-search to subcommands list
- Fixed provisioner - Services now restart when their unit files change (was only starting stopped ones)
- Deployed to artbird - Provisioned, restarted MCP server with new env vars
- Re-embedded all documents - Cleared index, ran
qmd update+qmd embedwith Qwen3 model - Fixed dimension mismatch bug - UI server's
server.tshardcoded a subprocess ENV that didn't pass throughQMD_*env vars. Also added env vars to qmdctl-ui systemd unit. - Ran "after" benchmark and comparison
- Total documents: 25,260
- Total vectors: 144,781 (avg ~5.7 chunks per doc)
- Model file:
Qwen3-Embedding-0.6B-Q8_0.gguf(639 MB) - VRAM usage: ~7 GB for model + compute buffers
The embed process required 3 passes to complete all vectors:
| Pass | Vectors embedded | Chunks failed | Duration | Notes |
|---|---|---|---|---|
| 1 | 21,216 | 125,555 | 30 min | MCP server was running, competing for VRAM. CUDA OOM on larger chunks. |
| 2 | 30,304 | 65,266 | 30 min | MCP server stopped first. Processed more chunks. |
| 3 | 93,261 | 0 | ~55 min | All remaining chunks completed. |
Root cause of multi-pass: The RTX 3070 has 8 GB VRAM. The Qwen3 model uses ~7 GB (model weights + compute buffers). With Chrome browsers (5 x 28 MB) and Xorg (128 MB) also on the GPU, there wasn't enough headroom for the largest chunks. Each pass succeeded on progressively smaller chunks, and the final pass (with MCP server stopped) completed everything.
Total embedding time: ~2 hours across 3 passes.
- Sustained rate: ~1,000 vectors/min (GPU at 94% utilization, 148W)
- Per-doc rate: ~175 docs/min
- This is reasonable for a 0.6B Q8_0 model through node-llama-cpp on an RTX 3070.
| Metric | Value |
|---|---|
| Queries improved | 1 |
| Queries regressed | 1 |
| Queries unchanged | 11 |
| Total latency before | 34,297 ms |
| Total latency after | 12,021 ms |
| Latency improvement | 65% faster |
Keyword queries (3): Identical scores, identical top-5 results, similar latency. These use BM25 (no embedding), so the model change has no effect.
Semantic queries (3): Identical scores and results. The reranker (Qwen3-Reranker, unchanged) dominates scoring here.
Cross-collection queries (2): Identical scores and results.
Long-context queries (2): No change. "CLI architecture" returns 1 result, "daemon architecture" returns 0. These depend more on the reranker and query expansion model than embeddings.
Structured queries (3): This is where the upgrade shines:
- hyde: Telegram bot security - Same top score (0.93), but 3.6x faster (5632ms -> 1570ms). Different top-5 results (1/5 overlap) - the new model finds different relevant docs.
- expand: vector embeddings - Score improved 0.90 -> 0.91, 4.6x faster (12830ms -> 2783ms). Completely different top-5 (0/5 overlap).
- expand: file watching - Score regressed slightly 0.90 -> 0.88, 3.8x faster (11404ms -> 3029ms). Some overlap (2/5).
The upgrade's main impact is on structured queries (hyde/expand), which are 3.6-4.6x faster. This makes sense - the larger context window (32K vs 2K) means the query expansion model can process more context in fewer passes. Score quality is neutral to slightly positive. The basic search path (keyword, semantic) is unaffected since it uses BM25 + reranking rather than embedding similarity.
Problem: provision-service only started stopped services. If a service was already running with old config, it stayed on the old config until manually restarted.
Fix: Track which units were rewritten and restart those specifically (commit 4423e081).
Problem: server.ts hardcoded a subprocess ENV with only HOME and PATH. The QMD_EMBED_MODEL env var set in the systemd unit was never passed through to qmd query subprocesses, causing 768-dim query embeddings against a 1024-dim index (dimension mismatch error).
Fix: Pass through all QMD_* env vars from process.env to subprocess ENV. Also added the env vars to the qmdctl-ui systemd unit (commit 5173a9d6).
QMD's MCP server has disposeModelsOnInactivity: true hardcoded with a 5-minute timeout. After 5 minutes of no queries, the embedding model unloads from VRAM. Reloading takes ~15 seconds. Options:
- Upstream QMD change to make this configurable via env var
- Keep-warm cron that sends a search query every few minutes
Five browserctl Chrome instances use ~140 MB of RTX 3070 VRAM for compositing. The i7-9700 has an iGPU (UHD 630) but it appears BIOS-disabled. --disable-gpu flag would move Chrome rendering to software. Low priority since 140 MB is small relative to 8 GB.
Both before and after, this long-context query returns no results. May indicate a gap in the indexed content or a reranker threshold issue.