Date: 2026-03-20 Duration: ~4 hours (03:00 - 06:55 UTC) Host: artbird (RTX 3070, 8GB VRAM, i7-9700)
Upgrade QMD's embedding model from embeddinggemma-300M (2K context, 768-dim) to Qwen3-Embedding-0.6B (32K context, 1024-dim) to improve search quality, and measure impact with a before/after benchmark.
291cf0c8 feat: add benchmark-search subcommand and upgrade QMD embedding model
- Created
apps/qmdctl/qmdctl/run_benchmark_search.py— 13 test queries across 4 categories (keyword, semantic, cross-collection, long-context, structured), with YAML output and comparison mode - Wired into
cli.pyandpyproject.tomlasqmdctl-benchmark-search - Updated
UNIT_QMDCTLsystemd unit withQMD_EMBED_MODELandQMD_EXPAND_CONTEXT_SIZE=4096env vars - Added QMD embedding model to
sandcastles.md - Updated
CLAUDE.mdwith benchmark-search subcommand
4423e081 fix: restart services when systemd unit files change
- Provisioner now restarts running services when their unit file was rewritten (previously only started stopped services)
| Time (UTC) | Event |
|---|---|
| 03:02 | Verified GPU health, all 6 services running, embeddinggemma-300M active |
| 03:02 | Ran "before" benchmark — all 13 queries successful |
| 03:06 | Committed and pushed. Syncthing synced to artbird ~45s later |
| 03:06 | Reinstalled qmdctl package, ran provision-service, restarted qmdctl |
| 03:08 | Started qmd embed -f — Qwen3 model downloading (639MB, 7s) |
| 03:09 | CUDA binary rebuild triggered (node-llama-cpp, ~10 min) |
| 03:20 | Embed started — GPU at 94%, 148W, ~1000 vectors/min |
| 03:50 | First OOM crash at 21,216 vectors — CUDA OOM on large chunks |
| 03:55 | Discovered orphan embed processes holding VRAM |
| 04:07 | MCP server holding 7GB VRAM, blocking embed restarts |
| 04:08 | Killed MCP server, freed VRAM, restarted embed |
| 04:10 | Embed running again, steady progress |
| 05:00 | Another OOM crash at 51,200 vectors |
| 05:00 | Restarted embed — continued from where it left off |
| 05:20 | OOM crash at 81,504 vectors, restarted again |
| 05:30 | Set up auto-restart monitoring loop |
| 06:15 | OOM crash at 142,656 vectors, restarted for final batch |
| 06:48 | Embedding complete: 144,781 vectors from 25,260 documents |
| 06:50 | Restarted MCP server |
| 06:54 | Ran "after" benchmark |
| Metric | Value |
|---|---|
| Documents indexed | 25,260 |
| Vectors embedded | 144,781 (~5.7 chunks/doc avg) |
| Total embed time | ~2.5 hours (across multiple restarts) |
| Throughput | ~1,000 vectors/min at GPU 100% |
| VRAM usage | ~7,100 MiB (model + compute buffers) |
| GPU utilization | 94-100% sustained |
| OOM crashes | 4 (each recovered by restart) |
| Model file size | 639 MB (Q8_0 quantization) |
Search queries (BM25, /api/search): Identical results before and after. Expected — the /api/search endpoint uses BM25 text search, not vector embeddings. The embedding model change does not affect these queries.
| Query | Before Score | After Score | Top-5 Overlap |
|---|---|---|---|
| click group commands | 0.88 | 0.88 | 5/5 |
| systemd unit file | 0.93 | 0.93 | 5/5 |
| pydantic model validation | 0.93 | 0.93 | 5/5 |
| auth token handling | 0.94 | 0.94 | 5/5 |
| async error handling | 0.89 | 0.89 | 5/5 |
| systemd deploy workflows | 0.97 | 0.97 | 5/5 |
| YAML config loading | 0.90 | 0.90 | 5/5 |
| notifications | 0.94 | 0.94 | 5/5 |
| CLI architecture | 0.97 | 0.97 | 1/1 |
| daemon architecture | 0.00 | 0.00 | 0/0 |
Structured queries (hyde/expand, /api/query): All failed with 502 after the upgrade. Root cause: dimension mismatch (see below).
The Qwen3-Embedding model upgrade is only partially working. Embeddings were stored successfully (1024-dim), but vector search at query time fails with:
Dimension mismatch for query vector for the "embedding" column.
Expected 1024 dimensions but received 768.
Root cause: QMD's store.js hardcodes DEFAULT_EMBED_MODEL = "embeddinggemma" for vector lookups. When the MCP server generates query embeddings, it uses this default model name, which resolves to the old 768-dim embeddinggemma model, not the 1024-dim Qwen3 model that was used for indexing.
The QMD_EMBED_MODEL env var is correctly read by:
llm.js— for downloading and loading the model fileembedCLI command — for indexingformatQueryForEmbedding()/formatDocForEmbedding()— for prompt formatting
But it is NOT read by:
store.jsline 121:searchVector→ always passesDEFAULT_EMBED_MODEL = "embeddinggemma"store.jsstructured search code → same hardcoded model nameindex.jsline 121:createStore()→ passesDEFAULT_EMBED_MODELtosearchVec()
Impact: /api/search (BM25 text) works fine. /api/query (hybrid with vectors) crashes on every request.
This is a QMD upstream bug. The fix needs store.js and index.js to read process.env.QMD_EMBED_MODEL (or derive the model name from the env var) when performing vector searches, not just at embed time.
- Dimension mismatch bug —
store.jsmust respectQMD_EMBED_MODELfor query-time vector lookups, not just embed-time indexing
- Model keep-alive — QMD's MCP server sets
disposeModelsOnInactivity: truewith a 5-min timeout. Models unload from VRAM after idle. No env var to override. - Chrome GPU usage — 5-7 browserctl Chrome instances use ~140-200MiB of RTX 3070 VRAM. artbird's i7-9700 iGPU (UHD 630) is BIOS-disabled.
--disable-gpuflag would move rendering to software. - OOM resilience —
qmd embedcrashes on CUDA OOM for large chunks but doesn't retry. Each restart picks up where it left off (idempotent), but manual restarts are needed. A retry loop or smaller batch size would help.
apps/qmdctl/qmdctl/run_benchmark_search.py (new)
apps/qmdctl/qmdctl/cli.py (add benchmark-search command)
apps/qmdctl/pyproject.toml (add script entry)
apps/qmdctl/CLAUDE.md (add subcommand docs)
apps/qmdctl/qmdctl/run_provision_service.py (env vars + restart fix)
sandcastles.md (add QMD model entry)
- Before:
/tmp/qmd-benchmarks/qmd-benchmark-before-20260320T030240Z.yaml - After:
/tmp/qmd-benchmarks/qmd-benchmark-after-20260320T065406Z.yaml