Skip to content

Instantly share code, notes, and snippets.

@gordonmurray
Created May 27, 2026 19:38
Show Gist options
  • Select an option

  • Save gordonmurray/da9449ce797f4382ce49a895c10e8607 to your computer and use it in GitHub Desktop.

Select an option

Save gordonmurray/da9449ce797f4382ce49a895c10e8607 to your computer and use it in GitHub Desktop.
SciFact nprobes sweep on Firn 0.7.1 — quality and latency identical at nprobes=100, 20, 8, 1 (MaxSim scoring dominates IVF probe)

Firn 0.7.1 — SciFact nprobes sweep

Follow-up to the baseline reproduction at https://gist.github.com/gordonmurray/97c10e02081fd2acf50283d5c53347ec.

Same dataset (BEIR SciFact), same model (lightonai/LateOn), same namespace, same IVF_PQ index. The only thing being varied is the query-time nprobes value.

Files

  • firn-beir-scifact-nprobes-sweep.md — the writeup. Results table for nprobes ∈ {100, 20, 8, 1}, plan analysis explaining why nprobes has no effect at this scale, and what this means for the tuning page.

Links

SciFact on Firn 0.7.1 — nprobes sweep

Follow-on to the baseline reproduction at https://gist.github.com/gordonmurray/97c10e02081fd2acf50283d5c53347ec. Question being answered: does dropping nprobes from the configured 100 hurt recall, and how does latency move with it?

Headline: nprobes has no measurable effect on this workload. Quality is identical at every value. Latency is identical at every value. The IVF partition probe is not the bottleneck here — the multivector MaxSim scoring stage is.

Setup

  • Same encoded SciFact embeddings as the baseline run.
  • Same namespace, same IVF_PQ index (num_partitions default sqrt(rows) ≈ 72, num_sub_vectors=64).
  • Firnflow restarted once before the sweep started, so the first measured value (nprobes=100) hits an empty foyer + empty handle pool. Subsequent values share that warm Lance state, which is noted below but the latency numbers show this doesn't materially change anything.
  • Search-only re-runs (no re-encode, no re-upsert, no re-index).
  • 300 SciFact test queries, QUERY_CONCURRENCY=32, k=100.

Results

nprobes ndcg@10 ndcg@100 recall@10 recall@100 QPS p50 p95 p99
100 0.7575 0.7742 0.9036 0.9767 3.1 9 887 ms 13 772 ms 15 640 ms
20 0.7575 0.7742 0.9036 0.9767 3.1 9 928 ms 13 099 ms 14 845 ms
8 0.7575 0.7742 0.9036 0.9767 3.1 9 772 ms 15 025 ms 16 665 ms
1 0.7575 0.7742 0.9036 0.9767 3.1 10 096 ms 13 255 ms 14 618 ms

map, recall@10 and recall@100 are bit-for-bit identical across all four values. The latency variation across rows is well inside run-to-run noise on the same host.

Why nprobes does nothing here

Firnflow logs the underlying Lance execution plan for each query. Sample for one query (whitespace inserted for readability):

Projection(Take(CoalesceBatches(SortExec(TopK)(
  MultivectorScoring(
    SortExec(TopK)(ANNSubIndex(ANNIVFPartition)),
    ... × 12-22 sub-vectors ...
  )
))))
output_rows=100
iops=0  requests=0  bytes_read=0  indices_loaded=0  parts_loaded=0
index_comparisons=19_065_888    (range across queries: 19M-22M)

What this says:

  • The IVF index is being used. ANNIVFPartition nodes appear once per query sub-vector, fan into a single MultivectorScoring node, then top-K and projection on top.
  • No IO is happening. iops=0, requests=0, bytes_read=0, indices_loaded=0. The IVF index and the document fragments are already resident in memory from the warmup; no S3, no page-cache miss.
  • Cost is in scoring, not in probing. index_comparisons is ~19-22 million per query. At 300 queries × ~20M comparisons × 32 concurrent threads, you saturate a 16-core CPU on the MaxSim stage regardless of how aggressively you prune at the partition level.

So nprobes is asking "look at how many IVF partitions per query sub-vector before scoring", and the answer is "look at all of them if you want, the scoring step is going to dominate either way".

What this means for the tuning page

For SciFact-scale multivector workloads on Firn 0.7.1:

  • nprobes is not a latency knob. Setting it high (the reference config's 100) doesn't cost anything. Setting it low (PLAID's 8, or even 1) doesn't help. Leave it at whatever default feels sensible; the script default of 20 is fine.
  • nprobes is not a recall knob either, at this scale. With ~72 IVF partitions and 12-22 query sub-vectors, even nprobes=1 per sub-vector reaches enough partitions in aggregate that all top-100 docs are recovered.
  • The real latency knob is the scoring stage. That's controlled by num_sub_vectors (PQ codebook coarseness — fewer sub-vectors = cheaper but lower-fidelity MaxSim) and by corpus size. A follow-up sweep on num_sub_vectors is the natural next step.

This is a SciFact-only finding. On bigger corpora (NFCorpus, FIQA, Quora) the IVF probe stage is probably where the latency story shifts. Worth re-running there to find where nprobes actually starts to matter.

Host

  • 12th Gen Intel i9-12900KF, 16 cores / 24 threads
  • 30 GB RAM
  • MinIO + firnflow + bench all colocated, loopback networking
  • Firn 0.7.1 (ghcr.io/gordonmurray/firnflow:0.7.1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment