You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Same dataset (BEIR SciFact), same model (lightonai/LateOn), same
namespace, same IVF_PQ index. The only thing being varied is the
query-time nprobes value.
Files
firn-beir-scifact-nprobes-sweep.md — the writeup. Results
table for nprobes ∈ {100, 20, 8, 1}, plan analysis explaining
why nprobes has no effect at this scale, and what this means
for the tuning page.
Headline: nprobes has no measurable effect on this workload.
Quality is identical at every value. Latency is identical at every
value. The IVF partition probe is not the bottleneck here — the
multivector MaxSim scoring stage is.
Setup
Same encoded SciFact embeddings as the baseline run.
Same namespace, same IVF_PQ index (num_partitions default
sqrt(rows) ≈ 72, num_sub_vectors=64).
Firnflow restarted once before the sweep started, so the first
measured value (nprobes=100) hits an empty foyer + empty handle
pool. Subsequent values share that warm Lance state, which is
noted below but the latency numbers show this doesn't materially
change anything.
Search-only re-runs (no re-encode, no re-upsert, no re-index).
300 SciFact test queries, QUERY_CONCURRENCY=32, k=100.
Results
nprobes
ndcg@10
ndcg@100
recall@10
recall@100
QPS
p50
p95
p99
100
0.7575
0.7742
0.9036
0.9767
3.1
9 887 ms
13 772 ms
15 640 ms
20
0.7575
0.7742
0.9036
0.9767
3.1
9 928 ms
13 099 ms
14 845 ms
8
0.7575
0.7742
0.9036
0.9767
3.1
9 772 ms
15 025 ms
16 665 ms
1
0.7575
0.7742
0.9036
0.9767
3.1
10 096 ms
13 255 ms
14 618 ms
map, recall@10 and recall@100 are bit-for-bit identical across all
four values. The latency variation across rows is well inside
run-to-run noise on the same host.
Why nprobes does nothing here
Firnflow logs the underlying Lance execution plan for each query.
Sample for one query (whitespace inserted for readability):
The IVF index is being used.ANNIVFPartition nodes appear
once per query sub-vector, fan into a single MultivectorScoring
node, then top-K and projection on top.
No IO is happening.iops=0, requests=0, bytes_read=0,
indices_loaded=0. The IVF index and the document fragments are
already resident in memory from the warmup; no S3, no page-cache
miss.
Cost is in scoring, not in probing.index_comparisons is
~19-22 million per query. At 300 queries × ~20M comparisons ×
32 concurrent threads, you saturate a 16-core CPU on the MaxSim
stage regardless of how aggressively you prune at the partition
level.
So nprobes is asking "look at how many IVF partitions per query
sub-vector before scoring", and the answer is "look at all of them
if you want, the scoring step is going to dominate either way".
What this means for the tuning page
For SciFact-scale multivector workloads on Firn 0.7.1:
nprobes is not a latency knob. Setting it high (the
reference config's 100) doesn't cost anything. Setting it low
(PLAID's 8, or even 1) doesn't help. Leave it at whatever default
feels sensible; the script default of 20 is fine.
nprobes is not a recall knob either, at this scale. With
~72 IVF partitions and 12-22 query sub-vectors, even
nprobes=1 per sub-vector reaches enough partitions in aggregate
that all top-100 docs are recovered.
The real latency knob is the scoring stage. That's controlled
by num_sub_vectors (PQ codebook coarseness — fewer sub-vectors
= cheaper but lower-fidelity MaxSim) and by corpus size. A
follow-up sweep on num_sub_vectors is the natural next step.
This is a SciFact-only finding. On bigger corpora (NFCorpus, FIQA,
Quora) the IVF probe stage is probably where the latency story
shifts. Worth re-running there to find where nprobes actually
starts to matter.
Host
12th Gen Intel i9-12900KF, 16 cores / 24 threads
30 GB RAM
MinIO + firnflow + bench all colocated, loopback networking