Rally track branch: msmarco_18_36_recall (in msmarco-v2-vector/)
Demonstrate linear scale-out of Elasticsearch vector search using the msmarco-v2-vector Rally track (~138M Cohere 1024-dim embeddings) with GPU indexing. Four tiers double data and nodes at each step, keeping data-per-node roughly constant:
| Tier | Documents | Data Nodes | Shards | Corpus Groups (base64) |
|---|---|---|---|---|
| 1/8 | 18,000,000 | 1 | 1 | 1 |
| 1/4 | 36,000,000 | 2 | 2 | 1–2 |
| 1/2 | 72,000,000 | 4 | 4 | 1–4 |
| Full | 138,000,000 | 8 | 8 | 1–8 |
Each base64 corpus group holds 18M docs (6 files × 3M), except group 8 which holds 12M (4 × 3M). The remaining 364,198 docs (file 47, the "parallel" corpus) are excluded from initial indexing and recall measurement.
Rally's KnnRecallRunner compares approximate kNN hits against brute-force
ground truth stored in queries-recall-*.json.bz2. The ground truth
differs per subset because the true nearest neighbors depend on which
documents are in the index.
Existing ground truth files:
| File | Corpus size | Top-K per query | Queries |
|---|---|---|---|
queries-recall.json.bz2 |
Full (138M) | 1000 | 76 |
queries-recall-10m.json.bz2 |
10M | 100 | 76 |
New ground truth files to generate:
| File | Corpus size | Top-K per query | Queries |
|---|---|---|---|
queries-recall-18m.json.bz2 |
18M | 1000 | 76 |
queries-recall-36m.json.bz2 |
36M | 1000 | 76 |
queries-recall-72m.json.bz2 |
72M | 1000 | 76 |
queries-recall-full.json.bz2 |
138,000,000 | 1000 | 76 |
All files use the same 76 queries (from msmarco-passage-v2/trec-dl-2022/judged
with Cohere embed-english-v3.0 embeddings). Only the ids (nearest neighbor
lists) change per subset.
The full dataset generation (queries-recall-full.json.bz2) serves as a
validation baseline — compare it against the existing queries-recall.json.bz2
to confirm correctness of the generation process at full scale.
The ground truth script uses train[:N] on the HuggingFace dataset, while
Rally ingests from pre-built corpus files. These produce the same document
set because:
- Corpus file generation (
_tools/parse_documents.py) slices the HFtrainsplit sequentially:train[0:3M]→ file 01,train[3M:6M]→ file 02, etc. - Rally's bulk operation reads corpus files in the order listed in the
corporaarray (01, 02, 03...) and reads documents sequentially within each file.ingest-doc-countis a global total that simply stops the sequential read after N documents (seePartitionBulkIndexParamSourcein Rally'sesrally/track/params.py). generate_ground_truth.pyusestrain[:N]which istrain[0:N]— the same first N documents from HuggingFace, in the same order.
For both subset tiers the doc count equals the total docs in the listed corpora (18M = 6 files × 3M in group 1; 36M = 12 files × 3M in groups 1–2), so Rally ingests every document in those corpus groups with no partial file reads. The ground truth and the indexed data will contain identical document sets.
Rather than standing up an Elasticsearch instance for brute-force script_score queries, we compute exact dot-product similarities directly in numpy:
- Load the 76 query embeddings from
queries-recall.json.bz2 - Stream document vectors from
Cohere/msmarco-v2-embed-english-v3on HF - L2-normalize each document vector (matching
vg.normalizein the corpus pipeline) - Compute
sigmoid(1, e, -dot(q, d))to match the ES script_score formula - Maintain per-query top-1000 via numpy
argpartition - Write JSONL output matching the existing ground truth format
_tools/generate_ground_truth.py — see file for usage.
The bottleneck is dataset download and Arrow decoding, not dot-product compute.
The actual matrix multiplication for 36M docs × 76 queries × 1024 dims takes
under 2 minutes on a modern CPU. Downloading and decoding the HuggingFace
dataset (Arrow format, ~50–100 GB for 36M rows selecting only _id + emb)
takes 10–30 minutes depending on network bandwidth and decode parallelism.
Recommendation: c7i.4xlarge (16 vCPU, 32 GB RAM, 12.5 Gbps, ~$0.71/hr)
- 32 GB RAM is plenty (working set is ~2 GB per batch; total peaks ~4 GB)
- 16 vCPUs provide enough parallelism for HF dataset decoding (
--workers 8) and numpy uses remaining cores for matmul via OpenBLAS - 12.5 Gbps network saturates HF download throughput
- Total cost: well under $1 for both runs combined
Does more vCPUs help?
Marginally. The datasets library parallelizes Arrow shard decoding across
num_proc workers, so more CPUs help with decode/parse. numpy matmul also
benefits from more cores (OpenBLAS thread pool). But the gains flatten beyond
~16 cores because:
- HF download bandwidth is the primary constraint, not CPU
- The per-query top-K maintenance uses numpy argpartition (76 queries × top-1000)
- Arrow decode parallelism has diminishing returns beyond 8–12 workers
A c7i.8xlarge (32 vCPU, $1.43/hr) would be ~20–30% faster for decode but
costs twice as much; not worth it for a one-off job that completes in under
an hour regardless.
# On the AWS instance — create an isolated Python environment:
python3 -m venv ground-truth-env
source ground-truth-env/bin/activate
pip install datasets numpy huggingface_hub
# Point the HuggingFace cache to a disk with enough space.
# load_dataset downloads all parquet shards (~500 GB+) before selecting rows,
# and the default cache (~/.cache/huggingface/) is often on a small root
# filesystem. Set these to a path on your large data volume:
export HF_HOME=/home/esbench/.rally/hf_cache
export HF_DATASETS_CACHE=/home/esbench/.rally/hf_cache/datasets
# 18M ground truth
python msmarco-v2-vector/_tools/generate_ground_truth.py \
--doc-count 18000000 \
--output queries-recall-18m.json \
--workers 8
bzip2 queries-recall-18m.json
# 36M ground truth
python msmarco-v2-vector/_tools/generate_ground_truth.py \
--doc-count 36000000 \
--output queries-recall-36m.json \
--workers 8
bzip2 queries-recall-36m.json
# 72M ground truth
python msmarco-v2-vector/_tools/generate_ground_truth.py \
--doc-count 72000000 \
--output queries-recall-72m.json \
--workers 8
bzip2 queries-recall-72m.json
# Full dataset ground truth (for validation against existing file).
# Use 138,000,000 (not 138,364,198) — the extra 364,198 docs are in
# file 47 (the "parallel" corpus), which is NOT included in the default
# initial-indexing corpora (groups 1-8). The existing
# queries-recall.json.bz2 was generated against 138M docs.
python msmarco-v2-vector/_tools/generate_ground_truth.py \
--doc-count 138000000 \
--output queries-recall-full.json \
--workers 8
bzip2 queries-recall-full.json
# Clean up when done
deactivate
rm -rf ground-truth-env
rm -rf /home/esbench/.rally/hf_cacheAfter generation, verify format consistency:
python -c "
import bz2, json
for f in ['queries-recall-18m.json.bz2', 'queries-recall-36m.json.bz2',
'queries-recall-72m.json.bz2', 'queries-recall-full.json.bz2']:
with bz2.open(f, 'r') as fh:
lines = fh.readlines()
print(f'{f}: {len(lines)} queries')
q = json.loads(lines[0])
print(f' keys: {list(q.keys())}')
print(f' ids len: {len(q[\"ids\"])}')
print(f' top score: {q[\"ids\"][0][1]}, bottom: {q[\"ids\"][-1][1]}')
"Validate the generated ground truth against existing files to confirm correctness. The existing files were produced by ES brute-force script_score (Java doubles); ours use numpy float64 matmul. Expect 99.99%+ overlap, with any differences at score-tie boundaries due to floating-point accumulation order.
10M validation (compare top-100, since the existing file stores 100 neighbors):
python msmarco-v2-vector/_tools/compare_ground_truth.py \
msmarco-v2-vector/queries-recall-10m.json.bz2 \
queries-recall-10m.json.bz2Output:
Average overlap: 100.0/100 (99.97%)
Worst overlap: 98/100 (query 2025747)
Max score diff on shared docs: 0.0000001000
NEAR MATCH — tiny differences likely from float precision in score ties.
The 2 mismatched docs in query 2025747 all have score 0.604882 — a tie-break
cluster where float32 (ES script_score) and float64 (numpy) accumulation order
produces different rankings.
Full dataset validation (compare top-1000):
The existing queries-recall.json.bz2 was generated against an ES index that
included the parallel corpus (file 47, 364,198 docs with 69_ prefix IDs),
totaling 138,364,198 documents. Our queries-recall-full.json was generated
with --doc-count 138000000 (groups 1-8 only, no parallel corpus).
python msmarco-v2-vector/_tools/compare_ground_truth.py \
msmarco-v2-vector/queries-recall.json.bz2 \
queries-recall-full.jsonOutput:
Average overlap: 999.5/1000 (99.95%)
Worst overlap: 992/1000 (query 2033396)
Max score diff on shared docs: 0.0000001000
NEAR MATCH — tiny differences likely from float precision in score ties.
All mismatches follow the same pattern: docs "only in A" (the existing file) have
69_-prefix IDs — these are neighbors from the 364K parallel corpus that our
generation excluded. Docs "only in B" (our file) are borderline docs pushed into
the top-1000 because the 69_ docs are absent. For example, query 2033396
(worst case, 992/1000 overlap) has 8 69_-prefix docs in the existing file at
ranks 258–792 with scores well above the score boundary; without them, 8
lower-scoring docs from other files fill the tail.
The Max score diff on shared docs: 0.0000001 confirms that scoring is
identical for documents present in both sets — the generation script is correct.
When running the benchmark, verify that Rally picked the correct recall file
and ingested the expected number of documents. Two log lines in Rally's log
file (~/.rally/logs/rally.log) confirm this:
1. Recall file selection — emitted by KnnRecallRunner when the recall
operation runs. Look for:
grep "recall_doc_set" ~/.rally/logs/rally.log
Expected output per tier:
recall_doc_set='18m' (type=str), using recall file: queries-recall-18m.json.bz2
recall_doc_set='36m' (type=str), using recall file: queries-recall-36m.json.bz2
recall_doc_set='72m' (type=str), using recall file: queries-recall-72m.json.bz2
recall_doc_set='full' (type=str), using recall file: queries-recall.json.bz2
The recall-doc-set parameter is set automatically in operations/default.json
based on initial_indexing_ingest_doc_count. If recall_doc_set shows an
unexpected value (e.g. -1 or wrong tier), check the parameter file.
2. Recall results — emitted after each recall operation completes:
grep "Recall results" ~/.rally/logs/rally.log
This shows avg_recall, min_recall, k, num_candidates, and NDCG scores.
3. Bulk indexing completion — Rally logs the total number of docs indexed when the bulk operation finishes. Search for the operation name:
grep "initial-documents-indexing" ~/.rally/logs/rally.log
Verify the total doc count matches the expected tier size (18M, 36M, 72M, or 138M).
Add filename constants for the new recall files and extend KnnRecallRunner
to select the right file based on recall-doc-set parameter.
Extend the recall-doc-set selection logic to detect the 18M, 36M, and 72M doc counts.
New GPU-scaling parameter files (in params/):
| File | Docs | Shards | Index Type |
|---|---|---|---|
params/params-gpu-scaling-1node-18m.json |
18,000,000 | 1 | hnsw + GPU |
params/params-gpu-scaling-2node-36m.json |
36,000,000 | 2 | hnsw + GPU |
params/params-gpu-scaling-4node-72m.json |
72,000,000 | 4 | hnsw + GPU |
params/params-gpu-scaling-8node-full.json |
138,000,000 | 8 | hnsw + GPU |
# Tier 1: 1 data node, 18M docs
esrally race --track-path=/path/to/msmarco-v2-vector \
--track-params=@/path/to/msmarco-v2-vector/params/params-gpu-scaling-1node-18m.json \
--target-hosts=<1-node-cluster> \
--pipeline=benchmark-only
# Tier 2: 2 data nodes, 36M docs
esrally race --track-path=/path/to/msmarco-v2-vector \
--track-params=@/path/to/msmarco-v2-vector/params/params-gpu-scaling-2node-36m.json \
--target-hosts=<2-node-cluster> \
--pipeline=benchmark-only
# Tier 3: 4 data nodes, 72M docs
esrally race --track-path=/path/to/msmarco-v2-vector \
--track-params=@/path/to/msmarco-v2-vector/params/params-gpu-scaling-4node-72m.json \
--target-hosts=<4-node-cluster> \
--pipeline=benchmark-only
# Tier 4: 8 data nodes, full dataset
esrally race --track-path=/path/to/msmarco-v2-vector \
--track-params=@/path/to/msmarco-v2-vector/params/params-gpu-scaling-8node-full.json \
--target-hosts=<8-node-cluster> \
--pipeline=benchmark-onlySeven search configurations measured at each tier (all with k=100), providing a recall-vs-latency curve:
| k | num_candidates | Description |
|---|---|---|
| 100 | 150 | Just above k, fastest |
| 100 | 400 | Moderate |
| 100 | 800 | Good quality |
| 100 | 1000 | Standard high-quality |
| 100 | 1500 | Mild over-retrieval |
| 100 | 2000 | Over-retrieval |
| 100 | 3000 | Heavy over-retrieval |
These appear in the search_ops array in each parameter file and produce
both throughput latency metrics (via knn-search-*) and recall metrics
(via knn-recall-*).