Scaling Benchmark Plan: msmarco-v2-vector with GPU Indexing

Rally track branch: msmarco_18_36_recall (in msmarco-v2-vector/)

Goal

Demonstrate linear scale-out of Elasticsearch vector search using the msmarco-v2-vector Rally track (~138M Cohere 1024-dim embeddings) with GPU indexing. Four tiers double data and nodes at each step, keeping data-per-node roughly constant:

Tier	Documents	Data Nodes	Shards	Corpus Groups (base64)
1/8	18,000,000	1	1	1
1/4	36,000,000	2	2	1–2
1/2	72,000,000	4	4	1–4
Full	138,000,000	8	8	1–8

Each base64 corpus group holds 18M docs (6 files × 3M), except group 8 which holds 12M (4 × 3M). The remaining 364,198 docs (file 47, the "parallel" corpus) are excluded from initial indexing and recall measurement.

Recall Measurement

Rally's KnnRecallRunner compares approximate kNN hits against brute-force ground truth stored in queries-recall-*.json.bz2. The ground truth differs per subset because the true nearest neighbors depend on which documents are in the index.

Existing ground truth files:

File	Corpus size	Top-K per query	Queries
`queries-recall.json.bz2`	Full (138M)	1000	76
`queries-recall-10m.json.bz2`	10M	100	76

New ground truth files to generate:

File	Corpus size	Top-K per query	Queries
`queries-recall-18m.json.bz2`	18M	1000	76
`queries-recall-36m.json.bz2`	36M	1000	76
`queries-recall-72m.json.bz2`	72M	1000	76
`queries-recall-full.json.bz2`	138,000,000	1000	76

All files use the same 76 queries (from msmarco-passage-v2/trec-dl-2022/judged with Cohere embed-english-v3.0 embeddings). Only the ids (nearest neighbor lists) change per subset.

The full dataset generation (queries-recall-full.json.bz2) serves as a validation baseline — compare it against the existing queries-recall.json.bz2 to confirm correctness of the generation process at full scale.

Document ordering: why HuggingFace slicing matches Rally ingestion

The ground truth script uses train[:N] on the HuggingFace dataset, while Rally ingests from pre-built corpus files. These produce the same document set because:

Corpus file generation (_tools/parse_documents.py) slices the HF train split sequentially: train[0:3M] → file 01, train[3M:6M] → file 02, etc.
Rally's bulk operation reads corpus files in the order listed in the corpora array (01, 02, 03...) and reads documents sequentially within each file. ingest-doc-count is a global total that simply stops the sequential read after N documents (see PartitionBulkIndexParamSource in Rally's esrally/track/params.py).
generate_ground_truth.py uses train[:N] which is train[0:N] — the same first N documents from HuggingFace, in the same order.

For both subset tiers the doc count equals the total docs in the listed corpora (18M = 6 files × 3M in group 1; 36M = 12 files × 3M in groups 1–2), so Rally ingests every document in those corpus groups with no partial file reads. The ground truth and the indexed data will contain identical document sets.

Ground Truth Generation

Approach: numpy brute-force over Hugging Face dataset

Rather than standing up an Elasticsearch instance for brute-force script_score queries, we compute exact dot-product similarities directly in numpy:

Load the 76 query embeddings from queries-recall.json.bz2
Stream document vectors from Cohere/msmarco-v2-embed-english-v3 on HF
L2-normalize each document vector (matching vg.normalize in the corpus pipeline)
Compute sigmoid(1, e, -dot(q, d)) to match the ES script_score formula
Maintain per-query top-1000 via numpy argpartition
Write JSONL output matching the existing ground truth format

Script

_tools/generate_ground_truth.py — see file for usage.

AWS Instance for Generation

The bottleneck is dataset download and Arrow decoding, not dot-product compute. The actual matrix multiplication for 36M docs × 76 queries × 1024 dims takes under 2 minutes on a modern CPU. Downloading and decoding the HuggingFace dataset (Arrow format, ~50–100 GB for 36M rows selecting only _id + emb) takes 10–30 minutes depending on network bandwidth and decode parallelism.

Recommendation: c7i.4xlarge (16 vCPU, 32 GB RAM, 12.5 Gbps, ~$0.71/hr)

32 GB RAM is plenty (working set is ~2 GB per batch; total peaks ~4 GB)
16 vCPUs provide enough parallelism for HF dataset decoding (--workers 8) and numpy uses remaining cores for matmul via OpenBLAS
12.5 Gbps network saturates HF download throughput
Total cost: well under $1 for both runs combined

Does more vCPUs help?

Marginally. The datasets library parallelizes Arrow shard decoding across num_proc workers, so more CPUs help with decode/parse. numpy matmul also benefits from more cores (OpenBLAS thread pool). But the gains flatten beyond ~16 cores because:

HF download bandwidth is the primary constraint, not CPU
The per-query top-K maintenance uses numpy argpartition (76 queries × top-1000)
Arrow decode parallelism has diminishing returns beyond 8–12 workers

A c7i.8xlarge (32 vCPU, $1.43/hr) would be ~20–30% faster for decode but costs twice as much; not worth it for a one-off job that completes in under an hour regardless.

Running the Generation

# On the AWS instance — create an isolated Python environment:
python3 -m venv ground-truth-env
source ground-truth-env/bin/activate
pip install datasets numpy huggingface_hub

# Point the HuggingFace cache to a disk with enough space.
# load_dataset downloads all parquet shards (~500 GB+) before selecting rows,
# and the default cache (~/.cache/huggingface/) is often on a small root
# filesystem.  Set these to a path on your large data volume:
export HF_HOME=/home/esbench/.rally/hf_cache
export HF_DATASETS_CACHE=/home/esbench/.rally/hf_cache/datasets

# 18M ground truth
python msmarco-v2-vector/_tools/generate_ground_truth.py \
    --doc-count 18000000 \
    --output queries-recall-18m.json \
    --workers 8
bzip2 queries-recall-18m.json

# 36M ground truth
python msmarco-v2-vector/_tools/generate_ground_truth.py \
    --doc-count 36000000 \
    --output queries-recall-36m.json \
    --workers 8
bzip2 queries-recall-36m.json

# 72M ground truth
python msmarco-v2-vector/_tools/generate_ground_truth.py \
    --doc-count 72000000 \
    --output queries-recall-72m.json \
    --workers 8
bzip2 queries-recall-72m.json

# Full dataset ground truth (for validation against existing file).
# Use 138,000,000 (not 138,364,198) — the extra 364,198 docs are in
# file 47 (the "parallel" corpus), which is NOT included in the default
# initial-indexing corpora (groups 1-8).  The existing
# queries-recall.json.bz2 was generated against 138M docs.
python msmarco-v2-vector/_tools/generate_ground_truth.py \
    --doc-count 138000000 \
    --output queries-recall-full.json \
    --workers 8
bzip2 queries-recall-full.json

# Clean up when done
deactivate
rm -rf ground-truth-env
rm -rf /home/esbench/.rally/hf_cache

Validation

After generation, verify format consistency:

python -c "
import bz2, json
for f in ['queries-recall-18m.json.bz2', 'queries-recall-36m.json.bz2',
          'queries-recall-72m.json.bz2', 'queries-recall-full.json.bz2']:
    with bz2.open(f, 'r') as fh:
        lines = fh.readlines()
    print(f'{f}: {len(lines)} queries')
    q = json.loads(lines[0])
    print(f'  keys: {list(q.keys())}')
    print(f'  ids len: {len(q[\"ids\"])}')
    print(f'  top score: {q[\"ids\"][0][1]}, bottom: {q[\"ids\"][-1][1]}')
"

Validate the generated ground truth against existing files to confirm correctness. The existing files were produced by ES brute-force script_score (Java doubles); ours use numpy float64 matmul. Expect 99.99%+ overlap, with any differences at score-tie boundaries due to floating-point accumulation order.

10M validation (compare top-100, since the existing file stores 100 neighbors):

python msmarco-v2-vector/_tools/compare_ground_truth.py \
    msmarco-v2-vector/queries-recall-10m.json.bz2 \
    queries-recall-10m.json.bz2

Output:

Average overlap: 100.0/100 (99.97%)
Worst overlap:   98/100 (query 2025747)
Max score diff on shared docs: 0.0000001000

NEAR MATCH — tiny differences likely from float precision in score ties.

The 2 mismatched docs in query 2025747 all have score 0.604882 — a tie-break cluster where float32 (ES script_score) and float64 (numpy) accumulation order produces different rankings.

Full dataset validation (compare top-1000):

The existing queries-recall.json.bz2 was generated against an ES index that included the parallel corpus (file 47, 364,198 docs with 69_ prefix IDs), totaling 138,364,198 documents. Our queries-recall-full.json was generated with --doc-count 138000000 (groups 1-8 only, no parallel corpus).

python msmarco-v2-vector/_tools/compare_ground_truth.py \
    msmarco-v2-vector/queries-recall.json.bz2 \
    queries-recall-full.json

Output:

Average overlap: 999.5/1000 (99.95%)
Worst overlap:   992/1000 (query 2033396)
Max score diff on shared docs: 0.0000001000

NEAR MATCH — tiny differences likely from float precision in score ties.

All mismatches follow the same pattern: docs "only in A" (the existing file) have 69_-prefix IDs — these are neighbors from the 364K parallel corpus that our generation excluded. Docs "only in B" (our file) are borderline docs pushed into the top-1000 because the 69_ docs are absent. For example, query 2033396 (worst case, 992/1000 overlap) has 8 69_-prefix docs in the existing file at ranks 258–792 with scores well above the score boundary; without them, 8 lower-scoring docs from other files fill the tail.

The Max score diff on shared docs: 0.0000001 confirms that scoring is identical for documents present in both sets — the generation script is correct.

Runtime validation: confirming recall file and doc count

When running the benchmark, verify that Rally picked the correct recall file and ingested the expected number of documents. Two log lines in Rally's log file (~/.rally/logs/rally.log) confirm this:

1. Recall file selection — emitted by KnnRecallRunner when the recall operation runs. Look for:

grep "recall_doc_set" ~/.rally/logs/rally.log

Expected output per tier:

recall_doc_set='18m' (type=str), using recall file: queries-recall-18m.json.bz2
recall_doc_set='36m' (type=str), using recall file: queries-recall-36m.json.bz2
recall_doc_set='72m' (type=str), using recall file: queries-recall-72m.json.bz2
recall_doc_set='full' (type=str), using recall file: queries-recall.json.bz2

The recall-doc-set parameter is set automatically in operations/default.json based on initial_indexing_ingest_doc_count. If recall_doc_set shows an unexpected value (e.g. -1 or wrong tier), check the parameter file.

2. Recall results — emitted after each recall operation completes:

grep "Recall results" ~/.rally/logs/rally.log

This shows avg_recall, min_recall, k, num_candidates, and NDCG scores.

3. Bulk indexing completion — Rally logs the total number of docs indexed when the bulk operation finishes. Search for the operation name:

grep "initial-documents-indexing" ~/.rally/logs/rally.log

Verify the total doc count matches the expected tier size (18M, 36M, 72M, or 138M).

Rally Track Changes

track.py

Add filename constants for the new recall files and extend KnnRecallRunner to select the right file based on recall-doc-set parameter.

operations/default.json

Extend the recall-doc-set selection logic to detect the 18M, 36M, and 72M doc counts.

Parameter Files

New GPU-scaling parameter files (in params/):

File	Docs	Shards	Index Type
`params/params-gpu-scaling-1node-18m.json`	18,000,000	1	`hnsw` + GPU
`params/params-gpu-scaling-2node-36m.json`	36,000,000	2	`hnsw` + GPU
`params/params-gpu-scaling-4node-72m.json`	72,000,000	4	`hnsw` + GPU
`params/params-gpu-scaling-8node-full.json`	138,000,000	8	`hnsw` + GPU

Running the Benchmark

# Tier 1: 1 data node, 18M docs
esrally race --track-path=/path/to/msmarco-v2-vector \
    --track-params=@/path/to/msmarco-v2-vector/params/params-gpu-scaling-1node-18m.json \
    --target-hosts=<1-node-cluster> \
    --pipeline=benchmark-only

# Tier 2: 2 data nodes, 36M docs
esrally race --track-path=/path/to/msmarco-v2-vector \
    --track-params=@/path/to/msmarco-v2-vector/params/params-gpu-scaling-2node-36m.json \
    --target-hosts=<2-node-cluster> \
    --pipeline=benchmark-only

# Tier 3: 4 data nodes, 72M docs
esrally race --track-path=/path/to/msmarco-v2-vector \
    --track-params=@/path/to/msmarco-v2-vector/params/params-gpu-scaling-4node-72m.json \
    --target-hosts=<4-node-cluster> \
    --pipeline=benchmark-only

# Tier 4: 8 data nodes, full dataset
esrally race --track-path=/path/to/msmarco-v2-vector \
    --track-params=@/path/to/msmarco-v2-vector/params/params-gpu-scaling-8node-full.json \
    --target-hosts=<8-node-cluster> \
    --pipeline=benchmark-only

Recall Data Points

Seven search configurations measured at each tier (all with k=100), providing a recall-vs-latency curve:

k	num_candidates	Description
100	150	Just above k, fastest
100	400	Moderate
100	800	Good quality
100	1000	Standard high-quality
100	1500	Mild over-retrieval
100	2000	Over-retrieval
100	3000	Heavy over-retrieval

These appear in the search_ops array in each parameter file and produce both throughput latency metrics (via knn-search-*) and recall metrics (via knn-recall-*).

ChrisHegarty/scaling-benchmark-plan.md

Select an option

No results found