Skip to content

Instantly share code, notes, and snippets.

@ChrisHegarty
Last active May 5, 2026 07:24
Show Gist options
  • Select an option

  • Save ChrisHegarty/cf8bc329177610d9d5345bb125553e1f to your computer and use it in GitHub Desktop.

Select an option

Save ChrisHegarty/cf8bc329177610d9d5345bb125553e1f to your computer and use it in GitHub Desktop.
GPU scaling benchmark plan for msmarco-v2-vector Rally track

Scaling Benchmark Plan: msmarco-v2-vector with GPU Indexing

Rally track branch: msmarco_18_36_recall (in msmarco-v2-vector/)

Goal

Demonstrate linear scale-out of Elasticsearch vector search using the msmarco-v2-vector Rally track (~138M Cohere 1024-dim embeddings) with GPU indexing. Four tiers double data and nodes at each step, keeping data-per-node roughly constant:

Tier Documents Data Nodes Shards Corpus Groups (base64)
1/8 18,000,000 1 1 1
1/4 36,000,000 2 2 1–2
1/2 72,000,000 4 4 1–4
Full 138,000,000 8 8 1–8

Each base64 corpus group holds 18M docs (6 files × 3M), except group 8 which holds 12M (4 × 3M). The remaining 364,198 docs (file 47, the "parallel" corpus) are excluded from initial indexing and recall measurement.

Recall Measurement

Rally's KnnRecallRunner compares approximate kNN hits against brute-force ground truth stored in queries-recall-*.json.bz2. The ground truth differs per subset because the true nearest neighbors depend on which documents are in the index.

Existing ground truth files:

File Corpus size Top-K per query Queries
queries-recall.json.bz2 Full (138M) 1000 76
queries-recall-10m.json.bz2 10M 100 76

New ground truth files to generate:

File Corpus size Top-K per query Queries
queries-recall-18m.json.bz2 18M 1000 76
queries-recall-36m.json.bz2 36M 1000 76
queries-recall-72m.json.bz2 72M 1000 76
queries-recall-full.json.bz2 138,000,000 1000 76

All files use the same 76 queries (from msmarco-passage-v2/trec-dl-2022/judged with Cohere embed-english-v3.0 embeddings). Only the ids (nearest neighbor lists) change per subset.

The full dataset generation (queries-recall-full.json.bz2) serves as a validation baseline — compare it against the existing queries-recall.json.bz2 to confirm correctness of the generation process at full scale.

Document ordering: why HuggingFace slicing matches Rally ingestion

The ground truth script uses train[:N] on the HuggingFace dataset, while Rally ingests from pre-built corpus files. These produce the same document set because:

  1. Corpus file generation (_tools/parse_documents.py) slices the HF train split sequentially: train[0:3M] → file 01, train[3M:6M] → file 02, etc.
  2. Rally's bulk operation reads corpus files in the order listed in the corpora array (01, 02, 03...) and reads documents sequentially within each file. ingest-doc-count is a global total that simply stops the sequential read after N documents (see PartitionBulkIndexParamSource in Rally's esrally/track/params.py).
  3. generate_ground_truth.py uses train[:N] which is train[0:N] — the same first N documents from HuggingFace, in the same order.

For both subset tiers the doc count equals the total docs in the listed corpora (18M = 6 files × 3M in group 1; 36M = 12 files × 3M in groups 1–2), so Rally ingests every document in those corpus groups with no partial file reads. The ground truth and the indexed data will contain identical document sets.

Ground Truth Generation

Approach: numpy brute-force over Hugging Face dataset

Rather than standing up an Elasticsearch instance for brute-force script_score queries, we compute exact dot-product similarities directly in numpy:

  1. Load the 76 query embeddings from queries-recall.json.bz2
  2. Stream document vectors from Cohere/msmarco-v2-embed-english-v3 on HF
  3. L2-normalize each document vector (matching vg.normalize in the corpus pipeline)
  4. Compute sigmoid(1, e, -dot(q, d)) to match the ES script_score formula
  5. Maintain per-query top-1000 via numpy argpartition
  6. Write JSONL output matching the existing ground truth format

Script

_tools/generate_ground_truth.py — see file for usage.

AWS Instance for Generation

The bottleneck is dataset download and Arrow decoding, not dot-product compute. The actual matrix multiplication for 36M docs × 76 queries × 1024 dims takes under 2 minutes on a modern CPU. Downloading and decoding the HuggingFace dataset (Arrow format, ~50–100 GB for 36M rows selecting only _id + emb) takes 10–30 minutes depending on network bandwidth and decode parallelism.

Recommendation: c7i.4xlarge (16 vCPU, 32 GB RAM, 12.5 Gbps, ~$0.71/hr)

  • 32 GB RAM is plenty (working set is ~2 GB per batch; total peaks ~4 GB)
  • 16 vCPUs provide enough parallelism for HF dataset decoding (--workers 8) and numpy uses remaining cores for matmul via OpenBLAS
  • 12.5 Gbps network saturates HF download throughput
  • Total cost: well under $1 for both runs combined

Does more vCPUs help?

Marginally. The datasets library parallelizes Arrow shard decoding across num_proc workers, so more CPUs help with decode/parse. numpy matmul also benefits from more cores (OpenBLAS thread pool). But the gains flatten beyond ~16 cores because:

  • HF download bandwidth is the primary constraint, not CPU
  • The per-query top-K maintenance uses numpy argpartition (76 queries × top-1000)
  • Arrow decode parallelism has diminishing returns beyond 8–12 workers

A c7i.8xlarge (32 vCPU, $1.43/hr) would be ~20–30% faster for decode but costs twice as much; not worth it for a one-off job that completes in under an hour regardless.

Running the Generation

# On the AWS instance — create an isolated Python environment:
python3 -m venv ground-truth-env
source ground-truth-env/bin/activate
pip install datasets numpy huggingface_hub

# Point the HuggingFace cache to a disk with enough space.
# load_dataset downloads all parquet shards (~500 GB+) before selecting rows,
# and the default cache (~/.cache/huggingface/) is often on a small root
# filesystem.  Set these to a path on your large data volume:
export HF_HOME=/home/esbench/.rally/hf_cache
export HF_DATASETS_CACHE=/home/esbench/.rally/hf_cache/datasets

# 18M ground truth
python msmarco-v2-vector/_tools/generate_ground_truth.py \
    --doc-count 18000000 \
    --output queries-recall-18m.json \
    --workers 8
bzip2 queries-recall-18m.json

# 36M ground truth
python msmarco-v2-vector/_tools/generate_ground_truth.py \
    --doc-count 36000000 \
    --output queries-recall-36m.json \
    --workers 8
bzip2 queries-recall-36m.json

# 72M ground truth
python msmarco-v2-vector/_tools/generate_ground_truth.py \
    --doc-count 72000000 \
    --output queries-recall-72m.json \
    --workers 8
bzip2 queries-recall-72m.json

# Full dataset ground truth (for validation against existing file).
# Use 138,000,000 (not 138,364,198) — the extra 364,198 docs are in
# file 47 (the "parallel" corpus), which is NOT included in the default
# initial-indexing corpora (groups 1-8).  The existing
# queries-recall.json.bz2 was generated against 138M docs.
python msmarco-v2-vector/_tools/generate_ground_truth.py \
    --doc-count 138000000 \
    --output queries-recall-full.json \
    --workers 8
bzip2 queries-recall-full.json

# Clean up when done
deactivate
rm -rf ground-truth-env
rm -rf /home/esbench/.rally/hf_cache

Validation

After generation, verify format consistency:

python -c "
import bz2, json
for f in ['queries-recall-18m.json.bz2', 'queries-recall-36m.json.bz2',
          'queries-recall-72m.json.bz2', 'queries-recall-full.json.bz2']:
    with bz2.open(f, 'r') as fh:
        lines = fh.readlines()
    print(f'{f}: {len(lines)} queries')
    q = json.loads(lines[0])
    print(f'  keys: {list(q.keys())}')
    print(f'  ids len: {len(q[\"ids\"])}')
    print(f'  top score: {q[\"ids\"][0][1]}, bottom: {q[\"ids\"][-1][1]}')
"

Validate the generated ground truth against existing files to confirm correctness. The existing files were produced by ES brute-force script_score (Java doubles); ours use numpy float64 matmul. Expect 99.99%+ overlap, with any differences at score-tie boundaries due to floating-point accumulation order.

10M validation (compare top-100, since the existing file stores 100 neighbors):

python msmarco-v2-vector/_tools/compare_ground_truth.py \
    msmarco-v2-vector/queries-recall-10m.json.bz2 \
    queries-recall-10m.json.bz2

Output:

Average overlap: 100.0/100 (99.97%)
Worst overlap:   98/100 (query 2025747)
Max score diff on shared docs: 0.0000001000

NEAR MATCH — tiny differences likely from float precision in score ties.

The 2 mismatched docs in query 2025747 all have score 0.604882 — a tie-break cluster where float32 (ES script_score) and float64 (numpy) accumulation order produces different rankings.

Full dataset validation (compare top-1000):

The existing queries-recall.json.bz2 was generated against an ES index that included the parallel corpus (file 47, 364,198 docs with 69_ prefix IDs), totaling 138,364,198 documents. Our queries-recall-full.json was generated with --doc-count 138000000 (groups 1-8 only, no parallel corpus).

python msmarco-v2-vector/_tools/compare_ground_truth.py \
    msmarco-v2-vector/queries-recall.json.bz2 \
    queries-recall-full.json

Output:

Average overlap: 999.5/1000 (99.95%)
Worst overlap:   992/1000 (query 2033396)
Max score diff on shared docs: 0.0000001000

NEAR MATCH — tiny differences likely from float precision in score ties.

All mismatches follow the same pattern: docs "only in A" (the existing file) have 69_-prefix IDs — these are neighbors from the 364K parallel corpus that our generation excluded. Docs "only in B" (our file) are borderline docs pushed into the top-1000 because the 69_ docs are absent. For example, query 2033396 (worst case, 992/1000 overlap) has 8 69_-prefix docs in the existing file at ranks 258–792 with scores well above the score boundary; without them, 8 lower-scoring docs from other files fill the tail.

The Max score diff on shared docs: 0.0000001 confirms that scoring is identical for documents present in both sets — the generation script is correct.

Runtime validation: confirming recall file and doc count

When running the benchmark, verify that Rally picked the correct recall file and ingested the expected number of documents. Two log lines in Rally's log file (~/.rally/logs/rally.log) confirm this:

1. Recall file selection — emitted by KnnRecallRunner when the recall operation runs. Look for:

grep "recall_doc_set" ~/.rally/logs/rally.log

Expected output per tier:

recall_doc_set='18m' (type=str), using recall file: queries-recall-18m.json.bz2
recall_doc_set='36m' (type=str), using recall file: queries-recall-36m.json.bz2
recall_doc_set='72m' (type=str), using recall file: queries-recall-72m.json.bz2
recall_doc_set='full' (type=str), using recall file: queries-recall.json.bz2

The recall-doc-set parameter is set automatically in operations/default.json based on initial_indexing_ingest_doc_count. If recall_doc_set shows an unexpected value (e.g. -1 or wrong tier), check the parameter file.

2. Recall results — emitted after each recall operation completes:

grep "Recall results" ~/.rally/logs/rally.log

This shows avg_recall, min_recall, k, num_candidates, and NDCG scores.

3. Bulk indexing completion — Rally logs the total number of docs indexed when the bulk operation finishes. Search for the operation name:

grep "initial-documents-indexing" ~/.rally/logs/rally.log

Verify the total doc count matches the expected tier size (18M, 36M, 72M, or 138M).

Rally Track Changes

track.py

Add filename constants for the new recall files and extend KnnRecallRunner to select the right file based on recall-doc-set parameter.

operations/default.json

Extend the recall-doc-set selection logic to detect the 18M, 36M, and 72M doc counts.

Parameter Files

New GPU-scaling parameter files (in params/):

File Docs Shards Index Type
params/params-gpu-scaling-1node-18m.json 18,000,000 1 hnsw + GPU
params/params-gpu-scaling-2node-36m.json 36,000,000 2 hnsw + GPU
params/params-gpu-scaling-4node-72m.json 72,000,000 4 hnsw + GPU
params/params-gpu-scaling-8node-full.json 138,000,000 8 hnsw + GPU

Running the Benchmark

# Tier 1: 1 data node, 18M docs
esrally race --track-path=/path/to/msmarco-v2-vector \
    --track-params=@/path/to/msmarco-v2-vector/params/params-gpu-scaling-1node-18m.json \
    --target-hosts=<1-node-cluster> \
    --pipeline=benchmark-only

# Tier 2: 2 data nodes, 36M docs
esrally race --track-path=/path/to/msmarco-v2-vector \
    --track-params=@/path/to/msmarco-v2-vector/params/params-gpu-scaling-2node-36m.json \
    --target-hosts=<2-node-cluster> \
    --pipeline=benchmark-only

# Tier 3: 4 data nodes, 72M docs
esrally race --track-path=/path/to/msmarco-v2-vector \
    --track-params=@/path/to/msmarco-v2-vector/params/params-gpu-scaling-4node-72m.json \
    --target-hosts=<4-node-cluster> \
    --pipeline=benchmark-only

# Tier 4: 8 data nodes, full dataset
esrally race --track-path=/path/to/msmarco-v2-vector \
    --track-params=@/path/to/msmarco-v2-vector/params/params-gpu-scaling-8node-full.json \
    --target-hosts=<8-node-cluster> \
    --pipeline=benchmark-only

Recall Data Points

Seven search configurations measured at each tier (all with k=100), providing a recall-vs-latency curve:

k num_candidates Description
100 150 Just above k, fastest
100 400 Moderate
100 800 Good quality
100 1000 Standard high-quality
100 1500 Mild over-retrieval
100 2000 Over-retrieval
100 3000 Heavy over-retrieval

These appear in the search_ops array in each parameter file and produce both throughput latency metrics (via knn-search-*) and recall metrics (via knn-recall-*).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment