Skip to content

Instantly share code, notes, and snippets.

@louspringer
Created March 17, 2026 21:26
Show Gist options
  • Select an option

  • Save louspringer/8e53f54ad6caec7e7301912b6dc5cd28 to your computer and use it in GitHub Desktop.

Select an option

Save louspringer/8e53f54ad6caec7e7301912b6dc5cd28 to your computer and use it in GitHub Desktop.
120B benchmark: tokens/s, interpretation, Prometheus correlation, charts (gx10 Qwen 122B)

120B benchmark: tokens/s and interpretation

Endpoint: gx10 120B (Qwen 122B) at http://gx10-83fb.tail3dac72.ts.net:8002
Script: benchmark_120b_tokens_per_second.py


Quick reference

What we measure How
Output tokens/s Non-streaming completion; usage.completion_tokens / wall_time
TTFT (time to first token) Optional streaming run; time until first content chunk
Variance Multiple runs (--runs N); mean ± std; use --runs 5 --warmup
Concurrency Parallel requests (--concurrent C); aggregate throughput; combine with --runs N for variance
Full spread --full-spread: warmup, TTFT, then C=1, C=2, C=4 each with 5 runs (all permutations)

See script help: python3 scripts/benchmark_120b_tokens_per_second.py --help


Interpreting throughput (good / better / best)

The following ranges are realistic for self-hosted 70B–120B (not vendor fantasy numbers). Full analysis and caveats: bench.md Gist by louspringer.

Band Output tokens/s Interpretation
Poor < 10 Underpowered GPU, bad quantization, CPU offload, or VRAM pressure
Good 10–20 Most self-hosted 70B–120B setups; our 17.3 t/s sits here
Strong 20–35 High-end GPUs (A100/H100), efficient quantization, clean pipeline
Excellent 35–60+ Multi-GPU sharding, tensor parallelism, kernel-level optimizations
Vendor tier 100+ Cloud APIs, batching, speculative decoding — not apples-to-apples

Bottom line: ~17 output tokens/s = healthy, no obvious misconfiguration; not elite, not broken.


What’s not in a single run

  • TTFT — Responsiveness (e.g. 0.5 s vs 8 s) matters for UX; measure with --ttft.
  • Concurrency — Throughput can drop under load (e.g. 17 t/s @ 1 user → 9 t/s @ 2 users); use --concurrent 2 or 4 to probe. Use --concurrent C --runs 5 for mean ± std of aggregate t/s.
  • Variance — Use --runs 5 (and optional --warmup) for mean ± std.
  • All permutations--full-spread runs warmup, TTFT, then C=1 (5 runs), C=2 (5 rounds), C=4 (5 rounds) and prints a summary table.

How we call the LLM (no LangChain)

Benchmarks and Cursor/Goose use the OpenAI-compatible HTTP API (POST /v1/chat/completions) directly. We do not use LangChain for this:

  • Benchmark: Needs precise timing (TTFT, wall time); direct HTTP gives full control and no extra framework overhead.
  • Cursor/Goose: Use the configured provider (base URL + model id); no app-level LangChain in this repo.

If you add an application that chains prompts or tools, LangChain (or LangGraph) can be useful there; for measuring and configuring the 120B endpoint, direct HTTP is the right choice.

See: SMOKE_TEST_120B, 120B_SERVE_RUNBOOK, GOOSE_LLM_GX10_ACCESS.


Correlate with Prometheus (same period)

To compare benchmark results with GPU, host, and LLM metrics from the same time window, use the Prometheus range query script. Metrics come from the gx10 telemetry exporter (port 9092) and are stored in Prometheus on Zane (TELEMETRY).

1. Record the benchmark period (optional)
When running a full-spread benchmark, write the run window to a JSON file:

python3 scripts/benchmark_120b_tokens_per_second.py --full-spread \
  --record-period docs/evidence/benchmark_period_latest.json

The script prints the period and the correlate command.

2. Query Prometheus for that period

# Using the period file from the last run
python3 scripts/correlate_benchmark_prometheus.py --period-file docs/evidence/benchmark_period_latest.json

# Or last N minutes (e.g. right after a benchmark)
python3 scripts/correlate_benchmark_prometheus.py --minutes 10

# Or explicit start/end (ISO or unix)
python3 scripts/correlate_benchmark_prometheus.py \
  --start 2026-03-17T20:43:15Z --end 2026-03-17T20:49:00Z

3. Metrics summarized

Metric Description
GPU utilization % gx10_gpu_utilization_percent — correlate with throughput (e.g. C=1 vs C=4).
GPU memory used gx10_gpu_memory_bytes state=used — check for pressure.
Host load gx10_host_load 1m — CPU load during benchmark.
120B up gx10_llm_state port=8002 state=up — should be 1 for the whole period.

JSON output: Add --json for machine-readable min/max/mean per metric. Set PROMETHEUS_URL if Prometheus is not at http://zane.tail3dac72.ts.net/prometheus.


Graphs and charts (correlated data)

An HTML report embeds benchmark summary and Prometheus time series in one page (bar chart + line charts). Generate it after a full-spread run that wrote a period file:

# 1. Run benchmark and record period (includes results in JSON)
python3 scripts/benchmark_120b_tokens_per_second.py --full-spread \
  --record-period docs/evidence/benchmark_period_latest.json

# 2. Generate report (fetches Prometheus for same period, builds HTML)
python3 scripts/benchmark_prometheus_report.py \
  --period-file docs/evidence/benchmark_period_latest.json \
  --output docs/evidence/benchmark_report.html

Open docs/evidence/benchmark_report.html in a browser. The report contains:

Chart Content
Bar chart TTFT (s) and throughput (t/s) for C=1, C=2, C=4 (with ±std in labels).
GPU over time GPU utilization % and GPU memory used (GB) during the period.
Host & LLM over time Host load 1m and 120B-up (1=yes) over the period.

Data is embedded in the HTML (no live Prometheus in the browser). If Prometheus had no data for the period, the line charts will be empty but the benchmark bar chart still shows. Use PROMETHEUS_URL if your Prometheus is not at the default Zane URL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment