120B benchmark: tokens/s and interpretation

Endpoint: gx10 120B (Qwen 122B) at http://gx10-83fb.tail3dac72.ts.net:8002
Script: benchmark_120b_tokens_per_second.py

Quick reference

What we measure	How
Output tokens/s	Non-streaming completion; `usage.completion_tokens / wall_time`
TTFT (time to first token)	Optional streaming run; time until first content chunk
Variance	Multiple runs (`--runs N`); mean ± std; use `--runs 5 --warmup`
Concurrency	Parallel requests (`--concurrent C`); aggregate throughput; combine with `--runs N` for variance
Full spread	`--full-spread`: warmup, TTFT, then C=1, C=2, C=4 each with 5 runs (all permutations)

See script help: python3 scripts/benchmark_120b_tokens_per_second.py --help

Interpreting throughput (good / better / best)

The following ranges are realistic for self-hosted 70B–120B (not vendor fantasy numbers). Full analysis and caveats: bench.md Gist by louspringer.

Band	Output tokens/s	Interpretation
Poor	< 10	Underpowered GPU, bad quantization, CPU offload, or VRAM pressure
Good	10–20	Most self-hosted 70B–120B setups; our 17.3 t/s sits here
Strong	20–35	High-end GPUs (A100/H100), efficient quantization, clean pipeline
Excellent	35–60+	Multi-GPU sharding, tensor parallelism, kernel-level optimizations
Vendor tier	100+	Cloud APIs, batching, speculative decoding — not apples-to-apples

Bottom line: ~17 output tokens/s = healthy, no obvious misconfiguration; not elite, not broken.

What’s not in a single run

TTFT — Responsiveness (e.g. 0.5 s vs 8 s) matters for UX; measure with --ttft.
Concurrency — Throughput can drop under load (e.g. 17 t/s @ 1 user → 9 t/s @ 2 users); use --concurrent 2 or 4 to probe. Use --concurrent C --runs 5 for mean ± std of aggregate t/s.
Variance — Use --runs 5 (and optional --warmup) for mean ± std.
All permutations — --full-spread runs warmup, TTFT, then C=1 (5 runs), C=2 (5 rounds), C=4 (5 rounds) and prints a summary table.

How we call the LLM (no LangChain)

Benchmarks and Cursor/Goose use the OpenAI-compatible HTTP API (POST /v1/chat/completions) directly. We do not use LangChain for this:

Benchmark: Needs precise timing (TTFT, wall time); direct HTTP gives full control and no extra framework overhead.
Cursor/Goose: Use the configured provider (base URL + model id); no app-level LangChain in this repo.

If you add an application that chains prompts or tools, LangChain (or LangGraph) can be useful there; for measuring and configuring the 120B endpoint, direct HTTP is the right choice.

See: SMOKE_TEST_120B, 120B_SERVE_RUNBOOK, GOOSE_LLM_GX10_ACCESS.

Correlate with Prometheus (same period)

To compare benchmark results with GPU, host, and LLM metrics from the same time window, use the Prometheus range query script. Metrics come from the gx10 telemetry exporter (port 9092) and are stored in Prometheus on Zane (TELEMETRY).

1. Record the benchmark period (optional)
When running a full-spread benchmark, write the run window to a JSON file:

python3 scripts/benchmark_120b_tokens_per_second.py --full-spread \
  --record-period docs/evidence/benchmark_period_latest.json

The script prints the period and the correlate command.

2. Query Prometheus for that period

# Using the period file from the last run
python3 scripts/correlate_benchmark_prometheus.py --period-file docs/evidence/benchmark_period_latest.json

# Or last N minutes (e.g. right after a benchmark)
python3 scripts/correlate_benchmark_prometheus.py --minutes 10

# Or explicit start/end (ISO or unix)
python3 scripts/correlate_benchmark_prometheus.py \
  --start 2026-03-17T20:43:15Z --end 2026-03-17T20:49:00Z

3. Metrics summarized

Metric	Description
GPU utilization %	`gx10_gpu_utilization_percent` — correlate with throughput (e.g. C=1 vs C=4).
GPU memory used	`gx10_gpu_memory_bytes` state=used — check for pressure.
Host load	`gx10_host_load` 1m — CPU load during benchmark.
120B up	`gx10_llm_state` port=8002 state=up — should be 1 for the whole period.

JSON output: Add --json for machine-readable min/max/mean per metric. Set PROMETHEUS_URL if Prometheus is not at http://zane.tail3dac72.ts.net/prometheus.

Graphs and charts (correlated data)

An HTML report embeds benchmark summary and Prometheus time series in one page (bar chart + line charts). Generate it after a full-spread run that wrote a period file:

# 1. Run benchmark and record period (includes results in JSON)
python3 scripts/benchmark_120b_tokens_per_second.py --full-spread \
  --record-period docs/evidence/benchmark_period_latest.json

# 2. Generate report (fetches Prometheus for same period, builds HTML)
python3 scripts/benchmark_prometheus_report.py \
  --period-file docs/evidence/benchmark_period_latest.json \
  --output docs/evidence/benchmark_report.html

Open docs/evidence/benchmark_report.html in a browser. The report contains:

Chart	Content
Bar chart	TTFT (s) and throughput (t/s) for C=1, C=2, C=4 (with ±std in labels).
GPU over time	GPU utilization % and GPU memory used (GB) during the period.
Host & LLM over time	Host load 1m and 120B-up (1=yes) over the period.

Data is embedded in the HTML (no live Prometheus in the browser). If Prometheus had no data for the period, the line charts will be empty but the benchmark bar chart still shows. Use PROMETHEUS_URL if your Prometheus is not at the default Zane URL.

louspringer/BENCHMARK_120B.md

Select an option

No results found