Endpoint: gx10 120B (Qwen 122B) at http://gx10-83fb.tail3dac72.ts.net:8002
Script: benchmark_120b_tokens_per_second.py
| What we measure | How |
|---|---|
| Output tokens/s | Non-streaming completion; usage.completion_tokens / wall_time |
| TTFT (time to first token) | Optional streaming run; time until first content chunk |
| Variance | Multiple runs (--runs N); mean ± std; use --runs 5 --warmup |
| Concurrency | Parallel requests (--concurrent C); aggregate throughput; combine with --runs N for variance |
| Full spread | --full-spread: warmup, TTFT, then C=1, C=2, C=4 each with 5 runs (all permutations) |
See script help: python3 scripts/benchmark_120b_tokens_per_second.py --help
The following ranges are realistic for self-hosted 70B–120B (not vendor fantasy numbers). Full analysis and caveats: bench.md Gist by louspringer.
| Band | Output tokens/s | Interpretation |
|---|---|---|
| Poor | < 10 | Underpowered GPU, bad quantization, CPU offload, or VRAM pressure |
| Good | 10–20 | Most self-hosted 70B–120B setups; our 17.3 t/s sits here |
| Strong | 20–35 | High-end GPUs (A100/H100), efficient quantization, clean pipeline |
| Excellent | 35–60+ | Multi-GPU sharding, tensor parallelism, kernel-level optimizations |
| Vendor tier | 100+ | Cloud APIs, batching, speculative decoding — not apples-to-apples |
Bottom line: ~17 output tokens/s = healthy, no obvious misconfiguration; not elite, not broken.
- TTFT — Responsiveness (e.g. 0.5 s vs 8 s) matters for UX; measure with
--ttft. - Concurrency — Throughput can drop under load (e.g. 17 t/s @ 1 user → 9 t/s @ 2 users); use
--concurrent 2or4to probe. Use--concurrent C --runs 5for mean ± std of aggregate t/s. - Variance — Use
--runs 5(and optional--warmup) for mean ± std. - All permutations —
--full-spreadruns warmup, TTFT, then C=1 (5 runs), C=2 (5 rounds), C=4 (5 rounds) and prints a summary table.
Benchmarks and Cursor/Goose use the OpenAI-compatible HTTP API (POST /v1/chat/completions) directly. We do not use LangChain for this:
- Benchmark: Needs precise timing (TTFT, wall time); direct HTTP gives full control and no extra framework overhead.
- Cursor/Goose: Use the configured provider (base URL + model id); no app-level LangChain in this repo.
If you add an application that chains prompts or tools, LangChain (or LangGraph) can be useful there; for measuring and configuring the 120B endpoint, direct HTTP is the right choice.
See: SMOKE_TEST_120B, 120B_SERVE_RUNBOOK, GOOSE_LLM_GX10_ACCESS.
To compare benchmark results with GPU, host, and LLM metrics from the same time window, use the Prometheus range query script. Metrics come from the gx10 telemetry exporter (port 9092) and are stored in Prometheus on Zane (TELEMETRY).
1. Record the benchmark period (optional)
When running a full-spread benchmark, write the run window to a JSON file:
python3 scripts/benchmark_120b_tokens_per_second.py --full-spread \
--record-period docs/evidence/benchmark_period_latest.jsonThe script prints the period and the correlate command.
2. Query Prometheus for that period
# Using the period file from the last run
python3 scripts/correlate_benchmark_prometheus.py --period-file docs/evidence/benchmark_period_latest.json
# Or last N minutes (e.g. right after a benchmark)
python3 scripts/correlate_benchmark_prometheus.py --minutes 10
# Or explicit start/end (ISO or unix)
python3 scripts/correlate_benchmark_prometheus.py \
--start 2026-03-17T20:43:15Z --end 2026-03-17T20:49:00Z3. Metrics summarized
| Metric | Description |
|---|---|
| GPU utilization % | gx10_gpu_utilization_percent — correlate with throughput (e.g. C=1 vs C=4). |
| GPU memory used | gx10_gpu_memory_bytes state=used — check for pressure. |
| Host load | gx10_host_load 1m — CPU load during benchmark. |
| 120B up | gx10_llm_state port=8002 state=up — should be 1 for the whole period. |
JSON output: Add --json for machine-readable min/max/mean per metric. Set PROMETHEUS_URL if Prometheus is not at http://zane.tail3dac72.ts.net/prometheus.
An HTML report embeds benchmark summary and Prometheus time series in one page (bar chart + line charts). Generate it after a full-spread run that wrote a period file:
# 1. Run benchmark and record period (includes results in JSON)
python3 scripts/benchmark_120b_tokens_per_second.py --full-spread \
--record-period docs/evidence/benchmark_period_latest.json
# 2. Generate report (fetches Prometheus for same period, builds HTML)
python3 scripts/benchmark_prometheus_report.py \
--period-file docs/evidence/benchmark_period_latest.json \
--output docs/evidence/benchmark_report.htmlOpen docs/evidence/benchmark_report.html in a browser. The report contains:
| Chart | Content |
|---|---|
| Bar chart | TTFT (s) and throughput (t/s) for C=1, C=2, C=4 (with ±std in labels). |
| GPU over time | GPU utilization % and GPU memory used (GB) during the period. |
| Host & LLM over time | Host load 1m and 120B-up (1=yes) over the period. |
Data is embedded in the HTML (no live Prometheus in the browser). If Prometheus had no data for the period, the line charts will be empty but the benchmark bar chart still shows. Use PROMETHEUS_URL if your Prometheus is not at the default Zane URL.