Skip to content

Instantly share code, notes, and snippets.

@louspringer
Created March 17, 2026 20:15
Show Gist options
  • Select an option

  • Save louspringer/2d1d88c67827d62fd8ff7c19b14c252d to your computer and use it in GitHub Desktop.

Select an option

Save louspringer/2d1d88c67827d62fd8ff7c19b14c252d to your computer and use it in GitHub Desktop.

You ran a single-shot throughput probe against a 120B model endpoint. Let’s decode it cleanly, then define what “good / better / best” actually means in this context.


What you actually measured (no fluff)

  • Prompt tokens: 33 → trivial input, not the bottleneck
  • Completion tokens: 256 → fixed-size output sample
  • Wall time: 14.78 s → end-to-end latency (includes overhead)
  • Throughput: 17.33 tokens/secsteady-state generation speed

👉 Translation: Your box (gx10) is generating ~17 tokens per second once it’s rolling.


What matters vs what doesn’t

What matters

  • Output tokens/sec → core generation throughput (GPU + model + quantization)
  • Time-to-first-token (TTFT) → responsiveness (not shown here)
  • Concurrency behavior → does it collapse under load?

What doesn’t (much here)

  • Prompt tokens (too small to matter)
  • Single-run wall time (no variance / no warm vs cold separation)

Ground truth ranges (120B-class models)

These are realistic field ranges, not vendor fantasy numbers:

🟥 Poor

  • < 10 tokens/sec

  • Symptoms:

    • Underpowered GPU or bad quantization
    • CPU offload choking
    • VRAM pressure / swapping

🟨 Acceptable / “Good”

  • 10–20 tokens/sec
  • This is where most self-hosted 70B–120B setups land
  • Your result (17.3 t/s) is here

🟩 Strong / “Better”

  • 20–35 tokens/sec

  • Requires:

    • High-end GPUs (A100/H100 class or well-optimized multi-GPU)
    • Efficient quantization (AWQ, GPTQ, etc.)
    • Clean pipeline (no I/O stalls)

🟦 Excellent / “Best-in-class self-hosted”

  • 35–60+ tokens/sec

  • Typically:

    • Multi-GPU sharding done right
    • Tensor parallelism tuned
    • Kernel-level optimizations
    • Minimal overhead stack

🟪 “Unfair comparison” tier

  • 100+ tokens/sec

  • That’s:

    • Vendor infra (OpenAI, Anthropic, etc.)
    • Custom kernels + batching + speculative decoding
    • Not apples-to-apples with your setup

Where your result sits

17.3 tokens/sec = solidly “good”

Interpretation:

  • Your system is healthy
  • No obvious misconfiguration
  • Likely not GPU-maxed, but also not broken

What this implies about gx10

Based on that number alone:

  • You’re probably:

    • Running quantized 120B (not full precision)
    • On mid-to-high tier GPU(s) but not fully saturated
  • Bottlenecks could be:

    • Memory bandwidth
    • Interconnect (if multi-GPU)
    • Kernel efficiency (common with Qwen builds)

What’s missing (and more important than the number)

This is a smoke test, not a performance characterization.

You don’t yet know:

  1. TTFT (time to first token)

    • Could be 0.5s or 8s → huge UX difference
  2. Throughput under concurrency

    • 17 t/s single-user might become:

      • 9 t/s @ 2 users
      • 3 t/s @ 4 users
    • That’s where systems actually fail

  3. Token variance

    • Was this steady or bursty?

If you want to move from “good” → “better”

High-impact levers:

1. Quantization tuning

  • Move to AWQ / GPTQ tuned for your hardware
  • Bad quantization can cost 30–50% throughput

2. Batch / speculative decoding

  • If supported → massive gains
  • This is how vendors hit absurd speeds

3. GPU utilization check

  • If GPU < 90% → you’re leaving performance on the table

4. KV cache + memory layout

  • Often the hidden limiter on large models

Bottom line

  • Your number (17.3 t/s) is:

    • Not slow
    • Not elite
    • Exactly where a sane 120B local deployment should be

If you told me this was broken, I’d push back. It isn’t.


If you want, next step is we turn this into a proper benchmark harness:

  • TTFT
  • sustained throughput
  • concurrency curve
  • cost-per-token (the one that actually matters)

That’s where the real signal shows up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment