Skip to content

Instantly share code, notes, and snippets.

@erudenko
Created March 5, 2026 02:16
Show Gist options
  • Select an option

  • Save erudenko/3088990ee50e287401a0ba3c53959cb3 to your computer and use it in GitHub Desktop.

Select an option

Save erudenko/3088990ee50e287401a0ba3c53959cb3 to your computer and use it in GitHub Desktop.
LLM Speed Benchmark: 6 Models, OpenRouter vs Direct API — claudish + Claude Code

LLM Speed Benchmark: 6 Models, OpenRouter vs Direct API

A practical speed benchmark comparing 6 frontier coding LLMs on the same task, routed through claudish — an open-source proxy that lets Claude Code use any AI model.

Each model is tested via two routes: OpenRouter (proxy) and the provider's native Direct API.

What We Tested

Task: Generate a TypeScript function parseQueryParams — parse URL query parameters into Record<string, string>, handling edge cases (missing values, duplicate keys, encoded characters), with JSDoc.

This is a representative real-world coding task: small, well-defined, requires understanding of URL parsing, TypeScript types, and documentation conventions.

Models tested:

Model Provider OpenRouter ID Context In $/M Out $/M
MiniMax M2.5 MiniMax minimax/minimax-m2.5 197K $0.29 $1.20
Kimi K2.5 Moonshot AI moonshotai/kimi-k2.5 262K $0.45 $2.20
GLM-5 Zhipu AI z-ai/glm-5 203K $0.80 $2.56
Gemini 3 Flash Preview Google google/gemini-3-flash-preview 1049K $0.50 $3.00
GPT-5.1 Codex Mini OpenAI openai/gpt-5.1-codex-mini 400K $0.25 $2.00
Qwen3.5 Plus Alibaba qwen/qwen3.5-plus-02-15 1000K $0.26 $1.56

Prices from OpenRouter as of March 5, 2026.

Methodology

  • 5 rounds of the identical prompt, all 12 model-routes (6 models x 2 routes) launched in parallel per round
  • Timing: wall-clock ms from claudish invocation to completion (includes proxy overhead)
  • Two routes per model: OpenRouter (OR) and Direct API (native provider endpoint)
  • Single-shot mode with --json output, no system prompts, no conversation history — cold start each time
  • Direct API routes use claudish provider shortcuts: g@ (Google), oai@ (OpenAI), kimi@ (Moonshot), mm@ (MiniMax), glm@ (Zhipu)

Test Environment

Machine MacBook Pro M1 Max, 64GB RAM
OS macOS 26.3 (Tahoe)
Network ~1.8s baseline latency to OpenRouter API
Claude Code v2.1.69
Claudish v5.5.2
Date March 5, 2026, ~12:50 PM UTC+2

Results

Full Leaderboard (sorted by mean speed)

#   Model                  Route      Mean     Min     Max  StdDev  OK   In $/M  Out $/M
----------------------------------------------------------------------------------------
1   Gemini 3 Flash         OR        32.6s   28.6s   41.7s    4.7s   5  $0.50   $3.00
2   GPT-5.1 Codex Mini     OR        32.7s   29.9s   41.5s    4.4s   5  $0.25   $2.00
3   GPT-5.1 Codex Mini     Direct    32.9s   28.2s   40.2s    4.1s   5  $0.25   $2.00
4   Gemini 3 Flash         Direct    33.4s   29.1s   41.7s    4.7s   5  $0.50   $3.00
5   MiniMax M2.5           OR        40.4s   31.5s   50.2s    6.2s   5  $0.29   $1.20
6   Qwen3.5 Plus           Direct    41.7s   37.5s   50.2s    4.6s   5  $0.26   $1.56
7   Qwen3.5 Plus           OR        42.7s   35.9s   57.9s    7.8s   5  $0.26   $1.56
8   Kimi K2.5              Direct    43.8s   35.5s   53.5s    5.9s   5  $0.45   $2.20
9   GLM-5                  OR        47.4s   35.7s   61.3s    9.1s   5  $0.80   $2.56
10  Kimi K2.5              OR        48.7s   39.0s   65.6s    9.4s   5  $0.45   $2.20
11  MiniMax M2.5           Direct     FAIL       -       -       -   0  $0.29   $1.20
12  GLM-5                  Direct     FAIL       -       -       -   0  $0.80   $2.56

Direct vs OpenRouter Comparison

Model                   OR Mean   Direct     Diff   Faster
----------------------------------------------------------
Gemini 3 Flash            32.6s    33.4s     0.8s    OR   2%
GPT-5.1 Codex Mini        32.7s    32.9s     0.2s    OR   1%
Kimi K2.5                 48.7s    43.8s     4.9s Direct  10%
MiniMax M2.5              40.4s   FAILED        -       OR
Qwen3.5 Plus              42.7s    41.7s     1.0s Direct   2%
GLM-5                     47.4s   FAILED        -       OR

Raw Data (ms per round)

Model (Route) R1 R2 R3 R4 R5
Gemini 3 Flash (OR) 32362 28618 30177 30304 41705
GPT-5.1 Codex Mini (OR) 30522 31153 30536 29894 41513
GPT-5.1 Codex Mini (Direct) 34030 28189 31368 30930 40184
Gemini 3 Flash (Direct) 35521 29132 30856 29927 41668
MiniMax M2.5 (OR) 42435 40693 37116 31497 50234
Qwen3.5 Plus (Direct) 41992 38196 40572 37493 50227
Qwen3.5 Plus (OR) 38661 35925 40021 41107 57929
Kimi K2.5 (Direct) 41590 35539 42686 53506 45873
GLM-5 (OR) 39490 49774 50911 35738 61322
Kimi K2.5 (OR) 40626 38983 65555 49001 49518

Key Findings

1. Gemini 3 Flash and GPT-5.1 Codex Mini are virtually tied for speed

Both consistently landed at ~32-33s mean. The race was so close that the ranking flipped between routes — Gemini won via OpenRouter, GPT won via Direct. For practical purposes, they're equal.

2. GPT-5.1 Codex Mini is the best value

At $0.25/M input + $2.00/M output, it's the cheapest of the top-tier models while matching Gemini Flash on speed. If you're cost-sensitive, this is the clear pick.

3. OpenRouter adds almost zero overhead for fast models

For Gemini and GPT, Direct API was actually 0.2-0.8s slower than OpenRouter — within noise. The proxy overhead is negligible for fast models because the routing time is tiny compared to inference time.

4. Direct API is notably faster for Kimi

Kimi K2.5 showed a 10% speed improvement via Direct API (43.8s vs 48.7s). This suggests OpenRouter's routing adds measurable latency for some Chinese providers.

5. MiniMax M2.5 is the budget king

At $0.29/M in + $1.20/M out (cheapest output pricing), it's competitive at 40s mean. If you don't need sub-35s response times, it's the most cost-effective option.

6. GLM-5 is the worst value

Slowest model (47.4s), most inconsistent (9.1s std dev), AND second most expensive ($0.80 in + $2.56 out). Hard to recommend over any competitor.

7. All models slowed down in Round 5

Every model was 20-50% slower in Round 5 compared to Rounds 1-4. This likely reflects increased load (time-of-day effect) or rate limiting from running 12 parallel requests per round.

How to Run This Yourself

Prerequisites

  1. Install Claude Code
  2. Install claudish: npm install -g claudish
  3. Set your API keys:
    export OPENROUTER_API_KEY='...'       # Required for OpenRouter routes
    export GEMINI_API_KEY='...'           # For g@ direct
    export OPENAI_API_KEY='...'           # For oai@ direct
    export MOONSHOT_API_KEY='...'         # For kimi@ direct
    export MINIMAX_API_KEY='...'          # For mm@ direct
    export ZHIPU_API_KEY='...'            # For glm@ direct

Run

# Download the test script
curl -O https://gist.githubusercontent.com/.../speed-test.sh
chmod +x speed-test.sh

# Run with default 5 rounds
./speed-test.sh

# Run with custom rounds
./speed-test.sh 10

Customize Models

Edit the OR_MODELS and DIRECT_MODELS arrays in the script. Find model IDs with:

claudish --models <search-term>

Caveats

  • End-to-end latency, not pure inference. Includes: claudish proxy startup, API routing, queue time, inference, and response streaming.
  • Single task type — results may differ for longer prompts, multi-turn, or different languages.
  • 5 rounds shows trends but isn't statistically rigorous. For publication-grade results, run 20+ rounds.
  • Time-of-day effects — load patterns vary. Our R5 slowdown confirms this.
  • Direct API failures for MiniMax (auth format mismatch) and GLM (env var naming) are claudish-specific, not model issues.

License

MIT — use freely, attribution appreciated.


Tested with claudish v5.5.2 on March 5, 2026.

#!/bin/bash
# ============================================================================
# LLM Speed Benchmark via claudish (Claude Code + OpenRouter / Direct APIs)
# ============================================================================
# Runs the same coding task on multiple models in parallel, repeats N rounds,
# and reports mean/min/max times with pricing (input & output separately).
#
# Tests each model via TWO routes:
# 1. OpenRouter (consistent proxy)
# 2. Direct API (native provider endpoint)
#
# Prerequisites:
# - Claude Code: https://docs.anthropic.com/en/docs/claude-code
# - claudish: npm install -g claudish (https://github.com/MadAppGang/claude-code)
# - API keys: OPENROUTER_API_KEY (required)
# GEMINI_API_KEY, OPENAI_API_KEY, MINIMAX_API_KEY,
# MOONSHOT_API_KEY, ZHIPU_API_KEY (optional, for direct tests)
#
# Usage:
# ./speed-test.sh # 5 rounds (default)
# ./speed-test.sh 3 # 3 rounds
# ============================================================================
set -euo pipefail
ROUNDS=${1:-5}
TASK='Write a TypeScript function called `parseQueryParams` that takes a URL string and returns a Record<string, string> of query parameters. Handle edge cases like missing values, duplicate keys (last wins), and encoded characters. Include JSDoc comment. Output ONLY the code, no explanation.'
OUTDIR="/tmp/speed-test-$(date +%s)"
mkdir -p "$OUTDIR"
# --- Models config: model_id | label | route | $/M input | $/M output ---
# OpenRouter routes
OR_MODELS=(
"openrouter@minimax/minimax-m2.5|MiniMax M2.5|OR|0.29|1.20"
"openrouter@moonshotai/kimi-k2.5|Kimi K2.5|OR|0.45|2.20"
"openrouter@z-ai/glm-5|GLM-5|OR|0.80|2.56"
"openrouter@google/gemini-3-flash-preview|Gemini 3 Flash|OR|0.50|3.00"
"openrouter@openai/gpt-5.1-codex-mini|GPT-5.1 Codex Mini|OR|0.25|2.00"
"openrouter@qwen/qwen3.5-plus-02-15|Qwen3.5 Plus|OR|0.26|1.56"
)
# Direct API routes (same models, native endpoints)
DIRECT_MODELS=(
"[email protected]|MiniMax M2.5|Direct|0.29|1.20"
"[email protected]|Kimi K2.5|Direct|0.45|2.20"
"glm@glm-5|GLM-5|Direct|0.80|2.56"
"g@gemini-3-flash-preview|Gemini 3 Flash|Direct|0.50|3.00"
"[email protected]|GPT-5.1 Codex Mini|Direct|0.25|2.00"
"openrouter@qwen/qwen3.5-plus-02-15|Qwen3.5 Plus|Direct|0.26|1.56"
)
# Note: Qwen has no direct API via claudish, falls back to OpenRouter
# Combine all models
ALL_MODELS=("${OR_MODELS[@]}" "${DIRECT_MODELS[@]}")
NUM_MODELS=${#ALL_MODELS[@]}
echo "=== LLM Speed Benchmark ($ROUNDS rounds x $NUM_MODELS model-routes) ==="
echo "Task: TypeScript parseQueryParams function"
echo "Output: $OUTDIR"
echo ""
# --- Run all rounds ---
for round in $(seq 1 $ROUNDS); do
echo "--- Round $round/$ROUNDS ---"
mkdir -p "$OUTDIR/round-$round"
for i in $(seq 0 $((NUM_MODELS - 1))); do
entry="${ALL_MODELS[$i]}"
model=$(echo "$entry" | cut -d'|' -f1)
label=$(echo "$entry" | cut -d'|' -f2)
route=$(echo "$entry" | cut -d'|' -f3)
tag="$label ($route)"
(
start=$(python3 -c 'import time; print(int(time.time()*1000))')
claudish --model "$model" --json "$TASK" > "$OUTDIR/round-$round/$i.json" 2>"$OUTDIR/round-$round/$i.err"
exit_code=$?
end=$(python3 -c 'import time; print(int(time.time()*1000))')
elapsed=$(( end - start ))
if [ $exit_code -eq 0 ] && [ -s "$OUTDIR/round-$round/$i.json" ]; then
echo "$elapsed" > "$OUTDIR/round-$round/$i.time"
echo " R$round $tag: ${elapsed}ms"
else
echo "ERR" > "$OUTDIR/round-$round/$i.time"
err=$(head -1 "$OUTDIR/round-$round/$i.err" 2>/dev/null)
echo " R$round $tag: FAILED ($err)"
fi
) &
done
wait
echo ""
done
# --- Calculate and display results ---
echo "=== Results ==="
echo ""
python3 - "$OUTDIR" "$ROUNDS" "$NUM_MODELS" <<'PYEOF'
import sys, os, math, json
outdir = sys.argv[1]
rounds = int(sys.argv[2])
num_models = int(sys.argv[3])
# Parse model config from env
models_raw = os.environ.get("ALL_MODELS_JSON", "")
# Hardcoded model metadata (must match script order)
meta = [
# OpenRouter
("MiniMax M2.5", "OR", 0.29, 1.20),
("Kimi K2.5", "OR", 0.45, 2.20),
("GLM-5", "OR", 0.80, 2.56),
("Gemini 3 Flash", "OR", 0.50, 3.00),
("GPT-5.1 Codex Mini", "OR", 0.25, 2.00),
("Qwen3.5 Plus", "OR", 0.26, 1.56),
# Direct
("MiniMax M2.5", "Direct", 0.29, 1.20),
("Kimi K2.5", "Direct", 0.45, 2.20),
("GLM-5", "Direct", 0.80, 2.56),
("Gemini 3 Flash", "Direct", 0.50, 3.00),
("GPT-5.1 Codex Mini", "Direct", 0.25, 2.00),
("Qwen3.5 Plus", "Direct", 0.26, 1.56),
]
results = []
for i in range(num_models):
label, route, price_in, price_out = meta[i]
times = []
for r in range(1, rounds + 1):
tf = os.path.join(outdir, f"round-{r}", f"{i}.time")
try:
val = open(tf).read().strip()
if val != "ERR":
times.append(int(val))
except:
pass
if times:
mean = sum(times) / len(times)
mn = min(times)
mx = max(times)
std = math.sqrt(sum((t - mean) ** 2 for t in times) / len(times))
else:
mean, mn, mx, std = 999999, 0, 0, 0
results.append({
"label": label, "route": route,
"price_in": price_in, "price_out": price_out,
"times": times, "mean": mean, "min": mn, "max": mx,
"std": std, "ok": len(times),
})
# --- Table 1: All results sorted by mean ---
results.sort(key=lambda x: x["mean"])
hdr = f"{'#':<3} {'Model':<22} {'Route':<7} {'Mean':>7} {'Min':>7} {'Max':>7} {'StdDev':>7} {'OK':>3} {'In $/M':>7} {'Out $/M':>8}"
print(hdr)
print("-" * len(hdr))
for rank, r in enumerate(results, 1):
if r["ok"] == 0:
print(f"{rank:<3} {r['label']:<22} {r['route']:<7} {'FAIL':>7} {'-':>7} {'-':>7} {'-':>7} {r['ok']:>3} ${r['price_in']:<6.2f} ${r['price_out']:<7.2f}")
else:
print(f"{rank:<3} {r['label']:<22} {r['route']:<7} {r['mean']/1000:>6.1f}s {r['min']/1000:>6.1f}s {r['max']/1000:>6.1f}s {r['std']/1000:>6.1f}s {r['ok']:>3} ${r['price_in']:<6.2f} ${r['price_out']:<7.2f}")
# --- Table 2: Direct vs OpenRouter comparison ---
print()
print("=== Direct vs OpenRouter (speed difference) ===")
print()
or_map = {r["label"]: r for r in results if r["route"] == "OR" and r["ok"] > 0}
dr_map = {r["label"]: r for r in results if r["route"] == "Direct" and r["ok"] > 0}
hdr2 = f"{'Model':<22} {'OR Mean':>8} {'Direct':>8} {'Diff':>8} {'Faster':>8}"
print(hdr2)
print("-" * len(hdr2))
for label in ["Gemini 3 Flash", "GPT-5.1 Codex Mini", "Kimi K2.5", "MiniMax M2.5", "Qwen3.5 Plus", "GLM-5"]:
or_r = or_map.get(label)
dr_r = dr_map.get(label)
if or_r and dr_r:
or_s = f"{or_r['mean']/1000:.1f}s"
dr_s = f"{dr_r['mean']/1000:.1f}s"
diff = or_r['mean'] - dr_r['mean']
diff_s = f"{abs(diff)/1000:.1f}s"
winner = "Direct" if diff > 0 else "OR"
pct = abs(diff) / max(or_r['mean'], dr_r['mean']) * 100
print(f"{label:<22} {or_s:>8} {dr_s:>8} {diff_s:>8} {winner:>5} {pct:>3.0f}%")
elif or_r:
or_s = f"{or_r['mean']/1000:.1f}s"
print(f"{label:<22} {or_s:>8} {'FAILED':>8} {'-':>8} {'OR':>8}")
elif dr_r:
dr_s = f"{dr_r['mean']/1000:.1f}s"
print(f"{label:<22} {'FAILED':>8} {dr_s:>8} {'-':>8} {'Direct':>8}")
else:
print(f"{label:<22} {'FAILED':>8} {'FAILED':>8} {'-':>8} {'-':>8}")
# --- Raw data ---
print()
print("--- Raw times (ms) per round ---")
for r in results:
tag = f"{r['label']} ({r['route']})"
vals = " ".join(f"{t:>6}" for t in r["times"]) if r["times"] else "no data"
print(f" {tag:<30} {vals}")
print()
print(f"Rounds: {rounds} | Sorted by mean (fastest first)")
print(f"Output: {outdir}")
PYEOF
echo ""
echo "Done! View model outputs: cat $OUTDIR/round-1/0.json | python3 -m json.tool"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment