Analysis Date: May 21, 2026
Benchmark: GPQA Diamond (graduate-level scientific reasoning)
Workload: 5M input tokens, 300K output tokens, 40M cache reads
Verification: Actual GLM 5.1 usage data (40.7M cache, 5.2M input, 337K output = $24.15)
This analysis set out to identify the "best value" text LLMs available via Venice.ai's API for agentic problem-solving workloads. Using GPQA Diamond as the benchmark, we calculate a "value score" (GPQA percentage divided by total cost) to find models that deliver above-average reasoning per dollar.
The best bargains deliver 85-88% of frontier performance for $0.56-$2.06, while premium "Fast" variants charge 6× more for similar GPQA performance.
All prices are per 1 million tokens (confirmed via actual GLM 5.1 billing):
| Component | GLM 5.1 Actual | Calculation |
|---|---|---|
| Cache Reads | $0.33/M | 40.7M × $0.33 = $13.23 ✓ |
| Input | $1.75/M | 5.2M × $1.75 = $9.10 ✓ |
| Output | $5.50/M | 0.337M × $5.50 = $1.85 ✓ |
| Total | — | $24.18 (matches actual) |
| Model | Input $/M | Output $/M | Cache $/M | Total Cost | GPQA Diamond | Value Score | Verdict |
|---|---|---|---|---|---|---|---|
| DeepSeek V4 Flash | $0.17 | $0.35 | $0.03 | $2.06 | ~88% | 42.7 | ⭐ BEST VALUE |
| DeepSeek V4 Pro | $1.73 | $3.80 | $0.33 | $22.25 | 90.1% | 4.05 | ⭐ Excellent |
| Grok 4.20 | $1.42 | $2.83 | $0.23 | $16.94 | ~88% | 5.19 | ⭐ Good |
| Grok 4.3 | $1.42 | $2.83 | $0.23 | $16.94 | ~88% | 5.19 | ⭐ Good |
| Grok Build 0.1 | $1.00 | $2.00 | $0.20 | $13.00 | ~85% | 6.54 | ⭐ Good |
| Model | Input $/M | Output $/M | Cache $/M | Total Cost | GPQA Diamond | Value Score | Verdict |
|---|---|---|---|---|---|---|---|
| Claude Opus 4.7 Fast | $36.00 | $180.00 | $3.60 | $414.00 | 94.2% | 0.23 | 🔴 Terrible Value |
| Claude Opus 4.6 Fast | $36.00 | $180.00 | $3.60 | $414.00 | 91.3% | 0.22 | 🔴 Terrible Value |
| GPT-5.5 Pro | $37.50 | $225.00 | — | $187.50 | 93.6% | 0.50 | 🔴 Overpriced |
| GPT-5.4 Pro | $37.50 | $225.00 | — | $187.50 | 92.0% | 0.49 | 🔴 Overpriced |
| Model | Input $/M | Output $/M | Cache $/M | Total Cost | GPQA Diamond | Value Score | Verdict |
|---|---|---|---|---|---|---|---|
| Claude Opus 4.7 | $6.00 | $30.00 | $0.60 | $69.00 | 94.2% | 1.36 | 🟡 Premium but Fair |
| Claude Opus 4.6 | $6.00 | $30.00 | $0.60 | $69.00 | 91.3% | 1.32 | 🟡 Standard |
| Claude Opus 4.5 | $6.00 | $30.00 | $0.60 | $69.00 | ~90% | 1.30 | 🟡 Standard |
| GPT-5.5 | $6.25 | $37.50 | $0.63 | $82.63 | 93.6% | 1.13 | 🟡 Premium |
| GPT-5.4 | $3.13 | $18.80 | $0.31 | $41.03 | 92.0% | 2.24 | 🟢 Good Value |
| GPT-5.4 Mini | $0.94 | $5.63 | $0.09 | $11.44 | ~85% | 7.43 | 🟢 Very Good |
| Gemini 3.1 Pro | $2.50 | $15.00 | $0.50 | $32.50 | 94.3% | 2.90 | 🟢 Good Value |
| Model | Total Cost | GPQA Diamond | Value Score | Verdict |
|---|---|---|---|---|
| Qwen 3.5 9B | $0.56 | ~75% | 133.9 | ⭐ EXCEPTIONAL |
| Mistral Small 3.2 | $0.56 | ~75% | 133.9 | ⭐ EXCEPTIONAL |
| GLM 4.7 Flash | $0.65 | ~80% | 123.1 | ⭐ EXCEPTIONAL |
| GLM 4.7 Flash Heretic | $0.70 | ~80% | 114.3 | ⭐ EXCEPTIONAL |
| DeepSeek V3.2 | $1.89 | ~75% | 39.7 | ⭐ Very Good |
| MiniMax M2.5 | $2.21 | ~80% | 36.2 | ⭐ Very Good |
| Qwen 3.6 27B | $2.63 | 87.8% | 33.4 | ⭐ Very Good |
| Kimi K2.5 | $3.85 | 87.6% | 22.8 | ⭐ Outstanding |
| Kimi K2.6 | $5.53 | ~88% | 15.9 | ⭐ Outstanding |
| Qwen 3.6 Plus | $5.57 | ~87% | 15.6 | ⭐ Outstanding |
| Model | Total Cost | GPQA Diamond | Value Score | Verdict |
|---|---|---|---|---|
| GLM 5 | $13.60 | ~84% | 6.18 | 🟢 Good |
| GLM 5.1 | $23.60 | 86.2% | 3.65 | 🟡 Your Actual Experience |
| GPT-5.2 | $16.16 | ~88% | 5.45 | 🟡 Expensive |
| Claude Sonnet 4.6 | $19.44 | ~85% | 4.37 | 🟡 Expensive |
| Claude Sonnet 4.5 | $20.25 | ~84% | 4.15 | 🟡 Expensive |
| Rank | Model | Cost | GPQA | Value Score | Why It Stands Out |
|---|---|---|---|---|---|
| 1 | Qwen 3.5 9B | $0.56 | ~75% | 133.9 | Cheapest viable option |
| 2 | Mistral Small 3.2 | $0.56 | ~75% | 133.9 | Same price, similar performance |
| 3 | GLM 4.7 Flash | $0.65 | ~80% | 123.1 | 80% frontier at < $1 |
| 4 | GLM 4.7 Flash Heretic | $0.70 | ~80% | 114.3 | Slightly pricier Flash variant |
| 5 | DeepSeek V4 Flash | $2.06 | ~88% | 42.7 | The Sweet Spot |
| 6 | DeepSeek V3.2 | $1.89 | ~75% | 39.7 | Proven DeepSeek quality |
| 7 | MiniMax M2.5 | $2.21 | ~80% | 36.2 | Strong agentic performance |
| 8 | Qwen 3.6 27B | $2.63 | 87.8% | 33.4 | Near-frontier, verified |
| 9 | Kimi K2.5 | $3.85 | 87.6% | 22.8 | Open weights, excellent |
| 10 | Kimi K2.6 | $5.53 | ~88% | 15.9 | Best open-weights upgrade |
| Model | Cost | GPQA | Value Score | Why Avoid |
|---|---|---|---|---|
| Claude Opus 4.7 Fast | $414.00 | 94.2% | 0.23 | 6× cost for speed, same accuracy |
| Claude Opus 4.6 Fast | $414.00 | 91.3% | 0.22 | Same problem |
| GPT-5.5 Pro | $187.50 | 93.6% | 0.50 | 2.3× cost, minimal gain over 5.5 |
| GPT-5.4 Pro | $187.50 | 92.0% | 0.49 | Same issue |
| Use Case | Recommended Model | Cost | Why |
|---|---|---|---|
| Ultra-budget, any capability | Qwen 3.5 9B | $0.56 | Cheapest option that works |
| Best value under $1 | GLM 4.7 Flash | $0.65 | 80% GPQA at throwaway price |
| Sweet spot (performance/price) | DeepSeek V4 Flash | $2.06 | 88% GPQA, proven reliable |
| Best open-weights | Kimi K2.6 | $5.53 | Full control, strong performance |
| Verified high performance | Qwen 3.6 27B | $2.63 | 87.8% GPQA, published scores |
| Frontier quality, fair price | GPT-5.4 | $41.03 | 92% GPQA, half the cost of 5.5 |
| Absolute best reasoning | Claude Opus 4.7 | $69.00 | 94.2% GPQA, justifies premium |
| Google ecosystem | Gemini 3.1 Pro | $32.50 | 94.3% GPQA, competitive |
| Never use | Any "Fast" variant | $187-$414 | Same accuracy, 6× price |
GPQA Diamond was selected because:
- Available across all models in analysis
- Tests graduate-level scientific reasoning (physics, chemistry, biology)
- "Google-proof" questions requiring genuine expertise
- PhD experts: ~65%, skilled non-experts: ~34%
- Directly relevant to agentic problem-solving capabilities
The following are notes based on trying several low-cost models on a real-world ETL task, involving extracting text data from an internal-only website via a SOCKS proxy, writing it to disk in a structured format, cleaning it up, summarizing it, and putting the summary into a JSON file.
GLM 5.1 (listed under "fair value") worked well overall, but costs started to add up. Initial prototype of the ETL pipeline cost about $20 to create and test to proof-of-concept level. Really wanted something cheaper to work on the instructions that GLM 5.1 produced.
Tried using Kimi K2.6 and it waaaaaay overthought / went in circles with itself, while trying to curate a dataset. It didn't seem to like that it (rightly) needed to delete a lot of crap from the dataset and spun for minutes determining a course of action before implementing it. Burned 94K of context and $0.39 just doing a basic curation task! Actual reasoning was not terrible but it seems to have miserably poor self-confidence; embarrassing almost. This persisted with a clean context as well as when it used inherited context from GLM 5.1.
Qwen 3.5 9B was just too stupid to follow the GLM 5.1 instructions, too. It went around in circles, disregarded instructions and tried to build scripts for itself, which didn't work, and generally just spun around. Disappointing, as it seemed like a good option and has good GPQA scores.
Qwen 3.6 Plus Uncensored isn't in the automated comparison above, so it may not be a "good value" by the numbers. But, true to some forum comments I read on HuggingFace, it seems to work much better than Qwen 3.5 9B (like a lot better). It was actually able to follow instructions, scrape the web for some content, analyze it, etc. But it appears to be fairly expensive, possibly on par with GLM 5.1?
GLM 4.7 Flash Heretic seems to work well with GLM 5.1's work instructions, and is good at tool use. Unfortunate, since it's basically when I was using all along. But I guess I at least had it pegged as a good model to use. Unfortunately, it failed completely at trying to use a SOCKS proxy that was already set up and working. Once I reset the session (cleared context) I was able to beat it into submission and make it less dumb, and in the end it was able to use a SOCKS proxy and automate a basic ETL task.