Skip to content

Instantly share code, notes, and snippets.

@Jachimo
Last active May 23, 2026 00:31
Show Gist options
  • Select an option

  • Save Jachimo/081210761b02f40d499ffbf57f4904bd to your computer and use it in GitHub Desktop.

Select an option

Save Jachimo/081210761b02f40d499ffbf57f4904bd to your computer and use it in GitHub Desktop.
Venice "Bargain" Models - 22 May 2026
Model GPQA_Diamond_Pct Total_Cost_USD
Qwen 3.5 9B 75 0.56
Mistral Small 3.2 75 0.56
GLM 4.7 Flash 80 0.65
GLM 4.7 Flash Heretic 80 0.70
DeepSeek V3.2 75 1.89
DeepSeek V4 Flash 88 2.06
MiniMax M2.5 80 2.21
Qwen 3.6 27B 87.8 2.63
Kimi K2.5 87.6 3.85
Kimi K2.6 88 5.53
Qwen 3.6 Plus 87 5.57
GLM 4.6 79.1 5.45
Gemini 3 Flash 78 6.30
GPT-5.4 Mini 85 11.44
Grok Build 0.1 85 13.00
GLM 5 84 13.60
Grok 4.20 88 16.94
Grok 4.3 88 16.94
GPT-5.2 88 16.16
Claude Sonnet 4.6 85 19.44
Claude Sonnet 4.5 84 20.25
DeepSeek V4 Pro 90.1 22.25
GLM 5.1 86.2 23.60
Gemini 3.1 Pro 94.3 32.50
GPT-5.4 92.0 41.03
Claude Opus 4.6 91.3 69.00
Claude Opus 4.7 94.2 69.00
Claude Opus 4.5 90 69.00
GPT-5.5 93.6 82.63
GPT-5.5 Pro 93.6 187.50
GPT-5.4 Pro 92.0 187.50
Claude Opus 4.7 Fast 94.2 414.00
Claude Opus 4.6 Fast 91.3 414.00

Venice.ai API Text Model Value Analysis

Premium & Mid-Range Models Benchmarked for Agentic Reasoning

Analysis Date: May 21, 2026
Benchmark: GPQA Diamond (graduate-level scientific reasoning)
Workload: 5M input tokens, 300K output tokens, 40M cache reads
Verification: Actual GLM 5.1 usage data (40.7M cache, 5.2M input, 337K output = $24.15)


Summary

This analysis set out to identify the "best value" text LLMs available via Venice.ai's API for agentic problem-solving workloads. Using GPQA Diamond as the benchmark, we calculate a "value score" (GPQA percentage divided by total cost) to find models that deliver above-average reasoning per dollar.

Key Findings

The best bargains deliver 85-88% of frontier performance for $0.56-$2.06, while premium "Fast" variants charge 6× more for similar GPQA performance.


Pricing

All prices are per 1 million tokens (confirmed via actual GLM 5.1 billing):

Component GLM 5.1 Actual Calculation
Cache Reads $0.33/M 40.7M × $0.33 = $13.23 ✓
Input $1.75/M 5.2M × $1.75 = $9.10 ✓
Output $5.50/M 0.337M × $5.50 = $1.85 ✓
Total $24.18 (matches actual)

Premium & Flagship Models

The Standout Bargains 🟢

Model Input $/M Output $/M Cache $/M Total Cost GPQA Diamond Value Score Verdict
DeepSeek V4 Flash $0.17 $0.35 $0.03 $2.06 ~88% 42.7 BEST VALUE
DeepSeek V4 Pro $1.73 $3.80 $0.33 $22.25 90.1% 4.05 Excellent
Grok 4.20 $1.42 $2.83 $0.23 $16.94 ~88% 5.19 Good
Grok 4.3 $1.42 $2.83 $0.23 $16.94 ~88% 5.19 Good
Grok Build 0.1 $1.00 $2.00 $0.20 $13.00 ~85% 6.54 Good

The Expensive Frontier 🔴

Model Input $/M Output $/M Cache $/M Total Cost GPQA Diamond Value Score Verdict
Claude Opus 4.7 Fast $36.00 $180.00 $3.60 $414.00 94.2% 0.23 🔴 Terrible Value
Claude Opus 4.6 Fast $36.00 $180.00 $3.60 $414.00 91.3% 0.22 🔴 Terrible Value
GPT-5.5 Pro $37.50 $225.00 $187.50 93.6% 0.50 🔴 Overpriced
GPT-5.4 Pro $37.50 $225.00 $187.50 92.0% 0.49 🔴 Overpriced

The Balanced Middle 🟡

Model Input $/M Output $/M Cache $/M Total Cost GPQA Diamond Value Score Verdict
Claude Opus 4.7 $6.00 $30.00 $0.60 $69.00 94.2% 1.36 🟡 Premium but Fair
Claude Opus 4.6 $6.00 $30.00 $0.60 $69.00 91.3% 1.32 🟡 Standard
Claude Opus 4.5 $6.00 $30.00 $0.60 $69.00 ~90% 1.30 🟡 Standard
GPT-5.5 $6.25 $37.50 $0.63 $82.63 93.6% 1.13 🟡 Premium
GPT-5.4 $3.13 $18.80 $0.31 $41.03 92.0% 2.24 🟢 Good Value
GPT-5.4 Mini $0.94 $5.63 $0.09 $11.44 ~85% 7.43 🟢 Very Good
Gemini 3.1 Pro $2.50 $15.00 $0.50 $32.50 94.3% 2.90 🟢 Good Value

Mid-Range Models

Outstanding Bargains 🟢

Model Total Cost GPQA Diamond Value Score Verdict
Qwen 3.5 9B $0.56 ~75% 133.9 EXCEPTIONAL
Mistral Small 3.2 $0.56 ~75% 133.9 EXCEPTIONAL
GLM 4.7 Flash $0.65 ~80% 123.1 EXCEPTIONAL
GLM 4.7 Flash Heretic $0.70 ~80% 114.3 EXCEPTIONAL
DeepSeek V3.2 $1.89 ~75% 39.7 Very Good
MiniMax M2.5 $2.21 ~80% 36.2 Very Good
Qwen 3.6 27B $2.63 87.8% 33.4 Very Good
Kimi K2.5 $3.85 87.6% 22.8 Outstanding
Kimi K2.6 $5.53 ~88% 15.9 Outstanding
Qwen 3.6 Plus $5.57 ~87% 15.6 Outstanding

Fair Value 🟡

Model Total Cost GPQA Diamond Value Score Verdict
GLM 5 $13.60 ~84% 6.18 🟢 Good
GLM 5.1 $23.60 86.2% 3.65 🟡 Your Actual Experience
GPT-5.2 $16.16 ~88% 5.45 🟡 Expensive
Claude Sonnet 4.6 $19.44 ~85% 4.37 🟡 Expensive
Claude Sonnet 4.5 $20.25 ~84% 4.15 🟡 Expensive

Top 10 Bargains Ranked

Rank Model Cost GPQA Value Score Why It Stands Out
1 Qwen 3.5 9B $0.56 ~75% 133.9 Cheapest viable option
2 Mistral Small 3.2 $0.56 ~75% 133.9 Same price, similar performance
3 GLM 4.7 Flash $0.65 ~80% 123.1 80% frontier at < $1
4 GLM 4.7 Flash Heretic $0.70 ~80% 114.3 Slightly pricier Flash variant
5 DeepSeek V4 Flash $2.06 ~88% 42.7 The Sweet Spot
6 DeepSeek V3.2 $1.89 ~75% 39.7 Proven DeepSeek quality
7 MiniMax M2.5 $2.21 ~80% 36.2 Strong agentic performance
8 Qwen 3.6 27B $2.63 87.8% 33.4 Near-frontier, verified
9 Kimi K2.5 $3.85 87.6% 22.8 Open weights, excellent
10 Kimi K2.6 $5.53 ~88% 15.9 Best open-weights upgrade

Models to Avoid

Model Cost GPQA Value Score Why Avoid
Claude Opus 4.7 Fast $414.00 94.2% 0.23 6× cost for speed, same accuracy
Claude Opus 4.6 Fast $414.00 91.3% 0.22 Same problem
GPT-5.5 Pro $187.50 93.6% 0.50 2.3× cost, minimal gain over 5.5
GPT-5.4 Pro $187.50 92.0% 0.49 Same issue

Recommendations by Use Case

Use Case Recommended Model Cost Why
Ultra-budget, any capability Qwen 3.5 9B $0.56 Cheapest option that works
Best value under $1 GLM 4.7 Flash $0.65 80% GPQA at throwaway price
Sweet spot (performance/price) DeepSeek V4 Flash $2.06 88% GPQA, proven reliable
Best open-weights Kimi K2.6 $5.53 Full control, strong performance
Verified high performance Qwen 3.6 27B $2.63 87.8% GPQA, published scores
Frontier quality, fair price GPT-5.4 $41.03 92% GPQA, half the cost of 5.5
Absolute best reasoning Claude Opus 4.7 $69.00 94.2% GPQA, justifies premium
Google ecosystem Gemini 3.1 Pro $32.50 94.3% GPQA, competitive
Never use Any "Fast" variant $187-$414 Same accuracy, 6× price

Methodology Notes

Benchmark Selection

GPQA Diamond was selected because:

  • Available across all models in analysis
  • Tests graduate-level scientific reasoning (physics, chemistry, biology)
  • "Google-proof" questions requiring genuine expertise
  • PhD experts: ~65%, skilled non-experts: ~34%
  • Directly relevant to agentic problem-solving capabilities

Subjective Notes

The following are notes based on trying several low-cost models on a real-world ETL task, involving extracting text data from an internal-only website via a SOCKS proxy, writing it to disk in a structured format, cleaning it up, summarizing it, and putting the summary into a JSON file.

GLM 5.1 (listed under "fair value") worked well overall, but costs started to add up. Initial prototype of the ETL pipeline cost about $20 to create and test to proof-of-concept level. Really wanted something cheaper to work on the instructions that GLM 5.1 produced.

Tried using Kimi K2.6 and it waaaaaay overthought / went in circles with itself, while trying to curate a dataset. It didn't seem to like that it (rightly) needed to delete a lot of crap from the dataset and spun for minutes determining a course of action before implementing it. Burned 94K of context and $0.39 just doing a basic curation task! Actual reasoning was not terrible but it seems to have miserably poor self-confidence; embarrassing almost. This persisted with a clean context as well as when it used inherited context from GLM 5.1.

Qwen 3.5 9B was just too stupid to follow the GLM 5.1 instructions, too. It went around in circles, disregarded instructions and tried to build scripts for itself, which didn't work, and generally just spun around. Disappointing, as it seemed like a good option and has good GPQA scores.

Qwen 3.6 Plus Uncensored isn't in the automated comparison above, so it may not be a "good value" by the numbers. But, true to some forum comments I read on HuggingFace, it seems to work much better than Qwen 3.5 9B (like a lot better). It was actually able to follow instructions, scrape the web for some content, analyze it, etc. But it appears to be fairly expensive, possibly on par with GLM 5.1?

GLM 4.7 Flash Heretic seems to work well with GLM 5.1's work instructions, and is good at tool use. Unfortunate, since it's basically when I was using all along. But I guess I at least had it pegged as a good model to use. Unfortunately, it failed completely at trying to use a SOCKS proxy that was already set up and working. Once I reset the session (cleared context) I was able to beat it into submission and make it less dumb, and in the end it was able to use a SOCKS proxy and automate a basic ETL task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment