Date: 2026-03-22/24 Total models tested: 35 (34 local + 1 API reference) Total tasks: 6 (across 2 rounds) Total runtime: ~8 hours
All models run on the same machine via llama-server in router mode (--models-preset, --models-max 1). Each model is loaded one at a time using its preset configuration from preset.ini, which defines per-model parameters (temperature, context size, quantization, GPU layers, etc.). No temperature or sampling overrides were applied by the benchmark script — all inference used each model's own preset settings.
The benchmark script sends each task as an OpenAI-compatible /v1/chat/completions request and records the response along with the server-reported timings and usage stats. Wall clock time is measured for each request (includes model thinking time, prompt processing, and generation).
Six coding tasks across four languages, each targeting a different skill:
Round 1:
Task 1 — Generate Python Function
- Write
top_k_frequent(nums, k)— return the k most frequent elements from a list of integers. - Constraints: No
collections.Counter, type hints required, time complexity better than O(n log n), specific tie-breaking rules. - Note: The prompt wording "as a sorted list" was ambiguous — both value-ascending and frequency-descending interpretations were accepted.
Task 2 — Fix JavaScript Bugs (~300 lines)
- A full async task queue module with
EventEmitter,PriorityQueue,TaskQueue,RateLimiter, andScheduledTaskRunner. Contains 3 functional bugs and 3 code smells. - Functional bugs (cause incorrect behavior):
_bubbleUpoperator precedence (Critical) — breaks the heap.once()wrong reference (Critical) — once never unregisters.clear()order of operations (Medium) — stats adjustment is a no-op.
- Code smells (bad practice but don't cause incorrect behavior):
4.
_executeWithTimeoutresolve-after-reject — a settled Promise ignores subsequent calls. 5.getTaskloose equality (==vs===). 6.drainlistener accumulation (onvsonce). - Scoring focused on the 3 functional bugs. Code smell fixes received minor bonus credit.
Task 3 — Refactor TypeScript (~300 lines)
- A module with a logger, LRU cache store, and request pipeline. Contains heavy duplication.
- Key targets: unify 4 identical log functions, simplify
shouldLog, use modern array methods/template literals, preserve all 23 exports.
Round 2:
Task 4 — Fix and Extend Rust CLI
- A
clap-based task manager with 4 bugs (2 compilation errors, 2 logic bugs) plus a feature request to add aneditsubcommand. - Bugs:
PrioritymissingValueEnumderive — code won't compile, clap can't parse the enum.next_idnever incremented — all tasks get ID 0.mark_doneuses immutable reference — won't compile.removeretains wrong items (==should be!=).
- Scored proportionally: each bug fixed ≈ 3.5 points, edit feature ≈ 2 points, base ≈ 2. The
ValueEnumbug tests Rust ecosystem knowledge (clap's derive API) rather than general coding ability — in a real workflow the compiler would catch it immediately.
Task 5 — Add Go HTTP Middleware
- Add logging middleware, per-IP rate limiting (10 req/s), and graceful shutdown to an existing
net/httpserver. Stdlib only.
Task 6 — Build JS SSE Client
- Implement a
ServerSentEventsclass from scratch in vanilla JS. Parse the SSE wire format from a ReadableStream: data accumulation, event types, last event ID, retry, null byte handling, CR/CRLF/LF line endings, and stream chunk boundaries.
| Range | Meaning |
|---|---|
| 1–4 | Broken, didn't follow instructions, or mostly wrong |
| 5–8 | Attempted but significant issues (new bugs, missed critical items, wrong output) |
| 9–12 | Functional with notable gaps (missed some bugs/opportunities, minor issues) |
| 13–16 | Good — correct with minor imperfections |
| 17–20 | Excellent — thorough, clean, nothing meaningful to criticize |
Task-specific notes:
- Python: Models using
sorted()capped at 11–12 (doesn't meet complexity requirement). Incorrect output scored 5–8. UsingCounterscored 2. - JS Bug Fix: Scored on the 3 functional bugs. Code smells treated as bonuses.
- TS Refactor: Scored on duplication reduction, modernization, export preservation, and whether code got shorter.
- Rust CLI: Scored proportionally per bug fixed.
ValueEnumis one of 4 bugs, not a hard gate — models that fixed 3/4 bugs and added the edit feature scored 15. - Go Middleware: All 3 features must be present. Rate limiter quality (sliding window vs counter reset vs broken Ticker) differentiates within the 13–17 range.
- JS SSE: Must handle ReadableStream, TextDecoder, multi-line data, CR/CRLF, null byte in ID, comment lines, and partial chunks across reads.
The scoring process was iteratively refined during the review:
-
JS Bug Fix correction: Three items initially classified as bugs (
_executeWithTimeoutresolve-after-reject,getTaskloose equality,drainon vs once) were reclassified as code smells after analysis showed they don't cause incorrect behavior in JavaScript. Scores were adjusted to not penalize models for correctly leaving them unchanged. -
Python ambiguity: The prompt's "as a sorted list" wording doesn't clearly specify sort order. Both value-ascending and frequency-descending interpretations were accepted.
-
Rust proportional scoring: Initially, missing
ValueEnumwas treated as a hard gate (capping scores at 10). This was corrected to proportional scoring (each bug ≈ 3.5 points) sinceValueEnumtests framework-specific knowledge rather than coding ability, and in a real workflow the compiler would catch it immediately.
| Rank | Model | Size (GB) | Avg Wall | Py | JS | TS | Rs | Go | SSE | Total (/120) |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | claude-haiku-4-5 † | API | ~15s | 18 | 17 | 16 | 17 | 18 | 17 | 103 |
| 2 | gpt-oss-120b-q8kxl | 60.0 | 28s | 18 | 17 | 13 | 18 | 17 | 17 | 100 |
| 2 | qwen35-35b-a3b-q8 | 34.4 | 112s | 17 | 18 | 15 | 18 | 16 | 16 | 100 |
| 4 | gpt-oss-120b-q8kxl-nonthink | 60.0 | 28s | 18 | 17 | 13 | 18 | 17 | 16 | 99 |
| 4 | gpt-oss-20b-q8kxl | 12.3 | 22s | 17 | 17 | 14 | 18 | 16 | 17 | 99 |
| 6 | qwen35-122b-a10b-iq4nl-nonthink | 57.2 | 68s | 17 | 14 | 14 | 18 | 16 | 14 | 93 |
| 7 | devstral-small-2-24b-2512-q8kxl | 27.0 | 93s | 15 | 14 | 15 | 18 | 14 | 14 | 90 |
| 7 | qwen35-122b-a10b-iq4nl | 57.2 | 204s | 17 | 14 | 14 | 15 | 15 | 15 | 90 |
| 9 | qwen3-coder-next-q40 | 42.2 | 21s | 17 | 14 | 14 | 15 | 14 | 15 | 89 |
| 9 | qwen3-coder-next-q6k | 61.1 | 24s | 16 | 16 | 13 | 15 | 14 | 15 | 89 |
| 9 | llama-4-scout-q40 | 57.0 | 48s | 16 | 14 | 13 | 18 | 14 | 14 | 89 |
| 12 | qwen35-27b-q8-nonthink | 26.6 | 113s | 13 | 17 | 12 | 18 | 14 | 13 | 87 |
| 13 | devstral-2-123b-2512-q40 | 66.1 | 262s | 11 | 13 | 14 | 18 | 15 | 15 | 86 |
| 13 | qwen35-122b-a10b-q4km | 71.5 | 354s | 15 | 14 | 13 | 15 | 15 | 14 | 86 |
| 13 | qwen35-35b-a3b-claude-distill-q6k | 26.6 | 54s | 17 | 16 | 10 | 15 | 14 | 14 | 86 |
| 13 | qwen35-35b-a3b-q6k-nonthink | 27.0 | 17s | 17 | 16 | 15 | 15 | 11 | 12 | 86 |
| 17 | qwen35-35b-a3b-q8-nonthink | 34.4 | 24s | 17 | 16 | 10 | 15 | 14 | 11 | 83 |
| 18 | minimax-m25-reap-139b-q4km | 78.4 | 108s | 11 | 13 | 16 | 15 | 14 | 13 | 82 |
| 18 | qwen35-27b-claude-distill-q8-nonthink | 26.6 | 186s | 6 | 14 | 11 | 18 | 14 | 14 | 82 |
| 20 | qwen35-9b-q8 | 8.9 | 68s | 11 | 14 | 15 | 15 | 12 | 14 | 81 |
| 20 | kimi-linear-48b-a3b-q6kl | 37.9 | 22s | 11 | 13 | 16 | 15 | 13 | 13 | 81 |
| 22 | qwen3-coder-30b-a3b-1m-q8kxl | 33.5 | 18s | 11 | 10 | 12 | 15 | 14 | 14 | 76 |
| 22 | qwen3-coder-30b-a3b-q4kxl | 16.5 | 14s | 11 | 10 | 12 | 15 | 14 | 14 | 76 |
| 22 | devstral-small-2-24b-2512-q40 | 12.6 | 55s | 5 | 13 | 15 | 15 | 14 | 14 | 76 |
| 22 | mistral-small-4-119b-iq4nl-nonthink | 63.7 | 64s | 9 | 13 | 11 | 15 | 14 | 14 | 76 |
| 26 | qwen3-coder-30b-a3b-q8kxl | 33.5 | 18s | 11 | 10 | 13 | 15 | 10 | 14 | 73 |
† claude-haiku-4-5 is an API reference model (Claude Haiku 4.5), not a local LLM. Timings are agent wall-clock time and are not directly comparable to local inference times.
*qwen35-27b-claude-distill-q8-nonthink scored 18 on Rust (caught ValueEnum) despite lower scores elsewhere — an unusual profile.
Incomplete results (ERR on 1+ tasks):
| Model | Size (GB) | Avg Wall | Completed | Score |
|---|---|---|---|---|
| qwen35-27b-q8 | 26.6 | 388s | 5/6 | 83+ |
| nemotron-cascade-2-30b-a3b-q8 | 31.3 | 156s | 4/6 | 83+ |
| qwen35-35b-a3b-q6k-think | 27.0 | 94s | 5/6 | 76 |
| gpt-oss-20b-coding-distill-mxfp4 | 12.8 | 30s | 4/6 | 57+ |
| step-35-flash-q4km | 111.7 | 545s | 3/6 | 54+ |
| nemotron-cascade-2-30b-a3b-q40 | 17.0 | 159s | 3/6 | 51+ |
| qwen35-08b-q8 | 0.8 | 21s | 4/6 | 29+ |
| glm-47-flash-q8kxl | 33.2 | 469s | 2/6 | 19+ |
| mistral-small-4-119b-iq4nl | 63.7 | 347s | 1/6 | 15+ |
ERR = model crashed (proxy error), produced empty content, or timed out. These models are excluded from the main ranking.
claude-haiku-4-5 (Rank #1, 103/120) † — Highest score in the benchmark. Strong across all 6 tasks, no failures. Token bucket rate limiter in Go, TextDecoder stream:true in SSE, O(n) Python, all 3 JS functional bugs found. Self-corrected by noticing a missing import. As an API reference model, not directly comparable to local LLMs on speed or size.
gpt-oss-120b-q8kxl (Rank #2, 100/120) — Tied for first among local models. Highest Python score (18/20, bucket sort O(n)), strong across all languages. The Go middleware included a cleanup goroutine for the rate limiter — production-quality code. At 60 GB it's large, but at 28s average wall time it's one of the fastest.
qwen35-35b-a3b-q8 (Rank #2, 100/120) — Tied for first among local models. The only model to score 18 on the JS bug fix (found all 3 functional bugs plus all code smells). Also scored 18 on Rust (caught ValueEnum). As a 35B MoE (3B active), it's half the size of gpt-oss-120b but 4x slower (112s avg) due to thinking.
gpt-oss-20b-q8kxl (Rank #4, 99/120) — One point behind the local leaders at a fraction of the size. 12.3 GB, 22s average. The best local model in the benchmark by any efficiency metric. Scored 17+ on every task except TS refactoring (14).
Beyond raw scores, models differ in how they think about code. For agentic coding, the ideal collaborator writes clean, minimal solutions — not sloppy, not over-engineered.
Tier 1 — Top collaborators (score 93+)
| Model | Style | Strength | Weakness |
|---|---|---|---|
| claude-haiku-4-5 † | Thorough generalist | Highest benchmark score (103). Correct O(n) Python, all JS bugs found, comprehensive Go with token bucket + responseWriter wrapper. SSE handles all edge cases (TextDecoder stream:true, CRLF, null byte, retry). Self-corrects (noticed missing import in Go). | No local deployment — API only. Markdown fences in output (standard for agentic use). TS refactor solid but not exceptional. |
| gpt-oss-120b-q8kxl | Pragmatist | Writes the minimum needed. Cleanest SSE (107 lines). Uses stdlib correctly. Go middleware includes cleanup goroutine — production-quality without being asked. | The cleanup goroutine is arguably over-engineering for a task that didn't ask for it. |
| gpt-oss-120b-q8kxl-nonthink | Same pragmatist | Identical quality to thinking variant at identical speed. 1 point less but no thinking overhead. | Same minor tendency to add production touches. |
| qwen35-35b-a3b-q8 | Craftsman | Finds elegant stdlib solutions (Python: 3-line heapq.nlargest). Only model to catch all JS bugs. |
Slow (112s). SSE uses setTimeout recursion instead of while loop. 9061 avg tokens — thinks a lot. |
| gpt-oss-20b-q8kxl | Careful engineer | Always correct. Fastest top-tier model (22s). Tiny (12.3 GB). Strong across all languages. | Over-handles edges nobody asked about (Python: checks empty/k<=0). Comments more than needed. |
| qwen35-122b-a10b-iq4nl-nonthink | Thorough | Consistent across languages (stdev 1.8). Catches ValueEnum. 388-line SSE is thorough with input validation. | 57 GB and 68s for scores that 27 GB models match. Verbose — SSE is 3.5x longer than gpt-oss-120b's. |
Tier 2 — Good collaborators (score 86–90)
| Model | Style | Strength | Weakness |
|---|---|---|---|
| devstral-small-2-24b-2512-q8kxl | Steady | Most consistent model (stdev 1.5). Good TS refactoring (unified logs). Strong Rust (caught ValueEnum). | Slow for size (93s at 27 GB). Dense model bottleneck. SSE missing TextDecoder reuse. |
| qwen35-122b-a10b-iq4nl | Thoughtful | Strong Python (17). Good SSE with proper TextDecoder reuse. Consistent across tasks. | Missed ValueEnum (ecosystem knowledge gap). 204s average — very slow. 57 GB. |
| qwen3-coder-next-q40 | Builder | Structures code well. Fastest quality model (21s). Good Go/SSE architecture. Python used bucket sort (O(n)). | Over-engineers: SSE has unused AbortController, creates new TextDecoder per chunk (breaks multi-byte chars), complex CR handling that's actually wrong. |
| qwen3-coder-next-q6k | Same builder | Slightly better JS bug-fix than q40 (16 vs 14). | Same over-engineering tendency. 61 GB — 50% larger than q40 for identical quality. |
| llama-4-scout-q40 | Surface modernist | Uses modern JS features (private # fields in SSE). Good Rust (18). Even scores. |
SSE looks modern but is fundamentally broken: split(/[\n\r]+/) collapses blank lines, so events never dispatch correctly. Style over substance. |
| devstral-2-123b-2512-q40 | Slow but correct | Strong Rust (18). Good SSE and Go. Fixes bugs methodically. | 262s avg, 66 GB — worst speed/quality ratio. Creates new TextDecoder per SSE chunk. |
| qwen35-122b-a10b-q4km | Verbose thinker | Solid across tasks. Proper min-heap in Python. | 354s avg, 71 GB — extremely slow. 185-line SSE for what others do in 107. Missed ValueEnum. |
| qwen35-35b-a3b-claude-distill-q6k | Fluent writer | Readable output. Fast (54s). Good Python (17). Decent JS bug-fix (16). | Distillation traded precision for fluency. TS refactoring is worst score (10) — code got longer. Missed ValueEnum. |
| qwen35-35b-a3b-q6k-nonthink | Efficient generalist | Fastest non-coder model at this quality tier (17s). Good Python (17) and JS (16). Clean TS refactor (15). | Go middleware has bad handler wiring and missing imports — won't compile. CRLF bug in SSE. Missed ValueEnum. |
Tier 3 — Adequate collaborators (score 76–85)
| Model | Style | Strength | Weakness |
|---|---|---|---|
| qwen35-27b-q8-nonthink | Precise debugger | Best JS bug-fix among nonthink models (17). Strong Rust (18). | Uneven — TS refactoring is weakest (12). SSE missing multi-line data handling. 113s avg. |
| qwen35-35b-a3b-q8-nonthink | Quick generalist | Fast (24s). Good Python (17) and JS (16). Same model as rank #1 but without thinking. | Without thinking, misses ValueEnum, produces weaker SSE (11) and TS (10). The thinking gap is real for this model. |
| minimax-m25-reap-139b-q4km | Refactorer | Best TS refactoring (16). Clean structural improvements. | Weak at algorithms (Python 11). SSE missing comment handling. 78 GB for middling quality. |
| qwen35-27b-claude-distill-q8-nonthink | Inconsistent | Caught ValueEnum (18 on Rust) — unusual for a distill model. Decent Go/SSE. | Python score is worst in the benchmark (6) — inverted heap logic. 186s avg. Wildly uneven profile. |
| qwen35-9b-q8 | Compact all-rounder | Best TS refactoring for its size (15). Only 8.9 GB. | Weak algorithms (Python 11). Go rate limiter holds mutex during entire request — serializes all traffic. |
| kimi-linear-48b-a3b-q6kl | Minimalist | Simplest approach every time. Most token-efficient (1412 avg). Python reads like pseudocode. Fast (22s). | Too simple — misses complexity requirements, weaker bug detection, SSE missing comment handling. |
| qwen3-coder-30b-a3b-q4kxl | Fast minimalist | Fastest model (14s). Reliable (6/6). Clean Go middleware. | Missed once() JS bug. Naive sorted() for Python. Missed ValueEnum. Speed without depth. |
| qwen3-coder-30b-a3b-1m-q8kxl-q8kxl | Same as q4kxl | 1M context variant — no quality advantage over base. 18s. | Same weaknesses. The extra context didn't help on any task. |
| devstral-small-2-24b-2512-q40 | Budget option | Good TS refactoring (15). Decent Go and SSE (14 each). 12.6 GB. | Buggy Python (5) — inverted heap comparison. Q4_0 quant clearly hurts vs Q8_K_XL. |
| mistral-small-4-119b-iq4nl-nonthink | Late starter | Decent Go middleware (14). Clean SSE parsing logic. Uses Set for callbacks (prevents duplicates). |
SSE doesn't auto-start — requires manual processStream() call, breaking the spec. Creates new TextDecoder per chunk. Missed ValueEnum. |
Tier 4 — Unreliable or insufficient (score <73 or high error rate)
| Model | Style | Strength | Weakness |
|---|---|---|---|
| qwen3-coder-30b-a3b-q8kxl | Broken specialist | Clean code structure when it works. | Go rate limiter uses time.Ticker — fundamentally wrong (allows ~1 req/s not 10). Missed once(). Missed ValueEnum. |
| qwen35-27b-q8 | Overthinking | Good Python (17). Strong Rust (18). Catches ValueEnum. | 388s avg — by far the slowest usable model. Crashed on SSE. Thinking generates massive token counts for modest gains. |
| qwen35-35b-a3b-q6k-think | Capable but fragile | Best Python in benchmark (18, O(n) bucket sort). Good Rust (16, FromStr). Fixes JS code smells as bonus. | Go task ERR (invalid JSON). SSE has critical const reassignment bug that crashes parsing on any standard field. Slower nonthink sibling actually scores higher. |
| nemotron-cascade-2-30b-a3b-q8 | Unreliable | Good when it works (Rust 18, Go 16). Correct sliding-window rate limiter. | Failed 2/6 tasks (SSE, round 2). Missing logging function name didn't match common patterns. |
| nemotron-cascade-2-30b-a3b-q40 | Broken | Good round 1 scores (Python 17, TS 16). | Failed all 3 round 2 tasks. Completely unstable — cannot be relied on. |
| gpt-oss-20b-coding-distill-mxfp4 | Format-broken | Perfect JS bug-fix (18). Good Go middleware. | Server can't parse its output format (channel tags) — fails 2/6 tasks. Unusable without server-side fixes. |
| step-35-flash-q4km | Unreliable giant | Strong Rust (18) and Python (16) when it works. | 111.7 GB, 545s avg. Failed 3/6 tasks. Too slow and too unreliable. |
| qwen35-08b-q8 | Too small | Fast (21s). | Fundamentally insufficient for coding tasks. Uses Counter (forbidden). 281-line SSE with maxEvents, bufferResetCount — massive over-engineering for a broken implementation. The philosopher model. |
| glm-47-flash-q8kxl | Context-limited | Caught ValueEnum in Rust (one of few). | Crashed 4/6 tasks. Can't handle inputs over ~200 lines. 469s when it works. |
| mistral-small-4-119b-iq4nl | Broken | One successful task (SSE: 14) shows it CAN code. | Crashed 5/6 tasks. Puts everything in reasoning_content with empty code output. Hallucinated non-existent bugs. |
For agentic coding workflows where speed, reliability, and clean output matter more than peak score:
| Model | Score | Wall | Size | Reliable | Consistent | Clean Code | Agentic Pick? |
|---|---|---|---|---|---|---|---|
| claude-haiku-4-5 † | 103 | ~15s | API | 6/6 | excellent | thorough | Best score; API only |
| gpt-oss-20b-q8kxl | 99 | 22s | 12.3 | 6/6 | excellent | careful | Best local agent |
| gpt-oss-120b-q8kxl-nonthink | 99 | 28s | 60.0 | 6/6 | good | cleanest | Best local if VRAM allows |
| qwen3-coder-next-q40 | 89 | 21s | 42.2 | 6/6 | excellent | over-engineers | Fast but watch quality |
| kimi-linear-48b-a3b-q6kl | 81 | 22s | 37.9 | 6/6 | good | minimal | Simple tasks only |
| qwen35-35b-a3b-q8 | 100 | 112s | 34.4 | 6/6 | excellent | elegant | Best local quality, too slow for agents |
Quality per second, quality per GB:
| Model | Score | Avg Wall | Size | Score/Min | Score/GB |
|---|---|---|---|---|---|
| gpt-oss-20b-q8kxl | 99 | 22s | 12.3 | 270 | 8.0 |
| gpt-oss-120b-q8kxl-nonthink | 99 | 28s | 60.0 | 216 | 1.7 |
| gpt-oss-120b-q8kxl | 100 | 28s | 60.0 | 214 | 1.7 |
| qwen3-coder-next-q40 | 89 | 21s | 42.2 | 254 | 2.1 |
| kimi-linear-48b-a3b-q6kl | 81 | 22s | 37.9 | 221 | 2.1 |
| qwen35-35b-a3b-q8-nonthink | 83 | 24s | 34.4 | 208 | 2.4 |
| llama-4-scout-q40 | 89 | 48s | 57.0 | 111 | 1.6 |
| qwen35-122b-a10b-iq4nl-nonthink | 93 | 68s | 57.2 | 82 | 1.6 |
| qwen35-35b-a3b-q8 | 100 | 112s | 34.4 | 54 | 2.9 |
| Base Model | Think Score | Think Wall | NoThink Score | NoThink Wall | Delta |
|---|---|---|---|---|---|
| qwen35-35b-a3b-q8 | 100 | 112s | 83 | 24s | +17 |
| gpt-oss-120b-q8kxl | 100 | 28s | 99 | 28s | +1 |
| qwen35-27b-q8 | 83+ | 388s | 87 | 113s | -4+ |
| qwen35-122b-a10b-iq4nl | 90 | 204s | 93 | 68s | -3 |
| qwen35-35b-a3b-q6k | 76 (ERR Go) | 94s | 86 | 17s | -10+ |
- qwen35-35b-a3b-q8 thinking vs nonthink: 17 point gap — the largest in the benchmark at Q8. The nonthink variant missed ValueEnum and produced weaker SSE/TS code. Thinking was clearly worth the 4.7x wall time cost for this model.
- qwen35-35b-a3b-q6k nonthink OUTPERFORMS think — at Q6_K quant, nonthink scores 86 vs 76 (ERR on Go), and runs 5.5x faster. The Go ERR may be a server artifact, but even on 5 comparable tasks, nonthink is competitive. Thinking gave worse SSE (const bug) and no benefit here.
- gpt-oss-120b: only 1 point difference — thinking barely helps this model. The nonthink variant is essentially free performance.
- qwen35-122b nonthink outperformed thinking by 3 points AND was 3x faster — the nonthink variant caught ValueEnum while the thinking variant didn't. Thinking hurt this model.
- qwen35-27b nonthink outperformed thinking by 4+ points on completed tasks, though the thinking variant crashed on one task.
Conclusion: Thinking only clearly helps qwen35-35b at Q8. At Q6_K quant, thinking showed no benefit and produced a critical bug in SSE. For all other models, nonthink was equal or better.
The ValueEnum derive was a strong differentiator but scored proportionally alongside other bugs. Without it the code won't compile, but in a real workflow the compiler would catch it immediately — it tests ecosystem knowledge, not coding ability.
| Caught ValueEnum | Count |
|---|---|
| Yes (ValueEnum or FromStr) | 15/33 |
| No | 16/33 |
| Error | 2/33 |
| Base Model | Base Score | Distill Score | Gap |
|---|---|---|---|
| qwen35-35b-a3b-q8 | 100 | 86 | -14 |
| qwen35-27b-q8-nonthink | 87 | 82 | -5 |
Distillation consistently degraded coding performance. The 35b distill lost 14 points — mostly from weaker TS refactoring and missing some JS/Rust bugs.
| Model Family | Quant A | Score | Quant B | Score | Better |
|---|---|---|---|---|---|
| qwen35-122b-a10b | IQ4_NL (57 GB) | 90 | Q4_K_M (72 GB) | 86 | Smaller |
| devstral-small-24b | Q4_0 (13 GB) | 76 | Q8_K_XL (27 GB) | 90 | Larger |
| qwen3-coder-30b-a3b | Q4_K_XL (17 GB) | 76 | Q8_K_XL (34 GB) | 73 | Smaller |
| qwen35-35b-a3b (nonthink) | Q6_K (27 GB) | 86 | Q8 (34 GB) | 83 | Smaller |
No consistent pattern — quantization effects are model-dependent.
- mistral-small-4-119b-iq4nl — Crashed on 5 of 6 tasks. Only completed JS SSE (scoring 14). Unusable.
- glm-47-flash-q8kxl — Crashed on 4 of 6 tasks. Only completed Rust (11) and Python (4). Can't handle larger inputs.
- qwen35-08b-q8 — Too small. Crashed on Go, produced broken code on everything else. Score 29 from 4 tasks.
- qwen35-35b-a3b-q6k-think — Failed Go middleware (invalid JSON from server). Its nonthink sibling at the same quant scored higher (86 vs 76) and ran 5.5x faster.
- nemotron-cascade-2-30b-a3b-q40 — Worked in round 1 (46/60) but failed all 3 round 2 tasks. Unstable.
- step-35-flash-q4km — 111.7 GB, 545s average, crashed on 3 tasks. When it works it's good (18 on Rust), but too unreliable and slow.
JS Bug Fix (Round 1):
| Bug | Fixed | Difficulty |
|---|---|---|
_bubbleUp parentheses |
24/26 | Easy |
once() wrong reference |
19/26 | Medium |
clear() order of operations |
10/26 | Hard |
Rust CLI (Round 2):
| Bug | Fixed | Difficulty |
|---|---|---|
mark_done immutable ref |
30/30 | Easy |
next_id not incremented |
29/30 | Easy |
remove wrong retain |
28/30 | Easy |
ValueEnum missing |
14/30 | Hard |
The Go middleware task separated models by implementation quality:
| Approach | Models | Quality |
|---|---|---|
| Sliding window (timestamp list) | 8 | Best — accurate, no burst edge cases |
| Counter with per-IP reset | 12 | Good — simple, minor burst window at reset |
| Counter with global ticker reset | 3 | Acceptable — all IPs reset simultaneously |
| Ticker-based (1 tick = 1 request) | 1 | Wrong — allows ~1 req/s instead of 10 |
| Mutex held during request | 1 | Bug — serializes all requests |