Local LLM Coding Benchmark — Full Review

Date: 2026-03-22/24 Total models tested: 35 (34 local + 1 API reference) Total tasks: 6 (across 2 rounds) Total runtime: ~8 hours

Methodology

Setup

All models run on the same machine via llama-server in router mode (--models-preset, --models-max 1). Each model is loaded one at a time using its preset configuration from preset.ini, which defines per-model parameters (temperature, context size, quantization, GPU layers, etc.). No temperature or sampling overrides were applied by the benchmark script — all inference used each model's own preset settings.

The benchmark script sends each task as an OpenAI-compatible /v1/chat/completions request and records the response along with the server-reported timings and usage stats. Wall clock time is measured for each request (includes model thinking time, prompt processing, and generation).

Tasks

Six coding tasks across four languages, each targeting a different skill:

Round 1:

Task 1 — Generate Python Function

Write top_k_frequent(nums, k) — return the k most frequent elements from a list of integers.
Constraints: No collections.Counter, type hints required, time complexity better than O(n log n), specific tie-breaking rules.
Note: The prompt wording "as a sorted list" was ambiguous — both value-ascending and frequency-descending interpretations were accepted.

Task 2 — Fix JavaScript Bugs (~300 lines)

A full async task queue module with EventEmitter, PriorityQueue, TaskQueue, RateLimiter, and ScheduledTaskRunner. Contains 3 functional bugs and 3 code smells.
Functional bugs (cause incorrect behavior):
1. _bubbleUp operator precedence (Critical) — breaks the heap.
2. once() wrong reference (Critical) — once never unregisters.
3. clear() order of operations (Medium) — stats adjustment is a no-op.
Code smells (bad practice but don't cause incorrect behavior): 4. _executeWithTimeout resolve-after-reject — a settled Promise ignores subsequent calls. 5. getTask loose equality (== vs ===). 6. drain listener accumulation (on vs once).
Scoring focused on the 3 functional bugs. Code smell fixes received minor bonus credit.

Task 3 — Refactor TypeScript (~300 lines)

A module with a logger, LRU cache store, and request pipeline. Contains heavy duplication.
Key targets: unify 4 identical log functions, simplify shouldLog, use modern array methods/template literals, preserve all 23 exports.

Round 2:

Task 4 — Fix and Extend Rust CLI

A clap-based task manager with 4 bugs (2 compilation errors, 2 logic bugs) plus a feature request to add an edit subcommand.
Bugs:
1. Priority missing ValueEnum derive — code won't compile, clap can't parse the enum.
2. next_id never incremented — all tasks get ID 0.
3. mark_done uses immutable reference — won't compile.
4. remove retains wrong items (== should be !=).
Scored proportionally: each bug fixed ≈ 3.5 points, edit feature ≈ 2 points, base ≈ 2. The ValueEnum bug tests Rust ecosystem knowledge (clap's derive API) rather than general coding ability — in a real workflow the compiler would catch it immediately.

Task 5 — Add Go HTTP Middleware

Add logging middleware, per-IP rate limiting (10 req/s), and graceful shutdown to an existing net/http server. Stdlib only.

Task 6 — Build JS SSE Client

Implement a ServerSentEvents class from scratch in vanilla JS. Parse the SSE wire format from a ReadableStream: data accumulation, event types, last event ID, retry, null byte handling, CR/CRLF/LF line endings, and stream chunk boundaries.

Scoring (1–20 per task)

Range	Meaning
1–4	Broken, didn't follow instructions, or mostly wrong
5–8	Attempted but significant issues (new bugs, missed critical items, wrong output)
9–12	Functional with notable gaps (missed some bugs/opportunities, minor issues)
13–16	Good — correct with minor imperfections
17–20	Excellent — thorough, clean, nothing meaningful to criticize

Task-specific notes:

Python: Models using sorted() capped at 11–12 (doesn't meet complexity requirement). Incorrect output scored 5–8. Using Counter scored 2.
JS Bug Fix: Scored on the 3 functional bugs. Code smells treated as bonuses.
TS Refactor: Scored on duplication reduction, modernization, export preservation, and whether code got shorter.
Rust CLI: Scored proportionally per bug fixed. ValueEnum is one of 4 bugs, not a hard gate — models that fixed 3/4 bugs and added the edit feature scored 15.
Go Middleware: All 3 features must be present. Rate limiter quality (sliding window vs counter reset vs broken Ticker) differentiates within the 13–17 range.
JS SSE: Must handle ReadableStream, TextDecoder, multi-line data, CR/CRLF, null byte in ID, comment lines, and partial chunks across reads.

Scoring Methodology Notes

The scoring process was iteratively refined during the review:

JS Bug Fix correction: Three items initially classified as bugs (_executeWithTimeout resolve-after-reject, getTask loose equality, drain on vs once) were reclassified as code smells after analysis showed they don't cause incorrect behavior in JavaScript. Scores were adjusted to not penalize models for correctly leaving them unchanged.
Python ambiguity: The prompt's "as a sorted list" wording doesn't clearly specify sort order. Both value-ascending and frequency-descending interpretations were accepted.
Rust proportional scoring: Initially, missing ValueEnum was treated as a hard gate (capping scores at 10). This was corrected to proportional scoring (each bug ≈ 3.5 points) since ValueEnum tests framework-specific knowledge rather than coding ability, and in a real workflow the compiler would catch it immediately.

Results

Rank	Model	Size (GB)	Avg Wall	Py	JS	TS	Rs	Go	SSE	Total (/120)
1	claude-haiku-4-5 †	API	~15s	18	17	16	17	18	17	103
2	gpt-oss-120b-q8kxl	60.0	28s	18	17	13	18	17	17	100
2	qwen35-35b-a3b-q8	34.4	112s	17	18	15	18	16	16	100
4	gpt-oss-120b-q8kxl-nonthink	60.0	28s	18	17	13	18	17	16	99
4	gpt-oss-20b-q8kxl	12.3	22s	17	17	14	18	16	17	99
6	qwen35-122b-a10b-iq4nl-nonthink	57.2	68s	17	14	14	18	16	14	93
7	devstral-small-2-24b-2512-q8kxl	27.0	93s	15	14	15	18	14	14	90
7	qwen35-122b-a10b-iq4nl	57.2	204s	17	14	14	15	15	15	90
9	qwen3-coder-next-q40	42.2	21s	17	14	14	15	14	15	89
9	qwen3-coder-next-q6k	61.1	24s	16	16	13	15	14	15	89
9	llama-4-scout-q40	57.0	48s	16	14	13	18	14	14	89
12	qwen35-27b-q8-nonthink	26.6	113s	13	17	12	18	14	13	87
13	devstral-2-123b-2512-q40	66.1	262s	11	13	14	18	15	15	86
13	qwen35-122b-a10b-q4km	71.5	354s	15	14	13	15	15	14	86
13	qwen35-35b-a3b-claude-distill-q6k	26.6	54s	17	16	10	15	14	14	86
13	qwen35-35b-a3b-q6k-nonthink	27.0	17s	17	16	15	15	11	12	86
17	qwen35-35b-a3b-q8-nonthink	34.4	24s	17	16	10	15	14	11	83
18	minimax-m25-reap-139b-q4km	78.4	108s	11	13	16	15	14	13	82
18	qwen35-27b-claude-distill-q8-nonthink	26.6	186s	6	14	11	18	14	14	82
20	qwen35-9b-q8	8.9	68s	11	14	15	15	12	14	81
20	kimi-linear-48b-a3b-q6kl	37.9	22s	11	13	16	15	13	13	81
22	qwen3-coder-30b-a3b-1m-q8kxl	33.5	18s	11	10	12	15	14	14	76
22	qwen3-coder-30b-a3b-q4kxl	16.5	14s	11	10	12	15	14	14	76
22	devstral-small-2-24b-2512-q40	12.6	55s	5	13	15	15	14	14	76
22	mistral-small-4-119b-iq4nl-nonthink	63.7	64s	9	13	11	15	14	14	76
26	qwen3-coder-30b-a3b-q8kxl	33.5	18s	11	10	13	15	10	14	73

† claude-haiku-4-5 is an API reference model (Claude Haiku 4.5), not a local LLM. Timings are agent wall-clock time and are not directly comparable to local inference times.

*qwen35-27b-claude-distill-q8-nonthink scored 18 on Rust (caught ValueEnum) despite lower scores elsewhere — an unusual profile.

Incomplete results (ERR on 1+ tasks):

Model	Size (GB)	Avg Wall	Completed	Score
qwen35-27b-q8	26.6	388s	5/6	83+
nemotron-cascade-2-30b-a3b-q8	31.3	156s	4/6	83+
qwen35-35b-a3b-q6k-think	27.0	94s	5/6	76
gpt-oss-20b-coding-distill-mxfp4	12.8	30s	4/6	57+
step-35-flash-q4km	111.7	545s	3/6	54+
nemotron-cascade-2-30b-a3b-q40	17.0	159s	3/6	51+
qwen35-08b-q8	0.8	21s	4/6	29+
glm-47-flash-q8kxl	33.2	469s	2/6	19+
mistral-small-4-119b-iq4nl	63.7	347s	1/6	15+

ERR = model crashed (proxy error), produced empty content, or timed out. These models are excluded from the main ranking.

Analysis

Top Performers

claude-haiku-4-5 (Rank #1, 103/120) † — Highest score in the benchmark. Strong across all 6 tasks, no failures. Token bucket rate limiter in Go, TextDecoder stream:true in SSE, O(n) Python, all 3 JS functional bugs found. Self-corrected by noticing a missing import. As an API reference model, not directly comparable to local LLMs on speed or size.

gpt-oss-120b-q8kxl (Rank #2, 100/120) — Tied for first among local models. Highest Python score (18/20, bucket sort O(n)), strong across all languages. The Go middleware included a cleanup goroutine for the rate limiter — production-quality code. At 60 GB it's large, but at 28s average wall time it's one of the fastest.

qwen35-35b-a3b-q8 (Rank #2, 100/120) — Tied for first among local models. The only model to score 18 on the JS bug fix (found all 3 functional bugs plus all code smells). Also scored 18 on Rust (caught ValueEnum). As a 35B MoE (3B active), it's half the size of gpt-oss-120b but 4x slower (112s avg) due to thinking.

gpt-oss-20b-q8kxl (Rank #4, 99/120) — One point behind the local leaders at a fraction of the size. 12.3 GB, 22s average. The best local model in the benchmark by any efficiency metric. Scored 17+ on every task except TS refactoring (14).

Code Quality Profiles

Beyond raw scores, models differ in how they think about code. For agentic coding, the ideal collaborator writes clean, minimal solutions — not sloppy, not over-engineered.

Tier 1 — Top collaborators (score 93+)

Model	Style	Strength	Weakness
claude-haiku-4-5 †	Thorough generalist	Highest benchmark score (103). Correct O(n) Python, all JS bugs found, comprehensive Go with token bucket + responseWriter wrapper. SSE handles all edge cases (TextDecoder stream:true, CRLF, null byte, retry). Self-corrects (noticed missing import in Go).	No local deployment — API only. Markdown fences in output (standard for agentic use). TS refactor solid but not exceptional.
gpt-oss-120b-q8kxl	Pragmatist	Writes the minimum needed. Cleanest SSE (107 lines). Uses stdlib correctly. Go middleware includes cleanup goroutine — production-quality without being asked.	The cleanup goroutine is arguably over-engineering for a task that didn't ask for it.
gpt-oss-120b-q8kxl-nonthink	Same pragmatist	Identical quality to thinking variant at identical speed. 1 point less but no thinking overhead.	Same minor tendency to add production touches.
qwen35-35b-a3b-q8	Craftsman	Finds elegant stdlib solutions (Python: 3-line `heapq.nlargest`). Only model to catch all JS bugs.	Slow (112s). SSE uses `setTimeout` recursion instead of while loop. 9061 avg tokens — thinks a lot.
gpt-oss-20b-q8kxl	Careful engineer	Always correct. Fastest top-tier model (22s). Tiny (12.3 GB). Strong across all languages.	Over-handles edges nobody asked about (Python: checks empty/k<=0). Comments more than needed.
qwen35-122b-a10b-iq4nl-nonthink	Thorough	Consistent across languages (stdev 1.8). Catches ValueEnum. 388-line SSE is thorough with input validation.	57 GB and 68s for scores that 27 GB models match. Verbose — SSE is 3.5x longer than gpt-oss-120b's.

Tier 2 — Good collaborators (score 86–90)

Model	Style	Strength	Weakness
devstral-small-2-24b-2512-q8kxl	Steady	Most consistent model (stdev 1.5). Good TS refactoring (unified logs). Strong Rust (caught ValueEnum).	Slow for size (93s at 27 GB). Dense model bottleneck. SSE missing TextDecoder reuse.
qwen35-122b-a10b-iq4nl	Thoughtful	Strong Python (17). Good SSE with proper TextDecoder reuse. Consistent across tasks.	Missed ValueEnum (ecosystem knowledge gap). 204s average — very slow. 57 GB.
qwen3-coder-next-q40	Builder	Structures code well. Fastest quality model (21s). Good Go/SSE architecture. Python used bucket sort (O(n)).	Over-engineers: SSE has unused `AbortController`, creates new `TextDecoder` per chunk (breaks multi-byte chars), complex CR handling that's actually wrong.
qwen3-coder-next-q6k	Same builder	Slightly better JS bug-fix than q40 (16 vs 14).	Same over-engineering tendency. 61 GB — 50% larger than q40 for identical quality.
llama-4-scout-q40	Surface modernist	Uses modern JS features (private `#` fields in SSE). Good Rust (18). Even scores.	SSE looks modern but is fundamentally broken: `split(/[\n\r]+/)` collapses blank lines, so events never dispatch correctly. Style over substance.
devstral-2-123b-2512-q40	Slow but correct	Strong Rust (18). Good SSE and Go. Fixes bugs methodically.	262s avg, 66 GB — worst speed/quality ratio. Creates new TextDecoder per SSE chunk.
qwen35-122b-a10b-q4km	Verbose thinker	Solid across tasks. Proper min-heap in Python.	354s avg, 71 GB — extremely slow. 185-line SSE for what others do in 107. Missed ValueEnum.
qwen35-35b-a3b-claude-distill-q6k	Fluent writer	Readable output. Fast (54s). Good Python (17). Decent JS bug-fix (16).	Distillation traded precision for fluency. TS refactoring is worst score (10) — code got longer. Missed ValueEnum.
qwen35-35b-a3b-q6k-nonthink	Efficient generalist	Fastest non-coder model at this quality tier (17s). Good Python (17) and JS (16). Clean TS refactor (15).	Go middleware has bad handler wiring and missing imports — won't compile. CRLF bug in SSE. Missed ValueEnum.

Tier 3 — Adequate collaborators (score 76–85)

Model	Style	Strength	Weakness
qwen35-27b-q8-nonthink	Precise debugger	Best JS bug-fix among nonthink models (17). Strong Rust (18).	Uneven — TS refactoring is weakest (12). SSE missing multi-line data handling. 113s avg.
qwen35-35b-a3b-q8-nonthink	Quick generalist	Fast (24s). Good Python (17) and JS (16). Same model as rank #1 but without thinking.	Without thinking, misses ValueEnum, produces weaker SSE (11) and TS (10). The thinking gap is real for this model.
minimax-m25-reap-139b-q4km	Refactorer	Best TS refactoring (16). Clean structural improvements.	Weak at algorithms (Python 11). SSE missing comment handling. 78 GB for middling quality.
qwen35-27b-claude-distill-q8-nonthink	Inconsistent	Caught ValueEnum (18 on Rust) — unusual for a distill model. Decent Go/SSE.	Python score is worst in the benchmark (6) — inverted heap logic. 186s avg. Wildly uneven profile.
qwen35-9b-q8	Compact all-rounder	Best TS refactoring for its size (15). Only 8.9 GB.	Weak algorithms (Python 11). Go rate limiter holds mutex during entire request — serializes all traffic.
kimi-linear-48b-a3b-q6kl	Minimalist	Simplest approach every time. Most token-efficient (1412 avg). Python reads like pseudocode. Fast (22s).	Too simple — misses complexity requirements, weaker bug detection, SSE missing comment handling.
qwen3-coder-30b-a3b-q4kxl	Fast minimalist	Fastest model (14s). Reliable (6/6). Clean Go middleware.	Missed `once()` JS bug. Naive `sorted()` for Python. Missed ValueEnum. Speed without depth.
qwen3-coder-30b-a3b-1m-q8kxl-q8kxl	Same as q4kxl	1M context variant — no quality advantage over base. 18s.	Same weaknesses. The extra context didn't help on any task.
devstral-small-2-24b-2512-q40	Budget option	Good TS refactoring (15). Decent Go and SSE (14 each). 12.6 GB.	Buggy Python (5) — inverted heap comparison. Q4_0 quant clearly hurts vs Q8_K_XL.
mistral-small-4-119b-iq4nl-nonthink	Late starter	Decent Go middleware (14). Clean SSE parsing logic. Uses `Set` for callbacks (prevents duplicates).	SSE doesn't auto-start — requires manual `processStream()` call, breaking the spec. Creates new TextDecoder per chunk. Missed ValueEnum.

Tier 4 — Unreliable or insufficient (score <73 or high error rate)

Model	Style	Strength	Weakness
qwen3-coder-30b-a3b-q8kxl	Broken specialist	Clean code structure when it works.	Go rate limiter uses `time.Ticker` — fundamentally wrong (allows ~1 req/s not 10). Missed `once()`. Missed ValueEnum.
qwen35-27b-q8	Overthinking	Good Python (17). Strong Rust (18). Catches ValueEnum.	388s avg — by far the slowest usable model. Crashed on SSE. Thinking generates massive token counts for modest gains.
qwen35-35b-a3b-q6k-think	Capable but fragile	Best Python in benchmark (18, O(n) bucket sort). Good Rust (16, FromStr). Fixes JS code smells as bonus.	Go task ERR (invalid JSON). SSE has critical `const` reassignment bug that crashes parsing on any standard field. Slower nonthink sibling actually scores higher.
nemotron-cascade-2-30b-a3b-q8	Unreliable	Good when it works (Rust 18, Go 16). Correct sliding-window rate limiter.	Failed 2/6 tasks (SSE, round 2). Missing logging function name didn't match common patterns.
nemotron-cascade-2-30b-a3b-q40	Broken	Good round 1 scores (Python 17, TS 16).	Failed all 3 round 2 tasks. Completely unstable — cannot be relied on.
gpt-oss-20b-coding-distill-mxfp4	Format-broken	Perfect JS bug-fix (18). Good Go middleware.	Server can't parse its output format (channel tags) — fails 2/6 tasks. Unusable without server-side fixes.
step-35-flash-q4km	Unreliable giant	Strong Rust (18) and Python (16) when it works.	111.7 GB, 545s avg. Failed 3/6 tasks. Too slow and too unreliable.
qwen35-08b-q8	Too small	Fast (21s).	Fundamentally insufficient for coding tasks. Uses `Counter` (forbidden). 281-line SSE with `maxEvents`, `bufferResetCount` — massive over-engineering for a broken implementation. The philosopher model.
glm-47-flash-q8kxl	Context-limited	Caught ValueEnum in Rust (one of few).	Crashed 4/6 tasks. Can't handle inputs over ~200 lines. 469s when it works.
mistral-small-4-119b-iq4nl	Broken	One successful task (SSE: 14) shows it CAN code.	Crashed 5/6 tasks. Puts everything in reasoning_content with empty code output. Hallucinated non-existent bugs.

Agentic Suitability

For agentic coding workflows where speed, reliability, and clean output matter more than peak score:

Model	Score	Wall	Size	Reliable	Consistent	Clean Code	Agentic Pick?
claude-haiku-4-5 †	103	~15s	API	6/6	excellent	thorough	Best score; API only
gpt-oss-20b-q8kxl	99	22s	12.3	6/6	excellent	careful	Best local agent
gpt-oss-120b-q8kxl-nonthink	99	28s	60.0	6/6	good	cleanest	Best local if VRAM allows
qwen3-coder-next-q40	89	21s	42.2	6/6	excellent	over-engineers	Fast but watch quality
kimi-linear-48b-a3b-q6kl	81	22s	37.9	6/6	good	minimal	Simple tasks only
qwen35-35b-a3b-q8	100	112s	34.4	6/6	excellent	elegant	Best local quality, too slow for agents

Efficiency Rankings

Quality per second, quality per GB:

Model	Score	Avg Wall	Size	Score/Min	Score/GB
gpt-oss-20b-q8kxl	99	22s	12.3	270	8.0
gpt-oss-120b-q8kxl-nonthink	99	28s	60.0	216	1.7
gpt-oss-120b-q8kxl	100	28s	60.0	214	1.7
qwen3-coder-next-q40	89	21s	42.2	254	2.1
kimi-linear-48b-a3b-q6kl	81	22s	37.9	221	2.1
qwen35-35b-a3b-q8-nonthink	83	24s	34.4	208	2.4
llama-4-scout-q40	89	48s	57.0	111	1.6
qwen35-122b-a10b-iq4nl-nonthink	93	68s	57.2	82	1.6
qwen35-35b-a3b-q8	100	112s	34.4	54	2.9

Thinking vs Non-Thinking

Base Model	Think Score	Think Wall	NoThink Score	NoThink Wall	Delta
qwen35-35b-a3b-q8	100	112s	83	24s	+17
gpt-oss-120b-q8kxl	100	28s	99	28s	+1
qwen35-27b-q8	83+	388s	87	113s	-4+
qwen35-122b-a10b-iq4nl	90	204s	93	68s	-3
qwen35-35b-a3b-q6k	76 (ERR Go)	94s	86	17s	-10+

qwen35-35b-a3b-q8 thinking vs nonthink: 17 point gap — the largest in the benchmark at Q8. The nonthink variant missed ValueEnum and produced weaker SSE/TS code. Thinking was clearly worth the 4.7x wall time cost for this model.
qwen35-35b-a3b-q6k nonthink OUTPERFORMS think — at Q6_K quant, nonthink scores 86 vs 76 (ERR on Go), and runs 5.5x faster. The Go ERR may be a server artifact, but even on 5 comparable tasks, nonthink is competitive. Thinking gave worse SSE (const bug) and no benefit here.
gpt-oss-120b: only 1 point difference — thinking barely helps this model. The nonthink variant is essentially free performance.
qwen35-122b nonthink outperformed thinking by 3 points AND was 3x faster — the nonthink variant caught ValueEnum while the thinking variant didn't. Thinking hurt this model.
qwen35-27b nonthink outperformed thinking by 4+ points on completed tasks, though the thinking variant crashed on one task.

Conclusion: Thinking only clearly helps qwen35-35b at Q8. At Q6_K quant, thinking showed no benefit and produced a critical bug in SSE. For all other models, nonthink was equal or better.

Rust Knowledge Is Rare

The ValueEnum derive was a strong differentiator but scored proportionally alongside other bugs. Without it the code won't compile, but in a real workflow the compiler would catch it immediately — it tests ecosystem knowledge, not coding ability.

Caught ValueEnum	Count
Yes (ValueEnum or FromStr)	15/33
No	16/33
Error	2/33

Claude Distill Variants vs Base Models

Base Model	Base Score	Distill Score	Gap
qwen35-35b-a3b-q8	100	86	-14
qwen35-27b-q8-nonthink	87	82	-5

Distillation consistently degraded coding performance. The 35b distill lost 14 points — mostly from weaker TS refactoring and missing some JS/Rust bugs.

Quantization Effects

Model Family	Quant A	Score	Quant B	Score	Better
qwen35-122b-a10b	IQ4_NL (57 GB)	90	Q4_K_M (72 GB)	86	Smaller
devstral-small-24b	Q4_0 (13 GB)	76	Q8_K_XL (27 GB)	90	Larger
qwen3-coder-30b-a3b	Q4_K_XL (17 GB)	76	Q8_K_XL (34 GB)	73	Smaller
qwen35-35b-a3b (nonthink)	Q6_K (27 GB)	86	Q8 (34 GB)	83	Smaller

No consistent pattern — quantization effects are model-dependent.

Models That Failed Consistently

mistral-small-4-119b-iq4nl — Crashed on 5 of 6 tasks. Only completed JS SSE (scoring 14). Unusable.
glm-47-flash-q8kxl — Crashed on 4 of 6 tasks. Only completed Rust (11) and Python (4). Can't handle larger inputs.
qwen35-08b-q8 — Too small. Crashed on Go, produced broken code on everything else. Score 29 from 4 tasks.
qwen35-35b-a3b-q6k-think — Failed Go middleware (invalid JSON from server). Its nonthink sibling at the same quant scored higher (86 vs 76) and ran 5.5x faster.
nemotron-cascade-2-30b-a3b-q40 — Worked in round 1 (46/60) but failed all 3 round 2 tasks. Unstable.
step-35-flash-q4km — 111.7 GB, 545s average, crashed on 3 tasks. When it works it's good (18 on Rust), but too unreliable and slow.

Bug/Feature Detection Difficulty

JS Bug Fix (Round 1):

Bug	Fixed	Difficulty
`_bubbleUp` parentheses	24/26	Easy
`once()` wrong reference	19/26	Medium
`clear()` order of operations	10/26	Hard

Rust CLI (Round 2):

Bug	Fixed	Difficulty
`mark_done` immutable ref	30/30	Easy
`next_id` not incremented	29/30	Easy
`remove` wrong retain	28/30	Easy
`ValueEnum` missing	14/30	Hard

Go Rate Limiter Quality

The Go middleware task separated models by implementation quality:

Approach	Models	Quality
Sliding window (timestamp list)	8	Best — accurate, no burst edge cases
Counter with per-IP reset	12	Good — simple, minor burst window at reset
Counter with global ticker reset	3	Acceptable — all IPs reset simultaneously
Ticker-based (1 tick = 1 request)	1	Wrong — allows ~1 req/s instead of 10
Mutex held during request	1	Bug — serializes all requests

aristath/llm-review.md

Select an option

No results found