Side-by-side comparison of the native tool architecture (main) vs the skill+bash architecture across 85+ QA scenarios. Both runs use the same Qwen 3.5 397B model on the same hardware, same NATS infrastructure, same QA evaluator. Only the os-assistant binary and scenario acceptance criteria differ.
Skills scored 48 GREAT vs main's 44 with the same BAD count (23 vs 23). Token usage was comparable (10.3M vs 10.3M) with fewer LLM calls (4,869 vs 5,649). Median time to first audio was 1.5s (skills) vs 1.7s (main).
| Main | Skills |
|---|