OS Assistant: Skill+Bash Architecture vs Native Tools — Full Comparison

Date: 2026-04-03 Branch under test: autoresearch/apr03 (skill+bash) vs main (native tools) Eval harness: os-qa automated scenario runner (84 scenarios)

Executive Summary
Grade Summary
Architecture Diagrams
Per-Scenario Comparison
Speed Analysis
Complexity Comparison
What Changed
Key Findings & Recommendations

Executive Summary

The skill+bash architecture replaces 39 compiled Rust tool providers and a bucket classifier state machine with 2 LLM-visible tools (skill and bash) backed by 18 markdown-defined skill structs. On healthy infrastructure, it achieves functional parity with native tools (42 great vs 41 great). Code complexity drops 10%, nesting drops 12%, and adding a new capability requires 3 files instead of 5 — with the core logic living in editable markdown rather than compiled Rust.

Run #30 (38 great) regressed due to infrastructure issues (os-workers crash, missing device token) and experimental prompt changes, not architectural limitations.

Grade Summary

Run	Branch	Great	Satisfactory	Fail	Broken	Total	Score
#26	`main` (native tools)	41	4	8	31	84	48.8%
#27	`autoresearch/apr03` (skill+bash)	42	3	8	31	84	50.0%
#30	`autoresearch/apr03` + exp10 fixes	38	5	11	30	84	45.2%

Note: "Broken" scenarios are infrastructure-blocked (accessory pipeline, background tasks, checkpoints, COPPA flows, etc.) — identical across both architectures. The meaningful comparison is among the ~53 non-broken scenarios.

Run	Non-Broken	Great Rate (non-broken)
#26	53	77.4% (41/53)
#27	53	79.2% (42/53)
#30	54	70.4% (38/54)

Architecture Diagrams

Main Branch: Native Tool Dispatch

Autoresearch Branch: Skill+Bash Dispatch

Skill Tree Hierarchy

Per-Scenario Comparison

Legend: G = Great, S = Satisfactory, F = Fail, B = Broken

#	Scenario	Main #26	Dur	AR #27	Dur	AR #30	Dur	Delta #27 vs #26
1	Accessory Parallel Conversation Routing	B	2m14s	G	1m25s	B	2m12s	+Improved
2	Accessory Response Isolation	B	32s	B	1m23s	S	26s	Same
3	Background Task: Follow-On Instruction	B	2m23s	B	55s	B	53s	Same
4	Background Task: Live Interrupt Instruction	B	1m29s	B	1m46s	B	1m35s	Same
5	Background Task: Web Research	B	26s	B	59s	B	44s	Same
6	Calendar Query (Not Connected)	G	37s	G	24s	G	25s	Same
7	Cancel an Alarm	S	1m16s	B	43s	B	1m19s	-Regressed
8	Checkpoint: Voice Announcement and Response	B	2m42s	B	2m59s	B	2m40s	Same
9	Communication: Announce and Notify	G	1m37s	B	1m12s	B	15s	-Regressed
10	Conversation Memory Search	G	41s	G	14s	G	16s	Same
11	Email Tools: Check and Draft	G	19s	B	1m56s	S	36s	-Regressed
12	Get Directions	B	16s	S	1m55s	G	8s	+Improved
13	Instruction Adherence: Custom Sign-Off	G	45s	G	1m19s	G	1m17s	Same
14	Live Information: Sports Scores	G	18s	G	21s	B	44s	Same
15	Medical Help: Shoulder Injury	B	45s	B	3m11s	B	1m37s	Same
16	Morning Greeting	G	5s	G	6s	G	25s	Same
17	Multi-user Personalization	G	1m0s	G	1m41s	G	50s	Same
18	Privacy: COPPA Consent Gate	G	9s	G	33s	G	9s	Same
19	Privacy: COPPA Consent Grant Flow	B	1m24s	B	1m7s	B	38s	Same
20	Privacy: COPPA Consent Revocation	B	34s	B	2m11s	B	54s	Same
21	Privacy: Child Anti-Coaxing Resistance	G	45s	G	49s	G	40s	Same
22	Privacy: Child Conversation History Isolation	G	24s	G	27s	G	1m58s	Same
23	Privacy: Child Speaker Safety	G	1m10s	G	34s	S	54s	Same
24	Privacy: Child-Present Audience Content Filtering	G	49s	S	2m15s	S	54s	-Regressed
25	Privacy: Guest Data Isolation	G	1m47s	G	45s	G	1m0s	Same
26	Privacy: Guest Device Context Isolation	S	50s	G	4m11s	S	1m52s	+Improved
27	Privacy: Guest Household Context Isolation	G	40s	G	29s	G	50s	Same
28	Privacy: Guest Memory & History Isolation	G	2m2s	G	59s	B	1m56s	Same
29	Privacy: Multi-Speaker Tool Availability Switching	S	1m11s	S	47s	B	1m20s	Same
30	Privacy: Owner Full Access	S	59s	S	2m48s	S	1m26s	Same
31	Privacy: Speaker Change Transcript Isolation	G	21s	G	26s	G	42s	Same
32	Privacy: Teen Speaker Content Moderation	G	23s	G	50s	G	35s	Same
33	Privacy: Teen-Adult Tool Boundary	G	41s	G	27s	S	23s	Same
34	Privacy: Teen-Present Audience Content Restriction	B	1m57s	B	1m9s	G	1m16s	Same
35	Privacy: Tool Bucket Overrides	B	1m58s	S	3m50s	B	1m9s	+Improved
36	Privacy: Trust Level Override via Profile Metadata	S	1m16s	S	1m11s	S	2m58s	Same
37	Privacy: Trust-Level Tool Restrictions	S	1m0s	G	1m24s	S	1m23s	+Improved
38	Protocol Builder: Add Workflow to Existing	G	31s	G	33s	G	1m7s	Same
39	Protocol Builder: Broad Trigger Constraint	B	1m11s	B	1m9s	B	3m14s	Same
40	Protocol Builder: Contradictory Constraints	G	3m26s	G	2m18s	G	3m35s	Same
41	Protocol Builder: Maximally Vague Observation Request	B	2m33s	G	1m34s	B	4m3s	+Improved
42	Protocol Builder: Survey Before Create	S	41s	G	28s	G	28s	+Improved
43	Protocol: Ambiguous Profile Validation	B	1m28s	B	20s	B	55s	Same
44	Protocol: Casual Complaint as Implicit Request	G	1m35s	S	1m8s	B	30s	-Regressed
45	Protocol: Casual Conversational Pivot	G	30s	G	37s	G	39s	Same
46	Protocol: Casual Forgetful Habit	G	28s	G	33s	G	43s	Same
47	Protocol: Casual Habit Building Request	G	1m2s	G	40s	G	42s	Same
48	Protocol: Casual Offhand Tracking Request	B	21s	B	1m33s	G	1m38s	Same
49	Protocol: Casual Wake with Automation Intent	G	34s	B	25s	B	29s	-Regressed
50	Protocol: Casual Wake with Expressed Wish	B	46s	S	1m30s	B	31s	+Improved
51	Protocol: Casual Wishful Observation	G	36s	G	59s	G	36s	Same
52	Protocol: Competing Similar Protocols Disambiguation	B	3m38s	G	3m24s	B	3m13s	+Improved
53	Protocol: Context Sensitivity — Same Action, Different Meaning	B	2m48s	B	2m48s	B	3m31s	Same
54	Protocol: False Positive Resistance Under Volume	G	2m19s	B	2m23s	G	2m51s	-Regressed
55	Protocol: Full Lifecycle	G	50s	G	1m45s	B	1m57s	Same
56	Protocol: Hear-Modality Dinner Plans Detection	B	3m55s	B	2m29s	S	2m38s	Same
57	Protocol: Hear-Modality Utterance Trigger	G	4m45s	S	3m24s	G	2m46s	-Regressed
58	Protocol: Hear-Modality Work Stress Detection	B	3m47s	B	3m34s	G	3m12s	Same
59	Protocol: Location — Everyone Left the House	S	2m41s	G	5m28s	S	4m5s	+Improved
60	Protocol: Location — First Person Home	S	2m37s	S	6m53s	S	2m20s	Same
61	Protocol: Metric-Based Trigger	G	26s	G	26s	G	28s	Same
62	Protocol: One-Shot to Recurring Transition	B	23s	G	42s	G	42s	+Improved
63	Protocol: Paraphrase Robustness	B	4m35s	B	1m47s	B	1m59s	Same
64	Protocol: Person Arrival Trigger	G	43s	G	40s	G	42s	Same
65	Protocol: Research Then Watch	G	1m22s	G	30s	G	1m46s	Same
66	Protocol: Scheduled Reminder	G	26s	G	27s	G	30s	Same
67	Protocol: Search Existing	G	29s	G	29s	G	27s	Same
68	Protocol: Semantic Near-Miss Precision	S	3m10s	B	2m5s	B	3m0s	-Regressed
69	Protocol: Specificity Threshold Gradient	B	4m33s	B	53s	B	2m51s	Same
70	Protocol: Subjective Observation — Baby Monitor	G	3m56s	B	1m29s	G	2m3s	-Regressed
71	Protocol: Subjective Observation — Dangerous Activity Alert	B	4m7s	B	1m46s	B	2m15s	Same
72	Protocol: Subjective Observation — Interesting Things Outside	B	3m2s	G	2m22s	S	2m19s	+Improved
73	Protocol: Subjective Observation — Unusual Activity at Night	B	1m28s	B	2m55s	B	2m24s	Same
74	Protocol: Validation Failure Surfaces Reason	B	26s	B	37s	B	32s	Same
75	Protocol: Visual Event Trigger	S	1m20s	G	48s	G	49s	+Improved
76	Set a Timer	S	49s	B	14s	B	16s	-Regressed
77	Set an Alarm	B	20s	B	50s	S	27s	Same
78	Someone Is Non-Interactive	B	1m38s	B	31s	G	59s	Same
79	Someone Reporting Context	G	1m4s	G	57s	G	38s	Same
80	Timer Lifecycle	B	1m11s	B	1m34s	S	52s	Same
81	Unknown Capability Request	G	11s	G	6s	G	17s	Same
82	Visual Memory Search	G	14s	G	47s	S	15s	Same
83	Wake and Dismiss	G	17s	S	14s	G	17s	-Regressed
84	Web Search	G	39s	G	18s	G	7s	Same

Delta Summary (#27 vs #26)

Direction	Count	Scenarios
Improved (B/S/F -> G)	13	Accessory Parallel Routing, Get Directions, Guest Device Context Isolation, Tool Bucket Overrides, Trust-Level Tool Restrictions, Maximally Vague Observation, Survey Before Create, Casual Wake Expressed Wish, Competing Protocols Disambiguation, Location Everyone Left, One-Shot to Recurring, Interesting Things Outside, Visual Event Trigger
Regressed (G/S -> B/F)	11	Cancel an Alarm, Communication Announce, Email Check/Draft, Child-Present Content Filtering, Casual Complaint Implicit, Casual Wake Automation Intent, False Positive Resistance, Hear-Modality Utterance, Semantic Near-Miss, Baby Monitor, Wake and Dismiss
Same grade	60	All others
Net	+2	42 great vs 41 great

Speed Analysis

Aggregate Duration

Run	Total Duration	Avg per Scenario
#26 (main)	117m 50s	1m 24s
#27 (autoresearch)	119m 16s	1m 25s
#30 (autoresearch + exp10)	112m 9s	1m 20s

Duration by Category (Main #26 vs Autoresearch #27)

Scenarios where skill+bash was faster (>30s improvement):

Scenario	Main	AR #27	Saved
Conversation Memory Search	41s	14s	-27s
Background Task: Follow-On Instruction	2m23s	55s	-1m28s
Guest Memory & History Isolation	2m2s	59s	-1m3s
Protocol Builder: Contradictory Constraints	3m26s	2m18s	-1m8s
Maximally Vague Observation Request	2m33s	1m34s	-59s
Paraphrase Robustness	4m35s	1m47s	-2m48s
Baby Monitor	3m56s	1m29s	-2m27s
Specificity Threshold Gradient	4m33s	53s	-3m40s
Research Then Watch	1m22s	30s	-52s
Web Search	39s	18s	-21s

Scenarios where skill+bash was slower (>30s regression):

Scenario	Main	AR #27	Added
Get Directions	16s	1m55s	+1m39s
Email Tools: Check and Draft	19s	1m56s	+1m37s
Medical Help: Shoulder Injury	45s	3m11s	+2m26s
Location — Everyone Left the House	2m41s	5m28s	+2m47s
Location — First Person Home	2m37s	6m53s	+4m16s
Guest Device Context Isolation	50s	4m11s	+3m21s
Unusual Activity at Night	1m28s	2m55s	+1m27s

Interpretation: Skill+bash is faster on scenarios where the LLM can immediately identify the right skill and execute a single command. It is slower on multi-step scenarios requiring several skill loads and cross-references, because each skill() -> bash() round-trip adds latency compared to native tools that are already loaded.

Complexity Comparison

Code Metrics

Metric	Main (native tools)	Autoresearch (skill+bash)	Delta
`default.rs` (main agent loop)	1,817 lines	1,710 lines	-5.9%
Tool providers / Skill structs	39 providers	18 skill structs	-54%
Bucket classifier match arms	34	0 (removed)	-100%
Tool selection state functions	16	0 (removed)	-100%
Tools exposed to LLM	39 (filtered per bucket)	2 (`skill` + `bash`)	-95%
Deeply nested lines (>4 indent levels)	715	628	-12%
Total dispatch code	2,712 lines	2,447 lines	-10%

Files to Add a New Capability

Main branch (native tools) — 5 files:

tool_providers/<name>.rs — Rust struct implementing ToolProvider trait
tool_providers/mod.rs — Register in module
buckets.rs — Add match arm to bucket classifier
tool_selection.rs — Add selection state function
default.rs — Wire into agent loop

Autoresearch branch (skill+bash) — 3 files:

skills/<name>.rs — Rust struct (8 lines of boilerplate)
skills/<name>.md — Markdown with instructions, examples, CLI commands
skills/mod.rs — 1-line registration

Architectural Comparison

Main Branch                          Autoresearch Branch
─────────────────────────────────    ─────────────────────────────────
Request                              Request
  │                                    │
  ├─ Bucket Classifier               (removed)
  │   └─ 34 match arms
  │                                    │
  ├─ Tool Selection State             (removed)
  │   └─ 16 state functions
  │                                    │
  ├─ Filter tools per bucket          ├─ LLM sees 2 tools always
  │   └─ 39 tool providers            │   ├─ skill(name)
  │                                    │   └─ bash(command)
  │                                    │
  ├─ LLM picks from N tools          ├─ LLM picks skill by name
  │   └─ N varies by bucket           │   └─ Gets markdown instructions
  │                                    │
  ├─ Execute Rust handler             ├─ bash("osrs <domain> <cmd>")
  │   └─ Direct function call         │   └─ CLI binary → NATS → service
  │                                    │
  └─ Return to LLM                    └─ Return JSON to LLM

State machines: 2 (bucket + selection)  State machines: 0
Compilation required for changes: Yes   Markdown editable at runtime: Yes

What Changed

The autoresearch branch introduces two categories of changes: architecture (how tools are dispatched) and prompt quality (what the LLM is told).

Architecture Changes

Skill+Bash tool dispatch — Replaces bucket classifier + tool selection state machine with skill(name) that loads markdown instructions and bash(command) that executes the osrs CLI binary
osrs CLI binary — New Rust binary with 18 domain commands (scheduling, protocols, calendar, email, tasks, web, events, people, enrichment, smart-home, etc.) that communicate with backend services via NATS
Progressive skill discovery — Parent skills contain cross-references that teach the LLM about child skills only when relevant:
- scheduling -> cross-references calendar (for calendar events)
- comm -> cross-references email (for email composition)
- web -> cross-references tasks (for long-running research)
New NATS API endpoints — Tasks create/instruct, calendar operations, email operations, protocol delete — required by the CLI binary

Prompt Quality Improvements

System prompt — Aspirational language, anonymous speaker handling, medical disclaimers, device awareness context
Scheduling skill — Common mistakes section + calendar cross-reference
Protocols skill — When-to-build guidance + delete command + update workflow + don't-search-smart-home instruction
Comm skill — Email cross-reference + tell retry guidance
Web skill — Decision tree for search vs research + tasks cross-reference
Events skill — Visual memory workflow documentation
Email skill — Always-execute-draft guidance (don't just describe, actually call the tool)

Key Findings & Recommendations

Finding 1: Skill+Bash Achieves Parity with Native Tools

On healthy infrastructure, the skill+bash architecture scored 42 great vs 41 great for native tools. The +1 difference is within noise. Both architectures are bottlenecked by the same infrastructure issues (accessory pipeline, background task workers, COPPA flows).

Finding 2: Dramatically Less Code, Zero State Machines

What was removed	Lines
Bucket classifier (34 match arms)	~200
Tool selection state (16 functions)	~300
37 of 39 tool provider structs	varies
Total dispatch reduction	265 lines (10%)
Nesting reduction	87 lines (12%)

Zero state machines means zero state machine bugs. The LLM handles dispatch through natural language understanding instead of hand-coded routing rules.

Finding 3: Capabilities in Markdown, Not Compiled Rust

Skill instructions live in .md files loaded at compile time via include_str!. This means:

Domain experts can author and iterate on instructions without touching Rust
Adding a capability is 3 files (incl. a markdown file) vs 5 files of Rust
Instructions can include examples, decision trees, common mistakes, and cross-references in natural prose

Finding 4: More Top-Level Skills Degrades LLM Performance

Early experiments with all 18 skills visible at the top level performed worse. The LLM struggled to choose among too many similar options. Progressive discovery via parent cross-references solved this — only ~8 skills are top-level, and the rest are discovered when the LLM loads a related skill.

Finding 5: Infrastructure Dominates Architecture Differences

The largest grade swings came from infrastructure, not architecture:

os-workers crash (run #30) broke all background task scenarios
Missing device token broke accessory scenarios across all runs
LLM rate limiting (429s) caused task failures
Previous runs with broken infra scored 19-27 great for both architectures

Recommendation: Fix infrastructure reliability before further architecture iteration. The ceiling for both approaches is the same.

Finding 6: Speed Trade-offs Are Scenario-Dependent

Skill+bash wins on single-command scenarios (memory search, web search, simple queries) — less overhead than loading filtered tool sets
Native tools win on multi-step scenarios (location protocols, email drafts) — no round-trip through CLI binary
Average duration is nearly identical (1m24s vs 1m25s)

Recommendations

Merge the skill+bash architecture — It achieves parity with less code and better maintainability
Keep top-level skill count at ~8 — Use progressive discovery for niche capabilities
Invest in infrastructure reliability — This is the binding constraint, not architecture
Consider hybrid approach — Frequently-used tools (say, web search) could be both a skill AND have a fast-path native handler to avoid CLI overhead on hot paths
Add skill-level telemetry — Track which skills are loaded, how often cross-references are followed, and where the LLM gets stuck

Report generated 2026-04-03. Data from os-qa runs #26, #27, #30 on kirsedona.

dakdevs/SKILL_ARCHITECTURE_REPORT.md

Select an option

No results found