Rigorously evaluate the tether plugin through diverse tasks, tracking workflow fidelity, workspace quality, and friction points to inform its evolution. Design Goals:
- Executable by any Claude Code session (portable)
- Can run full suite or subset in parallel
- Self-documenting: captures results for later evolution
- Results persist in
workspace/for analysis
# 1. Navigate to the test harness directory for your task
cd /Users/cck/CC/plugins/marketplaces/crinzo-plugins/scratch/[TASK-ID]
# 2. Ensure workspace directory exists (tether creates files here)
mkdir -p workspace
# 3. Check workspace state
/tether:workspace
# 4. Run the test task
/tether:skills [task prompt from table]
# 5. After completion, score the task using the rubric
# 6. Document friction in the results table
Each task runs in its own isolated folder:
crinzo-plugins/
├── tether/ # The plugin being tested
├── scratch/ # Test harness root
│ ├── A1/
│ │ └── workspace/ # A1's isolated workspace
│ ├── A2/
│ │ └── workspace/ # A2's isolated workspace
│ ├── A3/
│ │ └── workspace/
│ ├── B1/
│ │ └── workspace/
│ ... (one folder per task)
│ ├── E2/
│ │ └── workspace/
│ └── EVAL_REPORT.md # Aggregate results
Why isolation matters:
- Each task gets clean workspace state
- Parallel sessions don't conflict
- Lineage tests (C1→C2) can still reference parent by copying workspace file
- Easy to compare results across tasks
# Create all task folders upfront
cd /Users/cck/CC/plugins/marketplaces/crinzo-plugins/scratch
mkdir -p A1/workspace A2/workspace A3/workspace
mkdir -p B1/workspace B2/workspace B3/workspace
mkdir -p C1/workspace C2/workspace
mkdir -p D1/workspace D2/workspace
mkdir -p E1/workspace E2/workspace# From any scratch/[TASK]/ folder, verify tether is accessible
/tether:workspace # Should show empty workspaceFor each test task, track:
- Pre-task: Workspace file count, active tasks
- Routing Decision: What Assess returned (full/direct/clarify)
- Workspace File Created: Yes/No, filename
- Path Quality: Clear data transformation? (1-5)
- Delta Quality: Truly minimal? (1-5)
- Thinking Traces Quality: Substantive findings? (1-5)
- Build Fidelity: Stayed on Path? Within Delta?
- Completion: File renamed correctly? Delivered filled?
- Friction Points: Where did workflow stall or add overhead?
| ID | Category | Prompt | Expected Route | Tests |
|---|---|---|---|---|
| A1 | Routing | "Add a debug log to the assess agent's routing logic" | direct | Trivial task skips workspace |
| A2 | Routing | "Add a new command that shows workspace file statistics" | full | Full workflow triggers |
| A3 | Routing | "Improve the plugin" | clarify | Ambiguous prompts halt |
| B1 | Quality | "Add validation to ensure Path and Delta exist before Build proceeds" | full | Exploration depth |
| B2 | Quality | "Add line numbers to Thinking Traces output" | full | Delta minimality |
| B3 | Quality | "Create a workspace file template generator" | full | Path clarity |
| C1 | Lineage | "Create a function that parses workspace filenames into components" | full | Parent task |
| C2 | Lineage | "Use the filename parser to show lineage tree" | full | Child inherits _from-NNN |
| D1 | Creep | "Add a configuration system for tether settings" | full | Creep-prone scope |
| D2 | Creep | "Add descriptive error messages to assess routing" | full | Subtle over-engineering |
| E1 | Edge | "Integrate workspace with a GraphQL API" | clarify/blocked | Impossible task handling |
| E2 | Edge | "Fix the typo 'returing' in assess.md" | direct | Minimal overhead test |
/tether:skills Add a debug log to the assess agent's routing logic
Expected: Direct route (no workspace file) Validates: Assess correctly identifies trivial, ephemeral tasks Friction Check: If full route triggered, overhead is excessive
/tether:skills Add a new command that shows workspace file statistics
Expected: Full route → Anchor → Build Validates: Three-phase orchestration works end-to-end Friction Check: All phases execute in sequence, gate enforced
/tether:skills Improve the plugin
Expected: Clarify route (returns question) Validates: Ambiguous requests don't blindly execute Friction Check: Quality of clarifying question
/tether:skills Add validation to ensure Path and Delta exist before Build proceeds
Expected: Full route with substantive exploration Validates: Anchor explores code, finds patterns Scoring Focus: Thinking Traces quality (1-5)
/tether:skills Add line numbers to Thinking Traces output
Expected: Full route, tight Delta Validates: Delta stays minimal (not "refactor traces system") Scoring Focus: Delta quality (1-5)
/tether:skills Create a workspace file template generator
Expected: Full route, clear Path Validates: Path describes actual data transformation Scoring Focus: Path quality (1-5)
/tether:skills Create a function that parses workspace filenames into components
Expected: Full route, creates NNN_*_active.md
Validates: Workspace file created correctly
Note: Record NNN for C2
/tether:skills Use the filename parser to show lineage tree
Expected: Full route with _from-{C1-NNN} suffix
Validates: Lineage inheritance works
Scoring Focus: Did child read parent's Thinking Traces?
/tether:skills Add a configuration system for tether settings
Expected: Full route, likely to creep
During Build: Invoke /tether:creep
Validates: Creep detection catches over-engineering
Watch For: "flexible", "extensible", multiple files touched
/tether:skills Add descriptive error messages to assess routing
Expected: Full route Validates: Subtle creep (abstractions, "future-proof" code) Scoring Focus: Did implementation exceed stated Delta?
/tether:skills Integrate workspace with a GraphQL API
Expected: Clarify (asks about API details) OR Blocked Validates: Graceful handling of impossible/underspecified tasks Friction Check: How does workflow handle inability to proceed?
/tether:skills Fix the typo 'returing' in assess.md
Expected: Direct route (single-line fix) Validates: Trivial tasks don't incur workflow overhead Friction Check: If full route, note excessive ceremony
┌─────────────────────────────────────────────────────────────┐
│ TASK: [ID] │
├─────────────────────────────────────────────────────────────┤
│ 1. PRE-STATE │
│ /tether:workspace │
│ Record: Active tasks, workspace file count │
├─────────────────────────────────────────────────────────────┤
│ 2. INVOKE │
│ /tether:skills [prompt from task table] │
├─────────────────────────────────────────────────────────────┤
│ 3. OBSERVE ROUTING │
│ - Route returned: [full/direct/clarify] │
│ - Matched expectation: [Y/N] │
│ - Reasoning quality: [1-5] │
├─────────────────────────────────────────────────────────────┤
│ 4. OBSERVE ANCHOR (if full route) │
│ - Workspace file: [filename or "none"] │
│ - Path clarity: [1-5] │
│ - Delta minimality: [1-5] │
│ - Traces substance: [1-5] │
├─────────────────────────────────────────────────────────────┤
│ 5. OBSERVE BUILD │
│ - Stayed on Path: [Y/N] │
│ - Within Delta: [Y/N] │
│ - Traces expanded: [Y/N] │
│ - Files touched: [list] │
├─────────────────────────────────────────────────────────────┤
│ 6. VERIFY COMPLETION │
│ - Delivered filled: [Y/N] │
│ - File renamed: [Y/N] to [_complete/_blocked] │
├─────────────────────────────────────────────────────────────┤
│ 7. CREEP CHECK (D tasks) │
│ /tether:creep │
│ - Creep detected: [Y/N] │
│ - Off Path items: [list or "none"] │
│ - Exceeds Delta: [list or "none"] │
├─────────────────────────────────────────────────────────────┤
│ 8. FRICTION LOG │
│ - Unnecessary overhead: [describe] │
│ - Missing steps: [describe] │
│ - Awkward UX: [describe] │
│ - Evolution idea: [describe] │
└─────────────────────────────────────────────────────────────┘
Tasks can be run in parallel by different sessions since each has its own isolated folder: Independent Tasks (can run simultaneously):
- A1, A3, B2, B3, D2, E1, E2 Sequential Dependencies:
- C1 must complete before C2 (lineage test)
- After C1 completes, copy its workspace file to C2's workspace folder Suggested Parallel Split:
Session 1: cd scratch/A1 && run, cd scratch/A2 && run, cd scratch/A3 && run, cd scratch/E2 && run
Session 2: cd scratch/B1 && run, cd scratch/B2 && run, cd scratch/B3 && run
Session 3: cd scratch/C1 && run, then copy workspace to C2, cd scratch/C2 && run
Session 4: cd scratch/D1 && run, cd scratch/D2 && run, cd scratch/E1 && run
Workspace Coordination (With Isolation):
- Each task has its own clean workspace folder
- No cross-contamination between tests
- For C2 lineage test:
cp scratch/C1/workspace/*.md scratch/C2/workspace/before running C2 - Results collection: aggregate from all
scratch/*/workspace/folders
| Score | Path Clarity | Delta Minimality | Traces Substance | Reasoning Quality |
|---|---|---|---|---|
| 5 | Crystal clear data transformation, obvious flow | Smallest possible change, nothing extraneous | Rich insights, file:line refs, decision rationale | Precisely matches task complexity |
| 4 | Clear transformation, minor ambiguity | Tight scope, 1-2 extra touches | Good findings, some references | Good match, minor quibble |
| 3 | Understandable but vague | Reasonable scope, some bloat | Basic exploration summary | Acceptable but imperfect |
| 2 | Unclear, requires interpretation | Noticeable scope creep | Formulaic, little value | Mismatched complexity |
| 1 | Missing or incomprehensible | Major scope explosion | Empty or useless | Wrong route entirely |
After running all tasks, fill in:
| ID | Route | Match? | Path | Delta | Traces | Build Fidelity | Complete? | Friction Notes |
|---|---|---|---|---|---|---|---|---|
| A1 | - | - | - | |||||
| A2 | ||||||||
| A3 | - | - | - | |||||
| B1 | ||||||||
| B2 | ||||||||
| B3 | ||||||||
| C1 | ||||||||
| C2 | ||||||||
| D1 | ||||||||
| D2 | ||||||||
| E1 | ||||||||
| E2 | - | - | - | |||||
| Legend: Route = full/direct/clarify, Match = Y/N, Scores = 1-5, Build Fidelity = Y/N, Complete = Y/N |
| Metric | Target | Actual |
|---|---|---|
| Routing Accuracy | >80% | _/12 |
| Path Quality Avg | >3.5 | _/5 |
| Delta Quality Avg | >3.5 | _/5 |
| Traces Quality Avg | >3.5 | _/5 |
| Build Fidelity | >90% | _/N |
| Completion Rate | >80% | _/N |
For each friction point discovered:
FRICTION: [short name]
Task: [ID where discovered]
What happened: [concrete description]
Expected: [what should have happened]
Impact: [minor/moderate/severe]
Evolution: [proposed fix]
Files to modify: [agent/command files]
After evaluation, produce:
- Evaluation Report (create at
scratch/EVAL_REPORT.md)- Filled results capture table
- Aggregate metrics comparison
- Friction point catalog
- Evolution Backlog (append to report)
- Prioritized list of improvements
- Specific file changes with line references
- Estimated impact: high/medium/low
- Workspace Archive
- All workspace files in
scratch/*/workspace/folders - Keep as test artifacts for regression testing
- Query all:
ls scratch/*/workspace/*.md
- All workspace files in
| File | Purpose | When to Review |
|---|---|---|
tether/agents/assess.md |
Routing logic | After A1-A3 tasks |
tether/agents/anchor.md |
Workspace creation | After B1-B3 tasks |
tether/agents/code-builder.md |
Implementation | After all Build phases |
tether/agents/tether-orchestrator.md |
Phase coordination | If phases don't sequence properly |
tether/commands/creep.md |
Creep detection | After D1-D2 tasks |
tether/commands/workspace.md |
State queries | If workspace queries fail |
tether/skills/SKILL.md |
Entry point | If invocation pattern is off |
Phase 1: Routing Validation
├── A1: Direct route test
├── A2: Full workflow test
└── A3: Clarify route test
Phase 2: Quality Assessment
├── B1: Exploration depth
├── B2: Delta minimality
└── B3: Path clarity
Phase 3: Lineage Testing (SEQUENTIAL)
├── C1: Create parent task
└── C2: Create child task (uses C1's NNN)
Phase 4: Creep Detection
├── D1: Obvious creep-prone
└── D2: Subtle creep
Phase 5: Edge Cases
├── E1: Impossible/blocked
└── E2: Minimal overhead
Phase 6: Analysis
├── Fill results table
├── Calculate aggregate metrics
├── Catalog all friction points
└── Write evaluation report
Phase 7: Evolution Planning
├── Prioritize friction by impact
├── Identify top 3-5 changes
└── Document specific file edits
- Routing accuracy >80% (10+ correct routes)
- Path quality average >3.5
- Delta quality average >3.5
- Traces quality average >3.5
- Build fidelity >90%
- Lineage inheritance works (C2 references C1)
- Creep detection catches D1 over-engineering
- Trivial tasks (A1, E2) don't create workspace files
- Trivial tasks trigger full workflow (excessive overhead)
- Workspace files are formulaic (low Traces scores)
- Creep slips through undetected
- Phases bleed into each other (agent lane violations)
- Gate not enforced (Build without Path/Delta)
- Manual intervention required to complete tasks
- Clarify route triggers on clear requests
- Create test harness structure:
cd /Users/cck/CC/plugins/marketplaces/crinzo-plugins/scratch mkdir -p A1/workspace A2/workspace A3/workspace mkdir -p B1/workspace B2/workspace B3/workspace mkdir -p C1/workspace C2/workspace mkdir -p D1/workspace D2/workspace mkdir -p E1/workspace E2/workspace - Navigate to first test folder and verify plugin:
cd /Users/cck/CC/plugins/marketplaces/crinzo-plugins/scratch/A1 /tether:workspace # Should show empty workspace
- Begin with A1 (direct route test):
/tether:skills Add a debug log to the assess agent's routing logic - Document results using per-task protocol template
- Move to next task folder and repeat:
cd ../A2 /tether:skills Add a new command that shows workspace file statistics
- After all tasks complete, aggregate results:
# Collect all workspace files for review ls scratch/*/workspace/*.md