Squad Charter: Eval-to-Action Loop -- Closing the loop between weekend compete runs and weekly engineering investments

Squad Charter: Eval-to-Action Loop

Squad Name: Eval-to-Action Loop

Cycle / Dates: [TBD]

DRI (Eng): [TBD]

DRI (PM): Kay Venkatrajan

Workstream / Area: Agent Quality / Eval Infrastructure

Teams Channel / Alias: [TBD]

Mission

Build an automated pipeline that analyzes agent trajectories from weekend compete runs, identifies behavioral hotspots (loops, token waste, stalls), and surfaces them as ranked, code-linked GitHub issues -- so the CLI team can select 2-3 concrete investments each Monday and validate whether those changes moved the numbers by the following week.

Success means: by end of cycle, Monday eval meetings start with auto-generated, prioritized issues instead of raw logs, and at least one fix is validated through the before/after loop.

Business Impact

Today, the weekly eval meeting is passive reporting -- scores go up or down, the team discusses possible causes, and disperses to work on intuition. Insights decay between meetings, the same failure modes recur across weeks, and there's no institutional memory of what was tried and what worked.

This squad closes that gap by:

Reducing time-to-action: From "someone reads trajectory logs" to "auto-generated issues with candidate code locations"
Improving agent quality: Systematic detection of loops, prompt inflation, and tool routing failures that waste tokens and degrade eval scores
Building institutional memory: Week-over-week tracking of which fixes moved numbers, compounding improvement over time
Aligning engineering work to outcomes: Every weekly investment is tied to measurable eval metrics, not guesswork

Success Metrics

Primary:

Number of auto-generated GitHub issues from trajectory analysis per eval cycle (target: top 3-5 ranked issues per weekend run)
Percentage of Monday meeting agenda driven by auto-generated issues vs manual log reading (target: >80%)

Secondary:

Reduction in recurring hotspot types week-over-week (demonstrates fixes are landing)
Token savings from addressed hotspots (measured by before/after comparison on same eval tasks)
Time from "hotspot detected" to "PR opened" (target: <2 days)

Validation:

At least one fix shipped and validated through the before/after loop within the cycle
Monday eval meeting participants report the auto-generated issues are actionable (qualitative)

Scope

In Scope:

Harden the trajectory analysis prototype for batch processing across full weekend runs (CLI, Claude Code, Codex CLI x Sonnet 4.6, Opus 4.7, GPT 5.4)
Turn-by-turn conversation analysis: reconstruct agent reasoning, tool calls, prompts, and observations per step to surface exactly where and why the agent went wrong
Integrate analysis into the compete pipeline (post-run automation)
Build hotspot interpreter: translate behavioral symptoms into search strategies for the coding agent
Auto-generate GitHub issues with evidence (trajectory excerpts, token counts), candidate code locations, and linked eval tasks
Integrate into coding agent pipeline: trigger issue creation automatically from the latest run's analysis output, so hotspots flow directly into the agent repo as actionable issues
Build before/after validation: compare same eval tasks across weeks to measure fix impact
Investigate Kusto/MSBench data availability (what's already captured, can we query it directly)
Investigate whether Copilot CLI has or can add OTel instrumentation at per-turn/per-tool-call level

Out of Scope:

Changing the compete run infrastructure itself (MSBench orchestration)
Building a full observability platform or dashboarding system
Cross-run aggregation and regression detection across weeks (future enhancement)
Shift-left: running MSBench from within the agent repo (future investigation)
Modifying the agent's core behavior -- this squad identifies and surfaces issues, the CLI team owns the fixes

Plan and Milestones

Milestone 1: Analysis engine production-ready

Harden prototype for batch processing (handle all trajectory formats, error resilience, parallel execution)
Turn-by-turn conversation analysis: per-step reconstruction of agent reasoning, tool calls, prompts, and observations to pinpoint where the agent went wrong
Validate against a full weekend's worth of compete runs
Integrate as a post-run step in the compete pipeline
Deliverable: automated analysis runs after every weekend compete run, producing per-trajectory reports with turn-level detail

Milestone 2: Hotspot interpreter + coding agent pipeline integration

Build hotspot type --> search strategy mapping
Integrate with coding agent for candidate code identification
Build coding agent pipeline integration: auto-trigger issue creation from analysis output after each run
Auto-generate GitHub issues with evidence, candidate code, and linked eval tasks
Deliverable: top-N issues auto-created in the agent repo after each weekend run, triggered through the coding agent pipeline

Milestone 3: Closed-loop validation

Build before/after comparison on linked eval tasks
Track which issues were addressed and whether metrics improved
Deliverable: Monday meeting includes "last week's fixes: what moved" section

Dependencies and Risks

Dependencies:

Access to MSBench compete run output (ATIF trajectory files) and pipeline integration points
Clarity on Kusto schema -- what data is already captured, what gaps exist
Coding agent pipeline access and permissions for automated issue creation in the agent repo
Coding agent availability for candidate code search (or alternative approach)
CLI team engagement to triage and act on generated issues

Risks:

Data availability: If Kusto tables only have summary-level data (not per-turn granularity), the analysis engine must process raw ATIF files, adding pipeline complexity
Pipeline integration: Coding agent pipeline may have constraints on automated issue creation (rate limits, permissions, repo access) that require coordination
Signal quality: Auto-generated issues may have high false-positive rate initially; needs tuning against real compete runs before the team trusts the output
Adoption: If Monday meeting workflow doesn't shift to using generated issues, the loop doesn't close -- requires PM + team buy-in
Scope creep: Aggregation, regression detection, and shift-left are compelling but out of scope for this cycle

Squad Composition

Name	Role	Discipline	Allocation	Notes
Kay Venkatrajan	Core	PM (DRI)	[TBD]	Design direction, prototyping, dependency management, stakeholder alignment, prioritization
[TBD]	Core	Engineering	[TBD]	Analysis engine, turn-by-turn analysis
[TBD]	Core	Engineering	[TBD]	Hotspot interpreter, issue generation
[TBD]	Core	Engineering (Infra)	[TBD]	Pipeline integration, coding agent pipeline integration, MSBench/compete pipeline, OTel integration, Kusto updates
[TBD]	Advisor	Engineering	[TBD]	MSBench/compete pipeline expertise
[TBD]	Advisor	Engineering	[TBD]	Copilot CLI architecture (OTel, agent internals)
[TBD]	Advisor	Engineering	[TBD]	Coding agent pipeline (issue creation, permissions, API surface)

Core: Full-time or near full-time ownership
Contributor: Partial allocation (~30-50%)
Advisor: Guidance only (<30%)

Definition of Done

The squad is successful when:

Automated analysis runs after every weekend compete run -- no manual trigger needed, produces per-trajectory diagnostics, turn-by-turn analysis, and hotspot reports
Top 3-5 ranked issues are auto-created in the agent repo via the coding agent pipeline each week, with evidence (trajectory data, token counts), candidate code locations, and linked eval tasks for validation
Monday eval meeting uses generated issues as the primary agenda -- the team selects 2-3 investments from the ranked list, not from manual log reading
At least one fix is validated through the before/after loop -- a PR is shipped, the same eval tasks are re-run, and the metric improvement (or lack thereof) is reported in the following Monday meeting
Coding agent pipeline integration is operational -- issue creation flows automatically from analysis output to the agent repo without manual intervention
Unknowns are resolved: Kusto data availability is documented, OTel instrumentation feasibility is assessed, and shift-left viability is investigated (with a recommendation, not necessarily implemented)

kvenkatrajan/_squad_charter.md

Select an option

No results found