Skip to content

Instantly share code, notes, and snippets.

@kvenkatrajan
Last active May 12, 2026 13:00
Show Gist options
  • Select an option

  • Save kvenkatrajan/ecff151b15cc06f964cbc8081ceb786f to your computer and use it in GitHub Desktop.

Select an option

Save kvenkatrajan/ecff151b15cc06f964cbc8081ceb786f to your computer and use it in GitHub Desktop.
Squad Charter: Eval-to-Action Loop -- Closing the loop between weekend compete runs and weekly engineering investments

Squad Charter: Eval-to-Action Loop -- Closing the loop between weekend compete runs and weekly engineering investments

Squad Charter: Eval-to-Action Loop -- Closing the loop between weekend compete runs and weekly engineering investments

Squad Charter: Eval-to-Action Loop -- Closing the loop between weekend compete runs and weekly engineering investments

Squad Charter: Eval-to-Action Loop -- Closing the loop between weekend compete runs and weekly engineering investments

Squad Charter: Eval-to-Action Loop -- Closing the loop between weekend compete runs and weekly engineering investments

Squad Charter: Eval-to-Action Loop -- Closing the loop between weekend compete runs and weekly engineering investments

Squad Charter: Eval-to-Action Loop -- Closing the loop between weekend compete runs and weekly engineering investments

Squad Charter: Eval-to-Action Loop -- Closing the loop between weekend compete runs and weekly engineering investments

Squad Charter: Eval-to-Action Loop

Squad Name: Eval-to-Action Loop

Cycle / Dates: [TBD]

DRI (Eng): [TBD]

DRI (PM): Kay Venkatrajan

Workstream / Area: Agent Quality / Eval Infrastructure

Teams Channel / Alias: [TBD]


Mission

Build an automated pipeline that analyzes agent trajectories from weekend compete runs, identifies behavioral hotspots (loops, token waste, stalls), and surfaces them as ranked, code-linked GitHub issues -- so the CLI team can select 2-3 concrete investments each Monday and validate whether those changes moved the numbers by the following week.

Success means: by end of cycle, Monday eval meetings start with auto-generated, prioritized issues instead of raw logs, and at least one fix is validated through the before/after loop.


Business Impact

Today, the weekly eval meeting is passive reporting -- scores go up or down, the team discusses possible causes, and disperses to work on intuition. Insights decay between meetings, the same failure modes recur across weeks, and there's no institutional memory of what was tried and what worked.

This squad closes that gap by:

  • Reducing time-to-action: From "someone reads trajectory logs" to "auto-generated issues with candidate code locations"
  • Improving agent quality: Systematic detection of loops, prompt inflation, and tool routing failures that waste tokens and degrade eval scores
  • Building institutional memory: Week-over-week tracking of which fixes moved numbers, compounding improvement over time
  • Aligning engineering work to outcomes: Every weekly investment is tied to measurable eval metrics, not guesswork

Success Metrics

Primary:

  • Number of auto-generated GitHub issues from trajectory analysis per eval cycle (target: top 3-5 ranked issues per weekend run)
  • Percentage of Monday meeting agenda driven by auto-generated issues vs manual log reading (target: >80%)

Secondary:

  • Reduction in recurring hotspot types week-over-week (demonstrates fixes are landing)
  • Token savings from addressed hotspots (measured by before/after comparison on same eval tasks)
  • Time from "hotspot detected" to "PR opened" (target: <2 days)

Validation:

  • At least one fix shipped and validated through the before/after loop within the cycle
  • Monday eval meeting participants report the auto-generated issues are actionable (qualitative)

Scope

In Scope:

  • Harden the trajectory analysis prototype for batch processing across full weekend runs (CLI, Claude Code, Codex CLI x Sonnet 4.6, Opus 4.7, GPT 5.4)
  • Turn-by-turn conversation analysis: reconstruct agent reasoning, tool calls, prompts, and observations per step to surface exactly where and why the agent went wrong
  • Integrate analysis into the compete pipeline (post-run automation)
  • Build hotspot interpreter: translate behavioral symptoms into search strategies for the coding agent
  • Auto-generate GitHub issues with evidence (trajectory excerpts, token counts), candidate code locations, and linked eval tasks
  • Integrate into coding agent pipeline: trigger issue creation automatically from the latest run's analysis output, so hotspots flow directly into the agent repo as actionable issues
  • Build before/after validation: compare same eval tasks across weeks to measure fix impact
  • Investigate Kusto/MSBench data availability (what's already captured, can we query it directly)
  • Investigate whether Copilot CLI has or can add OTel instrumentation at per-turn/per-tool-call level

Out of Scope:

  • Changing the compete run infrastructure itself (MSBench orchestration)
  • Building a full observability platform or dashboarding system
  • Cross-run aggregation and regression detection across weeks (future enhancement)
  • Shift-left: running MSBench from within the agent repo (future investigation)
  • Modifying the agent's core behavior -- this squad identifies and surfaces issues, the CLI team owns the fixes

Plan and Milestones

Milestone 1: Analysis engine production-ready

  • Harden prototype for batch processing (handle all trajectory formats, error resilience, parallel execution)
  • Turn-by-turn conversation analysis: per-step reconstruction of agent reasoning, tool calls, prompts, and observations to pinpoint where the agent went wrong
  • Validate against a full weekend's worth of compete runs
  • Integrate as a post-run step in the compete pipeline
  • Deliverable: automated analysis runs after every weekend compete run, producing per-trajectory reports with turn-level detail

Milestone 2: Hotspot interpreter + coding agent pipeline integration

  • Build hotspot type --> search strategy mapping
  • Integrate with coding agent for candidate code identification
  • Build coding agent pipeline integration: auto-trigger issue creation from analysis output after each run
  • Auto-generate GitHub issues with evidence, candidate code, and linked eval tasks
  • Deliverable: top-N issues auto-created in the agent repo after each weekend run, triggered through the coding agent pipeline

Milestone 3: Closed-loop validation

  • Build before/after comparison on linked eval tasks
  • Track which issues were addressed and whether metrics improved
  • Deliverable: Monday meeting includes "last week's fixes: what moved" section

Dependencies and Risks

Dependencies:

  • Access to MSBench compete run output (ATIF trajectory files) and pipeline integration points
  • Clarity on Kusto schema -- what data is already captured, what gaps exist
  • Coding agent pipeline access and permissions for automated issue creation in the agent repo
  • Coding agent availability for candidate code search (or alternative approach)
  • CLI team engagement to triage and act on generated issues

Risks:

  • Data availability: If Kusto tables only have summary-level data (not per-turn granularity), the analysis engine must process raw ATIF files, adding pipeline complexity
  • Pipeline integration: Coding agent pipeline may have constraints on automated issue creation (rate limits, permissions, repo access) that require coordination
  • Signal quality: Auto-generated issues may have high false-positive rate initially; needs tuning against real compete runs before the team trusts the output
  • Adoption: If Monday meeting workflow doesn't shift to using generated issues, the loop doesn't close -- requires PM + team buy-in
  • Scope creep: Aggregation, regression detection, and shift-left are compelling but out of scope for this cycle

Squad Composition

Name Role Discipline Allocation Notes
Kay Venkatrajan Core PM (DRI) [TBD] Design direction, prototyping, dependency management, stakeholder alignment, prioritization
[TBD] Core Engineering [TBD] Analysis engine, turn-by-turn analysis
[TBD] Core Engineering [TBD] Hotspot interpreter, issue generation
[TBD] Core Engineering (Infra) [TBD] Pipeline integration, coding agent pipeline integration, MSBench/compete pipeline, OTel integration, Kusto updates
[TBD] Advisor Engineering [TBD] MSBench/compete pipeline expertise
[TBD] Advisor Engineering [TBD] Copilot CLI architecture (OTel, agent internals)
[TBD] Advisor Engineering [TBD] Coding agent pipeline (issue creation, permissions, API surface)
  • Core: Full-time or near full-time ownership
  • Contributor: Partial allocation (~30-50%)
  • Advisor: Guidance only (<30%)

Definition of Done

The squad is successful when:

  1. Automated analysis runs after every weekend compete run -- no manual trigger needed, produces per-trajectory diagnostics, turn-by-turn analysis, and hotspot reports
  2. Top 3-5 ranked issues are auto-created in the agent repo via the coding agent pipeline each week, with evidence (trajectory data, token counts), candidate code locations, and linked eval tasks for validation
  3. Monday eval meeting uses generated issues as the primary agenda -- the team selects 2-3 investments from the ranked list, not from manual log reading
  4. At least one fix is validated through the before/after loop -- a PR is shipped, the same eval tasks are re-run, and the metric improvement (or lack thereof) is reported in the following Monday meeting
  5. Coding agent pipeline integration is operational -- issue creation flows automatically from analysis output to the agent repo without manual intervention
  6. Unknowns are resolved: Kusto data availability is documented, OTel instrumentation feasibility is assessed, and shift-left viability is investigated (with a recommendation, not necessarily implemented)

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment