Skip to content

Instantly share code, notes, and snippets.

@kvenkatrajan
kvenkatrajan / 0-summary.md
Last active May 20, 2026 03:20
OfficeBench Opus 4.7: Copilot CLI (25953648966) vs Claude Code (25953652156) -- 24 task comparative analysis

OfficeBench — Copilot CLI vs Claude Code (Claude Opus 4.7)

Benchmark: OfficeBench (300 tasks) Model: Claude Opus 4.7 Run IDs: Copilot CLI 25953648966 vs Claude Code 25953652156 Date: 2026-05-19


Summary

@kvenkatrajan
kvenkatrajan / 0-summary.md
Last active May 20, 2026 00:37
OfficeBench Sonnet 4.6: Copilot CLI (25953648143) vs Claude Code (25953658705) -- 16 task comparative analysis with cache rate analysis

OfficeBench Failure Analysis: Copilot CLI vs Claude Code (Sonnet 4.6)

Run Details

Field Value
Copilot CLI Run 25953648143
Claude Code Run 25953658705
Model claude-sonnet-4.6 (both runs)
Benchmark OfficeBench (300 tasks)
@kvenkatrajan
kvenkatrajan / 0-summary.md
Last active May 19, 2026 16:18
SWE-Bench Pro GPT-5.4: Copilot CLI (25953654333) vs Codex CLI (25953664039) -- 39 task comparative analysis

SWE-Bench Pro Failure Analysis: Copilot CLI vs Codex CLI (GPT-5.4)

Run Details

Field Value
Copilot CLI Run 25953654333
Codex CLI Run 25953664039
Model gpt-5.4 (both runs)
Benchmark SWE-Bench Pro (731 tasks)
@kvenkatrajan
kvenkatrajan / 0-summary.md
Last active May 19, 2026 13:03
SWE-Bench Opus 4.7: Copilot CLI (25953624787) vs Claude Code (25953632656) — 17 improvement targets

SWE-Bench Failure Analysis: Copilot CLI vs Claude Code (Opus 4.7)

Run Details

Field Value
Copilot CLI Run 25953624787
Claude Code Run 25953632656
Model claude-opus-4.7 (both runs)
Benchmark SWE-Bench Verified (500 tasks)
@kvenkatrajan
kvenkatrajan / 0-summary.md
Last active May 19, 2026 01:01
SWE-Bench Sonnet 4.6: Copilot CLI vs Claude Code — 19 task comparative analysis with turn-by-turn hotspots

SWE-Bench Failure Analysis: Copilot CLI vs Claude Code (Sonnet 4.6)

Run Details

Field Value
Copilot CLI Run 25953625783
Claude Code Run 25953626217
Model claude-sonnet-4.6 (both runs)
Benchmark SWE-Bench Verified (500 tasks)
@kvenkatrajan
kvenkatrajan / 25940055958-ask-user-readable-enums.md
Last active May 18, 2026 13:09
Run Analysis: Trajectory hotspot analysis and suggested fixes from compete runs
@kvenkatrajan
kvenkatrajan / 25940055224-agent-idle-notification-arrives-when-not-read.md
Last active May 18, 2026 12:56
Run Analysis: Trajectory hotspot analysis and suggested fixes from compete runs
@kvenkatrajan
kvenkatrajan / _squad_charter.md
Last active May 12, 2026 13:00
Squad Charter: Eval-to-Action Loop -- Closing the loop between weekend compete runs and weekly engineering investments

Squad Charter: Eval-to-Action Loop -- Closing the loop between weekend compete runs and weekly engineering investments

Squad Charter: Eval-to-Action Loop -- Closing the loop between weekend compete runs and weekly engineering investments

Squad Charter: Eval-to-Action Loop -- Closing the loop between weekend compete runs and weekly engineering investments

Squad Charter: Eval-to-Action Loop -- Closing the loop between weekend compete runs and weekly engineering investments

Squad Charter: Eval-to-Action Loop -- Closing the loop between weekend compete runs and weekly engineering investments

@kvenkatrajan
kvenkatrajan / _eval_loop_gist.md
Last active May 13, 2026 17:13
From Passive Evals to Action-Driven Product Impact — Closing the loop between weekend compete runs and weekly engineering investments

How May We Make the Weekly Eval Reporting Action-Driven Product Impact

The Problem

Every Monday, the team meets to review results from weekend compete runs -- CLI, Claude Code, and Codex CLI evaluated across Sonnet 4.6, Opus 4.7, and GPT 5.4. Today, this meeting is largely passive reporting: scores go up, scores go down, the team discusses possible reasons, and then disperses to work on whatever feels most promising.

What's missing is a closed loop:

Weekend Runs --> Identify Signals --> Select Investments --> Execute --> Validate
@kvenkatrajan
kvenkatrajan / agent-trajectory-analysis-design.md
Last active May 11, 2026 04:59
Agent Trajectory Analysis — Design Overview

Agent Trajectory Analysis — Design Overview

What is this?

A deterministic analysis pipeline that takes raw agent trajectory recordings (ATIF v1.6 format) and produces structured reports showing exactly what the agent did, how much it cost, and what went wrong.

ATIF trajectory.json → OTel spans → normalize → metrics + diagnostics + turns → reports → repo mapping → code fixes / sub-issues