| name | code-review |
|---|---|
| description | Use when you want a comprehensive code review of current branch changes. This skill runs GPT-5.5 (high) via the OpenAI API, synthesizes its feedback, then immediately fixes all valid issues and iterates until clean. If a related PR or GitHub issue is already in conversation context, the skill will pass it to the reviewer for intent. |
Review current branch changes with GPT-5.5 (high) via the OpenAI API, synthesize feedback, fix all valid issues, and iterate until clean.
- Check conversation context for a related PR/issue and grab its details if present
- Run GPT-5.5 review as a
run_in_backgroundtask with codex underscript(pseudo-terminal), plus a Monitor watchdog for startup (see Step 1) - Audit what codex touched — codex can edit files; reconcile with the user's intended changeset before synthesizing
- Synthesize feedback — evaluate each issue for validity
- Fix all valid issues immediately
- Run the project's CI-equivalent verification yourself (Step 4) — codex's own "tests passed" line is not enough; codex tends to scope tests to the changed package while CI runs the whole tree.
- Re-run review — iterate until only nitpicks remain (do NOT stage changes)
Do NOT ask the user for a PR or issue number. Instead, scan the existing conversation for an already-mentioned reference to the work being reviewed:
- A GitHub URL the user pasted (
https://github.com/<org>/<repo>/issues/1234,/pull/5157) - A bare issue/PR reference the user typed (
#1234,PR 5157,issue 1234) - A branch name that matches an open PR (you'd typically know this because the user said "let's tackle X" earlier and you opened/checked out that branch)
If you find one, fetch it once before invoking codex:
- PR:
gh pr view <num> --json title,body,url - Issue:
gh issue view <num> --json title,body,labels,url
Condense title + body + URL into a short <context>...</context> block (under ~600 chars — extract the actual goal/acceptance criteria; don't dump huge bodies) and prepend it to the codex prompt so the reviewer understands the intent of the work.
If you don't find one, skip this step entirely. Don't ask, don't speculate, just review what's in the diff.
If the user typed free-text after /code-review (e.g. "focus on the caching logic"), prepend that too — it's an explicit reviewer instruction.
The scope is decided by Codex per the rules baked into the prompt below. Two buckets, FIRST non-empty wins:
- Working-tree + staged (union of
git diff --staged --name-only,git diff --name-only, andgit ls-files --others --exclude-standard) — anything in flight: staged-for-commit, modified-but-unstaged tracked files, and untracked files, all reviewed together. The split between staged and unstaged is irrelevant to review scope; if there's WIP, it all goes in. - Branch vs origin/master (
git diff --name-only origin/master...HEAD) — falls back to the committed work on this branch only when bucket 1 is empty (clean working tree).
Never use git diff origin/master (two-dot) for scope — it pulls in unrelated local changes to files not touched by this branch.
Before invoking codex, hash every file in the in-flight set so you can detect anything codex modifies on disk in Step 1.5.
Hash file contents, not git status --porcelain. Status codes encode the staged/unstaged split, which the user changes routinely as they review hunks. A git add flips MM → M without touching disk content; if the baseline is git status output, a normal user staging action looks identical to a codex edit and triggers a false alarm. Disk content hashes are immune — git add never touches the working copy.
Use a per-session temp path (don't hard-code /tmp/code-review-baseline.txt — two parallel sessions would clobber each other). Note: BSD mktemp (macOS) requires the Xs to be the last characters in the template — no .txt/.out suffix after them — so use bare XXXXXXXX patterns for portability:
BASELINE_FILE=${BASELINE_FILE:-$(mktemp /tmp/code-review-baseline.XXXXXXXX)}
# In-flight set = staged + modified-unstaged + untracked. Same selection
# codex itself uses for review scope.
{ git diff --staged --name-only
git diff --name-only
git ls-files --others --exclude-standard
} | sort -u | while read -r f; do
[ -f "$f" ] && shasum -a 256 "$f"
done > "$BASELINE_FILE"Write the prompt to a file first, then run codex with stdin closed and output redirected to a known file. Both of those matter:
- Closing stdin (
< /dev/null) avoids the failure mode where codex printsReading additional input from stdin...and then waits forever for EOF that never comes. Without< /dev/nullit can hang indefinitely with zero output. - Inlining a long prompt as a CLI arg often loses pieces to shell escaping (especially
$, backticks, smart quotes). Writing to a file and passing"$(cat /tmp/codex-prompt.txt)"is reliable.
Use unique per-session file paths — never bare $OUT_FILE or /tmp/codex-prompt.txt. Two parallel /code-review invocations (different Claude sessions, different worktrees, or different rounds within the same session) will otherwise race and corrupt each other's output. Use mktemp (or a fixed-prefix with a per-session id) for both files, and remember them across rounds.
# The Bash tool does NOT persist shell variables between calls — a var set in
# one call is gone in the next, and `${VAR:-$(mktemp ...)}` re-mktemps a NEW
# file every call. So choose FIXED LITERAL paths once per /code-review
# invocation (session-unique via a slug you pick, so parallel runs don't
# collide) and write those literals verbatim into EVERY call below (setup,
# launch, watchdog, read). Example slug: the branch name or a short random id.
PROMPT_FILE=/tmp/codex-prompt-REVIEWSLUG # <- replace REVIEWSLUG everywhere
OUT_FILE=/tmp/codex-review-REVIEWSLUG
echo "PROMPT_FILE=$PROMPT_FILE"
echo "OUT_FILE=$OUT_FILE"
cat > "$PROMPT_FILE" <<'PROMPT'
<CONTEXT_BLOCK_IF_ANY>
/review against latest origin master. Only review files that belong to this branch. Determine the in-scope file set using TWO-BUCKET precedence — pick the FIRST non-empty bucket:
1. In-flight changes: union of 'git diff --staged --name-only' (staged), 'git diff --name-only' (modified tracked, unstaged), and 'git ls-files --others --exclude-standard' (untracked). If this union is non-empty, review ALL of these together — don't split staged vs unstaged.
2. Branch vs master: 'git diff --name-only origin/master...HEAD'. Use this only when bucket 1 is empty (clean working tree).
Do NOT use 'git diff origin/master' (two-dot) — it includes unrelated local changes.
VERIFICATION — KEEP IT NARROW
Your verification scope is the **directly-touched packages only** — the Go packages or TS workspaces containing the files in the in-scope set. Do NOT run the full CI-equivalent target across the repo. The wrapper agent owns the broader CI-scoped run; if you duplicate that work you waste time and pull in unrelated flakes (toolchain skew, transient DB-state tests in untouched subsystems, broad-tree race detectors) that have nothing to do with the diff and aren't yours to debug.
How to derive the narrow target:
- **Go**: take every changed `*.go` file, map it to its package (`go list -e <file>` or directory walk), then run `go test -race -short -timeout=5m <packages>` on that set. One package usually; rarely more than two or three.
- **TypeScript**: the workspace(s) owning the changed files — e.g. `pnpm check iap`, `pnpm check signature`. Do NOT run `pnpm check ts-all`.
- **Other tooling**: targeted equivalents (one test file, one binary) — not the whole suite.
If you need to run a quick sanity check against another package to validate a finding (e.g. you suspect a reverse dep), name it explicitly in your findings — but the default is "touched packages only."
If those narrow tests fail, investigate and fix. Per AGENTS.md "No 'Pre-existing' Errors Exist," a failure in a package you DID touch is yours, even if it's a test you didn't directly modify. A failure in a package you DIDN'T touch — surface it as a finding ("ran broader X to validate Y, hit unrelated failure Z") and let the wrapper decide; don't chase it.
State the exact command you ran AND the scope (touched packages list) in your Findings output so the wrapper agent can layer the broader CI-scoped run on top.
PROMPT
# Run this as a SEPARATE Bash call with the tool's run_in_background:true (the
# prompt file above is written in its own prior foreground call). `script -q`
# gives codex a pseudo-terminal so it survives being detached; output lands in
# $OUT_FILE with \r line endings (strip them on read: tr -d '\r' < $OUT_FILE).
# Substitute the literal /tmp/...REVIEWSLUG paths — shell vars don't persist
# between calls.
: > "$OUT_FILE" # truncate so this round starts clean
script -q "$OUT_FILE" codex --dangerously-bypass-approvals-and-sandbox --model gpt-5.5 --config model_reasoning_effort="high" exec "$(cat "$PROMPT_FILE")" < /dev/null<CONTEXT_BLOCK_IF_ANY> is the <context>...</context> block from Step 0 if the user passed a PR/issue number; omit the placeholder entirely otherwise. Free-text instructions from the user also go before the /review line.
Why script + run_in_background is the one combination that works (macOS). codex needs a controlling TTY — with no terminal it aborts immediately at 0 bytes (exit code 144). Every detachment strips the TTY:
- Bare foreground call (no
&, just a long Bash call): works only until the harness auto-backgrounds it mid-flight — which it does at some unpredictable point on long runs — at which moment codex loses its terminal and dies (0 bytes / exit 144). It's nondeterministic: short rounds return inline and survive; long ones get backgrounded and die. Do NOT rely on it. - Shell
&: the orphan is reaped with the launching shell's process group → dies at 0 bytes. run_in_background: trueWITHOUTscript: detached, no TTY → 0 bytes / exit 144.setsid/nohup/disown:setsidstrips the controlling terminal (codex aborts); the others don't save the orphan from reaping.
script -q "$OUT_FILE" codex … allocates a pseudo-terminal that survives detachment, so launching it with the Bash tool's run_in_background: true runs codex to completion as a tracked task. This is the verified-reliable recipe — use it for every round.
(Optional belt-and-suspenders: set BASH_DEFAULT_TIMEOUT_MS/BASH_MAX_TIMEOUT_MS to 600000 in settings.json so foreground commands have headroom — but it does not stop auto-backgrounding, so it is not a substitute for the script recipe.)
Because the launch is a background task, don't wait idly for the completion notification to discover a 0-byte death — that can take many minutes. Right after launching, arm a Monitor on $OUT_FILE so you learn within ~50s whether codex actually started:
# Monitor command — emits ONE line then exits: "started" once real output
# appears, or "stuck" if still ≲300 bytes after 60s (the dead-at-startup
# signature). Use the literal $OUT_FILE path.
deadline=$(($(date +%s)+60))
while true; do
cur=$(stat -f%z "$OUT_FILE" 2>/dev/null || echo 0)
if [ "$cur" -gt 300 ]; then echo "codex-started: $cur bytes"; exit 0; fi
if [ "$(date +%s)" -ge "$deadline" ]; then echo "stuck: $cur bytes after 60s"; exit 1; fi
sleep 5
doneIf the watchdog reports stuck, codex died at startup. Stop the task (TaskStop, or kill the exact launcher PID — never a -f command-line pattern, which matches your own cleanup command and the launcher shell), then relaunch once with the same script + run_in_background recipe. Retry at most once; if it stalls again, surface the failure to the user. If it reports started, keep working and read $OUT_FILE when the background-task completion notification arrives. Slow growth (multi-minute pauses while codex reads files) is normal once it's past the header.
Reading the output: script records terminal \r, so always read via tr -d '\r' < "$OUT_FILE" (e.g. tr -d '\r' < "$OUT_FILE" | tail -c 4000 for the findings block) — a raw read shows doubled/garbled lines.
If you ever do need to kill a codex process, kill it by its exact PID — never by a -f command-line pattern (pkill -f "codex.*gpt-5.5"). The pattern matches the launcher's own shell and the cleanup command itself, so it kills the very process you meant to keep.
Codex runs with --dangerously-bypass-approvals-and-sandbox so it can run tests and verify findings — useful, but it also means codex CAN edit files. After codex returns, before synthesizing findings:
-
Re-hash the in-flight set and diff against the baseline content hashes. Files whose hash changed had their disk content modified — that's codex (or you, if you happened to edit something during the run, which is rare). Files whose hash matches were not touched, even if their
git statuscode changed (the user stages hunks as they review; that's normal and not a codex signal).AFTER_FILE=$(mktemp /tmp/code-review-after.XXXXXXXX) { git diff --staged --name-only git diff --name-only git ls-files --others --exclude-standard } | sort -u | while read -r f; do [ -f "$f" ] && shasum -a 256 "$f" done > "$AFTER_FILE" # Files with a different hash, OR present in only one of the two lists, # are codex-touched. Compare on the second column (the path) to also # catch new files codex added. diff "$BASELINE_FILE" "$AFTER_FILE"
-
Specifically check
git diff -- '*_test.go'against any test files codex modified. Codex's most failure-prone pattern is writing a test that asserts behavior requiring a production-code edit codex also wrote — accepting one without the other (or both viagit add -u) ships failing CI. The test and the prod change must tell the same story. -
For each codex-touched file, run
git diff <file>and classify:- Drive-by fix to a real bug it spotted while reviewing — usually welcome; surface it to the user as part of synthesis so they can keep it
- Hallucinated additions (constants/models that don't exist, types referencing non-existent APIs) — revert
- Speculative refactor codex thinks improves the code — surface to the user; let them decide
-
NEVER run
git add -u,git add ., or any wildcard staging during the review. That launders codex's edits into the user's reviewed staged set and destroys hunk-level review. Stage explicit paths only, and only when the user asks. (The user themselves staging hunks as they review is fine — they know what they're staging. The hash-based detection above is immune to it.) -
Surface every codex-touched file in your synthesis with a one-line classification per file. The user decides per-file whether to keep or revert.
Present findings, but critically evaluate each issue before acting:
- Is the issue valid? Does it describe a real problem, not a hypothetical one?
- Is the fix correct? Would the suggested approach actually improve things?
- Discard invalid feedback. Reviewers sometimes hallucinate issues or misread code.
- Do NOT skip valid issues because they seem "small." If a fix is valid and correct, it gets done.
The first instinct when an issue is "out of scope" or "structural" is to slap a TODO(...) comment on it and move on. Resist that instinct. A deferred finding is still a finding the reviewer will surface again next round — and if you can't justify the deferral to a fresh-eyes pass, you couldn't justify it to a code-reviewer either.
Defer only when ALL of these hold:
- The fix genuinely doesn't fit the branch's scope (different system, infrastructure rewrite, or needs a separate design decision).
- There's a concrete tracking artifact — a filed GitHub issue, a referenced ticket, a named follow-up branch. "I'll do it later" is not tracking.
- The in-code comment names the issue, names the fix, names the tracking artifact, and explains why this branch isn't the place. A bare
TODO(...)line isn't enough.
If you can't satisfy all three — fix it now.
Present synthesis in this format:
## Code Review Summary
### Valid Issues
[Issues that are real and should be fixed]
### Discarded
[Issues that were evaluated and found invalid, with brief reason]
### Codex side-effect edits
[Per-file classification from Step 1.5: `path/to/file` — keep / revert / user-decides, with one-sentence reason. Omit this section if codex didn't edit anything.]Immediately implement all fixes. Do not ask for permission — the user invoked the review to get things fixed.
After fixing, run build checks to verify nothing broke.
This is the wrapper's job, not codex's. Codex is instructed (Step 1's VERIFICATION block) to test ONLY the directly-touched packages — that keeps its run fast and avoids flakes in unrelated subsystems. The broader CI-scoped run lives here: you (the wrapper) layer it on top to catch reverse-dep breakage that the narrow run can't see.
If codex reports its narrow-scope tests passed, that's expected and useful — but it's NOT a substitute for the full CI-scoped run. A clean codex review combined with a too-narrow test run gives false confidence — the PR opens, CI catches breakage in a reverse dep, and the user has to debug it under time pressure.
Run the project's CI-equivalent verification yourself. Source the command from .github/workflows/*.yml, AGENTS.md / CLAUDE.md essential-commands sections, etc.
Source scope from the workflow's paths: / paths-ignore: filter, not just its run: step. A workflow that runs go test ./go/... but is filtered to paths: ['go/**', '!go/lambda/**', '!go/cmd/**'] has an effective scope that excludes lambda and cmd — running ./go/... raw locally surfaces failures CI would never see. The filter is part of the verification command. (See Step 1's VERIFICATION block for the worked example.) Same logic for TS/JS workflows scoped to specific workspaces.
Apply that scope: enumerate the touched packages plus their in-paths: reverse deps, or use go list to filter ./go/... against the workflow's allowed paths. Do not run packages CI's filter excludes — failures there are out of scope and not yours to debug.
When a failure appears in a package this branch didn't touch (but IS in scope per the workflow filter), the FIRST diagnostic step is to rebase onto the latest base branch (usually origin/master or origin/main). Master is enforced green, so two common signals show up:
- Master moved on while you were working. A recently-landed commit on master already fixes the failure — rebasing makes it go away with zero diagnostic time spent.
- Master moved on and the new behavior conflicts with the branch — rebasing surfaces this as a real conflict to resolve, instead of as a confusing CI failure later.
Either way, rebase first, debug second. Use the git-rebase skill — it handles the safety backup, conflict-aware resolution, and post-rebase verification. After rebasing, re-run the test command (in the same scoped target as before). Only escalate to package-level debugging (re-generate codegen, pnpm clean, go mod tidy, inspect the failing test) when the failure survives a fresh rebase.
If the failure does survive AND it's in an in-scope package: per AGENTS.md "No 'Pre-existing' Errors Exist" — fix it in this branch. Do not check master to see if it's also red. Do not skip the failing test. Do not push and let CI catch it.
After applying fixes:
- Re-hash the in-flight set into
$BASELINE_FILEbefore each round (same hashing snippet as Step 1's "Capture the pre-codex baseline"), so the next audit detects edits codex makes in the upcoming round, not in the one you just finished. - Re-run the Codex review — relaunch with the same
script+run_in_backgroundrecipe + Monitor watchdog (Step 1). Truncate$OUT_FILE(: > "$OUT_FILE") first so this round's output starts clean. - Re-audit what codex touched (Step 1.5).
- Synthesize and fix again.
- Re-run Step 4 (CI-equivalent verification) — every iteration must end with a clean full-tree test run, not just a clean review.
If codex re-raises an issue you addressed in a previous round, that is almost always one of:
- You didn't actually fix it. Re-read your diff against the finding — sometimes the "fix" was a comment that didn't change behavior. The reviewer is right; fix it for real.
- You deferred it and the deferral isn't visible to the reviewer. A bare
TODO(...)comment doesn't justify a P1 finding away. Either fix it in this branch or make the deferral self-documenting (severity assessment, named follow-up artifact, explanation of why this branch isn't the place). - The reviewer disagrees with your deferral severity. This means YOUR severity assessment was wrong, not the reviewer's. If a P1 keeps coming back, treat it as a P1 and fix it.
Re-raised findings are NEVER a reason to stop iterating. They are a reason to look harder at what you skipped.
Stop iterating only when:
- The reviewer surfaces no new issues AND doesn't re-raise old ones.
- Remaining feedback is subjective nitpicks (style preferences, naming bikesheds — not correctness, not perm gaps, not race windows).
Typically 2-3 rounds is sufficient — but if codex keeps surfacing the same thing on round 4, the answer is to actually fix it, not to declare victory.
If the OpenAI API call fails:
- Report the error
- Do not retry more than once
- Ask the user how to proceed