Skip to content

Instantly share code, notes, and snippets.

@arjshiv
Last active April 30, 2026 17:04
Show Gist options
  • Select an option

  • Save arjshiv/d13f66acfc99d1372a01e8453440ed40 to your computer and use it in GitHub Desktop.

Select an option

Save arjshiv/d13f66acfc99d1372a01e8453440ed40 to your computer and use it in GitHub Desktop.
Practical ways to make the most of AI in your codebase

Practical ways to make the most of AI in your codebase

The mistake most engineers make their first month with AI is thinking of it as code. It isn't. It's a teammate. And not even one teammate, because the best planner today is Opus, the best coder is Codex, and in two months that roster shifts. Tomorrow Sonnet 5.6 might be the best planner and Opus 4.8 might be the best coder. You're working with a rotating contractor pool whose members keep getting better at their jobs.

That changes what you optimize for. You don't build workflows that depend on a specific model. You build workflows that benefit from any model getting better, the way a good engineering org runs no matter who's playing which role.

Which means the right answer isn't "switch to Cursor" or "use Claude Code" or "buy this new wrapper." Those are implementation details. The right answer is the boring one: shift things left, write more rules and hooks, lean into the places where you cede control and be deliberate about where you take control back, treat project plans as institutional memory for the why the way you would for a junior engineer who's about to inherit your codebase. These are good engineering practices, compounded 100x by AI rather than reliant on it.

A startup engineer DM'd a series of questions over a couple of weeks asking how this looks in practice. They're the questions every engineer asks the first time they try to bring AI into their workflow, so I'm posting the answers in one place. If you're earlier in this and looking for what to do (not what to think about doing), this is for you.

The team behind these answers: two devs plus Claude shipping a React + Node + Postgres product (ResiDesk). The patterns aren't React-specific; they apply across stacks.

The single frame to hold while reading: separate what must be deterministic (hooks, types, lint, tests) from what benefits from judgement (taste, review priorities, design tradeoffs). Push enforcement into tooling so the LLM isn't the load-bearing layer. Push judgement into prompts so tooling isn't pretending to have taste.


Part 1: The five practical questions

1. How do you keep code consistent across multiple devs?

You lift conventions out of human memory into three tiers of shared assets, all readable by humans and agents.

Tier Artifact What it does
1. Rules .cursor/rules/*.mdc Frontend, testing, typescript, version-control. Glob-scoped, auto-loads when files match
2. Skills .claude/skills/*/SKILL.md Recipes: ship-pr, full-stack-testing, frontend-debug
3. Hooks .claude/hooks/*.sh Pre-commit and PostToolUse scripts. Reject bad code before it lands

Concrete examples on the React side:

  • check-no-dynamic-imports.sh blocks import() outside React.lazy
  • check-no-any-types.sh blocks any in TS files
  • check-imports-resolve.sh verifies every import path resolves
  • frontend.mdc declares PII tagging rules, theme tokens, the icon library, the LoadingShim component, useEffect anti-patterns

Both devs and Claude read the same rules before they edit. Update the .mdc once, the whole team benefits on the next commit. Every correction becomes a rule. Every rule prevents a class of future inconsistencies.

flowchart TD
    subgraph T1["Tier 1 - Cursor Rules (.cursor/rules/*.mdc)"]
        R1[frontend.mdc<br/>PII / theme / icons]
        R2[testing.mdc<br/>Jest + RTL]
        R3[typescript.mdc<br/>no any, strict]
        R4[root.mdc / architecture.mdc<br/>imports, scope discipline]
    end
    subgraph T2["Tier 2 - Skills (.claude/skills/SKILL.md)"]
        S1[ship-pr]
        S2[full-stack-testing]
        S3[web-design-guidelines]
        S4[frontend-debug]
    end
    subgraph T3["Tier 3 - Hooks (.claude/hooks/*.sh)"]
        H1[no-dynamic-imports]
        H2[no-any-types]
        H3[imports-resolve]
        H4[commit-atomicity]
    end
    OUT(["Dev A + Dev B + Claude<br/>same output shape"])
    T1 --> T2 --> T3 -->|enforced| OUT
Loading

2. How do you ensure code quality?

Layered gates, cheapest first. Nothing ships on vibes. Four gates:

  1. Author-time, under 1s. Cursor rules in editor context, TS strict, hooks on save.
  2. Pre-commit, under 10s. Atomicity, no-any, no-dynamic-import, import-resolve.
  3. Pre-push, under 60s. pnpm type-check and Jest.
  4. CI review agents, async. Domain-specific reviewers run on the diff. They read full file contents, not the diff alone. That catches stale deps, unused params, and API-surface traps invisible in a patch view.

ship-pr orchestrates these in order so devs run them with one command. Quality is a pipeline, not a code-review round.

flowchart LR
    G1[Gate 1<br/>Author-time<br/>&lt; 1s<br/>Cursor rules + TS]
    G2[Gate 2<br/>Pre-commit<br/>&lt; 10s<br/>no-any, atomicity,<br/>imports]
    G3[Gate 3<br/>Pre-push<br/>&lt; 60s<br/>type-check + Jest]
    G4[Gate 4<br/>Review agents<br/>async<br/>domain + arch]
    G1 --> G2 --> G3 --> G4
    style G1 fill:#a5d8ff
    style G2 fill:#d0bfff
    style G3 fill:#fff3bf
    style G4 fill:#b2f2bb
Loading

3. How do you find and address tech debt?

Surfaced by agents during normal work. Tagged. Paid down in atomic commits.

Surfacing:

  • Reviewer passes emit findings in three buckets: must-fix, nice-to-fix, and API-surface traps. The third bucket is your debt backlog.
  • codebase-orientation, deslop, and refactor-safely skills run structured audits on demand.
  • Multi-week debt becomes a dated ProjectPlans/YYYY-MM-DD-*.md.

Addressing:

  • Rule: one logical change per commit, target under 300 lines and under 8 files.
  • check-commit-atomicity.sh warns on oversized or cross-cutting commits.
  • Scope discipline rule in CLAUDE.md: do not fix pre-existing errors or unrelated code unless asked. No drive-by cleanups. Counterintuitive, but it's what lets you ship reviewable diffs continuously instead of giant refactor PRs that block the team.
flowchart TB
    subgraph S1["1. Surface"]
        SA[Review agents flag<br/>API surface traps]
        SB[refactor-safely / deslop /<br/>codebase-orientation skills]
        SC[ProjectPlans/YYYY-MM-DD-*.md<br/>for multi-week debt]
    end
    subgraph S2["2. Tag and prioritize"]
        TA[Must-fix / Nice-to-fix / Trap]
        TB[Security &gt; data loss &gt; bug &gt; perf &gt; slop]
        TC[Max 5 issues per pass]
    end
    subgraph S3["3. Pay down continuously"]
        PA[Atomic commits<br/>&lt; 300 lines / &lt; 8 files]
        PB[Scope discipline<br/>no drive-by fixes]
        PC[Agent dispatch<br/>backlog as agent tasks]
    end
    S1 --> S2 --> S3
Loading

4. How do you review code beyond Claude Code Action?

Claude Code Action is one reviewer of several. The same pipeline runs locally before push. Domain specialists go deeper than a generic LLM pass.

The review stack, in order:

  1. Local self-review via the claude-code-review command. Same prompts as CI, runs on uncommitted and untracked files too.
  2. Domain-specialist subagents: frontend-reviewer, backend-reviewer, database-reviewer, jobs-reviewer, llm-reviewer, integration-reviewer. Each looks only at its directories and uses a specialized prompt.
  3. architecture-reviewer, always runs. Checks import direction, abstraction levels, dependency boundaries. The Jeff Dean test (more on this below).
  4. Human review on PR. By this point AI has cleared formatting, obvious bugs, unused deps, type errors, stale references. Humans focus on product intent, UX, edge cases, tradeoffs.
  5. pr-screenshot-review skill for any visual change. Screenshot in a real browser, annotate, upload to PR. No React change ships without visual evidence.

By the time a PR hits the lead dev, around 80% of nitpicks are resolved. Reviews become conversations about design.

flowchart LR
    DEV[Dev diff<br/>committed +<br/>untracked]
    S1[1. Local self-review<br/>claude-code-review]
    S2[2. Domain agents<br/>frontend / backend /<br/>db / jobs / llm]
    S3[3. Architecture<br/>reviewer]
    S4[4. Claude Code<br/>Action on PR]
    S5[5. pr-screenshot-review]
    S6[6. Human review on PR]
    SHIP([Ship])
    DEV --> S1 --> S2 --> S3 --> S4 --> S5 --> S6 --> SHIP
    style DEV fill:#a5d8ff
    style SHIP fill:#73FBD3
Loading

5. Do you write unit and end-to-end tests?

Yes, both. Different mandates and different tools.

Unit tests (Jest):

  • Colocated next to source, plus Web/tests/
  • Target: domain logic and data shaping. Normalizers, filters, reducers, hooks with pure logic.
  • Mandate: every non-trivial utility ships with a test.
  • Run: pnpm test. Fast and deterministic.

End-to-end tests (Playwright via agent-browser):

  • Stack: Playwright plus the agent-browser CLI, driven by full-stack-testing.
  • Target: golden-path user flows. Login, inbox, conversation, dashboard, brand-hub, demo-mode toggle.
  • Selectors: data-cy="...", stable contracts not DOM traversal.
  • Login: read-only RESIDESK_AGENT_READONLY_PASSWORD agent.

The unlock: agents can run the e2e suite themselves. When Claude changes a React component, it can launch agent-browser, exercise the flow, screenshot it, verify behavior, and only then ask a human to look. Tests stop being a human chore. They become a self-service capability for the agent.

flowchart TB
    subgraph U["Unit tests - Jest"]
        UT[Target: domain logic<br/>+ data shaping]
        UE[normalizers, filters,<br/>reducers, pure hooks]
        UL[Web/tests/ +<br/>colocated *.test.js]
        UR[pnpm test - fast,<br/>deterministic]
    end
    subgraph E["E2E tests - Playwright + agent-browser"]
        ET[Target: golden-path<br/>user flows]
        EE[login, inbox, conversation,<br/>dashboard, brand-hub]
        EL[Selectors: data-cy attrs<br/>stable contracts]
        ER[full-stack-testing skill]
    end
    NOTE[The unlock: agents run<br/>both suites themselves]
    U --> NOTE
    E --> NOTE
Loading

Part 2: The four follow-ups

6. Claude loses context and ignores rules. How do you fight that?

Stop fighting it. Push anything that must happen into deterministic tooling. Claude is non-deterministic by design. Treating its context as a reliable enforcement surface is the wrong mental model.

The split:

Layer Determinism What lives here
Agent context (CLAUDE.md, .cursor/rules/*.mdc, skills, subagents) Best-effort Taste, judgement, review priorities
Tooling (git hooks, husky, lint-staged, tsc, eslint, prettier, CI) 100%, the commit fails or it doesn't Style, types, import resolution, formatting, tests, atomicity

Five concrete mitigations:

  1. Hooks over prose. Said "please don't do X" twice? Promote X to a hook. A bash script in .claude/hooks/ blocks during the session. Husky pre-commit blocks at commit time.
  2. Scoped rules via globs. .mdc files declare globs: Web/src/**/* so the frontend rule loads only when editing Web files.
  3. Subagents with fresh context. Reviewer subagents spawn clean every time. They don't inherit author-agent drift.
  4. Skills are invoked by name, not remembered. /ship-pr looks up the skill at runtime instead of trying to recall a 20-step protocol.
  5. Test the rule, don't trust it. If the rule says "buttons must never block on network," an e2e test clicks the button under a stalled network and asserts UI responds. Machine-verified reality beats remembered convention.
flowchart LR
    subgraph L["Best-effort (Claude might drift)"]
        L1[CLAUDE.md global brief]
        L2[.cursor/rules/*.mdc<br/>scoped via globs]
        L3[Skills - opt-in,<br/>invoked by name]
        L4[Review subagents -<br/>fresh context each run]
        LU[Use for: taste, judgement,<br/>review priorities]
    end
    subgraph R["100% deterministic (commit fails)"]
        R1[.claude/hooks/*.sh<br/>session-time]
        R2[Husky pre-commit]
        R3[lint-staged]
        R4[ESLint / Prettier / CI]
        RU[Use for: style, types,<br/>imports, atomicity, tests]
    end
    L -.->|said it twice?<br/>promote to hook| R
Loading

7. Do you use ESLint, or only agents? Doesn't ship-pr replace git hooks?

Both. Deterministic tools at two layers, agents on top. ship-pr orchestrates the stack, doesn't replace it.

The stack:

Layer Trigger Tools
.claude/hooks/*.sh PostToolUse during agent session check-no-any-types, check-no-dynamic-imports, check-imports-resolve, check-commit-atomicity, check-portless-installed
Husky pre-commit git commit pnpm type-check, check-web-duplicate-files, check-no-js-in-converted-dirs, then pnpm precommit runs lint-staged
lint-staged (per file) via husky prettier --write, tsc --noEmit, check-imports, check-ts-patterns for .ts/.tsx; check-imports for .js/.jsx
ESLint and Prettier editor save plus pre-commit .eslintrc at root and at Web/.eslintrc.js, .prettierrc at root
CI on push full test suite, type-check, review agents

Why both layers? Git hooks protect the repo so nothing bad reaches the remote. Claude hooks protect the session so the agent can't write the bad thing in the first place; you don't waste tokens fixing what a script would catch.

ship-pr runs git status/git diff, then layers A and B (deterministic), then domain review agents (semantic), then summarizes. Agents call into the same deterministic stack a human would use, then add a semantic pass that humans get too tired to do consistently.

flowchart TB
    subgraph A["Layer A - .claude/hooks/*.sh (session-time, protect agent)"]
        A1[check-no-any-types]
        A2[check-no-dynamic-imports]
        A3[check-imports-resolve]
        A4[check-commit-atomicity]
    end
    subgraph B["Layer B - Husky pre-commit + lint-staged (commit-time, protect repo)"]
        B1[pnpm type-check]
        B2[prettier --write]
        B3[eslint .eslintrc]
        B4[check-ts-patterns / imports]
    end
    subgraph C["Layer C - ship-pr skill (orchestration + semantic review)"]
        C1[git status + diff]
        C2[run layers A and B]
        C3[domain review agents]
        C4[ship / fixes / stop]
    end
    A --> B --> C
Loading

8. Do you commit project plans? Read them again? Even for small fixes?

Commit everything. The repo has 345 plans in ProjectPlans/ today. Agents are good at retrieving the plan relevant to the work at hand. Every plan you commit is an asset.

Diff captures the WHAT. Plan captures the WHY:

Artifact Captures
Commit diff and history Every line, every rename, every deletion
Project plan Constraints, alternatives considered, stakeholder asks, invariants you had to preserve

Six months later, git log tells you the code changed. The plan tells you whether the reason is still valid. Without the plan, the next agent reverse-engineers intent from a diff and usually gets it wrong.

When to write a plan (create-project-plans skill):

  • Multi-step feature work: always.
  • Migrations or refactors touching more than 5 files: always.
  • Small bug fix: a one-paragraph plan in the commit body is fine.
  • One-line fix: skip. The commit is the plan.

Why this scales:

  • Plans are dated (YYYY-MM-DD-topic.md) so agents filter by recency when staleness matters.
  • They live under ProjectPlans/, not in CLAUDE.md, so they don't bloat always-on context. Retrieval is agentic. The agent greps when it needs to.

Plan for a future where a cloud agent picks up a Linear ticket and ships it solo. That agent needs the why. Without it written down, the work needs a human in the loop, which defeats the point.

flowchart TB
    subgraph PLAN["ProjectPlans/YYYY-MM-DD-*.md"]
        P1[Constraints / invariants]
        P2[Alternatives considered]
        P3[Stakeholder asks /<br/>deadlines]
    end
    subgraph GIT["Git log + diff"]
        G1[Every line that changed]
        G2[Renames + deletions]
        G3[Author + date +<br/>atomic message]
    end
    PLAN -->|WHY| AGENT
    GIT -->|WHAT| AGENT
    AGENT[Cloud agent picks up<br/>a Linear ticket and<br/>greps both]
    style PLAN fill:#e5dbff
    style GIT fill:#dbe4ff
    style AGENT fill:#b2f2bb
Loading

9. Do all improvements go into docs? Aren't you worried about doc bloat?

Add and prune. Docs are a crutch. Code is the truth.

The priority when you find something worth capturing:

  1. Make the code express it. A well-named function plus a test case documents behavior better than prose.
  2. Make a hook enforce it. A pre-commit script never drifts from its own description.
  3. Put it in a scoped doc. Frontend-only rule? frontend.mdc with globs: Web/src/**/*. Loads only when relevant.
  4. Last resort: a global doc (CLAUDE.md, root.mdc). Expensive slot. Anything here pays a token cost on every turn. Prune aggressively.

How you keep the context window healthy:

  • Glob-scoped rules. Of around 30 .cursor/rules/*.mdc files, a handful auto-load on any given task.
  • Skills are opt-in. .claude/skills/*/SKILL.md doesn't load until someone invokes the skill. Around 80 skills on disk; typical session loads 0–2.
  • Subagents have isolated contexts. The architecture reviewer never sees the frontend rule.
  • MEMORY.md is an index, not a memory. Each memory is its own file, loaded only when relevant.
  • Pruning is a first-class activity. Rule replaced by a hook? Delete the rule. Skill superseded? Delete the skill. deslop and refactor-safely run this pass periodically.

The failure mode you're worried about (an "effective code" book that doesn't fit) is real for teams that only add. Treat docs the way you treat feature flags. Every one has a cost. Delete the moment it stops earning.

flowchart LR
    S1[1. CODE<br/>named function +<br/>test case]
    S2[2. HOOK<br/>pre-commit script<br/>never drifts]
    S3[3. SCOPED DOC<br/>.mdc with glob<br/>loads when relevant]
    S4[4. GLOBAL DOC<br/>CLAUDE.md / root.mdc<br/>expensive slot - prune]
    S1 -->|if can't| S2
    S2 -->|if can't| S3
    S3 -->|last resort| S4
    style S1 fill:#b2f2bb
    style S2 fill:#a5d8ff
    style S3 fill:#d0bfff
    style S4 fill:#ffd8a8
Loading

Part 3: The architecture-reviewer subagent

What it is

It's not a skill. It's a Claude Code subagent. The distinction matters:

Skill (.claude/skills/) Subagent (.claude/agents/)
Trigger User or agent invokes /skill-name Main agent delegates via the Agent tool
Context Shares the main agent's context Fresh, isolated context every run
Tools Inherits caller's tools Scoped (we restrict to Read/Grep/Glob)
Use case Recipes and workflows Specialist roles called into service

A subagent is a single markdown file with frontmatter. The system prompt is the markdown body. No build step, no plugin, no dependencies. Built in-house, no community marketplace source. The structure is the standard .claude/agents/*.md contract, so the pattern ports cleanly.

flowchart TB
    subgraph FM["Frontmatter (the contract)"]
        F1[name + description]
        F2[tools: Read / Grep / Glob]
        F3[model: haiku<br/>fast + cheap]
        F4[read-only by design]
    end
    subgraph BODY["Prompt body (system prompt is the markdown)"]
        B1[Jeff Dean test:<br/>Is this fundamentally<br/>WRONG?]
        B2[Flag ONLY:<br/>import direction, new deps,<br/>abstraction, approach mistakes]
        B3[Do NOT flag:<br/>naming, style, tests, micro-perf,<br/>error handling]
        B4[Grep-before-flag<br/>1. Verify pattern exists<br/>2. Skip if convention]
        B5[Severity gate<br/>1. Production impact?<br/>2. Code or business?<br/>3. Worth interrupt?]
    end
    NOTE[Most PRs return<br/>'No architecture issues.'<br/>That is the point.]
    FM --> BODY --> NOTE
    style FM fill:#dbe4ff
    style BODY fill:#e5dbff
    style NOTE fill:#fff3bf
Loading

The full file (.claude/agents/architecture-reviewer.md)

---
name: architecture-reviewer
description: Cross-cutting architecture reviewer. Always runs on every PR.
  Checks import direction, dependency boundaries, abstraction levels, and
  fundamental approach, the Jeff Dean test.
tools: Read, Grep, Glob
model: haiku
---

You review PRs for cross-cutting architecture issues that domain-specific
reviewers miss. You receive the full PR diff across all directories.

Apply the Jeff Dean test: **"Is this fundamentally the wrong approach?"**
If the approach is reasonable, return "No architecture issues." Most PRs
will pass.

Flag ONLY:

**Import direction violations**
- Server/ importing directly from Database/ models instead of
  Database/Models/*/methods
- Frontend (Web/, WebInternal/) importing server-side code
- Cross-domain coupling: one PMS integration importing from another
  (e.g., Entrata/ importing from Yardi/)
- LLM consumers importing directly from AI providers instead of
  routing through LLM/

**New dependencies**
- New packages added to any package.json. Flag for visibility.
  Not necessarily a problem, but requires explicit approval per
  repo policy.

**Wrong abstraction level**
- Business logic in controllers (should be in services)
- SQL queries in route handlers or controllers (should be in
  Database/Models/*/methods)
- UI rendering logic in API handlers
- API calls in React components instead of hooks or services

**Fundamental approach mistakes**
- Polling where the system already has webhooks or event listeners
- Client-side computation that should be server-side (large data
  processing, aggregations)
- Synchronous blocking work in request handlers that should be queued
  (Queue/, ScheduledJobs/)
- Reimplementing existing utilities from helperLibrary/ or components/

## Before flagging an issue
- Use Grep to verify the pattern you're flagging. Check that the
  import or dependency you're concerned about actually exists in
  the diff.
- Check whether the pattern you consider wrong is used elsewhere in
  the codebase. If the "violation" is the established convention,
  do not flag it.

Do NOT flag: naming, code organization within a single domain, style,
testing gaps, performance micro-optimizations, missing error handling
(domain agents cover that), file structure within a directory.

Return a bullet list. If nothing significant, return "No architecture
issues."

## Severity gate, apply before reporting ANY finding
1. Would this cause a user-visible bug, data corruption, or security
   breach in production? If no, skip.
2. Is this a product/business decision rather than a code defect?
   If yes, skip. The author has domain context you lack.
3. Is the real-world impact proportional to the interruption cost?
   If the impact is negligible, do not report it.

When in doubt, do NOT flag. False positives erode trust faster than
missed minor issues cause harm.

## Follow-up review mode
When the orchestrator says this is a FOLLOW-UP review:
- You will receive a skip-list from comment-triage. Respect it
  absolutely:
  - RESOLVED/DISMISSED: Do NOT re-raise under any circumstances
  - OPEN: Check if fixed in new code; report "still_open" or
    "resolved"
- Review ONLY the incremental diff for new issues. Do NOT re-review
  unchanged code.
- After round 1, only report new security/data-loss issues; drop
  everything else.
- Format: [STILL_OPEN] F1: desc | [RESOLVED] F2: desc | [NEW] desc

Design choices worth cribbing

  1. Frontmatter is the contract. Four fields: name, description, tools, model. The description is what the orchestrator reads when deciding to delegate. Write it like a job posting, not a summary.

  2. Model: haiku, not sonnet or opus. Reviewers run on every PR, often multiple times. Haiku is around 10x cheaper and 3x faster. The job is narrow and structured. If Haiku misses something subtle, bump up. Start with Haiku.

  3. Tools restricted to read-only. Read, Grep, Glob. The reviewer can look at code, can't edit it, can't run commands. It can never "fix" something and introduce a new bug. Its output is always advisory. Copy this restriction. A reviewer that can edit is no longer a reviewer.

  4. The Jeff Dean test. The framing ("Is this fundamentally the wrong approach?") reorients the model toward judgement instead of pattern-matching. Most PRs pass. That's the point. "Most PRs will pass" is a load-bearing sentence in the prompt. Without it, the model feels obligated to find something.

  5. Positive AND negative spec. The prompt says what TO flag (four categories with concrete bullets) and what NOT TO flag. Non-optional. If you only tell the model what to flag, it flags everything because of the "I must be helpful" bias. The "Do NOT flag" section does as much work as the "Flag ONLY" section.

  6. Grep before flag. Two lines that prevent the most common failure mode: the model imagines a violation, writes it up confidently, and you have to check. Forcing the model to grep first eliminates around half of hallucinated findings.

  7. Severity gate. Three questions before reporting any finding:

    • Would this cause a user-visible bug, data corruption, or security breach?
    • Is this a product/business decision rather than a code defect?
    • Is the real-world impact proportional to the interruption cost?

    Codebase-agnostic. Copy verbatim. The third question is the one most teams forget.

  8. "False positives erode trust faster than missed minor issues." Worth keeping in the prompt. Models respond to stated principles even when they look redundant.

  9. Follow-up review mode with skip-list. The agent supports a "round 2 of review" mode where it gets findings already resolved or dismissed. Without this, every follow-up re-raises nits the author addressed. Pair with a comment-triage agent that parses PR history and produces the skip-list.

What to adapt for your codebase:

Section Change to
Import direction bullets Your dir layout. Ours is Server/, Database/Models/*/methods, Web/, PMS integrations. Swap in yours.
Wrong abstraction bullets Your controller/service/model split. If you don't use that layering, flag what you care about.
Fundamental approach bullets Patterns you want to discourage in your stack. Polling vs webhooks is universal. Others are local.
tools: line Keep as is.
model: line Start with haiku.
Severity gate Copy verbatim.
Follow-up mode Copy verbatim.

Community skills can give you the shape: severity gate, model choice, tools restriction. The list of "flag ONLY" rules has to be specific to how your code is organized. Otherwise the reviewer recites generic best-practice bullets that don't map to your repo.


Part 4: How we'd interview an AI-first staff engineer

The single question across the whole interview: are they a partner with AI, or a passenger? Anyone can paste a prompt and ship something. The bar at staff is whether they can drive AI through ambiguity, recognize when it's wrong, and recover quickly.

Track 1: Their passion project (around 45 min)

A take-home is gameable. A passion project is something they've already built (almost certainly with AI) and lived with long enough to know where it's brittle. They can't fake their way through it. They either understand the bones of what AI generated, or they don't.

  1. Brief tour, 5 min. Watch whether they describe it in product terms (what it does, who it's for, what they learned shipping it) or only in technical terms.
  2. Live debug, 15 min. Pick something you noticed during the tour, ask them to fix it with AI. Watch: do they reproduce first? Form a hypothesis before prompting? Read AI output critically or paste-and-pray?
  3. Add a feature, 15 min. Make something up on the spot, around 30 lines. Tests whether they can extend AI-written code they didn't fully internalize when it was generated. Lots of engineers ship the first version and then can't extend their own creation.
  4. Product taste, 10 min. "Two more weeks, where would you take this?" The AI-resistant part. Models can write code. They can't tell you which feature matters.

Track 2: Bug in unfamiliar code (around 45 min)

Staff engineers join unfamiliar codebases all the time. Can they navigate strange code with AI as a partner, not a crutch?

Hand them either a real bug from your codebase or a synthetic one. Real is better, since it shows you how they'd onboard.

What you watch:

  • First moves. Do they orient before prompting? ls, git log, README, search for the symptom. Or do they immediately ask AI "what does this codebase do?" The latter is a tell.
  • Loop tightness. How fast is hypothesis to verify to revise? Junior+AI: one prompt, accept it. Staff+AI: hypothesis, ask AI to verify a specific claim, iterate. The grain size is much smaller.
  • When stuck. They will get stuck. Do they re-orient (re-read code, run a print, simplify) or prompt-spam variations of the same question?
  • When done. Do they verify the fix addresses root cause, or only that the symptom went away?
flowchart LR
    subgraph T1["Track 1 - Passion project (~45m)"]
        L1[1. Brief tour 5m<br/>product OR<br/>only technical?]
        L2[2. Live debug 15m<br/>reproduce? hypothesize?<br/>read AI critically?]
        L3[3. Add feature 15m<br/>extend AI-written<br/>code?]
        L4[4. Product taste 10m<br/>where would you<br/>take this next?]
    end
    subgraph T2["Track 2 - Bug in unfamiliar code (~45m)"]
        R1[First moves<br/>orient or<br/>prompt-spam?]
        R2[Loop tightness<br/>hypothesis grain<br/>size]
        R3[When stuck<br/>re-orient or<br/>spam?]
        R4[When done<br/>root cause +<br/>test?]
    end
    SHARE[Screen-shared end to end -<br/>prompts ARE 80% of the signal]
    T1 --> SHARE
    T2 --> SHARE
    style T1 fill:#dbe4ff
    style T2 fill:#e5dbff
    style SHARE fill:#b2f2bb
Loading

Green flags, ordered by weight

  1. Resilient under failure. When AI hallucinates a function, generates broken code, or solves the wrong problem, they don't blame the model. They diagnose why the prompt led the model astray and adjust. Strongest signal.
  2. Prompts with context, not vibes. They paste relevant code, the error, the constraints. Specific ask: "this function returns undefined when X is null; the caller passes Y; constrain the fix to the function body." Not "fix this."
  3. Thinks out loud BEFORE prompting. Narrates hypotheses, then prompts. Means they're using AI to verify thinking, not generate it.
  4. Reads AI output critically. Notices when the model "fixed" something by adding a try/catch that swallows the error. Notices wrong imports.
  5. Knows when to abandon a thread. AI is good at sucking you into local maxima. After around 3 failed iterations, a staff candidate steps back, closes the chat, re-reads the code.
  6. Articulates tradeoffs. Says why they didn't pick the alternatives. Pushes back on the model's first answer.
  7. Product taste in the extension question. Opinionated, grounded in real signal (user behavior, what's painful in their own use).

Red flags

  • Blames the model. ("AI was being dumb today.")
  • Can't say no to AI suggestions. Accepts every diff.
  • Misses hallucinations. Calls a function from the AI's response that doesn't exist.
  • Locked into one model or one prompt style.
  • Slow to test. Writes 200 lines without running anything.
  • Can't work without AI. Briefly take AI away ("imagine your IDE has no Copilot for the next 5 minutes; describe the fix in plain English"). If they freeze, signal.
  • Won't ship. Endless investigation, no fix.
flowchart LR
    subgraph G["GREEN - hire signals"]
        G1[Resilient under model failure<br/>strongest signal]
        G2[Prompts with context, not vibes]
        G3[Thinks out loud BEFORE prompting]
        G4[Reads AI output critically]
        G5[Knows when to abandon a thread]
        G6[Articulates tradeoffs]
        G7[Product taste in extension]
    end
    subgraph R["RED - pass signals"]
        R1[Blames the model]
        R2[Cannot say no to AI suggestions]
        R3[Misses hallucinations]
        R4[Locked into one model / style]
        R5[Slow to test]
        R6[Cannot work without AI]
        R7[Will not ship]
    end
    style G fill:#d3f9d8
    style R fill:#ffc9c9
Loading

Schedule

Time Activity
0:00–0:05 Intro and scope
0:05–0:50 Passion project: tour, debug, extend, taste
0:50–0:55 Reset
0:55–1:35 Unfamiliar-codebase bug
1:35–1:50 Their questions, debrief

Two hours total. Insist on screen sharing the entire time. The prompts they write are 80% of the signal.


Where this approach struggles

Three places it doesn't pay off:

  • Solo devs. Most of the leverage comes from review agents catching what a human author misses. Without a second perspective in the loop, you get the rules + hooks layer but not the review layer. Still useful, less compounding.
  • No test infrastructure. The e2e-via-agent unlock assumes the agent has something to run. If you don't have Jest or Playwright wired up, an agent can write code but can't verify the behaviour, so the safety story falls apart.
  • First two weeks. Writing rules, hooks, and skills is upfront cost. Expect to feel slower for the first sprint or two. The compounding kicks in around week three, when the rules start catching repeats automatically and you stop re-explaining the same conventions.

If you're a solo dev on a fresh codebase, start with hooks + a single CLAUDE.md. Skip the review agents and skills until you have either a teammate or enough volume to make them earn their keep.


TL;DR

Question One-liner
Consistency across devs Rules + skills + hooks. Conventions as code, not tribal knowledge.
Code quality Four layered gates: author-time, pre-commit, pre-push, review agents.
Tech debt Surfaced continuously by review agents, paid down in atomic commits.
Code review Domain specialist subagents locally and on CI. Humans focus on design.
Tests Jest for logic, Playwright for flows, both runnable by agents.
Context drift Hooks for deterministic, prompts for judgement. Promote rules to hooks.
ESLint vs agents Both. Two layers of hooks (Claude + git). ship-pr orchestrates.
Project plans Commit everything. WHY in plans, WHAT in git. Plan for cloud agents to read both.
Doc bloat Add AND prune. Docs are a crutch, code is the truth. Scope via globs.
Architecture-reviewer Custom subagent. Single markdown file. Read-only tools. Haiku model. Severity gate copy-paste.
AI-first interview Two tracks, screen-shared. Passion project + unfamiliar bug. Resilience under failure is the strongest green flag.

The single sentence: build workflows that benefit from any model getting better, not workflows that lean on a specific one. Separate deterministic from judgement, push each into the layer that handles it best, and treat your tooling (Cursor, Claude Code, your wrapper of the month) as a swappable implementation detail. The roster will keep rotating. Your foundation shouldn't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment