Promotion Evidence Tracking with AI Harnesses

Executive Summary

This research investigates how engineers can use AI coding agents (Claude Code, Cursor, and similar tools) to systematically track promotion evidence, with specific focus on the Senior-to-Staff (P3-to-P4) transition and neurodivergent-friendly design.

Four findings cut across all subtasks:

The claim-evidence-impact chain is universal. Every promotion framework, from Julia Evans' brag doc to Dropbox's six-dimension rubric, converges on the same structure: state a claim, back it with concrete evidence, quantify impact. The formats differ; the underlying pattern does not.
Existing tools solve the wrong half of the problem. Brag doc generators and engineering analytics platforms capture code artifacts well (commits, PRs, cycle time) but fundamentally miss the qualitative signal that Staff-level promotions require: architecture influence, mentoring, cross-team coordination, problem selection. The "what" is automated; the "why" is not.
Capture must be a side effect of work, not a separate act. This is simultaneously the strongest finding from neurodivergent research (executive dysfunction makes deliberate journaling unsustainable) and the strongest pattern in AI agent meta-workflows (session tracking hooks, automated standup generation). Systems that require the user to remember, initiate, and write from scratch fail. Systems that extract evidence from artifacts the user already produces succeed.
No public system implements the full pipeline. Despite extensive search, no published tool connects session capture to daily rollups to weekly summaries to quarterly reviews to promotion packets. Individual components exist (session tracking, brag doc generators, weekly digests). The vault pipeline's tiered aggregation architecture is genuinely novel.

1. The Evidence Structure Problem

Universal chain: Claim, Evidence, Impact

Every promotion framework studied shares the same atomic unit of evidence. The terminology varies (STAR method, person-verb-task-impact, three-part claim validation), but the structure is identical:

Component	What it answers	Example
Claim	"What level behavior am I demonstrating?"	"I operate at Staff level in technical leadership"
Evidence	"What specifically did I do?"	"Designed and drove migration architecture across 3 teams"
Impact	"What measurable outcome resulted?"	"Reduced deploy failures by 40%"

Evidence quality follows a hierarchy from weakest to strongest:

"I worked on X" (activity)
"I delivered X" (output)
"I delivered X which resulted in Y" (output + outcome)
"I identified the need for X, designed the approach, drove execution across N teams, resulting in Y" (agency + scope + outcome)

The last form is what Staff-level packets require. It captures not just what happened but the judgment and influence that caused it to happen.

Sources: Julia Evans' brag document format, Will Larson's StaffEng promotion packets, John Ogden's promotion case format, the ByteByteGo Tech Promotion Algorithm.

The Senior-to-Staff gap

Across all rubric frameworks studied (Dropbox, Etsy, GitLab, Square, Carta, Box, Levels.fyi), the Senior-to-Staff transition requires a consistent shift:

Dimension	Senior (P3)	Staff (P4)
Scope	Single team, single project	Multi-team, multi-project
Ambiguity	Well-defined problems	Self-identified problems, novel solutions
Impact	Feature/project outcomes	Organizational or company-wide outcomes
Leadership	Mentoring individuals	Setting technical direction for teams
Design	Component/feature design	System architecture, cross-cutting concerns
Communication	Clear within team	Persuasive across org, written artifacts that influence decisions
Evidence duration	6-month track record	12+ month sustained track record

The two most common failure modes for Staff packets: (1) listing features shipped without organizational impact narrative, and (2) evidence of excellent Senior work rather than Staff-level scope and ambiguity.

This has a direct implication for automated evidence capture: git-based tools that count PRs and lines changed will produce evidence that looks like Senior work (output). Staff evidence requires capturing the reasoning, influence, and organizational context around that output.

Sources: Levels.fyi Standard SWE Level Framework, Carta's Impact-Sphere Model, Dropbox Core Responsibilities, Etsy Engineering Career Ladder, Rafael Cepeda's promotion coaching, progression.fyi (100+ public frameworks).

Three format tiers

Promotion evidence formats cluster into three complementary tiers:

Capture formats (raw accumulation): Julia Evans' brag doc, Gergely Orosz's work log, Bragdocs.com. These optimize for regular recording of accomplishments. They solve the "forgetting what you did" problem but don't organize evidence against leveling criteria.

Packet formats (decision-ready narrative): Will Larson's promotion packet, John Ogden's promotion case, Kanishk Agrawal's promotion document. These structure evidence into the format a promotion committee expects: projects, impact, mentorship, glue work, gap analysis, advocacy. They solve the "organizing your case" problem but assume evidence already exists.

Rubric frameworks (evaluation dimensions): Dropbox 6-dim, Etsy 5-dim, Box 7-dim, GitLab 3-pillar, Square 2-section. These define what evidence must cover. They solve the "what counts" problem but provide no mechanism for capture.

An effective system needs all three: continuous capture feeding into a structured packet organized against a rubric. No single tool does this today.

2. The Tooling Gap

What exists

The tooling landscape splits into three tiers, none of which solve the full problem:

Tier 1: Brag doc generators. BragDoc.ai (open source, git-to-LLM pipeline), Brag AI (GitHub activity summarizer), Reflect (lookback-based GitHub contribution reports). These extract from git commits and PRs, use LLMs to generate "achievements," and produce formatted documents. They handle code artifacts well.

Tier 2: Enterprise engineering analytics. Jellyfish, LinearB, Swarmia, Hatica, Waydev, Allstacks. These ingest data from git, Jira, CI/CD, and calendars to produce team and org-level dashboards. Jellyfish explicitly markets "advocate for promotions with data." But these platforms serve managers and VPs, not individual ICs building promotion cases. The data is quantitative (cycle time, PR throughput) rather than narrative.

Tier 3: LLM-powered summarizers. An academic multi-agent pipeline (arXiv 2505.17710) that analyzes repos via PyDriller, the Claude Code skills ecosystem (career-ops for job search, session loggers for work tracking), and Cursor Automations for weekly digests. These are the closest to useful but are early-stage, narrow, or not specifically designed for promotion evidence.

What they capture vs miss

Captured well: Code output (commits, PRs, lines changed), velocity metrics (cycle time, deployment frequency), work categorization (features vs bugs vs docs), quantitative throughput.

Fundamentally missed: Mentoring and coaching (no artifact trail in git), architecture influence (lives in docs and chat, not code), cross-team coordination (happens in meetings and threads), problem selection (zero signal in any tool), organizational influence (changing how teams work), glue work (incident response, onboarding, enabling documentation), quality of decisions (a 10-line change that prevents a month of tech debt has the same git signal as a typo fix).

The fundamental gap

Tools that auto-extract from artifacts capture the "what" but not the "why." For Senior-to-Staff promotions, the "why" is the entire case. A Staff engineer's value is measured by the decisions they make and the work they cause others to do, not the code they personally ship.

Enterprise platforms like Jellyfish try to bridge this by connecting engineering work to business outcomes, but the connection is organizational (team X spent Y% on initiative Z), not individual (engineer A's architecture decision prevented 6 months of rework).

No tool combines: (1) automatic artifact extraction from multiple sources with (2) LLM-powered narrative generation framed around leveling dimensions with (3) human-in-the-loop confirmation/correction.

Sources: BragDoc.ai, Brag AI, Reflect, Jellyfish, Pensero, LinearB, Swarmia, Hatica, Waydev, Allstacks, Gitmore, Coderbuds, arXiv 2505.17710, awesome-agent-skills.

3. AI Agent Meta-Workflows: The State of the Art

Session tracking is the dominant pattern

Multiple independent implementations solve the same problem: capturing what happened in a coding session so knowledge compounds across sessions. The implementations vary (iannuttall/claude-sessions, Takuya Matsuyama's Inkdrop workflow, bx2's session logger, automatic session tracking hooks, daily standup automation), but they share a common trait: capture happens as a side effect of working, not as a separate action.

The most effective implementations hook into the agent's lifecycle (session start/end, PR merge, file save) rather than requiring the user to remember to log. This matches the neurodivergent research finding that external triggers succeed where internal motivation fails.

Three extension tiers in Claude Code

Skills (SKILL.md files): Markdown instruction packages. Matt Pocock's public repo, alirezarezvani/claude-skills (232+ skills spanning marketing, product, compliance). Skills aren't limited to engineering.
Hooks (settings.json lifecycle events): 15 lifecycle events with shell, prompt, and agent handler types. The agent handler can spawn sub-agents for deep verification.
Custom slash commands: Markdown prompt templates. The most accessible entry point for personal workflows.

Autonomous loop frameworks

Two frameworks represent the frontier: the Ralph Loop (intercepts session exits, auto-re-feeds prompts, dual-condition exit gate) and Sandcastle (Matt Pocock's parallel sandboxed agent orchestrator). Both are code-focused today, but the pattern generalizes to reporting pipelines: process N sessions into daily summaries, then weeklies, etc.

Agent-maintained knowledge bases

Andrej Karpathy's "LLM Wiki" concept (April 2026): structured markdown files with a coding agent doing the writing, linking, categorizing, and consistency checking. One topic grew to ~100 articles and 400,000 words without Karpathy writing directly. Tiago Forte's updated Second Brain framework proposes AI performing Progressive Summarization at scale. REM Labs' Dream Engine implements nightly processing.

The common thread: plain markdown as storage, AI agent as active organizer (not just retriever). This inverts the traditional PKM model.

Cursor Automations point to the market direction

Always-on agents triggered by schedules or events (messages, merged PRs, incidents). Cursor estimates hundreds of automations per hour across its user base. The positioning is shifting from "coding assistant" to "reusable internal workflow engine."

The over-engineering boundary

The METR 2025 study found experienced open-source developers were 19% slower with AI tools. HBR coined "workslop" for AI-generated content that looks polished but lacks substance. MIT's 2025 report found 95% of enterprise AI pilots failed to generate measurable returns.

AI meta-workflows succeed when they: capture signal that would otherwise be lost, transform existing artifacts into new formats, maintain consistency humans can't sustain. They fail when they: generate content requiring human verification anyway, add process overhead that didn't exist before, produce volume without substance.

Sources: iannuttall/claude-sessions, devas.life journaling, bx2 session logger, Matt Pocock's skills repo, alirezarezvani/claude-skills, Claude Code hooks/memory docs, Ralph Loop, Sandcastle, Karpathy LLM Wiki, Forte Labs AI Second Brain, REM Labs Dream Engine, Cursor Automations, METR 2025 study, HBR workslop article.

4. Neurodivergent-First Design: Why Most Systems Fail

The executive function tax

Career evidence tracking requires task initiation, working memory (recall), consistency, and organizational decision-making. These are the exact executive functions that ADHD specifically impairs. 10.57% of developers self-report concentration/memory disorders (Stack Overflow survey). The abandonment cycle is predictable: download, feel hopeful, use for a week, miss a day, guilt spiral, abandon.

For AuDHD (ADHD + autism), challenges compound: demand avoidance means evidence tracking that feels externally imposed triggers freeze. Autistic preference for routine conflicts with ADHD novelty-seeking. A rigid system triggers ADHD resistance; a flexible one triggers autistic anxiety.

Five anti-patterns that consistently fail

"Just set a weekly reminder." Reminders create a demand that triggers avoidance. The notification becomes noise.
"Make it fun with gamification." Works for 1-2 weeks (novelty). Then dopamine fades and the gamification layer adds maintenance complexity.
"Use a beautiful template." Triggers hyperfocus on building the system. The system is never used for its purpose.
"Start small, build up." "Just one sentence" still requires initiation + recall + opening the tool. Three executive function demands for one sentence.
"Leverage hyperfocus for monthly reviews." Hyperfocus is not schedulable. It is triggered by interest, urgency, or novelty, not calendar events.

Three viable design patterns

Pattern 1: Automation-first capture (evidence as side effect). The system ingests artifacts the engineer already produces (git, Asana, design docs, chat) and surfaces them as potential evidence. Zero additional executive function required. Supporting research from Zapier's neurodiversity work, behavior-sensing academic frameworks, and the RescueTime model.

Design requirements: zero-touch ingestion from APIs, capture first / organize later, no blank pages (always start with pre-populated content).

Pattern 2: Recognition over recall. Present a draft for the user to confirm, correct, or augment rather than asking them to generate from scratch. Recognition is dramatically easier than recall (Nielsen's Usability Heuristic #6), and the gap is amplified by ADHD working memory deficits.

Instead of "What did you accomplish this quarter?" present "Here are 47 PRs you merged, 12 design docs you authored, and 3 incidents you resolved. Which of these demonstrate Staff-level impact?"

The user's job shifts from generate (high EF) to curate (low EF), from recall (working memory dependent) to recognize (pattern matching, an ADHD strength), from write from blank page (initiation barrier) to edit pre-filled draft (lower activation energy).

Design requirements: AI-generated first draft, binary or multiple-choice interactions, progressive refinement (coarse signal first, detail in optional follow-up passes).

Pattern 3: Social scaffolding and external accountability. Academic research shows task management for ADHD adults is "relationally and affectively co-constructed" (Chen et al., 2026). Evidence capture needs an external agent (human or AI) providing activation energy.

Body doubling (working alongside another person), AI as companion ("Let's review your week" walk-through), and hook-based triggers (PR merge, session end, 1:1 prep) all provide external structure. The distinction matters: a habit requires internal initiation (executive function); a hook is triggered by an external event (no executive function required). Participants in the Chen et al. study preferred AI that acted as a supportive companion rather than a taskmaster.

Design requirements: external triggers not internal motivation, companion tone not taskmaster, embedded in existing workflows.

The spoon theory lens

Every activity costs finite energy ("spoons"). Writing a brag doc entry from memory costs 3-4 spoons. Reviewing an AI-generated draft costs 1. Passive artifact collection costs zero. Responding to "Was this PR impactful? Y/N" costs 0.5. The total spoon cost of the evidence system must approach zero for day-to-day operation.

Eight design principles for neurodivergent evidence capture

Zero executive function for capture (automated or side-effect)
Recognition over recall for review (present artifacts, ask "is this important?")
External triggers over internal motivation (hooks, not reminders)
Companion, not taskmaster (demands trigger avoidance, especially for AuDHD with PDA traits)
Graceful degradation on missed sessions (backfill from artifacts must equal real-time quality)
Capture first, organize never or later (classification at capture time kills the system)
Single tool, embedded in workflow (every additional app costs spoons)
Dopamine-compatible feedback (instant, concrete signal rather than deferred rewards)

Sources: Chen et al. 2026 (AI scaffolding for ADHD), arXiv 2602.09381 (metacognition scaffolding), arXiv 2507.06864 (neurodivergent-aware productivity), arXiv 2312.05029 (SE with ADHD case study), PMC 5729117 (working memory and ADHD), Zapier neurodiversity, ADDitude, ADDA body doubling, UI-Patterns Good Defaults, Nielsen's Usability Heuristics.

5. Validation of the Vault Pipeline Design

The research findings validate and challenge the vault pipeline (session -> daily -> weekly -> quarterly -> half-review) in specific ways.

What the research validates

Capture-at-point-of-work is structurally correct. The vault's /vault-session skill matches the most effective pattern across all research: capture as a side effect of working, not as a separate deliberate action. Every successful system in the neurodivergent research and the AI meta-workflow ecosystem follows this same principle.

Tiered aggregation is genuinely novel. No public system implements the full session-to-daily-to-weekly-to-quarterly-to-review pipeline. Individual components exist (session tracking, weekly digests, brag docs), but nobody has connected them into a compounding pipeline. The vault is ahead of the market.

Markdown-first architecture is validated by Karpathy and Forte. Plain markdown as storage with AI as active organizer (not just retriever) is exactly the vault's architecture. Karpathy's LLM Wiki and Forte's updated Second Brain framework both converge on this design independently.

Hook-driven activation matches neurodivergent design principles. The vault's lifecycle hooks (session end triggers capture, calendar triggers daily/weekly) implement the "external triggers over internal motivation" principle that the ADHD research identifies as essential.

Backfill is a critical feature, not an edge case. The neurodivergent research is emphatic: graceful degradation on missed sessions is a requirement. The vault's ability to backfill from artifacts with same-day quality output directly addresses the most common system-killer for ADHD engineers (miss a day, guilt spiral, abandon).

Where the research challenges the design

The promotion-specific gap. The vault captures rich qualitative signal (session context, decisions, reasoning) that tools miss, but it doesn't currently structure that signal against leveling dimensions. The claim-evidence-impact chain is the universal format for promotion evidence, and the vault's reporting pipeline doesn't explicitly produce it.

Recognition-over-recall is underused. The vault's daily/weekly/quarterly skills rely on the user providing input (reflection prompts). The neurodivergent research suggests the system should present a pre-filled draft assembled from session data and ask the user to curate, not generate.

The over-engineering risk is real. The METR 19% slowdown finding and HBR's "workslop" concept are warnings. The vault succeeds where others fail because it captures signal that would genuinely be lost. But every additional layer of automation must pass the test: does verifying the output cost less effort than producing it manually?

The gap the vault could fill

No tool combines: (1) automatic artifact extraction from multiple sources, (2) LLM-powered narrative generation framed around leveling dimensions, (3) human-in-the-loop confirmation/correction, and (4) neurodivergent-friendly design (zero EF capture, recognition-over-recall review, external triggers).

The vault pipeline already has components (1) and (3). Adding (2) (structuring evidence against rubric dimensions during weekly/quarterly rollups) and strengthening (4) (more recognition-over-recall, less open-ended reflection prompts) would create the system the research shows is needed but doesn't exist.

Sources

Promotion Formats and Rubrics

Tools and Platforms

BragDoc.ai / GitHub
Brag AI / GitHub
Reflect
Jellyfish People Management
Pensero
LinearB
Swarmia
Hatica
Waydev
Allstacks
Gitmore
Coderbuds
Cortex
arXiv 2505.17710: LLM Contribution Summarization

usirin/research.md

Select an option

No results found