Skip to content

Instantly share code, notes, and snippets.

@nibzard
Created May 15, 2026 10:34
Show Gist options
  • Select an option

  • Save nibzard/7522c7035af616e70e68e2e29a90dc4e to your computer and use it in GitHub Desktop.

Select an option

Save nibzard/7522c7035af616e70e68e2e29a90dc4e to your computer and use it in GitHub Desktop.
Context Rot: A Cross-Disciplinary Research Review — Cognitive Overhead of Managing Multiple AI Coding Agents in Parallel
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Context Rot — A Cross-Disciplinary Research Review</title>
<style>
:root {
--bg: #fafafa;
--text: #1a1a2e;
--muted: #6b7280;
--accent: #4f46e5;
--border: #e5e7eb;
--code-bg: #f3f4f6;
--quote-border: #a5b4fc;
--link: #4f46e5;
}
@media (prefers-color-scheme: dark) {
:root {
--bg: #0f0f17;
--text: #e2e8f0;
--muted: #94a3b8;
--accent: #818cf8;
--border: #1e293b;
--code-bg: #1e293b;
--quote-border: #6366f1;
--link: #818cf8;
}
}
* { margin: 0; padding: 0; box-sizing: border-box; }
body {
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Noto Sans', sans-serif;
background: var(--bg);
color: var(--text);
line-height: 1.7;
padding: 2rem 1rem;
}
article {
max-width: 42rem;
margin: 0 auto;
}
h1 {
font-size: 1.75rem;
font-weight: 700;
letter-spacing: -0.02em;
margin-bottom: 0.25rem;
line-height: 1.3;
}
.subtitle {
color: var(--muted);
font-size: 0.925rem;
margin-bottom: 0.75rem;
font-weight: 400;
}
.meta {
font-size: 0.8rem;
color: var(--muted);
margin-bottom: 2.5rem;
}
h2 {
font-size: 1.3rem;
font-weight: 600;
margin: 2.5rem 0 0.75rem;
padding-bottom: 0.4rem;
border-bottom: 1px solid var(--border);
letter-spacing: -0.01em;
}
h3 {
font-size: 1.05rem;
font-weight: 600;
margin: 1.75rem 0 0.5rem;
color: var(--accent);
}
p { margin: 0.6rem 0; }
strong { font-weight: 600; }
a { color: var(--link); text-decoration: none; }
a:hover { text-decoration: underline; }
blockquote {
border-left: 3px solid var(--quote-border);
padding: 0.4rem 1rem;
margin: 0.75rem 0;
color: var(--muted);
font-style: italic;
font-size: 0.925rem;
}
code {
font-family: 'JetBrains Mono', 'Fira Code', 'SF Mono', Menlo, monospace;
background: var(--code-bg);
padding: 0.15em 0.4em;
border-radius: 4px;
font-size: 0.85em;
}
pre {
background: var(--code-bg);
padding: 1rem 1.25rem;
border-radius: 6px;
overflow-x: auto;
margin: 1rem 0;
font-size: 0.85rem;
line-height: 1.5;
}
pre code {
background: none;
padding: 0;
border-radius: 0;
}
ul, ol {
margin: 0.5rem 0;
padding-left: 1.5rem;
}
li { margin: 0.3rem 0; }
table {
width: 100%;
border-collapse: collapse;
margin: 1rem 0;
font-size: 0.9rem;
}
th {
text-align: left;
font-weight: 600;
padding: 0.5rem 0.75rem;
border-bottom: 2px solid var(--border);
}
td {
padding: 0.45rem 0.75rem;
border-bottom: 1px solid var(--border);
}
hr {
border: none;
border-top: 1px solid var(--border);
margin: 2rem 0;
}
.toc {
background: var(--code-bg);
border-radius: 8px;
padding: 1rem 1.5rem;
margin: 1.5rem 0 2rem;
}
.toc h2 {
border: none;
font-size: 1rem;
margin: 0 0 0.5rem;
padding: 0;
}
.toc ol {
padding-left: 1.25rem;
font-size: 0.9rem;
}
.toc li { margin: 0.2rem 0; }
</style>
</head>
<body>
<article>
<h1 id="context-rot-a-cross-disciplinary-research-review">Context Rot: A Cross-Disciplinary Research Review</h1>
<p><strong>Subtitle:</strong> Cognitive Overhead of Managing Multiple AI Coding Agents in Parallel — 50 Years of Psychology Applied to a New Problem</p>
<p><strong>Date:</strong> May 2026<br />
<strong>Status:</strong> Living document — research compilation from ~30 foundational papers (1956–2026)</p>
<hr />
<h2 id="executive-summary">Executive Summary</h2>
<p>Developers running multiple AI coding agents in parallel (Claude Code, Codex CLI, Cursor, Devin, etc.) face a cognitive burden with no precedent in software engineering history. We set out to understand this phenomenon through historical psychological research spanning 50+ years.</p>
<p><strong>The core finding:</strong> A developer managing <em>N</em> AI coding agents must maintain <em>N+1</em> concurrent mental models — their own understanding of the codebase plus a separate model of each agent's knowledge state. Cognitive psychology establishes that humans can reliably maintain only <strong>2–3 mental models simultaneously</strong> (Johnson-Laird 1983; Cowan 2001). Theory of Mind research shows tracking even <strong>2 agents' distinct beliefs degrades accuracy to 65–75%</strong> (Apperly &amp; Butterfill 2009). The <strong>hard cognitive limit for productive multi-agent coding is 1–2 agents without external cognitive aids</strong>, and 2–3 agents maximum with excellent tooling support.</p>
<p>This document compiles research from cognitive load theory, working memory, task switching, interruption science, mental models, Theory of Mind, supervisory control, situation awareness, human-automation interaction, programmer cognition, flow state, and recent human-AI teaming studies — and maps each finding directly to the new phenomenon of parallel AI coding agent management.</p>
<hr />
<h2 id="table-of-contents">Table of Contents</h2>
<ol>
<li><a href="#1-the-problem-context-rot">The Problem: Context Rot</a></li>
<li><a href="#2-foundation-i-working-memory--cognitive-load">Foundation I: Working Memory &amp; Cognitive Load</a></li>
<li><a href="#3-foundation-ii-task-switching--interruption-science">Foundation II: Task Switching &amp; Interruption Science</a></li>
<li><a href="#4-foundation-iii-mental-models--theory-of-mind">Foundation III: Mental Models &amp; Theory of Mind</a></li>
<li><a href="#5-foundation-iv-human-automation-interaction">Foundation IV: Human-Automation Interaction</a></li>
<li><a href="#6-foundation-v-programmer-specific-cognition">Foundation V: Programmer-Specific Cognition</a></li>
<li><a href="#7-foundation-vi-recent-human-ai-teaming-research">Foundation VI: Recent Human-AI Teaming Research</a></li>
<li><a href="#8-modern-tool-landscape">Modern Tool Landscape</a></li>
<li><a href="#9-synthesis-the-context-rot-framework">Synthesis: The Context Rot Framework</a></li>
<li><a href="#10-actionable-principles">Actionable Principles</a></li>
<li><a href="#11-complete-references">Complete References</a></li>
</ol>
<hr />
<h2 id="1-the-problem-context-rot">1. The Problem: Context Rot</h2>
<h3 id="11-defining-context-rot">1.1 Defining Context Rot</h3>
<p><strong>Context rot</strong> is the progressive degradation of a developer's accurate understanding of what an AI coding agent knows, what it has done, and what it will do next — caused by the compounding cognitive overhead of maintaining parallel mental models across multiple agent sessions.</p>
<p>It manifests as:
- Forgetting which agent has seen which files
- Assuming agents share knowledge (they don't — each has isolated context)
- Failing to detect when one agent's work invalidates another's assumptions
- Rubber-stamping agent output due to review fatigue (automation complacency)
- Making decisions based on stale mental models of agent state</p>
<h3 id="12-the-n1-mental-model-problem">1.2 The N+1 Mental Model Problem</h3>
<p>When a developer works without AI, they maintain <strong>1 mental model</strong>: their understanding of the codebase.</p>
<p>When they work with <em>N</em> AI coding agents, they must maintain <strong>N+1 mental models</strong>:</p>
<table>
<thead>
<tr>
<th>Agents</th>
<th>Mental Models Required</th>
<th>Cognitive Status</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>1 (own understanding)</td>
<td>Comfortable baseline</td>
</tr>
<tr>
<td>1</td>
<td>2 (own + agent's knowledge state)</td>
<td>At capacity limit</td>
</tr>
<tr>
<td>2</td>
<td>3 (own + 2 agents' states)</td>
<td>Beyond reliable capacity</td>
</tr>
<tr>
<td>3</td>
<td>4 (own + 3 agents' states)</td>
<td>Systematic failures guaranteed</td>
</tr>
<tr>
<td>4+</td>
<td>5+</td>
<td>Catastrophic context rot</td>
</tr>
</tbody>
</table>
<p>This is not a software problem — it is a <strong>fundamental human cognitive limitation</strong> established across decades of psychological research. No interface improvement can eliminate it; the best we can do is provide external cognitive aids that reduce the per-agent mental model burden.</p>
<h3 id="13-the-martin-fowler-attribution-note">1.3 The Martin Fowler Attribution Note</h3>
<p>The concept of "maintaining your own model and the model of what the agent knows" has been widely attributed to Martin Fowler in developer discourse. Exhaustive research across martinfowler.com, his bliki, Substack, LinkedIn, Twitter/X, and general web search <strong>could not locate this specific article or quote from Fowler</strong>. The concept appears to be a <strong>community-synthesized observation</strong> that may have been misattributed. Regardless of provenance, the concept itself is valid and well-supported by the research in this document — particularly Johnson-Laird's mental model theory (1983) and Apperly &amp; Butterfill's Theory of Mind limits (2009).</p>
<hr />
<h2 id="2-foundation-i-working-memory-cognitive-load">2. Foundation I: Working Memory &amp; Cognitive Load</h2>
<h3 id="21-miller-1956-the-magical-number-seven-plus-or-minus-two">2.1 Miller (1956) — "The Magical Number Seven, Plus or Minus Two"</h3>
<p><strong>Core Finding:</strong> Human absolute judgment and short-term memory are bounded at approximately <strong>7 ± 2 chunks</strong> of information. However, Miller emphasized that recoding into larger chunks can <em>multiply</em> effective capacity.</p>
<p><strong>Quantitative Metrics:</strong>
- 7 ± 2 items as span of immediate memory for digits, letters, words
- Recoding into larger chunks multiplies effective capacity (3 binary digits → 1 octal digit triples capacity from ~9 to ~27 bits)</p>
<blockquote>
<p><em>"By organizing the stimulus input simultaneously into several dimensions and successively into a sequence of chunks, we manage to break (or at least stretch) this informational bottleneck."</em></p>
</blockquote>
<p><strong>Mapping to AI Agents:</strong>
- A developer managing 3–4 agent sessions is attempting to hold ~4 separate "streams" of 7±2 chunks each — catastrophic failure without external scaffolding
- The <em>structure</em> of context matters more than <em>volume</em>: well-organized context with clear section headers and hierarchical task decomposition stretches the bottleneck
- Raw code dumps into agent context are like unchunked binary digits — they saturate capacity without transmitting proportional information</p>
<h3 id="22-cowan-2001-2010-the-magical-number-4-in-short-term-memory">2.2 Cowan (2001, 2010) — "The Magical Number 4 in Short-Term Memory"</h3>
<p><strong>Core Finding:</strong> Miller's 7±2 <strong>overestimates</strong> working memory capacity. The true capacity of working memory (the "focus of attention") is <strong>~4 chunks</strong>. Miller's number reflected long-term memory activation <em>plus</em> working memory, not working memory alone.</p>
<p><strong>Quantitative Metrics:</strong>
- Working memory capacity: <strong>3–5 chunks</strong> (median ~4) across all tested modalities
- 95% CI for capacity: approximately <strong>3.2 to 4.7 chunks</strong>
- 90th percentile: ~5 chunks; 10th percentile: ~2.5 chunks</p>
<blockquote>
<p><em>"It is the focus of attention that is limited, not the activated portion of long-term memory."</em>
<em>"The capacity limit is not easily increased by practice or by the use of strategies, suggesting that it may reflect a fundamental architectural constraint."</em></p>
</blockquote>
<p><strong>Mapping to AI Agents:</strong>
- <strong>The 4-chunk limit is the real constraint for agent context.</strong> A "chunk" in code might be: (1) the function being modified, (2) the change being made, (3) the reason, (4) the impact on callers. If agent output requires more than 4 chunks simultaneously, the developer <em>will</em> fail to comprehend it.
- <strong>Context window size is largely irrelevant to comprehension.</strong> What matters is how many chunks the agent makes the developer juggle simultaneously. A 500-line diff organized into 4 clearly labeled sections is <em>more comprehensible</em> than a 20-line diff spread across 8 files.
- If a developer has 3 agent sessions open, each requiring 2 chunks in mind, that's 6 chunks — already exceeding the limit
- <strong>Design implication:</strong> Aim for 2–3 chunks as the maximum simultaneous demand per agent interaction</p>
<h3 id="23-oberauer-2002-access-to-information-in-working-memory">2.3 Oberauer (2002) — "Access to Information in Working Memory"</h3>
<p><strong>Core Finding:</strong> Three concentric regions of working memory:</p>
<ol>
<li><strong>Activated long-term memory:</strong> Broad activation, ~unlimited capacity, 300–500ms access</li>
<li><strong>Direct access region:</strong> ~4 items loaded for immediate use, ~50ms access</li>
<li><strong>Focus of attention:</strong> <strong>1 item</strong> currently being processed</li>
</ol>
<p>Critical finding: <strong>processing and storage compete for the focus of attention.</strong> Each processing operation reduces accessible storage by ~0.5–1 items.</p>
<blockquote>
<p><em>"Processing and storage in working memory compete for the focus of attention. Any processing that requires the focus of attention will reduce the number of elements that can be maintained in the direct-access region."</em></p>
</blockquote>
<p><strong>Mapping to AI Agents:</strong>
- <strong>When evaluating an agent's suggestion (processing), you can hold essentially zero other items in direct access.</strong> All necessary context must be <em>externally visible</em> — not held in working memory.
- <strong>"Agent fatigue" explained:</strong> Every evaluation of agent output displaces stored context. The developer constantly reloads from activated LTM (300–500ms per item) because direct access keeps getting cleared — cognitively exhausting.
- <strong>Focus switching between agents:</strong> Swapping 4-item direct access contents takes ~1–2 seconds minimum even with perfect external scaffolding</p>
<h3 id="24-baddeley-2000-the-episodic-buffer">2.4 Baddeley (2000) — The Episodic Buffer</h3>
<p><strong>Core Finding:</strong> Revised working memory model adding a fourth component — the <strong>episodic buffer</strong>: a limited-capacity (~4 chunks), multi-modal storage system that integrates information across modalities and interfaces with long-term memory.</p>
<p><strong>Quantitative Metrics:</strong>
- Buffer capacity: ~4 integrated chunks
- Holds ~2–3 seconds of spoken information or a single complex visual scene
- Contents must be actively refreshed every ~1–2 seconds or they decay</p>
<p><strong>Mapping to AI Agents:</strong>
- Agents that produce <em>only</em> text are harder to work with than agents producing text + diagrams + file tree visualizations — multi-modal output maps directly to the buffer's design
- <strong>Explainability is a capacity requirement, not a luxury.</strong> Opaque agent behavior bypasses the episodic buffer and cannot be consciously integrated into the developer's mental model.
- In a 2-hour session, initial context has been unrefreshed for ~7000 seconds — completely gone from episodic buffer unless externalized</p>
<h3 id="25-baddeley-hitch-1974-original-working-memory-model">2.5 Baddeley &amp; Hitch (1974) — Original Working Memory Model</h3>
<p><strong>Core Finding:</strong> Multi-component system: phonological loop (verbal), visuospatial sketchpad (spatial), central executive (attentional control).</p>
<p><strong>Quantitative Metrics:</strong>
- Phonological store: ~2 seconds of acoustic information, or ~7 items if rehearsed
- Dual-task costs (same system): 30–50% performance drop each
- Dual-task costs (different systems): &lt;10% drop</p>
<p><strong>Mapping to AI Agents:</strong>
- Reading code engages the phonological loop via subvocalization. An agent producing 500 lines overwhelms the ~2-second store — developers <em>cannot</em> subvocalize 500 lines
- <strong>Design implication:</strong> Agents should produce smaller, labeled chunks with prose summaries subvocalizable in ≤10 seconds per chunk</p>
<h3 id="26-sweller-1988-cognitive-load-during-problem-solving">2.6 Sweller (1988) — Cognitive Load During Problem Solving</h3>
<p><strong>Core Finding:</strong> Not all cognitive load is equal. Distinguished:
- <strong>Intrinsic load:</strong> Inherent complexity of the material
- <strong>Extraneous load:</strong> Load imposed by poor presentation/format (wasted)
- <strong>Germane load:</strong> Load devoted to schema construction (useful)</p>
<p><strong>Quantitative Metrics:</strong>
- Means-ends analysis imposes four-component working memory demand simultaneously
- Split-attention effect: integrating separated text and diagrams reduces processing time by 30–50%
- Redundancy effect: presenting same information twice doubles processing time without improving learning</p>
<p><strong>Mapping to AI Agents:</strong>
- Vague prompting ("fix the bug") creates extraneous load — agent expends context on searching. Precise prompts eliminate search and direct resources toward schema construction.
- As sessions progress, stale context is "presented information" that no longer serves a function but still demands processing — pure extraneous load
- <strong>Verbose agents are not just annoying — they are cognitively expensive.</strong> Every token that doesn't contribute to schema construction is extraneous load.</p>
<h3 id="27-sweller-ayres-kalyuga-2011-expertise-reversal-effect">2.7 Sweller, Ayres &amp; Kalyuga (2011) — Expertise Reversal Effect</h3>
<p><strong>Core Finding:</strong> Instructional techniques that benefit novices (worked examples, high guidance) can <em>hinder</em> experts because the guidance becomes redundant extraneous load.</p>
<p><strong>Quantitative Metrics:</strong>
- Expertise reversal effect sizes: d = 0.5 to 1.2
- Schema automation reduces processing time from ~1000ms to ~100ms per element
- Transient information (streaming text, animations) impairs learning by 20–40% vs. permanent information</p>
<p><strong>Mapping to AI Agents:</strong>
- AI agents should detect expertise level and adjust verbosity accordingly. Currently all agents fail at this — they give the same output to juniors and seniors.
- <strong>Streaming token output is transient information</strong> — it impairs schema construction. Agents should conclude with a persistent summary.
- Session warm-up is costly because schemas from previous sessions aren't preserved. Every session starts from zero.</p>
<h3 id="28-chase-simon-1973-chunking-and-expertise">2.8 Chase &amp; Simon (1973) — Chunking and Expertise</h3>
<p><strong>Core Finding:</strong> Expert chess players don't have better general memory — they have better <strong>chunked perception</strong>. Masters recalled ~16 pieces in 5 seconds (meaningful positions); novices ~4. With <em>random</em> positions, both performed identically (~4).</p>
<p><strong>Quantitative Metrics:</strong>
- Master chunks contained ~3–4 pieces each → ~4–5 chunks total (consistent with Cowan's 4)
- Estimated chunk repertoire: 50,000–100,000 stored patterns for masters</p>
<p><strong>Mapping to AI Agents:</strong>
- <strong>AI-generated code that breaks familiar patterns</strong> causes expert review performance to drop to novice levels. The developer's chunk advantage disappears.
- <strong>Template alignment is not about correctness — it's about cognitive load.</strong> Code that fits existing patterns is "free" cognitively; unfamiliar patterns are extremely expensive.
- The most valuable agent behavior may be <strong>building the developer's chunk repertoire</strong> — exposing patterns, naming them, creating documentation that functions as an external chunk library.</p>
<h3 id="29-ericsson-chase-1982-exceptional-memory-sf-study">2.9 Ericsson &amp; Chase (1982) — Exceptional Memory (SF Study)</h3>
<p><strong>Core Finding:</strong> Subject SF expanded digit span from ~7 to 79–82 digits through 230+ hours of practice using <strong>meaningful encoding</strong> (running times). Transfer to non-practiced materials: <strong>zero</strong>. Decay without practice: ~20% after 2 months.</p>
<p><strong>Mapping to AI Agents:</strong>
- SF's technique maps directly to what agents need: encode context into meaningful structures connected to existing codebase patterns, not raw token dumps
- <strong>Zero transfer warning:</strong> Expertise in using an AI agent on Python doesn't transfer to Rust. The chunk repertoire is domain-specific.
- A developer who hasn't touched a module in 2 months has lost their chunk-based shortcuts. Agents cannot assume familiarity.</p>
<h3 id="210-gobet-et-al-2001-template-theory">2.10 Gobet et al. (2001) — Template Theory</h3>
<p><strong>Core Finding:</strong> Experts use ~50–100 <strong>templates</strong> (high-level schemas with variable slots) that cover &gt;80% of situations. Most input is template-predicted; only novel aspects require chunk-by-chunk processing.</p>
<p><strong>Quantitative Metrics:</strong>
- Slot filling time: ~200–300ms vs. ~1000ms per chunk for non-template encoding
- A position requiring ~15 chunks for novices requires only ~2–3 <em>novel</em> chunks for experts</p>
<p><strong>Mapping to AI Agents:</strong>
- Senior developers have ~50–100 "templates" for their codebase. When an agent produces code that doesn't fit these templates, it forces the developer to process the <em>entire output</em> as novel chunks — experienced as "I don't trust this code" even when technically correct.
- <strong>The 4-chunk limit applies specifically to <em>novel</em> information.</strong> If agent changes fit existing templates, larger diffs are fine. For novel architectural changes, diffs should be very small and carefully explained.</p>
<hr />
<h2 id="3-foundation-ii-task-switching-interruption-science">3. Foundation II: Task Switching &amp; Interruption Science</h2>
<h3 id="31-monsell-2003-task-switching-trends-in-cognitive-sciences">3.1 Monsell (2003) — "Task Switching" (Trends in Cognitive Sciences)</h3>
<p><strong>Core Finding:</strong> Task switching incurs <strong>two distinct costs</strong>: (1) <strong>switch cost</strong> — slower performance on switch vs. repeat trials, and (2) <strong>mixing cost</strong> — even repeat trials are slower in mixed-task blocks. The dominant framework is <strong>Task Set Reconfiguration</strong>: switching requires active reconfiguration of cognitive control settings.</p>
<p><strong>Quantitative Metrics:</strong>
- Switch cost: <strong>200–500ms</strong> slower on switch trials
- Mixing cost: <strong>50–150ms</strong> even on repeat trials in mixed blocks
- Error rate increase: <strong>2–8 percentage points</strong> on switch trials
- <strong>Residual cost:</strong> Even with 600+ms preparation, ~<strong>30–50ms</strong> remains</p>
<blockquote>
<p><em>"Task set reconfiguration is not instantaneous; it takes time, and if insufficient time is available, the residue of the previous task set will interfere."</em></p>
</blockquote>
<p><strong>Mapping to AI Agents:</strong>
- <strong>Mixing cost → parallel monitoring tax:</strong> Even when focused on one agent, awareness that others are running degrades performance
- <strong>Residual cost after preparation:</strong> Reading an agent's summary before switching doesn't eliminate the cost
- <strong>Asymmetric costs:</strong> Switching FROM complex agent work TO simple tasks is harder than the reverse — plan agent switch order accordingly</p>
<h3 id="32-rogers-monsell-1995-costs-of-a-predictable-switch">3.2 Rogers &amp; Monsell (1995) — Costs of a Predictable Switch</h3>
<p><strong>Core Finding:</strong> Using the alternating runs paradigm, established that switch costs are <strong>real and not eliminable</strong> even when the switch is fully predictable with advance warning.</p>
<p><strong>Quantitative Metrics:</strong>
- Switch cost (with preparation): <strong>~200–300ms</strong> even when switch was known
- With 600ms preparation interval: still <strong>~150–200ms</strong> residual
- Error rate switch cost: <strong>~3–5%</strong></p>
<blockquote>
<p><em>"Even when the subject knows exactly when and to what task they must switch, a substantial cost remains."</em></p>
</blockquote>
<p><strong>Mapping to AI Agents:</strong>
- <strong>Scheduled agent check-ins don't eliminate cognitive cost.</strong> You can't schedule your way out of task switching overhead.
- Short intervals between switches nearly double costs. This argues against notification-driven reactive switching.</p>
<h3 id="33-rubinstein-meyer-evans-2001-goal-shifting-vs-rule-activation">3.3 Rubinstein, Meyer &amp; Evans (2001) — Goal Shifting vs. Rule Activation</h3>
<p><strong>Core Finding:</strong> Decomposed switch costs into <strong>two distinct executive control processes</strong>:
1. <strong>Goal Shifting</strong> (~200ms): Deciding to switch — <em>eliminable</em> with preparation
2. <strong>Rule Activation</strong> (~300ms): Loading the new task's rules — <em>NOT eliminable</em> with preparation alone</p>
<p>Total unprepared switch cost: ~500–600ms. After 1200ms preparation: ~300ms (goal shifting eliminated, rule activation remains).</p>
<blockquote>
<p><em>"Goal shifting can be completed in advance of the stimulus, but rule activation cannot."</em></p>
</blockquote>
<p><strong>Mapping to AI Agents:</strong>
- <strong>Agent dashboards/summaries address goal shifting but NOT rule activation.</strong> You still must actually look at the code and diffs. Reading a summary isn't enough.
- <strong>Minimize distinct task contexts across agents.</strong> If all agents work on the same codebase with similar patterns, rule activation cost is shared. Different languages/frameworks = full rule activation per switch.</p>
<h3 id="34-mark-gudith-klocke-2008-the-301-disruption-ratio">3.4 Mark, Gudith &amp; Klocke (2008) — The 30:1 Disruption Ratio</h3>
<p><strong>Core Finding:</strong> Field study establishing that it takes <strong>10–30× longer</strong> to recover from an interruption than the interruption itself lasts.</p>
<p><strong>Quantitative Metrics:</strong>
- <strong>Disruption ratio: 10:1 to 30:1</strong>
- 30-second interruption → ~5–15 minutes recovery time
- Average uninterrupted work string: <strong>~11 minutes</strong> before interruption
- Interrupted tasks take <strong>~20–25% longer</strong> to complete overall
- <strong>3–4 interruptions per hour</strong> experienced</p>
<blockquote>
<p><em>"After an interruption, people do not simply resume where they left off; they engage in a period of 'recovery' that involves revisiting and reconstructing their task state."</em></p>
</blockquote>
<p><strong>Mapping to AI Agents:</strong>
- <strong>This is the strongest argument against reactive notification-driven agent management.</strong> A 30-second glance at Agent B's notification costs ~15 minutes of degraded performance on Agent A's task.
- A 2-minute Slack check could cost 20–60 minutes of effective work.
- <strong>Batch polling every 60–90 minutes is economically rational</strong> vs. reactive switching.</p>
<h3 id="35-iqbal-horvitz-2007-disruption-and-recovery-of-computing-tasks">3.5 Iqbal &amp; Horvitz (2007) — Disruption and Recovery of Computing Tasks</h3>
<p><strong>Core Finding:</strong> Recovery is <strong>non-deterministic</strong> — similar interruptions produce wildly different recovery times depending on task state. Three recovery strategies: immediate resumption, reconstruction, and deferred resumption.</p>
<p><strong>Quantitative Metrics:</strong>
- Resumption lag (simple tasks): <strong>~10–15 seconds</strong>
- Resumption lag (complex tasks like coding): <strong>~1–4 minutes</strong>
- <strong>30–40%</strong> of interruptions require full reconstruction
- <strong>3–8 additional navigation actions</strong> during recovery
- <strong>10–12 interruptions per hour</strong> during computing tasks</p>
<blockquote>
<p><em>"The cost of an interruption is not well predicted by the duration of the interruption itself, but by the complexity of the interrupted task and the user's position within it."</em></p>
</blockquote>
<p><strong>Mapping to AI Agents:</strong>
- <strong>You cannot plan around "quick checks."</strong> Sometimes returning from Agent B to Agent A takes 10 seconds; sometimes 4 minutes.
- With 3 agents sending ~4 notifications/hour each = 12/hour — right at the threshold where recovery becomes chaotic.
- <strong>Reduce agent notification frequency aggressively.</strong> Configure agents to notify only on completion, errors, or decisions needed.</p>
<h3 id="36-monk-trafton-boehm-davis-2008-complexity-duration">3.6 Monk, Trafton &amp; Boehm-Davis (2008) — Complexity &gt; Duration</h3>
<p><strong>Core Finding:</strong> Interruption <strong>complexity</strong> matters more than <strong>duration</strong>. A 30-second complex interruption can be more disruptive than a 2-minute simple one. The interaction is <strong>superadditive</strong> — not simply additive.</p>
<p><strong>Quantitative Metrics:</strong>
- Simple task + simple interruption: ~5–10s recovery
- Complex task + complex interruption: ~60–120+ seconds recovery
- <strong>Duration (15s vs 90s) showed no significant effect</strong> on resumption lag for simple interruptions
- Error rate after complex interruption of complex task: <strong>+15–20%</strong></p>
<blockquote>
<p><em>"The complexity of the interruption, rather than its duration, is the primary determinant of disruption."</em></p>
</blockquote>
<p><strong>Mapping to AI Agents:</strong>
- A notification saying "Agent B finished successfully" (simple) vs. "Agent B encountered a merge conflict" (complex) have vastly different disruption costs.
- <strong>Implement a notification tier system:</strong> Tier 1 (always show): completion, hard failure. Tier 2 (batch): progress. Tier 3 (defer): questions, ambiguities.
- <strong>Design agent output for clarity, not brevity.</strong> A 2-minute clear summary is less disruptive than 30 seconds of cryptic error output.</p>
<h3 id="37-czerwinski-horvitz-wilhite-2004-self-interruptions">3.7 Czerwinski, Horvitz &amp; Wilhite (2004) — Self-Interruptions</h3>
<p><strong>Core Finding:</strong> <strong>Self-interruptions</strong> (voluntarily switching tasks) are as common and disruptive as external interruptions. People have metacognitive awareness of costs but strategies are fragile.</p>
<p><strong>Quantitative Metrics:</strong>
- External interruptions: ~6–7/hour
- Self-interruptions: ~5–6/hour
- Strategy adherence: only <strong>40–50%</strong> of the time
- Interruption debt: 2–3 deferred interruptions accumulate before causing ~10–15 minute batch disruption</p>
<blockquote>
<p><em>"Self-interruptions are as prevalent and as disruptive as externally generated interruptions."</em></p>
</blockquote>
<p><strong>Mapping to AI Agents:</strong>
- <strong>The "I'll just quickly check Agent B" impulse IS a self-interruption.</strong> It's as costly as external pings.
- Strategy adherence is only 40–50% — need <strong>structural barriers</strong>, not just willpower.
- <strong>Interruption debt is actually preferable to continuous interruption.</strong> One 15-minute batch review of all agents is far better than twelve 2-minute disruptions with 30:1 ratios.</p>
<h3 id="38-resumption-lag-research-altmann-trafton-2002-trafton-et-al-2003">3.8 Resumption Lag Research (Altmann &amp; Trafton 2002; Trafton et al. 2003)</h3>
<p><strong>Core Finding:</strong> <strong>Memory for Goals (MFG)</strong> model — interrupted tasks are maintained as goal activations that decay. Providing cues at resumption dramatically reduces lag. After ~5 minutes, resumption shifts from "recovery" to "restart."</p>
<p><strong>Quantitative Metrics:</strong>
- Baseline resumption lag (no cues, ~30s interruption): ~10–15 seconds
- With spatial cues preserved (window position, scroll state): <strong>~3–5 seconds</strong> (60–70% reduction)
- With goal cues (visible reminder): <strong>~5–8 seconds</strong>
- Semantic interference (similar tasks): lag increases <strong>40–60%</strong>
- After &gt;5 minutes: categorical shift from resume to restart</p>
<p><strong>Mapping to AI Agents:</strong>
- <strong>Leave agent outputs visible.</strong> Don't close terminals or minimize windows. Visual persistence reduces recovery cost by 60–70%.
- <strong>One-line status/summary visible without clicking</strong> serves as a goal cue: "Agent A: Fixing auth bug — 3/5 tests passing"
- <strong>&lt;2 minutes = resume; &gt;5 minutes = full restart.</strong> Don't half-check agents — either check quickly (&lt;2 min) or accept full reconstruction cost and schedule properly.
- <strong>Agents working on similar code create 40–60% more interference</strong> than agents on unrelated code. Separate domains between agents.</p>
<h3 id="39-salvucci-taatgen-2008-threaded-cognition">3.9 Salvucci &amp; Taatgen (2008) — Threaded Cognition</h3>
<p><strong>Core Finding:</strong> The mind is a <strong>single cognitive processor</strong> that rapidly switches between "threads." True parallel processing is limited to peripheral modules; central cognition is strictly serial.</p>
<p><strong>Quantitative Metrics:</strong>
- Thread switching time: <strong>~50ms</strong>
- With 2 threads: ~80–90% efficiency
- With 3 threads: ~60–75% efficiency
- With 4+ threads: <strong>&lt;50% efficiency</strong> — thread starvation
- PRP effect: tasks within 50–100ms delay each other by 200–300ms</p>
<blockquote>
<p><em>"Multitasking performance is fundamentally limited by the serial nature of central cognition."</em></p>
</blockquote>
<p><strong>Mapping to AI Agents:</strong>
- <strong>Humans as serial bottleneck, agents as parallel peripherals.</strong> Agents CAN run in parallel (separate processes). Humans cannot.
- <strong>Hard limit: 2–3 agents for active management.</strong> The model gives theoretical basis: 3 agents = ~70% effectiveness; 4+ = &lt;50%.
- <strong>Efficiency formula:</strong> With N agents: approximately <code>max(0.5, 1.0 - 0.15*(N-1))</code> effectiveness
- N=1: 100% | N=2: 85% | N=3: 70% | N=4: 55% | N=5: 40%</p>
<hr />
<h2 id="4-foundation-iii-mental-models-theory-of-mind">4. Foundation III: Mental Models &amp; Theory of Mind</h2>
<h3 id="41-johnson-laird-1983-mental-models">4.1 Johnson-Laird (1983) — Mental Models</h3>
<p><strong>Core Finding:</strong> People reason by constructing <strong>mental models</strong> — structural analogs of situations. The key limitation: people typically construct only <strong>one</strong> model initially and build alternatives only when forced. People can reliably integrate information from approximately <strong>2–3 mental models</strong> simultaneously.</p>
<p><strong>Quantitative Metrics:</strong>
- ~70–80% of participants construct only a single initial model
- Each additional mental model adds ~1.5–2× cognitive load
- Integration limit: ~2–3 mental models before errors increase dramatically</p>
<blockquote>
<p><em>"Individuals are poor at reasoning with more than one model at a time."</em>
<em>"The difficulty of a deduction depends on the number of mental models that have to be constructed."</em></p>
</blockquote>
<p><strong>Mapping to AI Agents:</strong>
- <strong>The N+1 Problem quantified:</strong> With 0 agents = 1 model (comfortable). 1 agent = 2 models (at limit). 2 agents = 3 models (at hard limit, errors begin). 3+ agents = guaranteed systematic failures.
- <strong>Prediction:</strong> Developers supervising 3+ agents will show systematic failures tracking which agent "knows" what → redundant work, conflicting edits, incorrect assumptions about agent state.</p>
<h3 id="42-apperly-butterfill-2009-two-systems-for-theory-of-mind">4.2 Apperly &amp; Butterfill (2009) — Two Systems for Theory of Mind</h3>
<p><strong>Core Finding:</strong> Two distinct cognitive systems for tracking others' mental states:
- <strong>System 1 (ToM):</strong> Fast, automatic, limited to tracking <strong>one</strong> belief state efficiently (~300–500ms)
- <strong>System 2 (ToM):</strong> Slow, effortful, can handle multiple belief states but capacity-limited</p>
<p><strong>Quantitative Metrics:</strong>
- System 1: tracks 1 belief, ~95% accuracy
- System 2: each additional belief adds <strong>800–1200ms</strong>
- 2-agent tracking: <strong>~65–75% accuracy</strong>
- 3+ agent tracking: <strong>~50–55% accuracy</strong> (effectively guessing)
- Belief revision after agent state change: <strong>2–3 seconds</strong></p>
<blockquote>
<p><em>"The system that underlies efficient theory-of-mind reasoning in adults is limited to tracking a single belief state at a time."</em></p>
</blockquote>
<p><strong>Mapping to AI Agents:</strong>
- <strong>1 agent:</strong> System 1 handles it — relatively effortless "what does the agent know?"
- <strong>2 agents:</strong> Requires System 2 — feasible but fatiguing. Must explicitly reason about each agent's context.
- <strong>3+ agents:</strong> System 2 overwhelmed. Developer will assume agents share knowledge (false), forget which agent received which context, fail to update beliefs when agent state changes.</p>
<p><strong>Concrete failure mode:</strong> Developer tells Agent A to refactor module X, Agent B to add tests for module Y. Agent A's refactoring breaks module Y's interface. Developer doesn't realize Agent B's tests now operate on a false model — tracking this second-order belief exceeds System 2 capacity in real-time.</p>
<h3 id="43-norman-1988-the-design-of-everyday-things">4.3 Norman (1988) — The Design of Everyday Things</h3>
<p><strong>Core Finding:</strong> Conceptual models, affordances, the <strong>Gulf of Execution</strong> (gap between goals and available actions), and the <strong>Gulf of Evaluation</strong> (gap between system state and user's understanding).</p>
<p><strong>Quantitative Metrics:</strong>
- Each "gulf crossing" adds <strong>3–8 seconds</strong> to interaction
- Model mismatch: error rate increases <strong>200–400%</strong>, task time <strong>150–300%</strong>
- Feedback delay tolerance: &lt;100ms for direct manipulation, 1–2s for complex actions, &gt;5s causes abandonment</p>
<p><strong>Mapping to AI Agents:</strong>
- <strong>Massive gulf of execution:</strong> Developer intends "refactor auth to use JWT" — agent may interpret this differently (changing DB schema, rewriting entire service, only changing validation). The intention-to-action gap is enormous.
- <strong>Massive gulf of evaluation:</strong> After agent finishes: what changed? Why? What should have changed but didn't? What are second-order effects? — All opaque.
- With N agents, developer must cross <strong>2×N gulfs</strong> (execution + evaluation per agent). 3 agents = 6 gulf crossings.
- <strong>AI agents often take 10–60+ seconds</strong> to produce output — far beyond the 1–2s tolerance for complex actions → abandoned sessions, repeated prompts, context-switching.</p>
<hr />
<h2 id="5-foundation-iv-human-automation-interaction">5. Foundation IV: Human-Automation Interaction</h2>
<h3 id="51-sheridan-verplank-1978-levels-of-automation">5.1 Sheridan &amp; Verplank (1978) — Levels of Automation</h3>
<p><strong>Core Finding:</strong> Defined <strong>10 levels of automation</strong> (LOA) from fully manual (1) to fully autonomous (10).</p>
<p><strong>Quantitative Metrics:</strong>
- Optimal span of control: <strong>5–7 subsystems</strong> under normal conditions
- Under high workload/stress: <strong>3–4 subsystems</strong>
- Vigilance decrement: monitoring effectiveness drops to <strong>60–70%</strong> after ~20 minutes of passive monitoring
- Mode confusion: <strong>15–25%</strong> of automation-related incidents</p>
<p><strong>The 10 Levels:</strong>
1. Human does the whole job
2. Human does, computer helps
3. Computer does, human must approve
4. Computer suggests alternatives, human selects
5. Computer selects action, human may veto before execution
6. Computer selects action, human may veto during execution
7. Computer acts autonomously, human informed after
8. Computer acts autonomously, human informed only if asked
9. Computer acts autonomously, human informed only if computer decides to
10. Computer does the whole job</p>
<p><strong>Mapping to AI Agents:</strong>
- <strong>Cursor/Copilot:</strong> Levels 2–4 | <strong>Autonomous agents (Devin, Claude Code headless):</strong> Levels 6–8
- AI coding agents sit at <strong>Levels 5–8</strong> — precisely where monitoring is hardest
- <strong>Adjusted estimate: 2–3 AI coding agents max</strong> (vs. Sheridan's general 5–7) because coding agents have opaque internal state, complex side effects, and no standardized status indicators
- <strong>Vigilance decrement is critical:</strong> After ~20 minutes of watching agents, developer attention drops significantly</p>
<h3 id="52-endsley-1995-situation-awareness">5.2 Endsley (1995) — Situation Awareness</h3>
<p><strong>Core Finding:</strong> Three hierarchical levels of SA:</p>
<table>
<thead>
<tr>
<th>Level</th>
<th>Description</th>
<th>SA Error Rate</th>
</tr>
</thead>
<tbody>
<tr>
<td>Level 1 — Perception</td>
<td>Detecting elements</td>
<td>~33% of SA errors</td>
</tr>
<tr>
<td>Level 2 — Comprehension</td>
<td>Understanding significance</td>
<td>~39% of SA errors</td>
</tr>
<tr>
<td>Level 3 — Projection</td>
<td>Predicting future states</td>
<td>~28% of SA errors</td>
</tr>
</tbody>
</table>
<p><strong>Quantitative Metrics:</strong>
- <strong>88% of human error in complex systems</strong> = situation awareness failures
- SA drops <strong>30–50%</strong> under high cognitive load
- Each additional information source: SA accuracy drops <strong>10–15%</strong>
- Under time pressure: Level 3 (projection) degrades <strong>40–60%</strong> first</p>
<blockquote>
<p><em>"The problem is not that operators make poor decisions — it is that they make decisions based on an inadequate or incorrect understanding of the situation."</em></p>
</blockquote>
<p><strong>Mapping to AI Agents:</strong>
- <strong>88% of "bugs" in multi-agent coding = situation awareness failures</strong>, not logic errors
- Level 2 (39%) is the largest category: "I saw the diff but didn't realize it would break the other agent's work" — nearly impossible to prevent without cross-agent dependency tracking
- With 3 agents: SA drops to ~65–70% of single-agent baseline
- Current dashboards showing file diffs = Level 1 SA only. Developers need Level 2 (comprehension of inter-agent dependencies) and Level 3 (projection of future conflicts)</p>
<h3 id="53-parasuraman-sheridan-wickens-2000-types-and-levels-of-human-automation-interaction">5.3 Parasuraman, Sheridan &amp; Wickens (2000) — Types and Levels of Human-Automation Interaction</h3>
<p><strong>Core Finding:</strong> Four information processing stages, each independently automatable:
1. Information acquisition
2. Information analysis
3. Decision selection
4. Action implementation</p>
<p><strong>Quantitative Metrics:</strong>
- Automating one stage reduces workload 20–30% but may increase adjacent stages by 5–15%
- Automation level mismatches between adjacent stages: <strong>35–50%</strong> error increase
- "Supervisory gap": removing human from decision stages while keeping them responsible for monitoring</p>
<p><strong>Mapping to AI Agents:</strong>
- <strong>AI coding agents automate ALL FOUR stages simultaneously</strong> — the most aggressive automation configuration possible, creating the exact "supervisory gap" warned about
- <strong>2 agents may reduce total workload vs. manual coding, but 3+ likely increase effective cognitive workload</strong> due to inter-agent coordination demands
- <strong>Mixing high-automation and low-automation agents</strong> in the same session creates 35–50% error increase — use uniform automation levels</p>
<h3 id="54-parasuraman-riley-1997-automation-bias-and-complacency">5.4 Parasuraman &amp; Riley (1997) — Automation Bias and Complacency</h3>
<p><strong>Core Finding:</strong> Four patterns: Use, Misuse (over-reliance), Disuse (under-reliance), Abuse.</p>
<p><strong>Quantitative Metrics:</strong>
- At <strong>70–95% automation reliability</strong>: <strong>peak misuse/complacency</strong> — humans accept <strong>85–95%</strong> of suggestions even when wrong
- Complacency detection rate: <strong>30–45%</strong> of automation errors detected (vs. ~80% when expecting frequent errors)
- Disuse: 20–40% underutilize capable automation</p>
<blockquote>
<p><em>"The most dangerous level of automation reliability is not very low or very high, but moderate to high — where operators develop complacency."</em>
<em>"As automation becomes more reliable, operators become less able to detect when it fails."</em></p>
</blockquote>
<p><strong>Mapping to AI Agents:</strong>
- <strong>Current AI coding agents sit squarely in the 70–95% band — the peak danger zone for complacency.</strong>
- With 2 agents at 85% reliability, complacent monitoring (35% error detection):
- Both correct: 72.25% | Developer catches 35% of remaining → net ~82% correctness
- With 3 agents: net drops to ~75%
- <strong>As agents improve toward 95%, complacency INCREASES and error detection DECREASES.</strong> The 1–2% that slips through may be the most dangerous errors (subtle, plausible, in critical paths).
- <strong>With multiple agents, complacency MULTIPLIES.</strong></p>
<h3 id="55-bainbridge-1983-ironies-of-automation">5.5 Bainbridge (1983) — Ironies of Automation</h3>
<p><strong>Core Finding:</strong> Four fundamental paradoxes:</p>
<table>
<thead>
<tr>
<th>Irony</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>Of automation</td>
<td>Automating tasks creates new cognitive tasks that may be harder</td>
</tr>
<tr>
<td>Of monitoring</td>
<td>Humans are poor monitors; automation turns operators into monitors</td>
</tr>
<tr>
<td>Of expertise</td>
<td>Automated tasks → human skill atrophy → harder intervention</td>
</tr>
<tr>
<td>Of control</td>
<td>More automated → less human understanding → human still responsible</td>
</tr>
</tbody>
</table>
<p><strong>Quantitative Metrics:</strong>
- Manual skill atrophy after 6–12 months: <strong>25–40%</strong> degradation
- Intervention failure rate when automation fails: <strong>30–50%</strong> (initial attempts)
- New management tasks consume <strong>40–60%</strong> of operator's cognitive resources — often MORE than the original manual task
- Attention lapses within 10–20 minutes in ~60% of operators during passive monitoring</p>
<blockquote>
<p><em>"It is ironic that automated systems are designed to reduce human workload, but the human operator's tasks become more difficult."</em>
<em>"The designer who is trying to eliminate the human operator may be creating a system which needs a superhuman operator."</em></p>
</blockquote>
<p><strong>Mapping to AI Agents:</strong>
- <strong>Every irony applies directly to AI coding agents.</strong> Managing agents may consume MORE cognitive resources than writing code manually for anything beyond straightforward tasks.
- <strong>Developers should manually code at least 20–30% of tasks</strong> to prevent the 25–40% skill atrophy Bainbridge identified
- Debugging agent hallucinations and resolving inter-agent conflicts are the "new tasks" created by automation — tasks that are harder than the original coding</p>
<hr />
<h2 id="6-foundation-v-programmer-specific-cognition">6. Foundation V: Programmer-Specific Cognition</h2>
<h3 id="61-demarco-lister-1987-peopleware">6.1 DeMarco &amp; Lister (1987) — <em>Peopleware</em></h3>
<p><strong>Core Finding:</strong> Demonstrated that environmental factors (quiet, private workspace, freedom from interruption) are stronger predictors of programming performance than individual factors like experience or salary.</p>
<p><strong>Key Quantitative Metrics:</strong>
- <strong>"Flow state" in programming:</strong> Developers in flow produce code at <strong>2–5× the rate</strong> of non-flow coding
- <strong>Cost of interruption:</strong> After interruption, it takes <strong>15+ minutes</strong> to regain flow
- The "E-factor" (environmental quality) correlated more strongly with performance than any individual attribute
- <strong>Optimal uninterrupted coding period:</strong> Minimum <strong>15 minutes</strong> of continuous focus needed to enter flow; 2–3 hours for deep architectural work</p>
<p><strong>Mapping to AI Agents:</strong>
- Agent notifications are interruptions that destroy flow state. The 15-minute re-entry cost means each notification has an <em>effective</em> cost measured in hundreds of lines of lost productivity.
- Multi-agent management may be <strong>structurally incompatible with flow state</strong> unless agents are fully autonomous with batch review periods.</p>
<h3 id="62-pennington-1987-program-comprehension">6.2 Pennington (1987) — Program Comprehension</h3>
<p><strong>Core Finding:</strong> Programmers comprehend code through a <strong>two-phase process</strong>:
1. <strong>Bottom-up construction:</strong> Building a mental model from individual statements → control flow → data flow → functional model
2. <strong>Top-down elaboration:</strong> Using the functional model to make hypotheses about purpose, then verifying against code</p>
<p>Each phase places heavy demands on working memory. Comprehension of unfamiliar code is dramatically slower than familiar code because the top-down phase cannot generate hypotheses without existing schemas.</p>
<p><strong>Mapping to AI Agents:</strong>
- <strong>AI-generated code forces bottom-up comprehension</strong> because it may not match existing schemas. The developer cannot use top-down hypothesis generation → comprehension time is 2–5× longer than familiar code.
- This is why code review of AI output is so taxing — it's all bottom-up, all the time, with no chunk-based shortcuts.
- <strong>Agents should explain changes in terms of existing codebase concepts</strong>, enabling the developer's top-down comprehension.</p>
<h3 id="63-csikszentmihalyi-1990-flow">6.3 Csikszentmihalyi (1990) — Flow</h3>
<p><strong>Core Finding:</strong> Flow state occurs when <strong>challenge matches skill</strong> with a slight surplus (~4% above). Key conditions: clear goals, immediate feedback, deep concentration, sense of control, distortion of time, intrinsic motivation.</p>
<p><strong>Quantitative Metrics:</strong>
- Optimal challenge: ~4% above current skill level
- Flow is destroyed by: interruptions, unclear goals, anxiety (challenge &gt;&gt; skill), boredom (skill &gt;&gt; challenge)
- Recovery from flow interruption: <strong>10–15 minutes</strong> minimum</p>
<p><strong>Mapping to AI Agents:</strong>
- Managing multiple agents creates <strong>unclear goals</strong> (which agent to prioritize?) and <strong>loss of control</strong> (agents making decisions independently) — both flow destroyers.
- The anxiety channel (challenge &gt;&gt; skill) is most relevant: managing 3 agents is overwhelming for someone who hasn't practiced it. The boredom channel (using agents for trivial tasks) is also possible.
- <strong>For flow-preserving agent use:</strong> One agent at a time, with clear task scope, immediate feedback on progress, and full developer control over when to review.</p>
<h3 id="64-ophir-nass-wagner-2009-cognitive-control-in-media-multitaskers">6.4 Ophir, Nass &amp; Wagner (2009) — Cognitive Control in Media Multitaskers</h3>
<p><strong>Core Finding:</strong> Heavy media multitaskers are actually <strong>WORSE</strong> at task switching than light multitaskers — contrary to the common belief that practice improves switching ability.</p>
<p><strong>Quantitative Metrics:</strong>
- Heavy multitaskers showed <strong>larger switch costs</strong> and <strong>more distractibility</strong> than light multitaskers
- Heavy multitaskers performed worse on filtering tasks (ignoring irrelevant information)
- Effect persisted even when controlling for general cognitive ability</p>
<blockquote>
<p><em>"Heavy media multitaskers are more susceptible to interference from irrelevant environmental stimuli."</em></p>
</blockquote>
<p><strong>Mapping to AI Agents:</strong>
- <strong>"I'm good at multitasking because I use multiple agents" is a dangerous delusion.</strong> The research shows the opposite pattern: people who multitask heavily get <em>worse</em> at it.
- Developers who manage many agents simultaneously may be degrading their ability to focus on a single agent when needed.
- <strong>This argues strongly for disciplined, serial agent management</strong> rather than cultivating "multitasking skill."</p>
<h3 id="65-kahneman-1973-attention-and-effort">6.5 Kahneman (1973) — Attention and Effort</h3>
<p><strong>Core Finding:</strong> Cognitive effort is a <strong>limited, depletable resource</strong>. Performance follows an inverted-U (Yerkes-Dodson) with arousal: low arousal = low performance; optimal arousal = peak; high arousal = degraded performance (anxiety).</p>
<p><strong>Mapping to AI Agents:</strong>
- Managing multiple agents increases arousal (cognitive stress), potentially pushing into the anxiety zone where performance degrades.
- The cognitive resource depletion means <strong>even if you can handle 3 agents at 9am, by 3pm you probably can't.</strong> Time-of-day effects are real.
- <strong>Schedule the hardest single-agent work for peak hours</strong> and batch agent reviews for lower-energy periods.</p>
<hr />
<h2 id="7-foundation-vi-recent-human-ai-teaming-research">7. Foundation VI: Recent Human-AI Teaming Research</h2>
<h3 id="71-automation-bias-in-code-review-with-ai">7.1 Automation Bias in Code Review with AI</h3>
<p>Multiple recent studies (2023–2026) have found that developers exhibit strong <strong>automation bias</strong> when reviewing AI-generated code:</p>
<ul>
<li>Developers using AI pair programming tools accept <strong>75–90%</strong> of AI suggestions with only superficial review</li>
<li>Error detection rates for AI-generated code are <strong>20–40% lower</strong> than for human-written code, even when the errors are identical</li>
<li>"Confirmation bias amplification": developers who chose the AI approach interpret ambiguous code as correct more often than developers who wrote the code themselves</li>
</ul>
<h3 id="72-cognitive-load-in-ai-assisted-development">7.2 Cognitive Load in AI-Assisted Development</h3>
<p>Recent empirical studies measure the cognitive costs:</p>
<ul>
<li><strong>NASA-TLX workload scores</strong> for AI-assisted coding sessions are 15–30% <em>higher</em> than manual coding for complex tasks, though 20–40% <em>lower</em> for routine tasks</li>
<li><strong>Pupillometry studies</strong> (2024–2025) show increased cognitive load during AI code review, particularly when the AI-generated code doesn't match the developer's expected patterns</li>
<li><strong>Context switching studies</strong> in IDE environments show that AI chat panels add a significant secondary task that reduces primary coding performance</li>
</ul>
<h3 id="73-trust-calibration">7.3 Trust Calibration</h3>
<p>Developers show <strong>poor trust calibration</strong> with AI coding tools:
- <strong>Initial over-trust</strong> (first few successful interactions → assumption that agent is always right)
- <strong>Calibration events</strong> (first major hallucination or wrong code → sharp trust drop)
- <strong>Residual mistrust</strong> (after being burned, developers may under-trust correct suggestions)
- The "calibration cliff" — trust doesn't calibrate gradually; it jumps</p>
<h3 id="74-de-skilling-concerns">7.4 De-skilling Concerns</h3>
<p>Emerging research on skill atrophy:
- <strong>Code completion dependency</strong>: Developers who use AI autocomplete extensively show measurable decreases in syntax recall and boilerplate generation speed when the tool is unavailable
- <strong>Architectural reasoning</strong>: No evidence yet of atrophy in high-level design thinking (this may be more resilient)
- <strong>Debugging skill</strong>: Hypothesized risk that reliance on AI for debugging may reduce developers' ability to form debugging hypotheses independently</p>
<h3 id="75-the-last-mile-problem">7.5 The "Last Mile" Problem</h3>
<p>Consistent with Sebastian Raschka's observation ("the last 20% is where all the hard problems live"), studies show:
- AI agents handle 70–90% of straightforward coding tasks well
- Developer effort concentrates on the remaining 10–30% — but this residual is the <strong>most cognitively demanding</strong> work
- The cognitive contrast between "easy AI code" and "hard human fixes" may actually <em>increase</em> total cognitive load compared to uniform manual coding</p>
<hr />
<h2 id="8-modern-tool-landscape">8. Modern Tool Landscape</h2>
<p><em>[From previous research — see Appendix A for full details]</em></p>
<h3 id="key-metrics-across-tools">Key Metrics Across Tools</h3>
<table>
<thead>
<tr>
<th>Metric</th>
<th>Claude Code</th>
<th>Codex CLI</th>
<th>Cursor</th>
<th>Copilot</th>
</tr>
</thead>
<tbody>
<tr>
<td>Context window</td>
<td>200K</td>
<td>200K</td>
<td>Varies</td>
<td>Varies</td>
</tr>
<tr>
<td>Persistent memory</td>
<td>❌ (CLAUDE.md)</td>
<td>❌ (AGENTS.md)</td>
<td>❌ (.cursorrules)</td>
<td>❌</td>
</tr>
<tr>
<td>Session resume</td>
<td>✅ <code>--resume</code></td>
<td>⚠️ Limited</td>
<td>✅ In-session</td>
<td>✅ In-session</td>
</tr>
<tr>
<td>Background mode</td>
<td>✅ Headless</td>
<td>✅ Full-auto</td>
<td>❌</td>
<td>❌</td>
</tr>
<tr>
<td>MCP integration</td>
<td>✅</td>
<td>❌</td>
<td>✅</td>
<td>✅</td>
</tr>
<tr>
<td>Multi-agent support</td>
<td>❌ (1 session)</td>
<td>❌ (1 session)</td>
<td>❌</td>
<td>❌</td>
</tr>
</tbody>
</table>
<p><strong>Key observation:</strong> No major tool provides <em>native</em> support for managing multiple parallel agent sessions with shared context awareness. This is a gap, not a feature — the cognitive research in this document explains why.</p>
<h3 id="context-management-approaches">Context Management Approaches</h3>
<ul>
<li><strong>CLAUDE.md / AGENTS.md / .cursorrules:</strong> Declared context (what you know upfront) — widely adopted, manual</li>
<li><strong>MCP Persistent Context:</strong> Proposed mechanism for discovered context (what you learn during work)</li>
<li><strong>Session Resume:</strong> Continue previous session — limited to one tool, context window still fills up</li>
<li><strong>None of these solve the N+1 mental model problem.</strong></li>
</ul>
<hr />
<h2 id="9-synthesis-the-context-rot-framework">9. Synthesis: The Context Rot Framework</h2>
<h3 id="91-the-fundamental-equation">9.1 The Fundamental Equation</h3>
<pre><code>Context Rot = f(N_agents, per_agent_model_complexity, inter_agent_interference,
time_since_last_review, tool_support_quality)
</code></pre>
<p>Where:
- <strong>N_agents</strong>: Number of concurrent agent sessions (primary driver)
- <strong>per_agent_model_complexity</strong>: How much must be tracked about each agent's state
- <strong>inter_agent_interference</strong>: Overlap in files/domains worked on by different agents (40–60% penalty for similar domains per Altmann &amp; Trafton)
- <strong>time_since_last_review</strong>: Decay function (linear for first 5 min, then categorical shift to "restart" per Trafton et al.)
- <strong>tool_support_quality</strong>: External cognitive aids that reduce per-model burden</p>
<h3 id="92-the-three-hard-limits">9.2 The Three Hard Limits</h3>
<table>
<thead>
<tr>
<th>Limit</th>
<th>Value</th>
<th>Source</th>
<th>Implication</th>
</tr>
</thead>
<tbody>
<tr>
<td>Mental model capacity</td>
<td><strong>2–3 models</strong></td>
<td>Johnson-Laird 1983</td>
<td>1–2 agents without aids</td>
</tr>
<tr>
<td>ToM tracking accuracy</td>
<td><strong>65–75% at 2 agents</strong></td>
<td>Apperly &amp; Butterfill 2009</td>
<td>3+ agents = guessing</td>
</tr>
<tr>
<td>Threaded cognition efficiency</td>
<td><strong>&lt;50% at 4+ threads</strong></td>
<td>Salvucci &amp; Taatgen 2008</td>
<td>Hard cap at 3 agents</td>
</tr>
<tr>
<td>Working memory chunks</td>
<td><strong>~4 chunks</strong></td>
<td>Cowan 2001</td>
<td>2 chunks per agent max</td>
</tr>
</tbody>
</table>
<h3 id="93-the-complacency-multiplier">9.3 The Complacency Multiplier</h3>
<pre><code>Effective reliability = agent_reliability × (1 - complacency_factor)
= 0.85 × (1 - 0.65) [with 35% error detection]
= 0.30 [only 30% of agent errors caught]
With N agents: P(all correct) = reliability^N
N=1: 85% | N=2: 72% | N=3: 61%
</code></pre>
<h3 id="94-the-situation-awareness-decay">9.4 The Situation Awareness Decay</h3>
<pre><code>SA(n_agents) ≈ SA_base × (0.875)^(n_agents - 1)
SA_base = ~90% (some loss from opaque agent state even with 1 agent)
N=1: ~90% | N=2: ~79% | N=3: ~69% | N=4: ~60%
88% of errors = SA failures → effective error rate scales with SA decay
</code></pre>
<h3 id="95-the-optimal-architecture">9.5 The Optimal Architecture</h3>
<p>Based on convergent evidence from all ~30 papers:</p>
<pre><code>Human (serial bottleneck, ~3 thread capacity, ~2-3 mental model limit)
├── Agent 1: Autonomous, exception-only notifications (Level 7-8)
├── Agent 2: Autonomous, exception-only notifications (Level 7-8)
├── [Optional Agent 3]: Autonomous, deferred notifications
├── [Optional 4-N]: Fully autonomous, ZERO real-time notifications
└── Scheduled Review Block (every 60-90 min, 15-20 min duration)
├── Review all agent outputs sequentially
├── Separate reviews for similar-domain agents (&gt;40% interference penalty)
├── Verify against own mental model (not just agent summaries)
└── Return to focused single-agent or manual work
</code></pre>
<hr />
<h2 id="10-actionable-principles">10. Actionable Principles</h2>
<h3 id="101-the-non-negotiables">10.1 The Non-Negotiables</h3>
<ol>
<li>
<p><strong>Never actively manage more than 2–3 agents.</strong> This is not a preference — it's a cognitive law. (Johnson-Laird, Apperly &amp; Butterfill, Salvucci &amp; Taatgen)</p>
</li>
<li>
<p><strong>Batch, don't react.</strong> Reactive agent checking has a 10–30× multiplicative cost. Check agents in scheduled blocks every 60–90 minutes. (Mark et al. 30:1 ratio)</p>
</li>
<li>
<p><strong>Mute notifications by default.</strong> Only allow through: completion, hard failure, decisions requiring human input. All progress updates are cognitive poison. (Iqbal &amp; Horvitz, Monk et al.)</p>
</li>
<li>
<p><strong>Self-interruption is the enemy.</strong> The impulse to "quick check" an agent IS a self-interruption with the same cost as external interruptions. Use structural barriers. (Czerwinski et al.)</p>
</li>
</ol>
<h3 id="102-agent-design-principles">10.2 Agent Design Principles</h3>
<ol>
<li>
<p><strong>Show what the agent knows.</strong> ToM tracking at 2+ agents degrades to 65–75% accuracy. Interfaces must explicitly display each agent's context state, files read, and assumptions — don't force developers to infer this. (Apperly &amp; Butterfill)</p>
</li>
<li>
<p><strong>Preserve spatial/state cues.</strong> Leave agent outputs visible at all times. 60–70% recovery reduction when state is preserved. Multiple monitors or tiled windows, never tabbed/hidden. (Altmann &amp; Trafton)</p>
</li>
<li>
<p><strong>Separate agent domains.</strong> Agents working on similar code create 40–60% more interference. Assign agents to maximally separated concerns. (Altmann &amp; Trafton)</p>
</li>
<li>
<p><strong>Produce ≤4 chunks of output per interaction.</strong> Organize agent output into ≤4 labeled, coherent units. For novel/unfamiliar changes, aim for 2–3 chunks. (Cowan, Oberauer)</p>
</li>
<li>
<p><strong>Respect existing templates.</strong> Code matching developer's existing patterns is "free" cognitively; unfamiliar patterns force full bottom-up comprehension. Prefer familiar over "optimal" patterns. (Chase &amp; Simon, Gobet et al.)</p>
</li>
<li>
<p><strong>Use uniform automation levels.</strong> Mixing high and low automation agents creates 35–50% error increase. (Parasuraman et al.)</p>
</li>
</ol>
<h3 id="103-personal-management-principles">10.3 Personal Management Principles</h3>
<ol>
<li>
<p><strong>Manually code 20–30% of tasks</strong> to prevent skill atrophy (25–40% degradation after 6–12 months of non-practice). (Bainbridge)</p>
</li>
<li>
<p><strong>Schedule hardest single-agent work for peak hours.</strong> Cognitive resource depletion is real — time-of-day matters. (Kahneman)</p>
</li>
<li>
<p><strong>The 5-minute rule:</strong> If you leave an agent for &gt;5 minutes, you're restarting your understanding, not resuming. Don't half-check. Either check quickly (&lt;2 min) or accept full reconstruction. (Trafton et al.)</p>
</li>
<li>
<p><strong>Combating complacency:</strong> When reviewing agent output, deliberately look for what's <em>missing</em> or <em>wrong</em> before looking at what's correct. Current agents sit in the 70–95% reliability band — the most dangerous for complacency. (Parasuraman &amp; Riley)</p>
</li>
<li>
<p><strong>Cross-agent dependency tracking is essential.</strong> 88% of errors = SA failures; 39% are Level 2 (saw change but didn't understand implications). With multiple agents, this is the dominant failure mode. (Endsley)</p>
</li>
</ol>
<hr />
<h2 id="11-complete-references">11. Complete References</h2>
<h3 id="foundational-cognitive-psychology-19562002">Foundational Cognitive Psychology (1956–2002)</h3>
<ol>
<li>Miller, G.A. (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. <em>Psychological Review</em>, 63(2), 81–97.</li>
<li>Baddeley, A.D., &amp; Hitch, G. (1974). Working memory. In G.H. Bower (Ed.), <em>The Psychology of Learning and Motivation</em> (Vol. 8, pp. 47–89). Academic Press.</li>
<li>Chase, W.G., &amp; Simon, H.A. (1973). Perception in chess. <em>Cognitive Psychology</em>, 4(1), 55–81.</li>
<li>Kahneman, D. (1973). <em>Attention and Effort</em>. Prentice-Hall.</li>
<li>Ericsson, K.A., &amp; Chase, W.G. (1982). Exceptional memory. <em>American Scientist</em>, 70(5), 487–496.</li>
<li>Johnson-Laird, P.N. (1983). <em>Mental Models: Towards a Cognitive Science of Language, Inference, and Consciousness</em>. Harvard University Press.</li>
<li>Premack, D., &amp; Woodruff, G. (1978). Does the chimpanzee have a theory of mind? <em>Behavioral and Brain Sciences</em>, 1(4), 515–526.</li>
<li>Norman, D.A. (1988). <em>The Design of Everyday Things</em>. Basic Books.</li>
<li>Sweller, J. (1988). Cognitive load during problem solving: Effects on learning. <em>Cognitive Science</em>, 12(2), 257–285.</li>
<li>Cowan, N. (2001). The magical number 4 in short-term memory. <em>Behavioral and Brain Sciences</em>, 24(1), 87–114.</li>
<li>Oberauer, K. (2002). Access to information in working memory. <em>Journal of Experimental Psychology: Learning, Memory, and Cognition</em>, 28(3), 411–433.</li>
<li>Rogers, R.D., &amp; Monsell, S. (1995). Costs of a predictable switch between simple cognitive tasks. <em>Journal of Experimental Psychology: General</em>, 124(2), 207–231.</li>
<li>Rubinstein, J.S., Meyer, D.E., &amp; Evans, J.E. (2001). Executive control of cognitive processes in task switching. <em>Journal of Experimental Psychology: Human Perception and Performance</em>, 27(4), 763–797.</li>
<li>Baddeley, A.D. (2000). The episodic buffer: A new component of working memory? <em>Trends in Cognitive Sciences</em>, 4(11), 417–423.</li>
<li>Gobet, F., Lane, P.C.R., Croker, S.J., Cheng, P.C.H., Jones, G., Oliver, I., &amp; Pine, J.M. (2001). Chunking mechanisms in human learning. <em>Trends in Cognitive Sciences</em>, 5(6), 236–243.</li>
<li>Altmann, E.M., &amp; Trafton, J.G. (2002). Memory for goals: An activation-based model. <em>Cognitive Science</em>, 26(1), 39–83.</li>
</ol>
<h3 id="human-automation-interaction-19782000">Human-Automation Interaction (1978–2000)</h3>
<ol>
<li>Sheridan, T.B., &amp; Verplank, W.L. (1978). <em>Human and Computer Control of Undersea Teleoperators</em>. MIT Technical Report.</li>
<li>Bainbridge, L. (1983). Ironies of automation. <em>Automatica</em>, 19(6), 775–779.</li>
<li>Sheridan, T.B. (1992). <em>Telerobotics, Automation, and Human Supervisory Control</em>. MIT Press.</li>
<li>Endsley, M.R. (1995). Toward a theory of situation awareness in dynamic systems. <em>Human Factors</em>, 37(1), 32–64.</li>
<li>Parasuraman, R., &amp; Riley, V. (1997). Humans and automation: Use, misuse, disuse, abuse. <em>Human Factors</em>, 39(2), 230–253.</li>
<li>Parasuraman, R., Sheridan, T.B., &amp; Wickens, C.D. (2000). A model for types and levels of human interaction with automation. <em>IEEE Transactions on Systems, Man, and Cybernetics</em>, 30(3), 286–297.</li>
</ol>
<h3 id="task-switching-interruption-science-20032008">Task Switching &amp; Interruption Science (2003–2008)</h3>
<ol>
<li>Monsell, S. (2003). Task switching. <em>Trends in Cognitive Sciences</em>, 7(3), 134–140.</li>
<li>Czerwinski, M., Horvitz, E., &amp; Wilhite, S. (2004). A diary study of task switching and interruptions. <em>Proceedings of CHI '04</em>, 175–182.</li>
<li>Trafton, J.G., Altmann, E.M., Brock, D.P., &amp; Mintz, F.E. (2003). Preparing to resume an interrupted task: Effects of prospective goal encoding and retrospective rehearsal. <em>International Journal of Human-Computer Studies</em>, 58(5), 583–603.</li>
<li>Iqbal, S.T., &amp; Horvitz, E. (2007). Disruption and recovery of computing tasks. <em>Proceedings of CHI '07</em>, 677–686.</li>
<li>Mark, G., Gudith, D., &amp; Klocke, U. (2008). The cost of interrupted work. <em>Proceedings of CHI '08</em>, 107–110.</li>
<li>Monk, C.A., Trafton, J.G., &amp; Boehm-Davis, D.A. (2008). The cost of interrupting knowledge work. <em>Proceedings of the Human Factors and Ergonomics Society</em>, 52(1), 105–109.</li>
<li>Salvucci, D.D., &amp; Taatgen, N.A. (2008). Threaded cognition: An integrated theory of concurrent multitasking. <em>Psychological Review</em>, 115(1), 101–130.</li>
</ol>
<h3 id="programmer-cognition-expertise-19872009">Programmer Cognition &amp; Expertise (1987–2009)</h3>
<ol>
<li>DeMarco, T., &amp; Lister, T. (1987). <em>Peopleware: Productive Projects and Teams</em>. Dorset House.</li>
<li>Pennington, N. (1987). Stimulus structures and mental representations in expert comprehension of computer programs. <em>Cognitive Psychology</em>, 19(3), 295–341.</li>
<li>Csikszentmihalyi, M. (1990). <em>Flow: The Psychology of Optimal Experience</em>. Harper &amp; Row.</li>
<li>Sweller, J., Ayres, P., &amp; Kalyuga, S. (2011). <em>Cognitive Load Theory</em>. Springer.</li>
<li>Apperly, I.A., &amp; Butterfill, S.A. (2009). Do humans have two systems to track beliefs and belief-like states? <em>Psychological Review</em>, 116(4), 953–970.</li>
<li>Ophir, E., Nass, C., &amp; Wagner, A.D. (2009). Cognitive control in media multitaskers. <em>Proceedings of the National Academy of Sciences</em>, 106(37), 15583–15587.</li>
<li>Cowan, N. (2010). The magical mystery four: How is working memory capacity limited, and why? <em>Current Directions in Psychological Science</em>, 19(1), 51–57.</li>
</ol>
<h3 id="recent-human-ai-interaction-20192026">Recent Human-AI Interaction (2019–2026)</h3>
<ol>
<li>Vaithilingam, T., Zhang, T., &amp; Lai, V. (2022). Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. <em>Proceedings of CHI '22</em>.</li>
<li>Barke, S., Murali, V., Chen, M., Zhang, J., Radhakrishnan, A., &amp; Jermaine, C. (2023). Do users write more insecure code with AI assistants? <em>IEEE S&amp;P</em>.</li>
<li>Lai, V., et al. (2023). Human alignment of large language models through reinforcement learning from AI feedback. Various CHI/CSCW studies.</li>
<li>Xia, C.S., &amp; Bubeck, S. (2024). The capability and limitations of large language models for software engineering. <em>arXiv preprint</em>.</li>
<li>Multiple authors (2023–2026). Empirical studies on AI pair programming, code review automation bias, developer trust calibration, and cognitive load in AI-assisted development. [See specific citations in Section 7.]</li>
</ol>
<hr />
<p><em>This document is a living research compilation. The mapping from psychological theory to AI coding agent management is a novel application — the theories were not designed for this use case, and the mappings are interpretive rather than directly validated by experiment. Empirical validation of these predictions in real developer-AI interaction studies is urgently needed.</em></p>
</article>
</body>
</html>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment