Codebase Wiki

A pattern for auto-generating and maintaining project documentation with LLMs.

The Problem

Most LLM-assisted coding looks like this: the agent gets a task, greps around to understand the codebase, builds a mental model from scratch, does the work, and forgets everything. Next task, same thing. Every session starts at zero. The agent spends half its context window just figuring out where it is.

The standard fix is to write a CLAUDE.md or a README with project conventions and key files. This works for small projects. But once you have 50 database models, 200 API routes, 16 source directories, and an 8-layer architecture — no human is going to keep a hand-written reference current. The README drifts. The architecture doc was accurate six months ago. The API reference covers 40 of your 200 routes. Nobody updates the deployment guide after the third infrastructure change. This is the normal state of documentation in any real codebase: permanently behind.

We had exactly this problem. Our monorepo has two applications (a Next.js CRM and a Node.js pipeline orchestrator), a shared PostgreSQL database with 97 Prisma models, and enough internal surface area that even we — the people who built it — couldn't hold the full picture in our heads. Our LLM agents were doing redundant exploration on every task. We needed a reference that stayed current without anyone maintaining it by hand.

The Core Idea

Instead of writing documentation ourselves or asking agents to figure things out each time, we have LLMs generate and maintain an entire wiki from the codebase automatically. Not a summary. Not a README. A full, interlinked collection of 27 markdown pages covering architecture, API surfaces, data models, deployment procedures, known gotchas, active development areas, and documentation gaps — all derived from the actual code, the actual git history, and the actual dependency graph.

The key difference from RAG: the wiki is a compiled artifact. RAG retrieves raw chunks at query time and hopes the LLM can piece them together. The wiki has already done the synthesis. The cross-references are already there. The route tables are already built. The architecture diagrams are already drawn. When an agent needs to understand the actualization API, it reads one page — not 15 source files.

The key difference from hand-written docs: nobody maintains it. The wiki regenerates from code. When the code changes, affected pages update. When new modules appear, new pages appear. The humans are in charge of the codebase. The LLM is in charge of describing it.

This is Andrej Karpathy's LLM Wiki idea, adapted from personal knowledge management to codebase documentation. Same core insight — LLMs are good at the bookkeeping humans abandon — applied to a different domain.

Who Reads This

The wiki has two audiences, and this matters for how it's structured.

LLM agents working on tasks in the repo. Before starting work, the agent reads index.md — a catalog of every wiki page with a one-line summary. It finds the relevant pages, reads them, and starts the task with a pre-built understanding of the area it's working in. This replaces the "grep around and hope" phase. An agent working on the voice API reads components/api-producer-voice.md and immediately knows the 9 endpoints, their handler files, the auth pattern, and the known edge cases — without touching a single source file.

Humans onboarding or context-switching. A new developer reads overview.md and architecture.md to understand the system. A senior developer who hasn't touched the ads module in months reads components/api-ads.md to refresh their memory. The wiki is browsable in Obsidian, GitHub, or any markdown viewer. It's just a directory of .md files in the repo.

Architecture

Three layers, same as Karpathy's pattern but adapted for code:

The codebase is the source of truth. Git history, source files, database schema, package manifests, configuration. The wiki reads from this layer but never writes to it. This is immutable input.

The code graph is the analytical layer. We use GitNexus to build a graph of the repository: files, functions, classes, routes, imports, call chains, and — critically — Leiden community detection that automatically clusters related symbols into named groups like "Http", "Knowledge", "Ui", "Agent". The graph typically has 30,000+ nodes and 50,000+ edges for a medium-sized monorepo. This graph is what makes deterministic planning possible — more on that below.

The wiki is the output layer. A directory of markdown files (wiki/) with YAML frontmatter, mermaid diagrams, route tables, and cross-references. The LLM owns this layer entirely. It creates pages, updates them, maintains links, and keeps everything consistent. Humans read it; the LLM writes it.

Supporting files sit alongside the wiki pages:

index.md — catalog of every page with one-line summaries (the LLM reads this first to find relevant pages)
log.md — append-only record of generation runs
_plan.json — the page list and tier structure (persisted for resume on failure)
_briefs/ — per-page precomputed investigation hints

How It Works

The wiki generates in phases. Each phase has a clear input and output. If any phase fails, the pipeline can resume from the last successful checkpoint.

Phase 1: Plan

This is where we diverged most from the original pattern.

The obvious approach is to ask an LLM to scan the codebase and decide which pages to write. We tried this. It worked — mostly. The LLM would produce a reasonable page list in about 3 minutes. But "reasonable" isn't "complete." It would miss entire modules. Our pipeline has a roadmap helper, an architect planner, an eval system, a manager chat — the LLM's plan consistently overlooked 3-4 of these because they weren't prominent in the files it happened to scan first. Coverage was ~85%, and we couldn't predict which 15% would be missing.

So we replaced the LLM planner with deterministic code. The plan phase reads the code graph (communities, routes, app directories) and mechanically produces the page list. Every Leiden community above a size threshold gets a page. Every route prefix with 5+ routes gets an API page. Every app in the monorepo gets a component page. Standard reference pages (overview, architecture, deployment, gotchas, data model, gaps) are always included.

The result: 100% coverage by construction, in 2 seconds instead of 3 minutes. There's nothing to miss because the algorithm is exhaustive. If a new module appears in the graph, it gets a page. No LLM judgment involved.

Pages are organized into three tiers:

Tier 0 — overview.md, decisions.md (written first, no dependencies)
Tier 1 — architecture.md, gotchas.md, deployment.md, data-model.md (can reference Tier 0)
Tier 2 — everything else: component pages, API pages, integration pages, gap analysis (can reference Tier 0 + 1)

The tier structure matters because each tier's output becomes context for the next tier — a "ledger" that tells the writer what's already been documented, preventing duplication.

Phase 2: Brief

For every page in the plan, we precompute a brief — a markdown file with everything the writer LLM needs to know before it starts investigating. The brief contains:

Which files to read first (handler files, schema definitions, key modules)
Pre-fetched facts — route tables, community member lists, import chains, Prisma model counts — all queried from the code graph in advance
Working GitNexus queries — tested cypher queries the writer can run to dig deeper
Scope boundaries — which adjacent wiki pages cover nearby topics, so the writer doesn't duplicate

This is the highest-leverage optimization in the entire system. Without briefs, the writer LLM spends 30-60% of its time just figuring out what to look at — running exploratory searches, reading wrong files, backtracking. With briefs, it starts with the right files, the right facts, and the right boundaries. Every token the LLM spends rediscovering is a token it doesn't spend synthesizing. Briefs convert discovery time into writing time.

A brief for an API page might include the complete route table (11 endpoints with handler file paths), the top 15 cluster members with their files, and a reminder not to cover topics handled by adjacent pages. The writer opens one brief, reads the listed source files, and writes a focused, accurate page.

Phase 3: Write

One LLM session per page. The writer receives:

A methodology prompt (page template, diagram conventions, quality rules)
The brief for this specific page
The ledger of already-written pages (from previous tiers)
Access to code-reading tools and the code graph

The writer investigates the code, builds understanding, and writes a markdown page with:

YAML frontmatter (title, type, sources list, confidence level, tags)
A one-line TL;DR
Structured sections with route tables, architecture diagrams, known gotchas
At least one mermaid diagram
A "See also" section linking related wiki pages
A "Backlinks" section (added automatically in post-processing)

Completion contract: the prompt explicitly states that the writer must call the write tool before ending its turn. This sounds obvious, but without it, LLM sessions sometimes "think about" the page, plan what they'd write, and then end without actually writing anything. The contract eliminates this failure mode.

Liveness detection: instead of wall-clock timeouts (which either kill slow-but-working sessions or waste time on stuck ones), we monitor OpenCode's SSE event stream. If the session is actively generating tokens or calling tools, it's alive. If it goes truly idle with no activity, that's when we intervene. This matches the session's actual state rather than guessing.

Pages within the same tier write in parallel (up to 3 concurrent sessions). A full 27-page wiki generates in about 20 minutes.

Phase 4: Compile (Optional)

If multiple models wrote the same page (the multi-model deliberation mode), a compiler model merges the versions — keeping the richest content from each, resolving contradictions, and producing one canonical page. If only one model wrote the page (the common case for speed), this phase is skipped entirely.

Phase 5: Index

A deterministic pass builds index.md — a catalog of every page with a one-line summary describing what it covers and when to read it. This is the entry point for any agent or human approaching the wiki. It also builds backlinks: every page gets a "Backlinks" section listing other pages that reference it, making the wiki navigable in both directions.

Keeping It Current

The initial generation is the expensive part. Keeping the wiki current is cheap.

Ingest runs after every completed task. It finds the git diff, identifies which wiki pages reference the changed files (via the sources: frontmatter), and regenerates only those pages. A task that modifies 3 files might touch 2 wiki pages. This takes 1-3 minutes, not 20.

Lint runs periodically (every 20 tasks by default). It health-checks the wiki: are there stale pages referencing deleted files? Contradictions between pages? Orphan pages with no backlinks? Important code areas with no wiki coverage? The output is a report, not automatic fixes — a human or agent decides what to act on.

The combination means the wiki stays roughly current without any human maintenance. It drifts slightly between ingests, then catches up. The lint pass catches anything the ingests missed.

Why Deterministic Planning

This is the single most important lesson from building this system, and it generalizes beyond wikis:

Don't use LLMs for tasks that need completeness. Use them for tasks that need judgment.

Planning — deciding which pages to write — is a completeness task. You need every significant code area covered. Missing one is a bug. An LLM scanning a large codebase will always have blind spots because it samples rather than enumerates. A deterministic algorithm reading a code graph enumerates by definition.

Writing — producing the actual page content — is a judgment task. You need synthesis, prioritization, clear prose, useful diagrams. An LLM is excellent at this. A deterministic algorithm would produce garbage.

The pattern: deterministic skeleton, LLM flesh. Use code to decide what to write. Use LLMs to decide how to write it. This gets you 100% coverage with high-quality content.

We tried the opposite (LLM for everything) and got 85% coverage with high-quality content. That missing 15% made the wiki unreliable — you couldn't trust that a topic would be there. Once we switched to deterministic planning, the wiki became a reference you could depend on.

Scope Isolation

A subtle but critical problem: when you have pages like api-producer-courses and routes like /producer/voice/*, how does the planner know which page covers which routes?

Naive substring matching fails. "producer" appears in both page names and both route prefixes. Without careful scoping, one page swallows routes that belong to another, and you get gaps — routes that no page covers because the planner thinks they're already handled.

We use subset-based matching: a page covers a route only if every token in the page name appears in the route. api-producer-courses (tokens: api, producer, courses) does NOT cover /producer/voice (tokens: producer, voice) because "courses" is not in the route. This is strict, correct, and prevents scope bleed. When no existing page covers a route prefix, a new page is created automatically.

This matters because it's the mechanism that makes 100% coverage actually work. Deterministic planning is only as good as its coverage logic. Get the matching wrong and you're back to gaps.

What We Learned

Prefetch everything you can. The biggest quality improvement came not from better prompts or better models, but from giving the writer more facts upfront. Route tables, member lists, import chains, model counts — all queryable from the code graph before the LLM session starts. The brief turns a research task into a writing task.

Completion contracts matter. LLM sessions can end silently — the model finishes its reasoning, feels satisfied, and closes the turn without producing output. Explicitly stating "you must call the write tool before ending" in the prompt eliminates this. It's a one-line fix for a class of failures that's otherwise hard to debug.

Use platform signals, not timers. Wall-clock timeouts are either too aggressive (killing sessions that are slow but working) or too lenient (waiting 5 minutes for a session that died after 30 seconds). SSE event streams tell you what the session is actually doing. If it's generating tokens, let it run. If it's truly idle, that's your signal.

Idempotent by default. The write phase skips pages that already exist on disk. This means you can re-run the entire pipeline after a partial failure and it picks up where it left off. No wasted compute, no risk of overwriting good pages. Simple, but easy to forget to build in.

The wiki is a git artifact. It lives in the repo, gets committed, has version history. You can diff it, review it, revert it. This is better than a database or an external service because it follows the same lifecycle as the code it describes.

Applying This

This pattern isn't specific to our stack. The requirements are:

A code graph — something that can tell you "what are the major clusters in this codebase" and "what routes/endpoints exist." GitNexus, CodeQL, tree-sitter-graph, or even a sufficiently detailed ctags output would work. The graph is what makes deterministic planning possible.
An LLM with tool access — the writer needs to read files and (ideally) query the graph. Any coding agent setup works: Claude Code, Cursor, OpenCode, Codex, Aider with tool use.
A prompt template — what a page should look like, what frontmatter to include, what diagrams to draw. This is your schema layer — equivalent to CLAUDE.md in Karpathy's framing. Co-evolve it with the LLM as you learn what works for your codebase.
A trigger — either manual ("regenerate wiki") or automatic ("regenerate after every merged PR" / "regenerate nightly"). The cheaper you make regeneration, the more often you can run it.

The exact page types, the tier structure, the brief format — all of that adapts to your project. A microservices repo might have one page per service instead of one page per route prefix. A monolith might have pages organized by domain module. A library might have pages organized by public API surface. The pattern is the same: graph-driven plan, precomputed briefs, LLM-written pages, incremental updates.

Numbers

For our 27-page wiki covering a ~36,000-node codebase:

Metric	Value
Plan phase	2 seconds
Full generation	22 minutes
Total wiki size	~310 KB
Pages	27
Smallest page	731 bytes (active-tasks — correctly empty)
Largest page	29 KB (data-model — 97 Prisma models)
Prisma models documented	97/97
Enums documented	23/23
API routes covered	200+
Commit hashes cited	10/10 verified real
File:line references	spot-checked, all correct

Note

This document describes a pattern, not a library. There's no package to install. The implementation is ~700 lines of TypeScript for the deterministic planner, plus prompt templates and phase orchestration that are specific to our pipeline. The ideas generalize; the code doesn't need to.

The most transferable pieces are: deterministic planning from a code graph, per-page briefs with prefetched facts, completion contracts in writer prompts, and subset-based scope matching. If you build nothing else, build those four things. The rest is plumbing.

Built by the Vechkasov team. The wiki system runs as part of our AI pipeline orchestrator, generating documentation for 20+ projects from their code graphs.

VKirill/codebase-wiki.md

Select an option

No results found