Skip to content

Instantly share code, notes, and snippets.

@joonan30
Last active April 12, 2026 20:53
Show Gist options
  • Select an option

  • Save joonan30/cbce305684d079dbe9a3fbaefe4e3959 to your computer and use it in GitHub Desktop.

Select an option

Save joonan30/cbce305684d079dbe9a3fbaefe4e3959 to your computer and use it in GitHub Desktop.
LLM Wiki: AI for Biology -- Collaborator Guide

LLM Wiki: Building a Personal Knowledge Base for Academic Papers with AI Agents

A methodology for using Claude Code + OpenAI Codex CLI to build and maintain a structured, searchable wiki from academic PDFs — designed for researchers who read dozens of papers and want compounding knowledge.

The Concept

Inspired by Karpathy's LLM Wiki pattern:

Original PDF → LLM markdown summary (sources/) → Structured wiki page (wiki/) → Overview synthesis

Each paper goes through a 3-tier pipeline:

  1. papers/: Original PDF (immutable archive)
  2. sources/: LLM-generated structured summary (7 standard sections)
  3. wiki/{category}/: Structured wiki page with cross-references ([[wikilinks]])

Overview pages synthesize across papers — this is where the real knowledge compounding happens.

Repository Structure

llm-wiki/
├── CLAUDE.md               # Schema, workflow, rules for AI agents
├── index.md                # Full page catalog
├── papers/                 # Original PDFs (cp, never symlink)
│   └── {author}-{year}-{title-5-words}.pdf
├── sources/                # PDF summaries (English only)
│   └── {author}-{year}-{title-5-words}.md
└── wiki/                   # Structured wiki pages (English only)
    ├── {category}/         # 25+ categories
    └── overviews/          # Synthesis pages (the real value)

Paper Naming Convention

All files (PDF, source, wiki) share the same name:

{first-author-lastname}-{year}-{first-5-title-words}.{ext}

Example: pollard-2006-an-rna-gene-expressed-during.pdf

Single Paper Ingest (Claude Code)

Step 1: Copy PDF and extract text

# Using opendataloader-pdf (best quality, needs Java)
export PATH="/opt/homebrew/opt/openjdk/bin:$PATH"
python3 -c "
import opendataloader_pdf, tempfile, os, re, sys
pdf = sys.argv[1]
with tempfile.TemporaryDirectory() as d:
    opendataloader_pdf.convert(pdf, output_dir=d, format='markdown',
                               pages='1-15', image_output='off', quiet=True)
    stem = os.path.splitext(os.path.basename(pdf))[0]
    text = open(f'{d}/{stem}.md').read()
lines = [l for l in text.splitlines() if not re.match(r'!\[image \d+\]', l)]
print('\n'.join(lines)[:12000])
" "/path/to/paper.pdf"

# Fallback: pypdf (faster, lower quality)
python3 -c "
import pypdf, sys
reader = pypdf.PdfReader(sys.argv[1])
text = ''
for page in reader.pages[:15]:
    t = page.extract_text()
    if t: text += t + '\n'
    if len(text) > 12000: break
print(text[:12000])
" "/path/to/paper.pdf"

Step 2: Create source file

---
title: "Paper Title"
authors: Author List
year: YYYY
doi: DOI
category: category_name
pdf_path: /full/path/to/papers/filename.pdf
pdf_filename: filename.pdf
source_collection: collection_name
---

7 standard sections: One-line Summary, Document Information, Key Contributions, Methodology, Key Results, Limitations, Related Work, Glossary.

Step 3: Create wiki page with [[wikilinks]] to related papers

Step 4: Update index.md

Batch Paper Ingest with Codex CLI (5+ papers)

When processing many papers at once, delegate to OpenAI's Codex CLI to save Claude's context window.

Prerequisites

  • Codex CLI installed: npm install -g @openai/codex
  • Authenticated: codex login

The Workaround: Korean/Unicode Path Issue

Codex websocket fails with non-ASCII characters in git repo paths. Solution:

# 1. Claude extracts text to /tmp (ASCII-only path)
mkdir -p /tmp/llm-wiki-ingest
python3 -c "..." paper.pdf > /tmp/llm-wiki-ingest/paper.txt

# 2. Run Codex from /tmp with --skip-git-repo-check
cd /tmp/llm-wiki-ingest
codex exec -m "gpt-5.4" \
  -c 'reasoning_effort="high"' \
  --full-auto \
  --skip-git-repo-check \
  "Read paper.txt and create source.md and wiki.md files..."

# 3. Fix paths and copy results back to project
sed -i '' 's|/short/path/|/full/unicode/path/|g' *.md
cp *-source.md /project/sources/
cp *-wiki.md /project/wiki/category/

Parallel Batch Processing

Run 4-5 Codex instances in parallel with & and wait:

cd /tmp/llm-wiki-ingest

codex exec -m "gpt-5.4" -c 'reasoning_effort="high"' \
  --full-auto --skip-git-repo-check \
  "Read paper1.txt. Create paper1-source.md and paper1-wiki.md..." &

codex exec -m "gpt-5.4" -c 'reasoning_effort="high"' \
  --full-auto --skip-git-repo-check \
  "Read paper2.txt. Create paper2-source.md and paper2-wiki.md..." &

wait
echo "Batch complete"

Known Limitations

Issue Cause Workaround
UTF-8 websocket error Non-ASCII chars in git repo path --skip-git-repo-check + work from /tmp
Model gpt-5.4-high rejected Not a valid model name Use gpt-5.4 + -c 'reasoning_effort="high"' separately
ChatGPT account model limits Some models API-key only Use default model or authenticate with API key

Codex vs Claude Agent: When to Use Which

Codex CLI Claude Agent tool
Best for Batch processing 5+ papers Complex tasks needing wiki context
Context Fresh per invocation Shares session context
Parallelism Shell & + wait Agent tool with run_in_background
Path issues Needs ASCII path workaround No path issues
Model gpt-5.4 Claude (same session)
Quality Good for structured extraction Better for synthesis/cross-referencing

The Knowledge Tree Method

The most valuable part of this workflow is knowledge tree expansion — starting from a topic and branching outward:

Root question (e.g., "non-cortical brain cell types")
├── 1st wave: Direct overview pages
│   ├── ARHGAP11B dedicated page
│   ├── Thalamic molecular architecture
│   ├── Cerebellar cell diversity
│   ├── Complement synaptic pruning
│   └── WM vs GM astrocyte biology
├── 2nd wave: Deeper branches from discoveries
│   ├── Dopaminergic neuron diversity (from brainstem section)
│   ├── Human Accelerated Regions (from brain evolution)
│   └── Brain region-specific disease vulnerability
└── 3rd wave: Cross-cutting themes
    ├── Circadian regulation in brain evolution
    ├── Hypothalamus cell type atlas
    └── ... (continues)

How it works in practice:

  1. Ask a question → Claude searches wiki → answers from existing sources
  2. If wiki is insufficient → read original PDFs → update wiki
  3. Follow-up questions → branch into new topics → create new overview pages
  4. Cross-reference → link new findings to existing pages → knowledge compounds

Each conversation session produces 5-15 new or updated wiki pages. After a few sessions, the wiki becomes a searchable, cross-referenced knowledge graph that any future conversation can draw from.

Rules in CLAUDE.md

Key rules that make this work:

# Answer only from wiki content (no web search)
# If wiki is insufficient, read original PDF
# If topic has no papers, say so and ask user for PDF
# All content in English
# PDFs stored as real files in papers/ (never symlink)
# pdf_path always points to papers/ folder
# Consistent YAML frontmatter in every file

Scaling Search with QMD

As recommended by Karpathy, at small scale (~100 sources), a simple index.md suffices. But once the wiki grows past ~500 pages, you need a proper search engine.

QMD is a local search engine for markdown files with:

  • Hybrid search: BM25 keyword (lex) + semantic vector (vec) + hypothetical document (hyde)
  • LLM re-ranking: Results are re-ranked by relevance
  • Fully on-device: No data leaves your machine

Setup as Claude Code MCP server

QMD runs as an MCP (Model Context Protocol) server that Claude Code can call directly. Once configured, Claude automatically searches the wiki via QMD instead of basic grep.

Search example

{
  "searches": [
    {"type": "lex", "query": "\"noncoding\" \"de novo\" autism"},
    {"type": "vec", "query": "how do noncoding variants contribute to ASD risk"},
    {"type": "hyde", "query": "De novo noncoding mutations in regulatory regions such as promoters and enhancers contribute to autism risk by disrupting TF binding sites and enhancer-promoter contacts."}
  ],
  "intent": "de novo noncoding mutations and autism",
  "limit": 100,
  "candidateLimit": 200
}

When to use what

Scale Search method
< 100 pages index.md + Claude's Grep
100-500 pages Grep works, QMD is faster
500+ pages QMD is essential — semantic search finds related pages that keyword search misses

At our current scale (~2,500 pages), QMD consistently finds related overview pages and cross-category connections that grep-based search misses.

Stats (as of April 2026)

  • Source files: ~1,100
  • Wiki pages: ~1,500 across 25 categories
  • Overview pages: ~60 synthesis pages
  • Papers: ~1,100 PDFs

Getting Started

  1. Create the folder structure
  2. Write a CLAUDE.md with your schema and rules
  3. Start with 5-10 papers in your field
  4. Ask Claude Code questions → let it build the wiki
  5. Follow curiosity → branch the knowledge tree

The wiki becomes more valuable with every paper added, because new papers connect to existing ones through [[wikilinks]] and overview pages.


Built with Claude Code (Anthropic) + Codex CLI (OpenAI)

LLM Wiki: AI for Biology — Collaborator Guide

A shared knowledge base of AI/deep learning papers in biology research. Maintained by Joon An and collaborators.

What is this?

This project follows Andrej Karpathy's "LLM Wiki" pattern: use an LLM to convert research PDFs into structured markdown, then build a queryable wiki on top.

Our version is specialized for AI in biology — genomic deep learning, single-cell foundation models, GWAS methods, neuroscience genetics, and more.

Current Scale (April 2026)

Component Count
Source summaries (sources/) 1,060
Wiki pages (wiki/) 1,427
Categories 25
Interactive visualizations 8
PDF collections 4 Dropbox folders + direct additions

How it Works: 3-Tier Architecture

Raw PDF (immutable, papers/)
  → sources/*.md   (LLM-generated summary, structured YAML frontmatter)
    → wiki/**/*.md  (final wiki page, cross-linked with [[wikilinks]])
  1. PDF (papers/): The original paper. All PDFs are copied here (no symlinks).
  2. Source markdown (sources/): A structured summary extracted by Claude (using opendataloader-pdf or pypdf). Contains title, authors, year, DOI, methodology, results, etc.
  3. Wiki page (wiki/{category}/): The final, polished page. Cross-linked to related papers via [[wikilinks]].

Folder Structure

llm-wiki/
├── CLAUDE.md               # Full schema, workflow, rules for Claude
├── index.md                # Page catalog (category + key papers)
├── log.md                  # Work log
├── scripts/
│   ├── paper_monitor.py    # Automated paper discovery
│   └── monitor_reports/    # Daily scan reports
├── papers/                 # Original PDFs (canonical storage)
│   └── {author}-{year}-{short-title}.pdf
├── sources/                # PDF summaries (1,060 files)
│   └── {author}-{year}-{short-title}.md
├── wiki/                   # Wiki pages (1,427 files)
│   ├── genomic-dl/         # DNA LMs, variant prediction, regulatory genomics
│   ├── single-cell-dl/     # scRNA-seq DL, cell type annotation
│   ├── single-cell-foundation/  # Geneformer, scGPT, virtual cells
│   ├── protein-ai/         # Protein LMs, structure prediction
│   ├── gwas/               # GWAS, EWAS, rare variant methods, population genetics
│   ├── neuroscience/       # ASD, schizophrenia, psychiatric genetics
│   ├── brain-development/  # Normal brain dev, cortical biology, neurogenesis
│   ├── brain-atlas/        # Brain cell atlases, BICCN, spatial transcriptomics
│   ├── long-read/          # PacBio, Oxford Nanopore
│   ├── lrRNA/              # Long-read RNA-seq: Iso-seq, MAS-seq, ONT
│   ├── drug-resistance/    # Cancer proteogenomics, drug resistance
│   ├── methylation-ai/     # DNA methylation AI/DL, epigenetic clocks
│   ├── methylation/        # General DNA methylation biology
│   ├── medical-llm/        # Medical/clinical LLMs, NLP for EHR
│   ├── statistics/         # FDR, rare variants, batch effects, Bayesian
│   ├── sex-differences-biology/  # Sex-specific genetic architecture
│   ├── reproductive-biology/     # Germline development, genomic imprinting
│   ├── meiosis/            # Meiotic recombination, crossover mechanisms
│   ├── synapse-evolution/  # Synapse molecular evolution
│   ├── aging/              # Longevity genetics, lifespan QTL
│   ├── organoid/           # Non-brain organoids
│   ├── single-cell-methylation/  # Single-cell DNA methylation
│   ├── concepts/           # Key ML/DL concepts
│   ├── overviews/          # Synthesis pages spanning multiple papers
│   └── other/              # Cross-cutting, evolution, benchmarks, misc
└── interactives/           # HTML/CSS/JS interactive visualizations
    ├── asd-cohorts/        # ASD cohort comparison
    ├── asd-models/         # ASD genetic architecture models
    ├── cancer-kinase-atlas/ # Cancer kinase network
    ├── cfm-explained/      # Conditional flow matching tutorial
    ├── glm/                # Genomic language model comparison
    ├── organoids/          # Organoid development timeline
    └── postmortem-brain-atlas/  # Brain atlas overview

Daily Workflow

1. Paper Monitoring (Automated)

A scheduled task runs paper_monitor.py every morning at 8 AM. It:

  1. Scans bioRxiv and medRxiv for new preprints in relevant categories
  2. Searches PubMed for new articles in 16 monitored journals:
    • Nature, Science, Cell
    • Nature Genetics, Nature Medicine, Nature Neuroscience, Nature Methods, Nature Biotechnology, Nature Machine Intelligence, Nature Communications
    • American Journal of Human Genetics, Neuron
    • Cell Genomics, Genome Biology, Genome Medicine, Genome Research
  3. Keyword-searches PubMed for each wiki category
  4. Scores relevance (+2 for high-value keywords, +1 for medium, penalties for irrelevant domains)
  5. Checks Unpaywall API for open access PDFs
  6. Writes a report to scripts/monitor_reports/monitor-YYYY-MM-DD.md
# Manual run: scan last 7 days, minimum score 2
python3 scripts/paper_monitor.py --days 7 --min-score 2

# Tighter filter
python3 scripts/paper_monitor.py --days 3 --min-score 4

2. Review Monitor Report

The report has three sections:

  • Auto-Ingest Ready: Score >= 4 with OA PDF available. These can be ingested immediately.
  • Manual Review: Score >= 2 but no OA PDF or needs human judgment.
  • Quick Ingest Commands: Copy-paste curl + Claude commands for each OA paper.

3. Ingest a Paper

Tell Claude Code:

"Add this paper to the wiki: /path/to/paper.pdf"

Or from a monitor report:

"Ingest the top 5 papers from today's monitor report"

Claude will automatically:

  1. Copy PDF to papers/ (always copy, never symlink)
  2. Extract text with opendataloader-pdf (Java required) or pypdf (fallback)
  3. Write structured source summary to sources/
  4. Create wiki page in wiki/{category}/
  5. Update index.md with a one-line summary

4. Query the Wiki

cd ~/Dropbox/연구-참고문헌/llm-wiki
claude

Ask questions like:

"Compare DNA foundation model architectures"
"What are the best methods for rare variant association testing?"
"Summarize ASD genetics papers in this wiki"
"Create an overview comparing single-cell foundation models"

Important: Claude answers only from papers in the wiki — no web search. If a topic isn't covered, provide the PDF.

Search with QMD (MCP Server)

At our scale (2,500+ pages), we use QMD as a local search engine, configured as a Claude Code MCP server. As Karpathy notes, QMD provides hybrid BM25/vector search with LLM re-ranking, all on-device.

QMD supports three search modes:

  • lex (BM25): Exact keyword matching — "CWAS" "noncoding" autism
  • vec (semantic): Meaning-based — "how do noncoding variants contribute to ASD risk"
  • hyde (hypothetical document): Write what the answer looks like for best recall

Combining all three in a single query produces the best results, especially for finding related overview pages across categories.

Tip: Set limit: 100 and candidateLimit: 200 — with 2,500+ documents, the default limit of 10 misses too many relevant results.

5. Build Interactive Visualizations

For complex topics, create interactive HTML pages:

"Create an interactive visualization comparing genomic language model architectures"

These go to interactives/{topic-name}/ and get registered in interactives/index.html.

6. Save Knowledge as Overviews

When Claude synthesizes an answer across multiple papers, save it:

"Save this as an overview page"

This creates a page in wiki/overviews/. Questions become permanent knowledge that compounds over time.


How to Browse

Option A: Obsidian (Recommended)

  1. Open llm-wiki/ as an Obsidian Vault
  2. You get: [[wikilinks]] graph view, full-text search, markdown rendering
  3. Navigate via index.md or the graph view

Option B: Claude Code

cd ~/Dropbox/연구-참고문헌/llm-wiki
claude

Claude reads CLAUDE.md + index.md, finds relevant wiki pages, and synthesizes answers based only on papers in the wiki.


PDF Text Extraction

Primary: opendataloader-pdf — best quality markdown output with heading structure preserved. Requires Java.

export PATH="/opt/homebrew/opt/openjdk/bin:$PATH"
python3 -c "
import opendataloader_pdf, tempfile, os, re, sys
pdf = sys.argv[1]
with tempfile.TemporaryDirectory() as d:
    opendataloader_pdf.convert(pdf, output_dir=d, format='markdown', pages='1-15', image_output='off', quiet=True)
    stem = os.path.splitext(os.path.basename(pdf))[0]
    text = open(f'{d}/{stem}.md').read()
lines = [l for l in text.splitlines() if not re.match(r'!\[image \d+\]', l)]
print('\n'.join(lines)[:12000])
" "/path/to/paper.pdf"

Fallback: pypdf — simpler, no Java needed.

python3 -c "
import pypdf, sys
reader = pypdf.PdfReader(sys.argv[1])
text = ''
for page in reader.pages[:40]:
    t = page.extract_text()
    if t: text += t + '\n'
    if len(text) > 12000: break
print(text[:12000])
" "/path/to/paper.pdf"

Installation:

pip3 install opendataloader-pdf --break-system-packages
pip3 install pypdf --break-system-packages
# Java: brew install openjdk

Lessons Learned

1. Strict "Papers Only" Principle

Never use web search to fill gaps. This is the most important rule.

  • All answers must be grounded in papers that are in the wiki.
  • If the wiki doesn't have enough information, go back to the original PDF and re-read it.
  • If the wiki has no papers on a topic, say so and ask for the PDF.
  • Overview pages must only cite papers that exist in the wiki.

This prevents hallucination and ensures every claim is traceable to a specific paper.

2. PDF Management

  • Always copy, never symlink. PDFs go to papers/ as real files.
  • Naming convention: {first-author-lastname}-{year}-{first-5-title-words}.pdf — same stem for PDF, source, and wiki page.
  • Never use external paths (Downloads, Desktop) in pdf_path.

3. Category Classification

Classify by METHOD, not topic:

Principle Example
Method-based EWAS paper studying methylation → gwas, not methylation-ai
Strict boundaries methylation-ai = AI/DL models only, not any paper mentioning methylation
Wet-lab → catch-all Pure experimental biology → other

Categories are lab-specific. Define yours in CLAUDE.md and Claude follows them automatically.

4. Deduplication

Duplicates accumulate during batch processing. Common causes:

  • Same DOI across categories
  • Re-processed files getting -v2 names

Prevention: Check for existing DOI before creating files. Run dedup passes after batch ingestion.

5. Knowledge Compounding

The most valuable part isn't individual pages — it's overview pages. Questions become permanent knowledge that can be refined.

6. What Doesn't Work

  • Web search for gap-filling — breaks paper-only principle
  • Claude's built-in PDF reader for batch processing — too slow, can hallucinate structure
  • Over-categorizing — 25 categories is already a lot, use other liberally
  • Trusting auto-classification blindly — LLMs classify by keywords, not understanding. Periodic human review is essential.

File Formats

Source File (sources/)

---
title: "Exact Paper Title"
authors: Author List
year: 2024
doi: 10.xxxx/xxxxx
category: genomic-dl
pdf_path: /full/path/to/papers/file.pdf
pdf_filename: file.pdf
source_collection: collection_name
---

## One-line Summary
## 1. Document Information
## 2. Key Contributions
## 3. Methodology and Architecture
## 4. Key Results and Benchmarks
## 5. Limitations and Future Work
## 6. Related Work
## 7. Glossary

Wiki Page (wiki/{category}/)

---
title: "Paper Title"
authors: Author list
year: 2024
doi: 10.xxxx/xxxxx
source: source_filename.md
category: genomic-dl
tags: []
---

## Summary
## Key Contributions
## Methodology and Architecture
## Results
## Related Papers
- [[category/page]] — relationship

Category Definitions (Our Lab's Example)

Note: These categories reflect our lab's research focus. Define your own in CLAUDE.md.

Category What goes here
genomic-dl DNA language models, variant effect prediction, regulatory genomics
single-cell-dl scRNA-seq deep learning, cell type annotation, integration
single-cell-foundation Geneformer, scGPT, large single-cell foundation models
single-cell-methylation Single-cell DNA methylation analysis
protein-ai Protein language models, structure prediction
gwas GWAS, EWAS, rare variant testing, population genetics
neuroscience ASD, schizophrenia, psychiatric genetics
brain-development Normal brain development, cortical biology, neurogenesis
brain-atlas Brain cell atlases, BICCN, spatial transcriptomics
long-read PacBio, Oxford Nanopore sequencing methods
lrRNA Long-read RNA-seq: Iso-seq, MAS-seq, ONT
drug-resistance Cancer proteogenomics, drug resistance mechanisms
methylation-ai DNA methylation AI/DL models only
methylation General DNA methylation biology
medical-llm Medical/clinical LLMs, NLP for EHR, healthcare AI
statistics Statistical methods (FDR, Bayesian, batch effects)
sex-differences-biology Sex-specific genetic architecture, XWAS
reproductive-biology Germline development, PGC, genomic imprinting
meiosis Meiotic recombination, crossover mechanisms
synapse-evolution Synapse molecular evolution
aging Longevity genetics, lifespan QTL
organoid Non-brain organoids
concepts General ML/DL concepts used across biology
overviews Synthesis pages spanning multiple papers
other Cross-cutting topics, wet-lab biology, reviews, benchmarks

Getting Started as a Collaborator

  1. Get access to the shared Dropbox folder containing llm-wiki/
  2. Install Obsidian (free) and open llm-wiki/ as a vault
  3. Browse: Start from index.md or search for your topic
  4. To add papers: Use Claude Code — "Add this paper: /path/to/paper.pdf"
  5. Read CLAUDE.md for the full specification

Requirements for Claude Code Usage

  • Claude Code CLI installed
  • Python 3 with opendataloader-pdf and pypdf
  • Java (for opendataloader-pdf: brew install openjdk)
  • CLAUDE.md in the repo root teaches Claude all the rules automatically

Inspired by Karpathy's LLM Wiki. Built with Claude Code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment