Skip to content

Instantly share code, notes, and snippets.

@joonan30
Created April 9, 2026 08:57
Show Gist options
  • Select an option

  • Save joonan30/f6013d9da55edb506360b2fcb588997c to your computer and use it in GitHub Desktop.

Select an option

Save joonan30/f6013d9da55edb506360b2fcb588997c to your computer and use it in GitHub Desktop.
LLM Wiki — AI for Biology: Claude Code instructions for managing a personal research knowledge base (Karpathy LLM Wiki pattern)

LLM Wiki — AI for Biology

중요: Git 워크트리 사용 금지

모든 파일은 메인 브랜치에 직접 저장한다. Git worktree를 생성하지 않는다.

A personal knowledge base for AI/deep learning papers in biology research. Follows Karpathy's LLM Wiki pattern: Original PDF → LLM markdown summary (sources) → Structured wiki page (wiki).

Language policy: All wiki content is in English. Conversation with Claude can be in Korean or English.

중요: 논문 기반 답변 원칙

이 프로젝트의 목적은 논문 내용을 바탕으로 지식을 축적하는 것이다.

  • 답변은 위키(sources/, wiki/)에 있는 논문 내용만을 근거로 한다.
  • 웹 검색(WebSearch, WebFetch)을 사용하지 않는다. 위키에 없는 정보를 보충하기 위해 웹 검색을 하지 않는다.
  • 위키 내용이 불충분하면, 해당 논문의 원본 PDF를 읽어서 보충할 수 있다 (Bash + opendataloader-pdf 사용).
  • 위키에 해당 주제의 논문이 아예 없으면, 없다고 말하고 사용자에게 PDF를 요청한다.
  • Overview 페이지 작성 시에도 위키 내 논문들만을 출처로 사용한다.

Interactive Visualization Rules

  • When creating a new interactive, always add an entry to interactives/index.html.
  • Entry format: list item with title, date, and a short description. No icons.
  • All interactive pages use white background (background: #fff).
  • Interactive files go in interactives/{topic-name}/ subdirectories.

Statistics (2026-04-09)

  • Source markdown: 1,054 files (sources/)
  • Wiki pages: 1,421 files across 25 categories (wiki/)

Structure

llm-wiki/
├── CLAUDE.md               # This file — schema, workflow, usage
├── index.md                # Full page catalog (category + key papers)
├── log.md                  # Work log
├── scripts/                # Ingest & build scripts
├── papers/                 # Original PDF files (canonical storage)
│   └── {author}-{year}-{title-5-words}.pdf
├── sources/                # PDF summaries (all English)
│   └── {author}-{year}-{title-5-words}.md
├── interactives/           # Interactive HTML visualizations
│   └── {topic-name}/
└── wiki/                   # Structured wiki pages (all English)
    └── {category}/

파일명 규칙 (Naming Convention)

모든 논문 관련 파일(PDF, source markdown, wiki markdown)은 동일한 규칙을 따른다:

{1저자 성}-{연도}-{제목 첫 5단어를 -로 연결}.확장자

세부 규칙:

  • 1저자 성(last name)만 사용, 소문자, 특수문자 제거
  • 연도는 4자리
  • 제목에서 첫 5단어, 소문자, 특수문자 제거, 띄어쓰기는 -
  • 컨소시엄 논문은 컨소시엄 이름 사용 (예: 1000-genomes-project-2015-...)

연구실 자체 논문 규칙

연구실 논문은 source_collectionstatus 필드로 상태를 구분한다:

상태 source_collection status 설명
게재 완료 lab-papers published 이미 저널에 게재된 연구실 논문
심사중 our-manuscript under-review 저널에 투고하여 심사 중인 논문
작성중 our-manuscript in-preparation 아직 투고 전인 논문

작성중/심사중 논문 (PDF가 없는 경우):

source_collection: our-manuscript
status: in-preparation   # 또는 under-review
pdf_path: ""
pdf_filename: ""

게재 완료 논문:

source_collection: lab-papers
status: published

참고: anlab은 연구실 읽기 목록(외부 논문)에 사용하는 별도 collection이다. 연구실이 저자인 논문에는 사용하지 않는다.

PDF 관리 규칙

원칙: 모든 PDF는 papers/ 폴더에 실제 파일로 저장한다.

  • 사용자가 외부 경로에서 PDF를 제공하면, 반드시 cp 명령으로 papers/ 폴더에 복사한다. symlink를 만들지 않는다.
  • 파일명은 반드시 위 파일명 규칙을 따른다.
  • pdf_path는 항상 papers/ 내 절대 경로를 가리켜야 한다.
  • pdf_filenamepdf_path의 basename과 반드시 일치해야 한다.
  • 외부 경로(~/Downloads/, ~/Desktop/ 등)를 pdf_path에 절대 넣지 않는다.

Add a New Paper (Ingest)

Step 1 — Copy PDF to papers/ and extract text:

opendataloader-pdf를 사용한다 (Java 필요). 실패 시 pypdf로 폴백.

export PATH="/opt/homebrew/opt/openjdk/bin:$PATH"
python3 -c "
import opendataloader_pdf, tempfile, os, re, sys
pdf = sys.argv[1]
with tempfile.TemporaryDirectory() as d:
    opendataloader_pdf.convert(pdf, output_dir=d, format='markdown', pages='1-15', image_output='off', quiet=True)
    stem = os.path.splitext(os.path.basename(pdf))[0]
    text = open(f'{d}/{stem}.md').read()
lines = [l for l in text.splitlines() if not re.match(r'!\[image \d+\]', l)]
print('\n'.join(lines)[:12000])
" "/path/to/paper.pdf"

pypdf 폴백:

python3 -c "
import pypdf, sys
reader = pypdf.PdfReader(sys.argv[1])
text = ''
for page in reader.pages[:40]:
    t = page.extract_text()
    if t: text += t + '\n'
    if len(text) > 12000: break
print(text[:12000])
" "/path/to/paper.pdf"

Step 2 — Write source file to sources/{filename}.md

Step 3 — Create wiki page at wiki/{category}/{filename}.md

Step 4 — Update index.md

Wiki Page Schema

Source Files (sources/)

---
title: "Paper Title"
authors: Author List
year: YYYY
doi: DOI
category: category_name
pdf_path: /full/path/to/paper.pdf
pdf_filename: filename.pdf
source_collection: collection_name
---

## One-line Summary
## 1. Document Information
## 2. Key Contributions
## 3. Methodology and Architecture
## 4. Key Results and Benchmarks
## 5. Limitations and Future Work
## 6. Related Work
## 7. Glossary

Wiki Pages (wiki/{category}/)

---
title: "Exact English Title"
authors: Author list
year: YYYY
doi: DOI
source: source_filename.md
category: category_name
pdf_path: /full/path.pdf
pdf_filename: filename.pdf
source_collection: collection_name
tags: []
---

## Summary
## Key Contributions
## Methodology and Architecture
## Results
## Related Papers
- [[category/page]] — relationship

Overview Pages (wiki/overviews/)

---
title: "Topic Title"
tags: [relevant-tags]
---

## Overview
## Timeline / Comparison Table
## Related Pages

Category Definitions

Category Includes
genomic-dl DNA LMs, variant effect prediction, regulatory genomics, sequence models
single-cell-dl scRNA-seq DL, cell type annotation, integration, imputation, perturbation
single-cell-foundation Geneformer, scGPT, virtual cells, large single-cell foundation models
single-cell-methylation Single-cell DNA methylation analysis, epigenomic profiling
protein-ai Protein LMs, structure prediction, PTM prediction
gwas GWAS, common/rare variant methods, population genetics, LD, variant interpretation
neuroscience ASD genetics, schizophrenia genetics, psychiatric genetics, disease gene functional studies
brain-development Normal brain development, cortical biology, cerebral organoid methodology, neurogenesis
brain-atlas Brain cell atlases, BICCN, spatial transcriptomics
organoid Non-brain organoids: lung, kidney, liver, heart, gut, retinal; iPSC differentiation
long-read PacBio, Oxford Nanopore, long-read DNA sequencing methods
lrRNA Long-read RNA-seq: Iso-seq, MAS-seq, ONT cDNA/dRNA, transcript isoforms
drug-resistance Cancer proteogenomics, drug resistance, cancer genomics, immunotherapy
methylation-ai DNA methylation AI, epigenetic clocks
methylation General DNA methylation biology
statistics Statistical methods: FDR, rare variants, batch effects, Bayesian
medical-llm Medical/clinical LLMs, NLP for EHR, clinical NLP
sex-differences-biology Sex-specific genetic architecture, XWAS, sex-biased disease, X-inactivation
reproductive-biology Germline development, PGC reprogramming, meiotic recombination, genomic imprinting
meiosis Meiotic recombination, crossover mechanisms, synaptonemal complex
synapse-evolution Synapse molecular evolution, postsynaptic density, comparative synaptomics
aging Longevity genetics, lifespan QTL, aging biology
other Cross-cutting, evolution, networks, benchmarks, misc
concepts Key concepts, methods, algorithms explained
overviews Synthesis pages, timelines, comparison tables

Design Principles

  • 3-tier architecture: Raw PDF (immutable) → sources/.md (summaries) → wiki/**/.md (final)
  • Single wiki: All AI for Biology in one wiki, organized by category
  • Obsidian compatible: [[wikilinks]], standard markdown
  • English only: All content in English (for paper writing + RAG)
  • PDF extraction: Bash + opendataloader-pdf (NOT Claude Read tool). Fallback to pypdf.
  • Consistent YAML: Every source file has title, authors, year, doi, category, pdf_path, pdf_filename, source_collection
  • Knowledge compounding: Answers to queries saved as overviews/ pages
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment