LLM Wiki — AI for Biology

중요: Git 워크트리 사용 금지

모든 파일은 메인 브랜치에 직접 저장한다. Git worktree를 생성하지 않는다.

A personal knowledge base for AI/deep learning papers in biology research. Follows Karpathy's LLM Wiki pattern: Original PDF → LLM markdown summary (sources) → Structured wiki page (wiki).

Language policy: All wiki content is in English. Conversation with Claude can be in Korean or English.

중요: 논문 기반 답변 원칙

이 프로젝트의 목적은 논문 내용을 바탕으로 지식을 축적하는 것이다.

답변은 위키(sources/, wiki/)에 있는 논문 내용만을 근거로 한다.
웹 검색(WebSearch, WebFetch)을 사용하지 않는다. 위키에 없는 정보를 보충하기 위해 웹 검색을 하지 않는다.
위키 내용이 불충분하면, 해당 논문의 원본 PDF를 읽어서 보충할 수 있다 (Bash + opendataloader-pdf 사용).
위키에 해당 주제의 논문이 아예 없으면, 없다고 말하고 사용자에게 PDF를 요청한다.
Overview 페이지 작성 시에도 위키 내 논문들만을 출처로 사용한다.

Interactive Visualization Rules

When creating a new interactive, always add an entry to interactives/index.html.
Entry format: list item with title, date, and a short description. No icons.
All interactive pages use white background (background: #fff).
Interactive files go in interactives/{topic-name}/ subdirectories.

Statistics (2026-04-09)

Source markdown: 1,054 files (sources/)
Wiki pages: 1,421 files across 25 categories (wiki/)

Structure

llm-wiki/
├── CLAUDE.md               # This file — schema, workflow, usage
├── index.md                # Full page catalog (category + key papers)
├── log.md                  # Work log
├── scripts/                # Ingest & build scripts
├── papers/                 # Original PDF files (canonical storage)
│   └── {author}-{year}-{title-5-words}.pdf
├── sources/                # PDF summaries (all English)
│   └── {author}-{year}-{title-5-words}.md
├── interactives/           # Interactive HTML visualizations
│   └── {topic-name}/
└── wiki/                   # Structured wiki pages (all English)
    └── {category}/

파일명 규칙 (Naming Convention)

모든 논문 관련 파일(PDF, source markdown, wiki markdown)은 동일한 규칙을 따른다:

{1저자 성}-{연도}-{제목 첫 5단어를 -로 연결}.확장자

세부 규칙:

1저자 성(last name)만 사용, 소문자, 특수문자 제거
연도는 4자리
제목에서 첫 5단어, 소문자, 특수문자 제거, 띄어쓰기는 -
컨소시엄 논문은 컨소시엄 이름 사용 (예: 1000-genomes-project-2015-...)

연구실 자체 논문 규칙

연구실 논문은 source_collection과 status 필드로 상태를 구분한다:

상태	`source_collection`	`status`	설명
게재 완료	`lab-papers`	`published`	이미 저널에 게재된 연구실 논문
심사중	`our-manuscript`	`under-review`	저널에 투고하여 심사 중인 논문
작성중	`our-manuscript`	`in-preparation`	아직 투고 전인 논문

작성중/심사중 논문 (PDF가 없는 경우):

source_collection: our-manuscript
status: in-preparation   # 또는 under-review
pdf_path: ""
pdf_filename: ""

게재 완료 논문:

source_collection: lab-papers
status: published

참고: anlab은 연구실 읽기 목록(외부 논문)에 사용하는 별도 collection이다. 연구실이 저자인 논문에는 사용하지 않는다.

PDF 관리 규칙

원칙: 모든 PDF는 papers/ 폴더에 실제 파일로 저장한다.

사용자가 외부 경로에서 PDF를 제공하면, 반드시 cp 명령으로 papers/ 폴더에 복사한다. symlink를 만들지 않는다.
파일명은 반드시 위 파일명 규칙을 따른다.
pdf_path는 항상 papers/ 내 절대 경로를 가리켜야 한다.
pdf_filename은 pdf_path의 basename과 반드시 일치해야 한다.
외부 경로(~/Downloads/, ~/Desktop/ 등)를 pdf_path에 절대 넣지 않는다.

Add a New Paper (Ingest)

Step 1 — Copy PDF to papers/ and extract text:

opendataloader-pdf를 사용한다 (Java 필요). 실패 시 pypdf로 폴백.

export PATH="/opt/homebrew/opt/openjdk/bin:$PATH"
python3 -c "
import opendataloader_pdf, tempfile, os, re, sys
pdf = sys.argv[1]
with tempfile.TemporaryDirectory() as d:
    opendataloader_pdf.convert(pdf, output_dir=d, format='markdown', pages='1-15', image_output='off', quiet=True)
    stem = os.path.splitext(os.path.basename(pdf))[0]
    text = open(f'{d}/{stem}.md').read()
lines = [l for l in text.splitlines() if not re.match(r'!\[image \d+\]', l)]
print('\n'.join(lines)[:12000])
" "/path/to/paper.pdf"

pypdf 폴백:

python3 -c "
import pypdf, sys
reader = pypdf.PdfReader(sys.argv[1])
text = ''
for page in reader.pages[:40]:
    t = page.extract_text()
    if t: text += t + '\n'
    if len(text) > 12000: break
print(text[:12000])
" "/path/to/paper.pdf"

Step 2 — Write source file to sources/{filename}.md

Step 3 — Create wiki page at wiki/{category}/{filename}.md

Step 4 — Update index.md

Wiki Page Schema

Source Files (sources/)

---
title: "Paper Title"
authors: Author List
year: YYYY
doi: DOI
category: category_name
pdf_path: /full/path/to/paper.pdf
pdf_filename: filename.pdf
source_collection: collection_name
---

## One-line Summary
## 1. Document Information
## 2. Key Contributions
## 3. Methodology and Architecture
## 4. Key Results and Benchmarks
## 5. Limitations and Future Work
## 6. Related Work
## 7. Glossary

Wiki Pages (wiki/{category}/)

---
title: "Exact English Title"
authors: Author list
year: YYYY
doi: DOI
source: source_filename.md
category: category_name
pdf_path: /full/path.pdf
pdf_filename: filename.pdf
source_collection: collection_name
tags: []
---

## Summary
## Key Contributions
## Methodology and Architecture
## Results
## Related Papers
- [[category/page]] — relationship

Overview Pages (wiki/overviews/)

---
title: "Topic Title"
tags: [relevant-tags]
---

## Overview
## Timeline / Comparison Table
## Related Pages

Category Definitions

Category	Includes
`genomic-dl`	DNA LMs, variant effect prediction, regulatory genomics, sequence models
`single-cell-dl`	scRNA-seq DL, cell type annotation, integration, imputation, perturbation
`single-cell-foundation`	Geneformer, scGPT, virtual cells, large single-cell foundation models
`single-cell-methylation`	Single-cell DNA methylation analysis, epigenomic profiling
`protein-ai`	Protein LMs, structure prediction, PTM prediction
`gwas`	GWAS, common/rare variant methods, population genetics, LD, variant interpretation
`neuroscience`	ASD genetics, schizophrenia genetics, psychiatric genetics, disease gene functional studies
`brain-development`	Normal brain development, cortical biology, cerebral organoid methodology, neurogenesis
`brain-atlas`	Brain cell atlases, BICCN, spatial transcriptomics
`organoid`	Non-brain organoids: lung, kidney, liver, heart, gut, retinal; iPSC differentiation
`long-read`	PacBio, Oxford Nanopore, long-read DNA sequencing methods
`lrRNA`	Long-read RNA-seq: Iso-seq, MAS-seq, ONT cDNA/dRNA, transcript isoforms
`drug-resistance`	Cancer proteogenomics, drug resistance, cancer genomics, immunotherapy
`methylation-ai`	DNA methylation AI, epigenetic clocks
`methylation`	General DNA methylation biology
`statistics`	Statistical methods: FDR, rare variants, batch effects, Bayesian
`medical-llm`	Medical/clinical LLMs, NLP for EHR, clinical NLP
`sex-differences-biology`	Sex-specific genetic architecture, XWAS, sex-biased disease, X-inactivation
`reproductive-biology`	Germline development, PGC reprogramming, meiotic recombination, genomic imprinting
`meiosis`	Meiotic recombination, crossover mechanisms, synaptonemal complex
`synapse-evolution`	Synapse molecular evolution, postsynaptic density, comparative synaptomics
`aging`	Longevity genetics, lifespan QTL, aging biology
`other`	Cross-cutting, evolution, networks, benchmarks, misc
`concepts`	Key concepts, methods, algorithms explained
`overviews`	Synthesis pages, timelines, comparison tables

Design Principles

3-tier architecture: Raw PDF (immutable) → sources/.md (summaries) → wiki/**/.md (final)
Single wiki: All AI for Biology in one wiki, organized by category
Obsidian compatible: [[wikilinks]], standard markdown
English only: All content in English (for paper writing + RAG)
PDF extraction: Bash + opendataloader-pdf (NOT Claude Read tool). Fallback to pypdf.
Consistent YAML: Every source file has title, authors, year, doi, category, pdf_path, pdf_filename, source_collection
Knowledge compounding: Answers to queries saved as overviews/ pages

joonan30/CLAUDE.md

Select an option

No results found