If you're working with AI agents (like Claude Code) on larger codebases, you've probably hit this wall:
The agent doesn't know what your files do.
Sure, it can read code. But when you have 50+ files, the agent wastes time:
- Opening every file to understand what it does
- Searching for related files by grepping random keywords
- Asking you "what does this file do?"
- Losing context about which agent created which file
- Breaking changes because it doesn't know dependencies
There's a better way.
Instead of this:
src/
cli.py # What does this do? 🤷
core.py # No idea without reading it
utils.py # Generic name, could be anything
You get this:
src/
cli.py → SOM-SCR-0014-v1.0.0 "Management CLI" [cli, admin, sqlite]
core.py → SOM-SCR-0012-v1.1.0 "Main proxy addon" [proxy, mitmproxy]
utils.py → SOM-SCR-0008-v2.0.0 "Date utilities" [utilities, datetime]
Every file has:
- A semantic ID (category + sequence + version)
- A one-line description
- Semantic tags for discovery
- The agent that created it
- How to run/use it
- When it was created/modified
Embedded in the file header (first 13 lines), so it's Git-trackable and always in sync.
I spent a weekend building a complete, production-ready system for this. Three things:
- Why semantic IDs beat UUIDs for AI development
- Complete technical specification
- Practical guide with real-world use cases
- Decision framework (when to use / not use)
- CLI tool (Python): Initialize, sync, search, validate, export
- TUI browser (Go/Bubble Tea): Interactive catalog browser with vim keybindings
- AI integration (Python): Auto-tag files with GPT-4/Claude, analyze dependencies, generate docs
- Project setup (3 commands)
- Team onboarding (30 minutes vs 3 days)
- Daily development
- AI agent coordination
Before (Traditional):
Day 1: Read README, still confused
Day 2: Open 50 files trying to understand architecture
Day 3: Finally find the right file to edit
First PR: Day 5
After (With Catalog):
Minute 1: Launch TUI, filter by "documentation" category
Minute 5: Read SOM-DOC-0001 (Quickstart)
Minute 10: Search by tag "authentication"
Minute 15: Open SOM-SCR-0023 (auth module), understand it
Minute 30: Make first code contribution
Time saved: 4.5 days → 30 minutes
Scenario: You have Claude, GPT-4, and a human developer all working on the same codebase.
Problem: Who changed what? Where are the OpenTelemetry files? What's safe to refactor?
Solution:
# Find all OTel files
ghost-catalog search --tag opentelemetry
# → 4 files found instantly
# Check who's working on what
ghost-catalog list --agent AGENT-CLAUDE-002 --sort modified
# → See Claude's recent work
# Before refactoring db_handler.py, check impact
ghost-catalog-ai analyze-deps --file-id SOM-SCR-XXXX
# → Shows 3 files depend on this (core.py, cli.py, collector.py)Result: No conflicts, clear coordination, safe refactoring.
Scenario: Production bug in collector.py. Need to know:
- When was this created?
- Who (which agent) created it?
- Has it changed recently?
- What does it depend on?
Without catalog:
git log --follow collector.py # Shows commits, but unclear
git blame collector.py # Shows line-by-line changes
# Still unclear: What does this file actually DO?With catalog:
ghost-catalog info SOM-SCR-0013-v1.0.0Output:
File: SOM-SCR-0013-v1.0.0
Name: collector.py
Description: Intelligence collector with Team Cymru integration
Created: 2025-11-23
Modified: 2025-01-24 (TODAY!) ← Bug likely introduced today
Agent: AGENT-CLAUDE-002
Tags: [intel, asn, teamcymru, networking]
Execution: python -m ghost_shell.intel.collect
Immediately know:
- What it does (intelligence collection)
- When it changed (today)
- Who changed it (Claude)
- How to run it
- Related functionality (tags)
Time to understand: 2 hours → 30 seconds
Scenario: Your project has grown to 200+ files across multiple modules.
Challenges:
- Find all files related to a feature
- Identify unmaintained files
- Audit version consistency before release
- Generate project structure documentation
Solutions:
1. Find all proxy-related files:
ghost-catalog search --tag proxy
# → 5 files: blocker.py, core.py, fingerprint.py, cookies.py, PROXY_GUIDE.md2. Find stale files (not updated in 6 months):
SELECT file_id, name, modified,
julianday('now') - julianday(modified) as days_old
FROM file_catalog
WHERE days_old > 180;3. Pre-release version audit:
ghost-catalog list --format json | jq '.[] | select(.version | startswith("0."))'
# → All pre-1.0 files that need review before release4. Auto-generate project documentation:
ghost-catalog-ai generate-docs --type structure --project-id MY-PROJECTOutput: Complete markdown file with all files organized by category, with descriptions and tags.
Scenario: Security team needs to:
- List all files containing authentication code
- Prove all code has been reviewed in last quarter
- Show change tracking and accountability
Catalog queries:
# All security-related files
ghost-catalog search --tag security --format csv > security_audit.csv
# Files not reviewed in 90 days
SELECT file_id, name, modified, agent_id
FROM file_catalog
WHERE modified < date('now', '-90 days')
AND category IN ('script', 'configuration');
# Agent activity report
SELECT agent_id, COUNT(*) as files_modified
FROM file_catalog
WHERE modified >= date('now', '-90 days')
GROUP BY agent_id;Result: Complete audit trail with agent accountability.
1. Multi-Agent Development
- You have Claude, GPT-4, and humans all contributing
- Need clear coordination and handoffs
- Example: AI pair programming with multiple LLMs
2. Large Codebases (50+ files)
- Hard to remember what each file does
- Frequent onboarding of new developers
- Example: Microservices architecture, monorepos
3. Rapid Onboarding
- New team members join frequently
- Want them productive in hours, not days
- Example: Open-source projects, consulting teams
4. Compliance-Heavy Environments
- Need audit trails
- Track who changed what
- Example: Healthcare, finance, government projects
5. AI-Generated Code
- AI agents generate lots of files
- Need to track which agent created what
- Example: Cursor, Claude Code, GitHub Copilot heavy usage
1. Tiny Projects (1-10 files)
- Overhead isn't worth it
- Use good filenames instead
2. Volatile Early-Stage Projects
- Architecture changes daily
- Files created/deleted constantly
- Wait until things stabilize
3. Pure Human Teams with Established Tools
- Team already uses Jira, Confluence, etc. effectively
- No AI agent involvement
- Stick with what works
4. External Codebases You Don't Control
- Can't add headers to third-party libraries
- Use external documentation instead
| Git Commits | Catalog Headers |
|---|---|
| ✅ Tracks changes over time | ✅ Tracks current state |
| ❌ Doesn't explain current purpose | ✅ Always shows what file does |
| ❌ Noisy (hundreds of commits) | ✅ Concise (one header) |
| ❌ No semantic tags | ✅ Tags for discovery |
Use both: Git for history, Catalog for current state.
| Code Comments | Catalog Headers |
|---|---|
| ✅ Inline explanations | ✅ Structured metadata |
| ❌ Scattered throughout file | ✅ First 13 lines (consistent) |
| ❌ No machine-readable format | ✅ Parseable (grep, SQL) |
| ❌ No versioning | ✅ Semver built-in |
Use both: Comments for code logic, Headers for file metadata.
| README | Catalog |
|---|---|
| ✅ Human-friendly narrative | ✅ Machine-queryable data |
| ❌ Gets out of sync | ✅ Lives with file (auto-synced) |
| ❌ Static | ✅ Auto-generated from catalog |
| ❌ One file | ✅ Every file documented |
Use both: README for overview, Catalog for every file.
| IDEs | Catalog |
|---|---|
| ✅ Go-to-definition | ✅ Go-to-purpose |
| ✅ Syntax highlighting | ✅ Semantic discovery (tags) |
| ❌ IDE-specific | ✅ Tool-agnostic (CLI/TUI) |
| ❌ No AI agent awareness | ✅ Agent tracking built-in |
Use both: IDE for coding, Catalog for discovery.
# 1. Initialize
python ghost_catalog_cli.py init
# 2. Sync your codebase
python ghost_catalog_cli.py sync
# 3. Browse interactively
./ghost_catalog_tuiThat's it. You now have a semantic catalog of your codebase.
📚 Documentation:
- Technical specification (1,500 lines)
- Practical guide with use cases (3,000 lines)
- Implementation suite guide (650 lines)
💻 Working Code:
- CLI tool (Python, 776 lines)
- TUI browser (Go/Bubble Tea, 653 lines)
- AI integration (Python, 651 lines)
📊 Database:
- SQLite schema (4 tables, 6 indexes)
- 20+ analytics queries
- Dependency tracking
🎯 Everything Needed:
- Installation instructions
- Complete workflows
- Troubleshooting guide
- Production checklist
🔗 https://github.com/SoMaCoSF/ghost-catalog
Clone and use:
git clone https://github.com/SoMaCoSF/ghost-catalog.git
cd ghost-catalog
pip install -r requirements.txt
python ghost_catalog_cli.py init
python ghost_catalog_cli.py syncRepository includes:
- All working tools (CLI, TUI, AI integration)
- Complete documentation (linked to gists)
- Example files and workflows
- MIT License
All documentation is also available as gists:
-
Technical Deep Dive: https://gist.github.com/SoMaCoSF/38d0d859192546ca4add36e4f7351c7d
- Architecture, schemas, specifications
-
Practical Guide: https://gist.github.com/SoMaCoSF/edf5ba3afd8e8849903b9400add4d406
- Tutorials, use cases, decision framework
-
Implementation Suite: https://gist.github.com/SoMaCoSF/6e62939c6b9810aa9e4f3c604ac9f9fe
- Working code (CLI, TUI, AI)
-
How to Use (this document): https://gist.github.com/SoMaCoSF/b52b7f68d09d4752138fb0712e153c21
- Quick start, scenarios, examples
✅ You should use this if you:
- Work with AI coding assistants (Claude Code, Cursor, Copilot)
- Manage codebases with 50+ files
- Onboard new developers frequently
- Use multiple AI agents on the same project
- Need compliance/audit trails
- Want semantic code organization
❌ Skip this if you:
- Have 1-10 files
- Work solo with no AI
- Early-stage project (architecture not stable)
- Don't control the codebase (third-party)
I was working with Claude Code on a privacy proxy project (Ghost_Shell) that merged two existing codebases (50+ files). Every time Claude had to understand a file, it would:
- Read the entire file (wasting tokens)
- Ask me what it does
- Forget 10 minutes later and ask again
I thought: What if every file had a "name tag" that told the agent:
- What it is
- What it does
- When it was created
- Which agent made it
- How to run it
So I built it. And it works incredibly well.
Now when Claude asks "what does this file do?", I just say:
Read the file_id header
And it gets everything it needs in 13 lines instead of reading 500 lines of code.
# 1. Clone the repo
git clone https://github.com/SoMaCoSF/ghost-catalog.git
cd ghost-catalog
# 2. Install
pip install -r requirements.txt
# 3. Initialize in your project
cd /path/to/your/project
python /path/to/ghost-catalog/ghost_catalog_cli.py init
# 4. Create your first file with a header
# (Use the template from docs)
# 5. Sync
python /path/to/ghost-catalog/ghost_catalog_cli.py sync# 1. Backup your project
cp -r my_project my_project.backup
# 2. Add catalog headers to existing files
# (Use the bulk migration script from the docs)
# 3. Or use AI to help
python ghost_catalog_ai_integration.py auto-tag \
--api-key $OPENAI_API_KEY \
--dry-run # Preview first
# 4. Sync to database
python ghost_catalog_cli.py sync
# 5. Validate
python ghost_catalog_cli.py validate# 1. Start with new files only
# Add headers to new files as you create them
# 2. Add headers to frequently-edited files
# When you edit a file, add a header
# 3. Eventually cover 80% of important files
# Don't worry about 100% coverageBefore:
You: "Claude, refactor the authentication module"
Claude: "Sure, which file is that?"
You: "auth.py"
Claude: *reads entire 500-line file* "What does this depend on?"
You: "utils.py and db.py"
Claude: *reads both files* "Ok, I'll refactor"
After:
You: "Claude, refactor the authentication module"
Claude: *checks catalog* "Found SOM-SCR-0023 (auth.py)"
Claude: *checks dependencies in catalog* "Depends on SOM-SCR-0008 (utils) and SOM-SCR-0012 (db)"
Claude: "I'll refactor, this will impact 3 files: auth.py, login.py, session.py"
You: "Go ahead"
Result: Faster, more accurate, fewer back-and-forth questions.
Step 1: Add to .claude/commands/catalog.md:
# Catalog System Commands
When working with this codebase, use the catalog system:
## Find a file
/catalog search <query>
## Get file info
/catalog info <SOM-SCR-XXXX>
## List by category
/catalog list --category script
## Check dependencies
/catalog deps <SOM-SCR-XXXX>Step 2: Add to workspace instructions (.claude/settings.json):
{
"instructions": [
"This codebase uses the Ghost Catalog system.",
"Every file has a header with metadata (first 13 lines).",
"Before reading a file, check its file_id header to understand what it does.",
"File IDs format: SOM-<CATEGORY>-<SEQUENCE>-v<VERSION>",
"Categories: SCR=script, DOC=documentation, CFG=configuration, TST=test",
"Use 'ghost-catalog search' to find files by tag or description."
]
}GitHub Issues: https://github.com/SoMaCoSF/ghost-catalog/issues Discussions: https://github.com/SoMaCoSF/ghost-catalog/discussions Documentation: See gists linked above
Questions? Open a discussion on GitHub.
Found a bug? Open an issue.
Want to contribute? PRs welcome!
MIT License - Use freely in personal or commercial projects.
Problem: AI agents waste time understanding your codebase.
Solution: Add semantic file IDs (like "name tags" for code files).
Result:
- Onboarding: 3 days → 30 minutes
- File discovery: Minutes → 5 seconds
- AI agent coordination: Chaos → Clear
- Code archaeology: Hours → Instant
Get started:
git clone https://github.com/SoMaCoSF/ghost-catalog.git
python ghost_catalog_cli.py init
python ghost_catalog_cli.py syncRead more: Check the gists linked in this post.
Built this over a weekend because I was tired of AI agents asking "what does this file do?" every 5 minutes. Hope it helps your workflow too!
— SoMaCoSF
P.S. If you use this, I'd love to hear about it. Drop a comment or open a discussion on GitHub!