Introducing Ghost Catalog: A Semantic File ID System for AI-Agent Development

The Problem

If you're working with AI agents (like Claude Code) on larger codebases, you've probably hit this wall:

The agent doesn't know what your files do.

Sure, it can read code. But when you have 50+ files, the agent wastes time:

Opening every file to understand what it does
Searching for related files by grepping random keywords
Asking you "what does this file do?"
Losing context about which agent created which file
Breaking changes because it doesn't know dependencies

There's a better way.

The Solution: Semantic File IDs

Instead of this:

src/
  cli.py          # What does this do? 🤷
  core.py         # No idea without reading it
  utils.py        # Generic name, could be anything

You get this:

src/
  cli.py          → SOM-SCR-0014-v1.0.0 "Management CLI" [cli, admin, sqlite]
  core.py         → SOM-SCR-0012-v1.1.0 "Main proxy addon" [proxy, mitmproxy]
  utils.py        → SOM-SCR-0008-v2.0.0 "Date utilities" [utilities, datetime]

Every file has:

A semantic ID (category + sequence + version)
A one-line description
Semantic tags for discovery
The agent that created it
How to run/use it
When it was created/modified

Embedded in the file header (first 13 lines), so it's Git-trackable and always in sync.

What I Built

I spent a weekend building a complete, production-ready system for this. Three things:

1. Documentation (7,200+ lines)

Why semantic IDs beat UUIDs for AI development
Complete technical specification
Practical guide with real-world use cases
Decision framework (when to use / not use)

2. Working Tools (2,700+ lines of code)

CLI tool (Python): Initialize, sync, search, validate, export
TUI browser (Go/Bubble Tea): Interactive catalog browser with vim keybindings
AI integration (Python): Auto-tag files with GPT-4/Claude, analyze dependencies, generate docs

3. Complete Workflows

Project setup (3 commands)
Team onboarding (30 minutes vs 3 days)
Daily development
AI agent coordination

Real-World Use Cases

Use Case 1: New Developer Onboarding

Before (Traditional):

Day 1: Read README, still confused
Day 2: Open 50 files trying to understand architecture
Day 3: Finally find the right file to edit
First PR: Day 5

After (With Catalog):

Minute 1: Launch TUI, filter by "documentation" category
Minute 5: Read SOM-DOC-0001 (Quickstart)
Minute 10: Search by tag "authentication"
Minute 15: Open SOM-SCR-0023 (auth module), understand it
Minute 30: Make first code contribution

Time saved: 4.5 days → 30 minutes

Use Case 2: AI Agent Coordination

Scenario: You have Claude, GPT-4, and a human developer all working on the same codebase.

Problem: Who changed what? Where are the OpenTelemetry files? What's safe to refactor?

Solution:

# Find all OTel files
ghost-catalog search --tag opentelemetry
# → 4 files found instantly

# Check who's working on what
ghost-catalog list --agent AGENT-CLAUDE-002 --sort modified
# → See Claude's recent work

# Before refactoring db_handler.py, check impact
ghost-catalog-ai analyze-deps --file-id SOM-SCR-XXXX
# → Shows 3 files depend on this (core.py, cli.py, collector.py)

Result: No conflicts, clear coordination, safe refactoring.

Use Case 3: Code Archaeology

Scenario: Production bug in collector.py. Need to know:

When was this created?
Who (which agent) created it?
Has it changed recently?
What does it depend on?

Without catalog:

git log --follow collector.py  # Shows commits, but unclear
git blame collector.py          # Shows line-by-line changes
# Still unclear: What does this file actually DO?

With catalog:

ghost-catalog info SOM-SCR-0013-v1.0.0

Output:

File: SOM-SCR-0013-v1.0.0
Name: collector.py
Description: Intelligence collector with Team Cymru integration
Created: 2025-11-23
Modified: 2025-01-24 (TODAY!)  ← Bug likely introduced today
Agent: AGENT-CLAUDE-002
Tags: [intel, asn, teamcymru, networking]
Execution: python -m ghost_shell.intel.collect

Immediately know:

What it does (intelligence collection)
When it changed (today)
Who changed it (Claude)
How to run it
Related functionality (tags)

Time to understand: 2 hours → 30 seconds

Use Case 4: Managing Large Codebases

Scenario: Your project has grown to 200+ files across multiple modules.

Challenges:

Find all files related to a feature
Identify unmaintained files
Audit version consistency before release
Generate project structure documentation

Solutions:

1. Find all proxy-related files:

ghost-catalog search --tag proxy
# → 5 files: blocker.py, core.py, fingerprint.py, cookies.py, PROXY_GUIDE.md

2. Find stale files (not updated in 6 months):

SELECT file_id, name, modified,
       julianday('now') - julianday(modified) as days_old
FROM file_catalog
WHERE days_old > 180;

3. Pre-release version audit:

ghost-catalog list --format json | jq '.[] | select(.version | startswith("0."))'
# → All pre-1.0 files that need review before release

4. Auto-generate project documentation:

ghost-catalog-ai generate-docs --type structure --project-id MY-PROJECT

Output: Complete markdown file with all files organized by category, with descriptions and tags.

Use Case 5: Compliance & Security Audits

Scenario: Security team needs to:

List all files containing authentication code
Prove all code has been reviewed in last quarter
Show change tracking and accountability

Catalog queries:

# All security-related files
ghost-catalog search --tag security --format csv > security_audit.csv

# Files not reviewed in 90 days
SELECT file_id, name, modified, agent_id
FROM file_catalog
WHERE modified < date('now', '-90 days')
  AND category IN ('script', 'configuration');

# Agent activity report
SELECT agent_id, COUNT(*) as files_modified
FROM file_catalog
WHERE modified >= date('now', '-90 days')
GROUP BY agent_id;

Result: Complete audit trail with agent accountability.

Scenarios: When to Use This

✅ Perfect For

1. Multi-Agent Development

You have Claude, GPT-4, and humans all contributing
Need clear coordination and handoffs
Example: AI pair programming with multiple LLMs

2. Large Codebases (50+ files)

Hard to remember what each file does
Frequent onboarding of new developers
Example: Microservices architecture, monorepos

3. Rapid Onboarding

New team members join frequently
Want them productive in hours, not days
Example: Open-source projects, consulting teams

4. Compliance-Heavy Environments

Need audit trails
Track who changed what
Example: Healthcare, finance, government projects

5. AI-Generated Code

AI agents generate lots of files
Need to track which agent created what
Example: Cursor, Claude Code, GitHub Copilot heavy usage

❌ Not Ideal For

1. Tiny Projects (1-10 files)

Overhead isn't worth it
Use good filenames instead

2. Volatile Early-Stage Projects

Architecture changes daily
Files created/deleted constantly
Wait until things stabilize

3. Pure Human Teams with Established Tools

Team already uses Jira, Confluence, etc. effectively
No AI agent involvement
Stick with what works

4. External Codebases You Don't Control

Can't add headers to third-party libraries
Use external documentation instead

Comparison: This vs. Alternatives

vs. Git Commit Messages

Git Commits	Catalog Headers
✅ Tracks changes over time	✅ Tracks current state
❌ Doesn't explain current purpose	✅ Always shows what file does
❌ Noisy (hundreds of commits)	✅ Concise (one header)
❌ No semantic tags	✅ Tags for discovery

Use both: Git for history, Catalog for current state.

vs. Code Comments

Code Comments	Catalog Headers
✅ Inline explanations	✅ Structured metadata
❌ Scattered throughout file	✅ First 13 lines (consistent)
❌ No machine-readable format	✅ Parseable (grep, SQL)
❌ No versioning	✅ Semver built-in

Use both: Comments for code logic, Headers for file metadata.

vs. README Documentation

README	Catalog
✅ Human-friendly narrative	✅ Machine-queryable data
❌ Gets out of sync	✅ Lives with file (auto-synced)
❌ Static	✅ Auto-generated from catalog
❌ One file	✅ Every file documented

Use both: README for overview, Catalog for every file.

vs. IDEs (VSCode, IntelliJ)

IDEs	Catalog
✅ Go-to-definition	✅ Go-to-purpose
✅ Syntax highlighting	✅ Semantic discovery (tags)
❌ IDE-specific	✅ Tool-agnostic (CLI/TUI)
❌ No AI agent awareness	✅ Agent tracking built-in

Use both: IDE for coding, Catalog for discovery.

Quick Start (3 Commands)

# 1. Initialize
python ghost_catalog_cli.py init

# 2. Sync your codebase
python ghost_catalog_cli.py sync

# 3. Browse interactively
./ghost_catalog_tui

That's it. You now have a semantic catalog of your codebase.

What's Included

📚 Documentation:

Technical specification (1,500 lines)
Practical guide with use cases (3,000 lines)
Implementation suite guide (650 lines)

💻 Working Code:

CLI tool (Python, 776 lines)
TUI browser (Go/Bubble Tea, 653 lines)
AI integration (Python, 651 lines)

📊 Database:

SQLite schema (4 tables, 6 indexes)
20+ analytics queries
Dependency tracking

🎯 Everything Needed:

Installation instructions
Complete workflows
Troubleshooting guide
Production checklist

GitHub Repository

🔗 https://github.com/SoMaCoSF/ghost-catalog

Clone and use:

git clone https://github.com/SoMaCoSF/ghost-catalog.git
cd ghost-catalog
pip install -r requirements.txt
python ghost_catalog_cli.py init
python ghost_catalog_cli.py sync

Repository includes:

All working tools (CLI, TUI, AI integration)
Complete documentation (linked to gists)
Example files and workflows
MIT License

Gists (Complete Documentation)

All documentation is also available as gists:

Technical Deep Dive: https://gist.github.com/SoMaCoSF/38d0d859192546ca4add36e4f7351c7d
- Architecture, schemas, specifications
Practical Guide: https://gist.github.com/SoMaCoSF/edf5ba3afd8e8849903b9400add4d406
- Tutorials, use cases, decision framework
Implementation Suite: https://gist.github.com/SoMaCoSF/6e62939c6b9810aa9e4f3c604ac9f9fe
- Working code (CLI, TUI, AI)
How to Use (this document): https://gist.github.com/SoMaCoSF/b52b7f68d09d4752138fb0712e153c21
- Quick start, scenarios, examples

Who This Is For

✅ You should use this if you:

Work with AI coding assistants (Claude Code, Cursor, Copilot)
Manage codebases with 50+ files
Onboard new developers frequently
Use multiple AI agents on the same project
Need compliance/audit trails
Want semantic code organization

❌ Skip this if you:

Have 1-10 files
Work solo with no AI
Early-stage project (architecture not stable)
Don't control the codebase (third-party)

Why I Built This

I was working with Claude Code on a privacy proxy project (Ghost_Shell) that merged two existing codebases (50+ files). Every time Claude had to understand a file, it would:

Read the entire file (wasting tokens)
Ask me what it does
Forget 10 minutes later and ask again

I thought: What if every file had a "name tag" that told the agent:

What it is
What it does
When it was created
Which agent made it
How to run it

So I built it. And it works incredibly well.

Now when Claude asks "what does this file do?", I just say:

Read the file_id header

And it gets everything it needs in 13 lines instead of reading 500 lines of code.

How to Incorporate Into Your Project

Option 1: Start Fresh (New Project)

# 1. Clone the repo
git clone https://github.com/SoMaCoSF/ghost-catalog.git
cd ghost-catalog

# 2. Install
pip install -r requirements.txt

# 3. Initialize in your project
cd /path/to/your/project
python /path/to/ghost-catalog/ghost_catalog_cli.py init

# 4. Create your first file with a header
# (Use the template from docs)

# 5. Sync
python /path/to/ghost-catalog/ghost_catalog_cli.py sync

Option 2: Migrate Existing Project

# 1. Backup your project
cp -r my_project my_project.backup

# 2. Add catalog headers to existing files
# (Use the bulk migration script from the docs)

# 3. Or use AI to help
python ghost_catalog_ai_integration.py auto-tag \
  --api-key $OPENAI_API_KEY \
  --dry-run  # Preview first

# 4. Sync to database
python ghost_catalog_cli.py sync

# 5. Validate
python ghost_catalog_cli.py validate

Option 3: Gradual Adoption

# 1. Start with new files only
# Add headers to new files as you create them

# 2. Add headers to frequently-edited files
# When you edit a file, add a header

# 3. Eventually cover 80% of important files
# Don't worry about 100% coverage

Example Integration: Claude Code Workflow

Before:

You: "Claude, refactor the authentication module"
Claude: "Sure, which file is that?"
You: "auth.py"
Claude: *reads entire 500-line file* "What does this depend on?"
You: "utils.py and db.py"
Claude: *reads both files* "Ok, I'll refactor"

After:

You: "Claude, refactor the authentication module"
Claude: *checks catalog* "Found SOM-SCR-0023 (auth.py)"
Claude: *checks dependencies in catalog* "Depends on SOM-SCR-0008 (utils) and SOM-SCR-0012 (db)"
Claude: "I'll refactor, this will impact 3 files: auth.py, login.py, session.py"
You: "Go ahead"

Result: Faster, more accurate, fewer back-and-forth questions.

Add This to Your Claude Code Workspace

Step 1: Add to .claude/commands/catalog.md:

# Catalog System Commands

When working with this codebase, use the catalog system:

## Find a file
/catalog search <query>

## Get file info
/catalog info <SOM-SCR-XXXX>

## List by category
/catalog list --category script

## Check dependencies
/catalog deps <SOM-SCR-XXXX>

Step 2: Add to workspace instructions (.claude/settings.json):

{
  "instructions": [
    "This codebase uses the Ghost Catalog system.",
    "Every file has a header with metadata (first 13 lines).",
    "Before reading a file, check its file_id header to understand what it does.",
    "File IDs format: SOM-<CATEGORY>-<SEQUENCE>-v<VERSION>",
    "Categories: SCR=script, DOC=documentation, CFG=configuration, TST=test",
    "Use 'ghost-catalog search' to find files by tag or description."
  ]
}

Community & Support

GitHub Issues: https://github.com/SoMaCoSF/ghost-catalog/issues Discussions: https://github.com/SoMaCoSF/ghost-catalog/discussions Documentation: See gists linked above

Questions? Open a discussion on GitHub.

Found a bug? Open an issue.

Want to contribute? PRs welcome!

License

MIT License - Use freely in personal or commercial projects.

TL;DR

Problem: AI agents waste time understanding your codebase.

Solution: Add semantic file IDs (like "name tags" for code files).

Result:

Onboarding: 3 days → 30 minutes
File discovery: Minutes → 5 seconds
AI agent coordination: Chaos → Clear
Code archaeology: Hours → Instant

Get started:

git clone https://github.com/SoMaCoSF/ghost-catalog.git
python ghost_catalog_cli.py init
python ghost_catalog_cli.py sync

Read more: Check the gists linked in this post.

Built this over a weekend because I was tired of AI agents asking "what does this file do?" every 5 minutes. Hope it helps your workflow too!

— SoMaCoSF

P.S. If you use this, I'd love to hear about it. Drop a comment or open a discussion on GitHub!

SoMaCoSF/FORUM_POST_INTRODUCTION.md