Skip to content

Instantly share code, notes, and snippets.

@SoMaCoSF
Last active November 26, 2025 16:06
Show Gist options
  • Select an option

  • Save SoMaCoSF/b52b7f68d09d4752138fb0712e153c21 to your computer and use it in GitHub Desktop.

Select an option

Save SoMaCoSF/b52b7f68d09d4752138fb0712e153c21 to your computer and use it in GitHub Desktop.
Ghost Catalog: Introduction and Use Cases - Complete guide with GitHub repository link

Introducing Ghost Catalog: A Semantic File ID System for AI-Agent Development

The Problem

If you're working with AI agents (like Claude Code) on larger codebases, you've probably hit this wall:

The agent doesn't know what your files do.

Sure, it can read code. But when you have 50+ files, the agent wastes time:

  • Opening every file to understand what it does
  • Searching for related files by grepping random keywords
  • Asking you "what does this file do?"
  • Losing context about which agent created which file
  • Breaking changes because it doesn't know dependencies

There's a better way.


The Solution: Semantic File IDs

Instead of this:

src/
  cli.py          # What does this do? 🤷
  core.py         # No idea without reading it
  utils.py        # Generic name, could be anything

You get this:

src/
  cli.py          → SOM-SCR-0014-v1.0.0 "Management CLI" [cli, admin, sqlite]
  core.py         → SOM-SCR-0012-v1.1.0 "Main proxy addon" [proxy, mitmproxy]
  utils.py        → SOM-SCR-0008-v2.0.0 "Date utilities" [utilities, datetime]

Every file has:

  • A semantic ID (category + sequence + version)
  • A one-line description
  • Semantic tags for discovery
  • The agent that created it
  • How to run/use it
  • When it was created/modified

Embedded in the file header (first 13 lines), so it's Git-trackable and always in sync.


What I Built

I spent a weekend building a complete, production-ready system for this. Three things:

1. Documentation (7,200+ lines)

  • Why semantic IDs beat UUIDs for AI development
  • Complete technical specification
  • Practical guide with real-world use cases
  • Decision framework (when to use / not use)

2. Working Tools (2,700+ lines of code)

  • CLI tool (Python): Initialize, sync, search, validate, export
  • TUI browser (Go/Bubble Tea): Interactive catalog browser with vim keybindings
  • AI integration (Python): Auto-tag files with GPT-4/Claude, analyze dependencies, generate docs

3. Complete Workflows

  • Project setup (3 commands)
  • Team onboarding (30 minutes vs 3 days)
  • Daily development
  • AI agent coordination

Real-World Use Cases

Use Case 1: New Developer Onboarding

Before (Traditional):

Day 1: Read README, still confused
Day 2: Open 50 files trying to understand architecture
Day 3: Finally find the right file to edit
First PR: Day 5

After (With Catalog):

Minute 1: Launch TUI, filter by "documentation" category
Minute 5: Read SOM-DOC-0001 (Quickstart)
Minute 10: Search by tag "authentication"
Minute 15: Open SOM-SCR-0023 (auth module), understand it
Minute 30: Make first code contribution

Time saved: 4.5 days → 30 minutes


Use Case 2: AI Agent Coordination

Scenario: You have Claude, GPT-4, and a human developer all working on the same codebase.

Problem: Who changed what? Where are the OpenTelemetry files? What's safe to refactor?

Solution:

# Find all OTel files
ghost-catalog search --tag opentelemetry
# → 4 files found instantly

# Check who's working on what
ghost-catalog list --agent AGENT-CLAUDE-002 --sort modified
# → See Claude's recent work

# Before refactoring db_handler.py, check impact
ghost-catalog-ai analyze-deps --file-id SOM-SCR-XXXX
# → Shows 3 files depend on this (core.py, cli.py, collector.py)

Result: No conflicts, clear coordination, safe refactoring.


Use Case 3: Code Archaeology

Scenario: Production bug in collector.py. Need to know:

  • When was this created?
  • Who (which agent) created it?
  • Has it changed recently?
  • What does it depend on?

Without catalog:

git log --follow collector.py  # Shows commits, but unclear
git blame collector.py          # Shows line-by-line changes
# Still unclear: What does this file actually DO?

With catalog:

ghost-catalog info SOM-SCR-0013-v1.0.0

Output:

File: SOM-SCR-0013-v1.0.0
Name: collector.py
Description: Intelligence collector with Team Cymru integration
Created: 2025-11-23
Modified: 2025-01-24 (TODAY!)  ← Bug likely introduced today
Agent: AGENT-CLAUDE-002
Tags: [intel, asn, teamcymru, networking]
Execution: python -m ghost_shell.intel.collect

Immediately know:

  • What it does (intelligence collection)
  • When it changed (today)
  • Who changed it (Claude)
  • How to run it
  • Related functionality (tags)

Time to understand: 2 hours → 30 seconds


Use Case 4: Managing Large Codebases

Scenario: Your project has grown to 200+ files across multiple modules.

Challenges:

  • Find all files related to a feature
  • Identify unmaintained files
  • Audit version consistency before release
  • Generate project structure documentation

Solutions:

1. Find all proxy-related files:

ghost-catalog search --tag proxy
# → 5 files: blocker.py, core.py, fingerprint.py, cookies.py, PROXY_GUIDE.md

2. Find stale files (not updated in 6 months):

SELECT file_id, name, modified,
       julianday('now') - julianday(modified) as days_old
FROM file_catalog
WHERE days_old > 180;

3. Pre-release version audit:

ghost-catalog list --format json | jq '.[] | select(.version | startswith("0."))'
# → All pre-1.0 files that need review before release

4. Auto-generate project documentation:

ghost-catalog-ai generate-docs --type structure --project-id MY-PROJECT

Output: Complete markdown file with all files organized by category, with descriptions and tags.


Use Case 5: Compliance & Security Audits

Scenario: Security team needs to:

  • List all files containing authentication code
  • Prove all code has been reviewed in last quarter
  • Show change tracking and accountability

Catalog queries:

# All security-related files
ghost-catalog search --tag security --format csv > security_audit.csv

# Files not reviewed in 90 days
SELECT file_id, name, modified, agent_id
FROM file_catalog
WHERE modified < date('now', '-90 days')
  AND category IN ('script', 'configuration');

# Agent activity report
SELECT agent_id, COUNT(*) as files_modified
FROM file_catalog
WHERE modified >= date('now', '-90 days')
GROUP BY agent_id;

Result: Complete audit trail with agent accountability.


Scenarios: When to Use This

Perfect For

1. Multi-Agent Development

  • You have Claude, GPT-4, and humans all contributing
  • Need clear coordination and handoffs
  • Example: AI pair programming with multiple LLMs

2. Large Codebases (50+ files)

  • Hard to remember what each file does
  • Frequent onboarding of new developers
  • Example: Microservices architecture, monorepos

3. Rapid Onboarding

  • New team members join frequently
  • Want them productive in hours, not days
  • Example: Open-source projects, consulting teams

4. Compliance-Heavy Environments

  • Need audit trails
  • Track who changed what
  • Example: Healthcare, finance, government projects

5. AI-Generated Code

  • AI agents generate lots of files
  • Need to track which agent created what
  • Example: Cursor, Claude Code, GitHub Copilot heavy usage

Not Ideal For

1. Tiny Projects (1-10 files)

  • Overhead isn't worth it
  • Use good filenames instead

2. Volatile Early-Stage Projects

  • Architecture changes daily
  • Files created/deleted constantly
  • Wait until things stabilize

3. Pure Human Teams with Established Tools

  • Team already uses Jira, Confluence, etc. effectively
  • No AI agent involvement
  • Stick with what works

4. External Codebases You Don't Control

  • Can't add headers to third-party libraries
  • Use external documentation instead

Comparison: This vs. Alternatives

vs. Git Commit Messages

Git Commits Catalog Headers
✅ Tracks changes over time ✅ Tracks current state
❌ Doesn't explain current purpose ✅ Always shows what file does
❌ Noisy (hundreds of commits) ✅ Concise (one header)
❌ No semantic tags ✅ Tags for discovery

Use both: Git for history, Catalog for current state.


vs. Code Comments

Code Comments Catalog Headers
✅ Inline explanations ✅ Structured metadata
❌ Scattered throughout file ✅ First 13 lines (consistent)
❌ No machine-readable format ✅ Parseable (grep, SQL)
❌ No versioning ✅ Semver built-in

Use both: Comments for code logic, Headers for file metadata.


vs. README Documentation

README Catalog
✅ Human-friendly narrative ✅ Machine-queryable data
❌ Gets out of sync ✅ Lives with file (auto-synced)
❌ Static ✅ Auto-generated from catalog
❌ One file ✅ Every file documented

Use both: README for overview, Catalog for every file.


vs. IDEs (VSCode, IntelliJ)

IDEs Catalog
✅ Go-to-definition ✅ Go-to-purpose
✅ Syntax highlighting ✅ Semantic discovery (tags)
❌ IDE-specific ✅ Tool-agnostic (CLI/TUI)
❌ No AI agent awareness ✅ Agent tracking built-in

Use both: IDE for coding, Catalog for discovery.


Quick Start (3 Commands)

# 1. Initialize
python ghost_catalog_cli.py init

# 2. Sync your codebase
python ghost_catalog_cli.py sync

# 3. Browse interactively
./ghost_catalog_tui

That's it. You now have a semantic catalog of your codebase.


What's Included

📚 Documentation:

  • Technical specification (1,500 lines)
  • Practical guide with use cases (3,000 lines)
  • Implementation suite guide (650 lines)

💻 Working Code:

  • CLI tool (Python, 776 lines)
  • TUI browser (Go/Bubble Tea, 653 lines)
  • AI integration (Python, 651 lines)

📊 Database:

  • SQLite schema (4 tables, 6 indexes)
  • 20+ analytics queries
  • Dependency tracking

🎯 Everything Needed:

  • Installation instructions
  • Complete workflows
  • Troubleshooting guide
  • Production checklist

GitHub Repository

🔗 https://github.com/SoMaCoSF/ghost-catalog

Clone and use:

git clone https://github.com/SoMaCoSF/ghost-catalog.git
cd ghost-catalog
pip install -r requirements.txt
python ghost_catalog_cli.py init
python ghost_catalog_cli.py sync

Repository includes:

  • All working tools (CLI, TUI, AI integration)
  • Complete documentation (linked to gists)
  • Example files and workflows
  • MIT License

Gists (Complete Documentation)

All documentation is also available as gists:

  1. Technical Deep Dive: https://gist.github.com/SoMaCoSF/38d0d859192546ca4add36e4f7351c7d

    • Architecture, schemas, specifications
  2. Practical Guide: https://gist.github.com/SoMaCoSF/edf5ba3afd8e8849903b9400add4d406

    • Tutorials, use cases, decision framework
  3. Implementation Suite: https://gist.github.com/SoMaCoSF/6e62939c6b9810aa9e4f3c604ac9f9fe

    • Working code (CLI, TUI, AI)
  4. How to Use (this document): https://gist.github.com/SoMaCoSF/b52b7f68d09d4752138fb0712e153c21

    • Quick start, scenarios, examples

Who This Is For

✅ You should use this if you:

  • Work with AI coding assistants (Claude Code, Cursor, Copilot)
  • Manage codebases with 50+ files
  • Onboard new developers frequently
  • Use multiple AI agents on the same project
  • Need compliance/audit trails
  • Want semantic code organization

❌ Skip this if you:

  • Have 1-10 files
  • Work solo with no AI
  • Early-stage project (architecture not stable)
  • Don't control the codebase (third-party)

Why I Built This

I was working with Claude Code on a privacy proxy project (Ghost_Shell) that merged two existing codebases (50+ files). Every time Claude had to understand a file, it would:

  1. Read the entire file (wasting tokens)
  2. Ask me what it does
  3. Forget 10 minutes later and ask again

I thought: What if every file had a "name tag" that told the agent:

  • What it is
  • What it does
  • When it was created
  • Which agent made it
  • How to run it

So I built it. And it works incredibly well.

Now when Claude asks "what does this file do?", I just say:

Read the file_id header

And it gets everything it needs in 13 lines instead of reading 500 lines of code.


How to Incorporate Into Your Project

Option 1: Start Fresh (New Project)

# 1. Clone the repo
git clone https://github.com/SoMaCoSF/ghost-catalog.git
cd ghost-catalog

# 2. Install
pip install -r requirements.txt

# 3. Initialize in your project
cd /path/to/your/project
python /path/to/ghost-catalog/ghost_catalog_cli.py init

# 4. Create your first file with a header
# (Use the template from docs)

# 5. Sync
python /path/to/ghost-catalog/ghost_catalog_cli.py sync

Option 2: Migrate Existing Project

# 1. Backup your project
cp -r my_project my_project.backup

# 2. Add catalog headers to existing files
# (Use the bulk migration script from the docs)

# 3. Or use AI to help
python ghost_catalog_ai_integration.py auto-tag \
  --api-key $OPENAI_API_KEY \
  --dry-run  # Preview first

# 4. Sync to database
python ghost_catalog_cli.py sync

# 5. Validate
python ghost_catalog_cli.py validate

Option 3: Gradual Adoption

# 1. Start with new files only
# Add headers to new files as you create them

# 2. Add headers to frequently-edited files
# When you edit a file, add a header

# 3. Eventually cover 80% of important files
# Don't worry about 100% coverage

Example Integration: Claude Code Workflow

Before:

You: "Claude, refactor the authentication module"
Claude: "Sure, which file is that?"
You: "auth.py"
Claude: *reads entire 500-line file* "What does this depend on?"
You: "utils.py and db.py"
Claude: *reads both files* "Ok, I'll refactor"

After:

You: "Claude, refactor the authentication module"
Claude: *checks catalog* "Found SOM-SCR-0023 (auth.py)"
Claude: *checks dependencies in catalog* "Depends on SOM-SCR-0008 (utils) and SOM-SCR-0012 (db)"
Claude: "I'll refactor, this will impact 3 files: auth.py, login.py, session.py"
You: "Go ahead"

Result: Faster, more accurate, fewer back-and-forth questions.


Add This to Your Claude Code Workspace

Step 1: Add to .claude/commands/catalog.md:

# Catalog System Commands

When working with this codebase, use the catalog system:

## Find a file
/catalog search <query>

## Get file info
/catalog info <SOM-SCR-XXXX>

## List by category
/catalog list --category script

## Check dependencies
/catalog deps <SOM-SCR-XXXX>

Step 2: Add to workspace instructions (.claude/settings.json):

{
  "instructions": [
    "This codebase uses the Ghost Catalog system.",
    "Every file has a header with metadata (first 13 lines).",
    "Before reading a file, check its file_id header to understand what it does.",
    "File IDs format: SOM-<CATEGORY>-<SEQUENCE>-v<VERSION>",
    "Categories: SCR=script, DOC=documentation, CFG=configuration, TST=test",
    "Use 'ghost-catalog search' to find files by tag or description."
  ]
}

Community & Support

GitHub Issues: https://github.com/SoMaCoSF/ghost-catalog/issues Discussions: https://github.com/SoMaCoSF/ghost-catalog/discussions Documentation: See gists linked above

Questions? Open a discussion on GitHub.

Found a bug? Open an issue.

Want to contribute? PRs welcome!


License

MIT License - Use freely in personal or commercial projects.


TL;DR

Problem: AI agents waste time understanding your codebase.

Solution: Add semantic file IDs (like "name tags" for code files).

Result:

  • Onboarding: 3 days → 30 minutes
  • File discovery: Minutes → 5 seconds
  • AI agent coordination: Chaos → Clear
  • Code archaeology: Hours → Instant

Get started:

git clone https://github.com/SoMaCoSF/ghost-catalog.git
python ghost_catalog_cli.py init
python ghost_catalog_cli.py sync

Read more: Check the gists linked in this post.


Built this over a weekend because I was tired of AI agents asking "what does this file do?" every 5 minutes. Hope it helps your workflow too!

— SoMaCoSF


P.S. If you use this, I'd love to hear about it. Drop a comment or open a discussion on GitHub!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment