Skip to content

Instantly share code, notes, and snippets.

@SoMaCoSF
Created November 25, 2025 01:22
Show Gist options
  • Select an option

  • Save SoMaCoSF/edf5ba3afd8e8849903b9400add4d406 to your computer and use it in GitHub Desktop.

Select an option

Save SoMaCoSF/edf5ba3afd8e8849903b9400add4d406 to your computer and use it in GitHub Desktop.
Ghost Catalog: Practical Guide - Tutorial with real-world use cases and narratives for semantic file management

Ghost_Shell File Catalog System: The Complete Guide

A Practical Guide to Understanding, Using, and Implementing Semantic File IDs

"Every file tells a story. The catalog system makes sure you can find it, understand it, and trust it."


Table of Contents

Part I: Understanding the System

  1. What Is This System?
  2. Why Not Just Use Traditional UUIDs?
  3. The Human Story: A Day in the Life
  4. Core Concepts Explained

Part II: How to Use It

  1. Quick Start: Your First 5 Minutes
  2. Reading File IDs Like a Pro
  3. Finding Files: The Four Methods
  4. Working with Tags
  5. Version Management

Part III: How It Works (The Technical Details)

  1. System Architecture
  2. The File ID Format Deep Dive
  3. Header Formats for Every File Type
  4. How IDs Are Generated
  5. The Registry: Current vs Future

Part IV: Practical Implementation

  1. Setting Up Your First Catalog
  2. Migrating an Existing Project
  3. CLI Tool Usage Guide
  4. Bubble Tea TUI Browser
  5. Automation & Workflows

Part V: Advanced Topics

  1. Dependency Tracking
  2. Analytics & Insights
  3. Integration with Development Tools
  4. Troubleshooting Common Issues

Part VI: Real-World Use Cases

  1. Use Case 1: Onboarding New Team Members
  2. Use Case 2: Managing Large Codebases
  3. Use Case 3: AI Agent Coordination
  4. Use Case 4: Code Archaeology
  5. Use Case 5: Compliance & Auditing

Part VII: Decision Framework

  1. When to Use This System
  2. When NOT to Use This System
  3. Comparison: SOM IDs vs UUIDs vs Git Hashes

Part I: Understanding the System

What Is This System?

The Elevator Pitch

Imagine if every file in your codebase had a name tag that told you:

  • What it is
  • What it does
  • Who created it
  • When it was made
  • What version it is
  • How to run it
  • What topics it relates to

That's the Ghost_Shell File Catalog System. Instead of random UUIDs like 550e8400-e29b-41d4-a716-446655440000, you get semantic IDs like SOM-SCR-0014-v1.0.0 that actually mean something.

The Big Picture

Traditional File System          Ghost_Shell Catalog System
─────────────────────           ────────────────────────────
📁 src/                          📁 src/
  📄 cli.py                        📄 cli.py
  📄 core.py                          ├─ ID: SOM-SCR-0014-v1.0.0
  📄 utils.py                         ├─ "Management CLI"
                                      ├─ Tags: [cli, admin]
                                      ├─ Agent: AGENT-CLAUDE-002
                                      └─ Run: python -m ghost_shell.cli

                                   📄 core.py
                                      ├─ ID: SOM-SCR-0012-v1.1.0
                                      ├─ "Main proxy addon"
                                      └─ Tags: [proxy, core]

Why It Exists

Problem: In a multi-agent development environment (humans + AI), it's hard to:

  • Track who changed what
  • Understand file purposes without reading code
  • Find related files quickly
  • Maintain version consistency
  • Coordinate between different agents

Solution: Embed rich metadata directly into every file using a semantic naming system.


Why Not Just Use Traditional UUIDs?

The UUID Approach

# File: 550e8400-e29b-41d4-a716-446655440000.py
# (What does this do? No idea without reading it.)

def process_data():
    pass

UUIDs are great for:

  • Distributed systems needing uniqueness guarantees
  • Database primary keys
  • API tokens
  • Anything where collision risk is critical

UUIDs are terrible for:

  • Human comprehension
  • Semantic grouping (finding related files)
  • Version tracking
  • Understanding file purpose at a glance

The SOM ID Approach

# ==============================================================================
# file_id: SOM-SCR-0014-v1.0.0
# name: cli.py
# description: Ghost_Shell unified management CLI
# project_id: GHOST-SHELL
# category: script
# tags: [cli, management, admin, opentelemetry]
# created: 2025-11-23
# modified: 2025-11-23
# version: 1.0.0
# agent_id: AGENT-CLAUDE-002
# execution: python -m ghost_shell.cli [command] [args]
# ==============================================================================

Instantly you know:

  • It's a script (SCR)
  • It's the 14th script in the project
  • It's version 1.0.0
  • It's a CLI tool (from tags)
  • Claude created it
  • How to run it

Side-by-Side Comparison

Feature UUID SOM File ID
Uniqueness ✅ Cryptographically guaranteed ✅ Sequential within category
Human Readable ❌ Completely opaque ✅ Self-documenting
Semantic Grouping ❌ No meaning ✅ Category-based
Version Tracking ❌ Needs external system ✅ Built-in semver
Searchability ❌ Only exact match ✅ By category, tag, agent
Git Friendly ❌ Random strings in diffs ✅ Readable diffs
Discoverability ❌ Need registry lookup ✅ Self-contained metadata
Agent Coordination ❌ No agent info ✅ Agent ID embedded

The Decision

For a single-workspace, multi-agent development environment where:

  • Agents need to understand existing code quickly
  • Files have clear categories (scripts, docs, tests)
  • Version tracking matters
  • Human oversight is common

SOM IDs win because they prioritize comprehension over collision resistance.


The Human Story: A Day in the Life

Morning: Sarah the Developer

8:00 AM - Sarah joins the Ghost_Shell project. She's never seen the codebase before.

# Without catalog system:
$ ls ghost_shell/
cli.py  core.py  main.py  utils.py  blocker.py  collector.py

# She has to open each file and read it to understand what it does
# With catalog system:
$ ghost-catalog list

╭────────────────────────┬──────────────────────────────────╮
│ File ID                │ Description                      │
├────────────────────────┼──────────────────────────────────┤
│ SOM-SCR-0010-v1.0.0    │ OpenTelemetry setup              │
│ SOM-SCR-0011-v1.0.0    │ Traffic blocking                 │
│ SOM-SCR-0012-v1.1.0    │ Main proxy addon                 │
│ SOM-SCR-0013-v1.0.0    │ Intelligence collector           │
│ SOM-SCR-0014-v1.0.0    │ Management CLI                   │
╰────────────────────────┴──────────────────────────────────╯

# She immediately understands the architecture without reading code

Result: Sarah is productive in 5 minutes instead of 5 hours.


Midday: Alex the AI Agent

12:00 PM - Alex (an AI agent) needs to find all files related to OpenTelemetry to add new metrics.

# Without catalog system:
$ grep -r "opentelemetry" . --include="*.py"
# Returns 200+ lines of code matches, unclear which FILES to edit
# With catalog system:
$ ghost-catalog search --tag opentelemetry

Found 4 files:
- SOM-SCR-0010-v1.0.0  telemetry.py       (OpenTelemetry setup)
- SOM-SCR-0012-v1.1.0  core.py            (Main proxy - uses OTel)
- SOM-SCR-0013-v1.0.0  collector.py       (Intel collector - OTel metrics)
- SOM-SCR-0014-v1.0.0  cli.py             (CLI - OTel integration)

# Alex knows exactly which 4 files to modify

Result: Precise targeting instead of shotgun edits.


Afternoon: Mike the Manager

3:00 PM - Mike needs to audit which files haven't been updated in 6 months for the compliance report.

# Without catalog system:
$ find . -name "*.py" -mtime +180
# Returns raw file paths, no context about what they do
# With catalog system:
$ ghost-catalog list --sort modified | head -5

Oldest Modified Files:
╭────────────────────────┬──────────────────┬────────────╮
│ File ID                │ Description      │ Last Edit  │
├────────────────────────┼──────────────────┼────────────┤
│ SOM-SCR-0003-v1.0.0    │ Legacy parser    │ 2024-06-15 │
│ SOM-DOC-0001-v1.0.0    │ Old setup guide  │ 2024-07-20 │
╰────────────────────────┴──────────────────┴────────────╯

# Mike has actionable info with business context

Result: Compliance report done in 10 minutes, not 2 hours.


Evening: Emma the Onboarding Bot

6:00 PM - Emma (an AI onboarding agent) detects a new contributor and generates a personalized learning path.

# Emma queries the catalog database:
SELECT file_id, name, description, tags
FROM file_catalog
WHERE category = 'documentation'
ORDER BY created;

# Results:
# SOM-DOC-0001-v1.0.0  QUICKSTART.md       (Quick start guide)
# SOM-DOC-0003-v1.0.0  CODEBASE_OVERVIEW.md (Complete architecture)
# SOM-DOC-0004-v1.0.0  API_REFERENCE.md    (API documentation)

Emma generates:

Welcome! Here's your learning path:
1. Start: SOM-DOC-0001 (Quickstart - 10 min read)
2. Then: SOM-DOC-0003 (Architecture overview - 30 min read)
3. Deep dive: SOM-SCR-0012 (Core proxy code)

Result: New contributors have a clear path instead of drowning in files.


Core Concepts Explained

Concept 1: Semantic IDs

What: IDs that convey meaning through their structure

Example:

SOM-SCR-0014-v1.0.0
 │   │   │    │
 │   │   │    └─ Version 1.0.0 (semantic versioning)
 │   │   └────── Sequence 0014 (14th script)
 │   └────────── Category: SCR (script)
 └────────────── Namespace: SOM (Somacosf workspace)

Why it matters: You can tell the file's purpose without opening it.


Concept 2: Embedded Metadata

What: Every file contains a structured header with metadata

Why not a separate database?

  • Metadata travels with the file (Git-trackable)
  • No database corruption risk
  • Works offline immediately
  • Self-documenting code

The trade-off:

  • Slower searches (scan files vs query DB)
  • No relationships between files
  • Manual updates needed

The solution: Hybrid approach

  • Metadata in files (source of truth)
  • Database catalog (for fast queries)
  • Sync tool keeps them aligned

Concept 3: Categories

What: File types organized by purpose

The 9 Categories:

Code Category Purpose Examples
CMD Slash commands Claude Code commands Custom workflows
SCR Scripts Executable code cli.py, launcher.py
DOC Documentation Human-readable docs README, guides
CFG Configuration Settings files config.yaml
REG Registry Catalog metadata agent_registry.md
TST Tests Test suites test_*.py
TMP Templates Boilerplates Project scaffolds
DTA Data/schemas Data definitions schema.sql
LOG Logs/diaries Development logs development_diary.md

Why categories matter:

# Find all test files instantly
$ ghost-catalog list --category test

# Find all config files
$ ghost-catalog list --category configuration

Concept 4: Tags

What: Free-form keywords that describe file characteristics

Examples:

# For a CLI tool:
tags: [cli, management, admin, sqlite]

# For a proxy module:
tags: [proxy, mitmproxy, blocking, privacy]

# For documentation:
tags: [onboarding, tutorial, quickstart]

Tag Strategy:

  • Functional: What it does (cli, api, database)
  • Technical: What it uses (opentelemetry, sqlite, mitmproxy)
  • Topical: What domain (security, networking, analytics)
  • Status: Implementation state (wip, deprecated, experimental)

Why tags matter:

# Find everything related to proxies
$ ghost-catalog search --tag proxy

# Find all OpenTelemetry-instrumented files
$ ghost-catalog search --tag opentelemetry

# Find work-in-progress files
$ ghost-catalog search --tag wip

Concept 5: Agent Tracking

What: Every file records which AI agent created/modified it

Format: AGENT-<TYPE>-<NUMBER>

  • AGENT-CLAUDE-002: Claude Sonnet 4.5
  • AGENT-GPT-001: GPT-4
  • AGENT-HUMAN-001: Human developer Sarah

Why it matters:

  1. Debugging: "Which agent introduced this bug?"
  2. Accountability: Track agent activity
  3. Context switching: "What has agent X been working on?"
  4. Handoffs: "Agent Y needs to continue agent X's work"

Example:

# Find all files Claude has worked on
$ ghost-catalog list --agent AGENT-CLAUDE-002

# Track agent activity over time
$ sqlite3 catalog.db "
  SELECT agent_id, COUNT(*) as files_created
  FROM file_catalog
  GROUP BY agent_id
  ORDER BY files_created DESC;
"

Concept 6: Semantic Versioning

What: Version numbers follow semver (MAJOR.MINOR.PATCH)

Rules:

  • PATCH (1.0.0 → 1.0.1): Bug fixes, docs updates
  • MINOR (1.0.1 → 1.1.0): New features, backward-compatible
  • MAJOR (1.1.0 → 2.0.0): Breaking changes

Why it matters:

# Find all v1.x files (stable)
$ ghost-catalog list | grep "v1\."

# Find pre-1.0 files (not release-ready)
$ sqlite3 catalog.db "
  SELECT file_id, name
  FROM file_catalog
  WHERE version LIKE '0.%';
"

# Audit: All files should be v1.0.0+ before release
$ ghost-catalog validate --strict-version 1.0.0

In the File ID:

SOM-SCR-0014-v1.0.0  ← Initial release
SOM-SCR-0014-v1.0.1  ← Bug fix
SOM-SCR-0014-v1.1.0  ← New feature added
SOM-SCR-0014-v2.0.0  ← Breaking refactor

Part II: How to Use It

Quick Start: Your First 5 Minutes

Step 1: Check if a File Has a Catalog Header

Open any Python file in the project:

# ==============================================================================
# file_id: SOM-SCR-0014-v1.0.0
# name: cli.py
# description: Ghost_Shell unified management CLI
# ...
# ==============================================================================

Has header = Cataloged file ❌ No header = Needs catalog entry


Step 2: Find All Cataloged Files

Method 1: Grep (works everywhere)

grep -r "file_id:" . --include="*.py" --include="*.md"

Method 2: CLI Tool (if installed)

ghost-catalog list

Method 3: TUI Browser (if installed)

ghost-catalog-tui

Step 3: Search for Something Specific

Find all scripts:

ghost-catalog list --category script

Find files with "proxy" in description:

ghost-catalog search "proxy"

Find files by tag:

ghost-catalog search --tag opentelemetry

Step 4: View File Details

ghost-catalog info SOM-SCR-0014-v1.0.0

Output:

╭─────────────────────────────────────────────╮
│ File: SOM-SCR-0014-v1.0.0                   │
├─────────────────────────────────────────────┤
│ Name:        cli.py                         │
│ Path:        ghost_shell/cli.py             │
│ Description: Management CLI                 │
│ Category:    script                         │
│ Tags:        [cli, management, admin]       │
│ Version:     1.0.0                          │
│ Created:     2025-11-23                     │
│ Modified:    2025-11-23                     │
│ Agent:       AGENT-CLAUDE-002               │
│ Execution:   python -m ghost_shell.cli      │
╰─────────────────────────────────────────────╯

Step 5: Open the File

# Option 1: Manual
code ghost_shell/cli.py

# Option 2: From TUI (press 'o' on selected file)

# Option 3: CLI tool
ghost-catalog info SOM-SCR-0014-v1.0.0 --open

Reading File IDs Like a Pro

Anatomy of a File ID

SOM-SCR-0014-v1.0.0
│   │   │    │
│   │   │    └── Version (semantic versioning)
│   │   └─────── Sequence number (unique within category)
│   └─────────── Category code (3 letters)
└─────────────── Namespace (SOM = Somacosf)

Quick Reference: Category Codes

CMD = Commands        CFG = Configuration
SCR = Scripts         REG = Registry
DOC = Documentation   TST = Tests
TMP = Templates       DTA = Data/Schemas
LOG = Logs/Diaries

Reading Examples

SOM-SCR-0012-v1.1.0

  • "This is the 12th script in the project"
  • "It's at version 1.1.0 (has received 1 feature update)"
  • "It's a SCR (runnable script)"

SOM-DOC-0003-v1.0.0

  • "This is the 3rd documentation file"
  • "It's at version 1.0.0 (stable release)"
  • "It's a DOC (documentation)"

SOM-TST-0001-v2.0.0

  • "This is the 1st test file"
  • "It's at version 2.0.0 (has had breaking changes)"
  • "It's a TST (test suite)"

What the Sequence Number Tells You

Low numbers (0001-0010): Core/foundational files

  • SOM-SCR-0001 - Usually the main entry point
  • SOM-DOC-0001 - Usually the README or quickstart

Mid numbers (0011-0050): Feature modules

  • SOM-SCR-0014 - CLI tool (added mid-project)

High numbers (0051+): Recent additions or utilities

  • SOM-SCR-0087 - Probably a recently added helper

What the Version Tells You

v0.x.x: Pre-release, experimental

  • v0.1.0 - Initial draft
  • v0.9.5 - Almost ready, but not stable

v1.x.x: Stable, production-ready

  • v1.0.0 - First stable release
  • v1.5.0 - Mature, with several features

v2.x.x+: Major revisions

  • v2.0.0 - Significant refactor
  • v3.0.0 - Another major rewrite

Finding Files: The Four Methods

Method 1: Bash/PowerShell (Always Available)

List all file IDs:

grep -r "file_id:" . --include="*.py" --include="*.md" | awk '{print $2}'

Find by category:

grep -r "file_id: SOM-SCR" . --include="*.py"

Find by tag:

grep -r "tags:.*opentelemetry" . --include="*.py"

Find recently modified:

grep -r "modified:" . --include="*.py" | sort -t: -k2 -r | head -10

Method 2: Repomix Output (Best for Analysis)

Generate:

repomix --output catalog.txt --style xml

Parse with Python:

import re

with open('catalog.txt') as f:
    content = f.read()

# Extract all file IDs
file_ids = re.findall(r'file_id:\s+(SOM-[A-Z]{3}-\d{4}-v[\d.]+)', content)

print(f"Found {len(file_ids)} cataloged files:")
for fid in sorted(set(file_ids)):
    print(f"  {fid}")

Method 3: CLI Tool (Recommended)

Installation:

# From source (Go)
git clone https://github.com/somacosf/ghost-catalog
cd ghost-catalog
go build -o ghost-catalog
sudo mv ghost-catalog /usr/local/bin/

# Or from releases
curl -LO https://github.com/somacosf/ghost-catalog/releases/latest/download/ghost-catalog
chmod +x ghost-catalog

Usage:

# List all files
ghost-catalog list

# Search
ghost-catalog search "proxy"

# Filter by category
ghost-catalog list --category script

# Filter by tag
ghost-catalog search --tag opentelemetry

# Get file info
ghost-catalog info SOM-SCR-0014-v1.0.0

# Validate headers
ghost-catalog validate

# Show statistics
ghost-catalog stats

Method 4: Bubble Tea TUI (Most Interactive)

Launch:

ghost-catalog-tui

Interface:

╔══════════════════╦═════════════════════════════════════╗
║  FILTERS         ║  FILE LIST                          ║
║                  ║  [Selected file highlighted]        ║
║ Categories       ║                                     ║
║ [x] Scripts      ║  FILE DETAILS                       ║
║ [ ] Docs         ║  [Metadata display]                 ║
║                  ║                                     ║
║ Tags             ║                                     ║
║ [x] proxy        ║                                     ║
╚══════════════════╩═════════════════════════════════════╝

Navigation:

  • /: Move selection
  • Enter: View details
  • o: Open in editor
  • Tab: Switch panes
  • /: Search
  • q: Quit

When to Use Each Method

Method Best For
Bash/PowerShell Quick checks, no tools installed
Repomix Analysis, sharing with AI agents
CLI Tool Daily development, scripting
TUI Browser Exploring unfamiliar codebases

Working with Tags

Tag Philosophy

Tags are your semantic index. Think of them as:

  • Hashtags for code
  • Labels for organization
  • Keywords for search

Good Tag Examples

# CLI tool
tags: [cli, management, admin, interactive, rich]

# Database module
tags: [database, sqlite, duckdb, orm, persistence]

# Security feature
tags: [security, encryption, authentication, jwt]

# Experimental feature
tags: [experimental, wip, prototype, needs-review]

Bad Tag Examples

# Too generic
tags: [code, file, python]  ❌

# Too specific (use description instead)
tags: [this-uses-requests-library-version-2-31]  ❌

# Redundant with category
tags: [script]  ❌ (category already says "script")

Tag Guidelines

  1. Use lowercase with hyphens: multi-word-tag
  2. 3-7 tags per file (sweet spot)
  3. Mix functional + technical tags
  4. Include status if relevant: wip, deprecated, stable
  5. Think search: "What would I search for to find this?"

Querying by Tags

Single tag:

ghost-catalog search --tag opentelemetry

Multiple tags (AND):

ghost-catalog search --tag proxy --tag security

Multiple tags (OR) via SQL:

SELECT DISTINCT fc.file_id, fc.name
FROM file_catalog fc
JOIN file_tags ft ON fc.file_id = ft.file_id
WHERE ft.tag IN ('proxy', 'security');

Tag Analytics

Most used tags:

SELECT tag, COUNT(*) as usage_count
FROM file_tags
GROUP BY tag
ORDER BY usage_count DESC
LIMIT 10;

Tag co-occurrence (find related tags):

SELECT t1.tag, t2.tag, COUNT(*) as co_occurrences
FROM file_tags t1
JOIN file_tags t2 ON t1.file_id = t2.file_id AND t1.tag < t2.tag
GROUP BY t1.tag, t2.tag
ORDER BY co_occurrences DESC
LIMIT 20;

Result:

proxy + security = 5 files
cli + database = 3 files
opentelemetry + monitoring = 4 files

Version Management

Semantic Versioning Rules

MAJOR.MINOR.PATCH

Change Type Example Bump
Bug fix Fixed null pointer PATCH (1.0.0 → 1.0.1)
Documentation update Added docstrings PATCH
New function (compatible) Added get_stats() MINOR (1.0.1 → 1.1.0)
New optional parameter def foo(x, y=None) MINOR
Breaking change Removed function MAJOR (1.1.0 → 2.0.0)
Changed function signature foo(x)foo(x, y) MAJOR

Version Consistency

Both places must match:

# file_id: SOM-SCR-0014-v1.0.0  ← Version here
# ...
# version: 1.0.0                ← Must match here

Validate:

ghost-catalog validate

Output:

✗ SOM-SCR-0008-v1.0.0  cli.py
  └─ Version mismatch: file_id has v1.0.0 but version field has 1.0.1

Auto-fix:

ghost-catalog validate --fix

Version Auditing

Pre-release files (v0.x.x):

SELECT file_id, name, version
FROM file_catalog
WHERE version LIKE '0.%'
ORDER BY version DESC;

Outdated files (not at latest major version):

-- Assuming v2.x is the latest
SELECT file_id, name, version
FROM file_catalog
WHERE CAST(substr(version, 1, 1) AS INTEGER) < 2;

Version distribution:

SELECT
    CAST(substr(version, 1, 1) AS INTEGER) as major_version,
    COUNT(*) as file_count
FROM file_catalog
GROUP BY major_version;

Result:

major_version | file_count
0             | 3          (pre-release)
1             | 18         (stable)
2             | 4          (latest)

Part III: How It Works (The Technical Details)

System Architecture

The Five Layers

graph TB
    subgraph "Layer 1: File System"
        A[Physical Files<br/>with Headers]
    end

    subgraph "Layer 2: Metadata Schema"
        B[Embedded Headers<br/>12+ Fields]
    end

    subgraph "Layer 3: Registry"
        C[SQLite Catalog DB]
        D[File System Headers]
    end

    subgraph "Layer 4: Access APIs"
        E[CLI Tool]
        F[TUI Browser]
        G[SQL Queries]
    end

    subgraph "Layer 5: Integration"
        H[Git Hooks]
        I[CI/CD Validation]
        J[Editor Plugins]
    end

    A --> B
    B --> C
    B --> D
    C --> E
    C --> F
    C --> G
    D --> E
    E --> H
    F --> I
    G --> J
Loading

Layer 1: File System

  • Physical .py, .md, .yaml files
  • Each has a header (first 10-20 lines)

Layer 2: Metadata Schema

  • Structured data in headers
  • 12+ fields per file
  • Format varies by file type (Python vs Markdown)

Layer 3: Registry

  • Current: Distributed (metadata in files)
  • Future: Centralized SQLite database
  • Sync tool keeps them aligned

Layer 4: Access APIs

  • CLI: Command-line queries
  • TUI: Interactive browser
  • SQL: Direct database queries

Layer 5: Integration

  • Git hooks: Validate on commit
  • CI/CD: Check in pipelines
  • Editors: Syntax highlighting, autocomplete

The File ID Format Deep Dive

Format Specification

SOM-<CATEGORY>-<SEQUENCE>-v<VERSION>

Constraints:
- SOM: Fixed namespace (3 chars)
- CATEGORY: One of 9 valid codes (3 chars, uppercase)
- SEQUENCE: 0001-9999 (4 digits, zero-padded)
- VERSION: MAJOR.MINOR.PATCH (semver)

Regex: ^SOM-[A-Z]{3}-\d{4}-v\d+\.\d+\.\d+$

Valid Examples

SOM-SCR-0001-v1.0.0  ✅
SOM-DOC-0042-v2.5.3  ✅
SOM-CFG-0007-v1.0.0  ✅

Invalid Examples

SOM-XYZ-0001-v1.0.0  ❌ (XYZ is not a valid category)
SOM-SCR-1-v1.0.0     ❌ (sequence must be 4 digits)
SOM-SCR-0001-1.0.0   ❌ (missing 'v' prefix)
SOM-SCR-0001-v1      ❌ (version must be full semver)
som-scr-0001-v1.0.0  ❌ (must be uppercase)

Parsing a File ID

Python:

import re

def parse_file_id(file_id):
    pattern = r'^SOM-([A-Z]{3})-(\d{4})-v(\d+)\.(\d+)\.(\d+)$'
    match = re.match(pattern, file_id)

    if not match:
        raise ValueError(f"Invalid file ID: {file_id}")

    return {
        'namespace': 'SOM',
        'category': match.group(1),
        'sequence': int(match.group(2)),
        'version_major': int(match.group(3)),
        'version_minor': int(match.group(4)),
        'version_patch': int(match.group(5)),
        'version': f"{match.group(3)}.{match.group(4)}.{match.group(5)}"
    }

# Usage
info = parse_file_id('SOM-SCR-0014-v1.0.0')
print(info)
# {'namespace': 'SOM', 'category': 'SCR', 'sequence': 14,
#  'version_major': 1, 'version_minor': 0, 'version_patch': 0,
#  'version': '1.0.0'}

Generating the Next File ID

def get_next_file_id(category, existing_ids):
    """Generate next file ID for a category."""
    # Find highest sequence in category
    max_seq = 0
    for file_id in existing_ids:
        info = parse_file_id(file_id)
        if info['category'] == category:
            max_seq = max(max_seq, info['sequence'])

    # Increment
    next_seq = max_seq + 1

    # Format
    return f"SOM-{category}-{next_seq:04d}-v1.0.0"

# Usage
existing = ['SOM-SCR-0012-v1.1.0', 'SOM-SCR-0013-v1.0.0', 'SOM-SCR-0014-v1.0.0']
next_id = get_next_file_id('SCR', existing)
print(next_id)  # SOM-SCR-0015-v1.0.0

Header Formats for Every File Type

Python Files

Template:

# ==============================================================================
# file_id: SOM-SCR-NNNN-vX.X.X
# name: filename.py
# description: One-line description
# project_id: PROJECT-NAME
# category: script
# tags: [tag1, tag2, tag3]
# created: YYYY-MM-DD
# modified: YYYY-MM-DD
# version: X.X.X
# agent_id: AGENT-XXX-NNN
# execution: python filename.py [args]
# ==============================================================================

"""Module docstring."""

# Your code here

Line count: 13 lines (including delimiters) Format: Hash comments


Markdown Files

Template:

<!--
===============================================================================
file_id: SOM-DOC-NNNN-vX.X.X
name: filename.md
description: One-line description
project_id: PROJECT-NAME
category: documentation
tags: [tag1, tag2, tag3]
created: YYYY-MM-DD
modified: YYYY-MM-DD
version: X.X.X
agent:
  id: AGENT-XXX-NNN
  name: agent_name
  model: model_id
execution:
  type: documentation
  invocation: Read for [purpose]
===============================================================================
-->

# Document Title

Content here...

Line count: 19-21 lines Format: HTML comment with YAML-like structure Special: Nested agent and execution fields


YAML/Config Files

Template:

# ==============================================================================
# file_id: SOM-CFG-NNNN-vX.X.X
# name: config.yaml
# description: One-line description
# project_id: PROJECT-NAME
# category: configuration
# tags: [tag1, tag2]
# created: YYYY-MM-DD
# modified: YYYY-MM-DD
# version: X.X.X
# ==============================================================================

# Your config here
proxy:
  host: 127.0.0.1
  port: 8080

Line count: 11 lines Format: YAML comments Note: Simplified (no nested agent/execution fields)


PowerShell Scripts

Template:

# ==============================================================================
# file_id: SOM-SCR-NNNN-vX.X.X
# name: script.ps1
# description: One-line description
# project_id: PROJECT-NAME
# category: script
# tags: [tag1, tag2]
# created: YYYY-MM-DD
# modified: YYYY-MM-DD
# version: X.X.X
# agent_id: AGENT-XXX-NNN
# execution: .\script.ps1 [-Param value]
# ==============================================================================

param(
    [string]$Param1
)

# Your script here

Same structure as Python, but with PowerShell syntax.


How IDs Are Generated

Current Method: Manual Allocation

sequenceDiagram
    participant Agent
    participant Files as File System
    participant Dev as Development Diary

    Agent->>Agent: 1. Decide to create new file
    Agent->>Agent: 2. Determine category (SCR/DOC/CFG/etc)
    Agent->>Files: 3. grep "file_id: SOM-SCR" *.py
    Files-->>Agent: 4. Returns existing SCR file IDs
    Agent->>Agent: 5. Find max sequence (e.g., 0014)
    Agent->>Agent: 6. Increment (0014 → 0015)
    Agent->>Agent: 7. Format: SOM-SCR-0015-v1.0.0
    Agent->>Agent: 8. Populate header template
    Agent->>Files: 9. Write file with header
    Agent->>Dev: 10. Log in development diary
Loading

Steps:

  1. Decide file category (script/doc/config/etc)
  2. Search existing files: grep -r "file_id: SOM-SCR"
  3. Extract sequences, find max
  4. Increment: max + 1
  5. Zero-pad: 15 → 0015
  6. Construct: SOM-SCR-0015-v1.0.0
  7. Fill header template
  8. Write file

Pros:

  • Simple, no tools needed
  • Full control

Cons:

  • Manual, error-prone
  • Collision risk if multiple agents create simultaneously

Future Method: Automated via CLI

graph LR
    A[Agent requests<br/>new file] --> B[ghost-catalog generate]
    B --> C[Query catalog DB<br/>for max sequence]
    C --> D[Allocate next ID]
    D --> E[Lock sequence<br/>in DB]
    E --> F[Generate header]
    F --> G[Write file]
    G --> H[Register in catalog]
Loading

Command:

ghost-catalog generate \
  --file new_module.py \
  --category script \
  --description "New feature module" \
  --tags cli,admin,feature-x

Process:

  1. Query catalog DB for highest sequence in SCR category
  2. Lock next sequence (prevent collisions)
  3. Generate file ID: SOM-SCR-0015-v1.0.0
  4. Create header from template
  5. Write file with header + boilerplate code
  6. Insert into catalog DB
  7. Return file ID to agent

Pros:

  • Atomic (no collisions)
  • Consistent formatting
  • Automatic catalog registration

Cons:

  • Requires tool installation
  • Depends on catalog DB

The Registry: Current vs Future

Current: Distributed File Headers

ghost_shell/
├── cli.py
│   └─ Header: SOM-SCR-0014-v1.0.0
├── core.py
│   └─ Header: SOM-SCR-0012-v1.1.0
├── collector.py
│   └─ Header: SOM-SCR-0013-v1.0.0
└── README.md
    └─ Header: SOM-DOC-0001-v1.0.0

How it works:

  • Metadata lives in each file's header
  • No central database
  • Query via grep/awk/sed

Advantages: ✅ Git-trackable ✅ No database corruption ✅ Works offline ✅ Self-documenting

Disadvantages: ❌ Slow searches (scan all files) ❌ No relationships (can't link files) ❌ No indexing ❌ Hard to aggregate statistics


Future: Centralized Catalog Database

ghost_shell/
├── data/
│   └── catalog.db (SQLite)
│       ├── file_catalog table
│       ├── file_tags table
│       ├── agent_registry table
│       └── file_dependencies table
└── files with headers (source of truth)

How it works:

  • Metadata still in file headers (source of truth)
  • Synced to SQLite database (fast queries)
  • ghost-catalog sync keeps them aligned

Database Schema:

CREATE TABLE file_catalog (
    file_id TEXT PRIMARY KEY,
    name TEXT,
    path TEXT,
    description TEXT,
    category TEXT,
    version TEXT,
    created DATE,
    modified DATE,
    agent_id TEXT,
    execution TEXT,
    checksum TEXT
);

CREATE TABLE file_tags (
    file_id TEXT,
    tag TEXT,
    PRIMARY KEY (file_id, tag)
);

CREATE TABLE agent_registry (
    id TEXT PRIMARY KEY,
    name TEXT,
    model TEXT,
    first_seen TIMESTAMP
);

CREATE TABLE file_dependencies (
    file_id TEXT,
    depends_on_file_id TEXT,
    dependency_type TEXT
);

Sync Process:

# Scan file system, update database
ghost-catalog sync

# Output:
# Scanned: 50 files
# Updated: 3 files
# New: 1 file
# Unchanged: 46 files

Advantages: ✅ Fast queries (indexed) ✅ Complex filters (tags AND category) ✅ Relationships (dependencies) ✅ Statistics (COUNT, GROUP BY) ✅ Full-text search

Disadvantages: ❌ Database can desync ❌ Requires sync tool ❌ Extra maintenance


Part IV: Practical Implementation

Setting Up Your First Catalog

Step 1: Install Tools (Optional but Recommended)

CLI Tool:

# From GitHub releases
curl -LO https://github.com/somacosf/ghost-catalog/releases/latest/download/ghost-catalog-linux-amd64
chmod +x ghost-catalog-linux-amd64
sudo mv ghost-catalog-linux-amd64 /usr/local/bin/ghost-catalog

# Verify
ghost-catalog --version

TUI Browser:

# Same process
curl -LO https://github.com/somacosf/ghost-catalog/releases/latest/download/ghost-catalog-tui-linux-amd64
chmod +x ghost-catalog-tui-linux-amd64
sudo mv ghost-catalog-tui-linux-amd64 /usr/local/bin/ghost-catalog-tui

Step 2: Add Your First Header

For a Python script:

  1. Open the file
  2. Add header at the top (before any code):
# ==============================================================================
# file_id: SOM-SCR-0001-v1.0.0
# name: my_script.py
# description: My first cataloged script
# project_id: MY-PROJECT
# category: script
# tags: [example, first, tutorial]
# created: 2025-01-24
# modified: 2025-01-24
# version: 1.0.0
# agent_id: AGENT-HUMAN-001
# execution: python my_script.py
# ==============================================================================

"""My script docstring."""

def main():
    print("Hello from cataloged script!")

if __name__ == '__main__':
    main()
  1. Save the file

Step 3: Verify the Header

# Check that it parses correctly
grep "file_id:" my_script.py
# Output: # file_id: SOM-SCR-0001-v1.0.0

# Validate format
ghost-catalog validate my_script.py
# Output: ✓ SOM-SCR-0001-v1.0.0 - my_script.py

Step 4: Create the Catalog Database

# Initialize catalog
ghost-catalog init

# This creates:
# - data/catalog.db (SQLite database)
# - .ghost-catalog.config (settings)

# Scan and populate
ghost-catalog sync

# Output:
# Scanning...
# Found 1 cataloged file
# Inserted 1 file into catalog

Step 5: Query the Catalog

# List all files
ghost-catalog list

# Get file info
ghost-catalog info SOM-SCR-0001-v1.0.0

# Search
ghost-catalog search "first"

Migrating an Existing Project

Scenario: You Have 50 Files, None Have Headers

Goal: Add catalog headers to all files

Strategy: Automated bulk migration


Step 1: Audit Current State

# List all Python and Markdown files
find . -name "*.py" -o -name "*.md" > files_to_migrate.txt

# Count them
wc -l files_to_migrate.txt
# Output: 50 files

Step 2: Generate Headers (Bulk Script)

Create migrate_catalog.py:

#!/usr/bin/env python3
"""Bulk add catalog headers to files."""

from pathlib import Path
from datetime import date
import re

# Configuration
PROJECT_ID = "MY-PROJECT"
AGENT_ID = "AGENT-HUMAN-001"
TODAY = date.today().isoformat()

# Category mapping (file path → category)
CATEGORY_MAP = {
    'test_': 'test',
    'tests/': 'test',
    'docs/': 'documentation',
    'config/': 'configuration',
    'README': 'documentation',
}

def determine_category(filepath):
    """Guess category from file path."""
    name = filepath.name.lower()
    path_str = str(filepath).lower()

    for keyword, category in CATEGORY_MAP.items():
        if keyword in name or keyword in path_str:
            return category

    # Default: script
    return 'script' if filepath.suffix == '.py' else 'documentation'

def get_category_code(category):
    """Map category name to 3-letter code."""
    codes = {
        'script': 'SCR',
        'documentation': 'DOC',
        'configuration': 'CFG',
        'test': 'TST',
    }
    return codes.get(category, 'SCR')

def get_next_sequence(category_code, existing_ids):
    """Find next sequence for category."""
    max_seq = 0
    pattern = rf'SOM-{category_code}-(\d{{4}})-v'

    for file_id in existing_ids:
        match = re.search(pattern, file_id)
        if match:
            seq = int(match.group(1))
            max_seq = max(max_seq, seq)

    return max_seq + 1

def generate_python_header(file_id, filename, description, category):
    """Generate Python file header."""
    return f"""# ==============================================================================
# file_id: {file_id}
# name: {filename}
# description: {description}
# project_id: {PROJECT_ID}
# category: {category}
# tags: []
# created: {TODAY}
# modified: {TODAY}
# version: 1.0.0
# agent_id: {AGENT_ID}
# execution: python {filename}
# ==============================================================================

"""

def add_header_to_file(filepath, header):
    """Prepend header to file."""
    # Read existing content
    with open(filepath, 'r') as f:
        existing = f.read()

    # Write header + content
    with open(filepath, 'w') as f:
        f.write(header + existing)

def main():
    # Find all files
    py_files = list(Path('.').rglob('*.py'))
    md_files = list(Path('.').rglob('*.md'))
    all_files = py_files + md_files

    # Filter out already cataloged
    to_migrate = []
    existing_ids = []

    for filepath in all_files:
        with open(filepath, 'r') as f:
            content = f.read(500)  # Read first 500 chars

        if 'file_id:' in content:
            # Extract file ID
            match = re.search(r'file_id:\s+(SOM-[A-Z]{3}-\d{4}-v[\d.]+)', content)
            if match:
                existing_ids.append(match.group(1))
        else:
            to_migrate.append(filepath)

    print(f"Found {len(all_files)} total files")
    print(f"Already cataloged: {len(existing_ids)}")
    print(f"To migrate: {len(to_migrate)}")

    # Migrate each file
    for filepath in to_migrate:
        category = determine_category(filepath)
        category_code = get_category_code(category)
        sequence = get_next_sequence(category_code, existing_ids)

        file_id = f"SOM-{category_code}-{sequence:04d}-v1.0.0"

        # Generate description (TODO: use LLM for better descriptions)
        description = f"TODO: Add description for {filepath.name}"

        if filepath.suffix == '.py':
            header = generate_python_header(file_id, filepath.name, description, category)
        else:
            # Markdown header (similar structure)
            header = generate_markdown_header(file_id, filepath.name, description, category)

        # Add header
        add_header_to_file(filepath, header)
        existing_ids.append(file_id)

        print(f"✓ {file_id}{filepath}")

    print(f"\n✅ Migrated {len(to_migrate)} files")
    print("⚠️  Remember to update descriptions (search for 'TODO')")

if __name__ == '__main__':
    main()

Run it:

python migrate_catalog.py

Step 3: Review and Fix Descriptions

The script adds placeholder descriptions. You need to fill them in:

# Find all TODOs
grep -r "description: TODO" . --include="*.py" --include="*.md"

# For each file, update the description to something meaningful

Step 4: Sync to Database

ghost-catalog sync

# Output:
# Scanned: 50 files
# New: 50 files
# Sync complete

Step 5: Validate

ghost-catalog validate

# Fix any issues
ghost-catalog validate --fix

CLI Tool Usage Guide

Installation

From releases (recommended):

# Linux
curl -LO https://github.com/somacosf/ghost-catalog/releases/latest/download/ghost-catalog-linux-amd64
chmod +x ghost-catalog-linux-amd64
sudo mv ghost-catalog-linux-amd64 /usr/local/bin/ghost-catalog

# macOS
curl -LO https://github.com/somacosf/ghost-catalog/releases/latest/download/ghost-catalog-darwin-amd64
chmod +x ghost-catalog-darwin-amd64
sudo mv ghost-catalog-darwin-amd64 /usr/local/bin/ghost-catalog

# Windows
# Download ghost-catalog-windows-amd64.exe from releases
# Add to PATH

From source (if you have Go installed):

git clone https://github.com/somacosf/ghost-catalog
cd ghost-catalog
go build -o ghost-catalog
sudo mv ghost-catalog /usr/local/bin/

Basic Commands

Initialize catalog:

ghost-catalog init

Sync catalog with file system:

ghost-catalog sync

List all files:

ghost-catalog list

Search files:

ghost-catalog search "proxy"

Get file info:

ghost-catalog info SOM-SCR-0014-v1.0.0

Validate headers:

ghost-catalog validate

Advanced Usage

Filter by category:

ghost-catalog list --category script
ghost-catalog list --category documentation

Filter by tag:

ghost-catalog search --tag opentelemetry
ghost-catalog search --tag proxy --tag security

Filter by agent:

ghost-catalog list --agent AGENT-CLAUDE-002

Sort options:

ghost-catalog list --sort modified    # Recently modified first
ghost-catalog list --sort created     # Recently created first
ghost-catalog list --sort file_id     # Alphabetical by ID

Output formats:

ghost-catalog list --format table     # Default
ghost-catalog list --format json      # For scripting
ghost-catalog list --format csv       # For Excel

Generate new file:

ghost-catalog generate \
  --file new_module.py \
  --category script \
  --description "New feature module" \
  --tags cli,admin,feature-x \
  --project MY-PROJECT

Export catalog:

ghost-catalog export catalog.json
ghost-catalog export catalog.csv --format csv

Statistics:

ghost-catalog stats

Scripting with CLI

Get all file paths:

ghost-catalog list --format json | jq -r '.[].path'

Count files by category:

ghost-catalog list --format json | jq 'group_by(.category) | map({category: .[0].category, count: length})'

Find stale files (not modified in 90 days):

ghost-catalog list --format json | jq --arg cutoff "$(date -d '90 days ago' +%Y-%m-%d)" '.[] | select(.modified < $cutoff) | .file_id'

Bubble Tea TUI Browser

Launching

ghost-catalog-tui

Interface Tour

┌────────────────────────────────────────────────────────────┐
│              Ghost_Shell File Catalog Browser               │
├──────────────┬─────────────────────────────────────────────┤
│ FILTERS      │ FILE LIST                                   │
│              │ ┌────────────┬──────────────────────────┐   │
│ Categories   │ │ File ID    │ Description              │   │
│ [x] Scripts  │ ├────────────┼──────────────────────────┤   │
│ [ ] Docs     │ │►SOM-SCR-014│ Management CLI           │   │
│ [ ] Config   │ └────────────┴──────────────────────────┘   │
│              │                                             │
│ Tags         │ FILE DETAILS                                │
│ [x] cli      │ ┌───────────────────────────────────────┐   │
│ [ ] proxy    │ │ Name:    cli.py                       │   │
│              │ │ Path:    ghost_shell/cli.py           │   │
│ Search       │ │ Tags:    [cli, management, admin]     │   │
│ [_______]    │ │ Version: 1.0.0                        │   │
│              │ │ Agent:   AGENT-CLAUDE-002             │   │
│ Actions      │ └───────────────────────────────────────┘   │
│ [o] Open     │                                             │
│ [c] Copy     │                                             │
└──────────────┴─────────────────────────────────────────────┘
│ [?] Help │ [/] Search │ [Tab] Switch │ [q] Quit           │
└────────────────────────────────────────────────────────────┘

Keybindings

Key Action Description
/k Move up Navigate file list
/j Move down Navigate file list
Enter Select Show file details
Tab Switch pane Toggle sidebar/file list
Space Toggle filter Check/uncheck filter
/ Search Focus search box
Esc Clear Clear filters/search
o Open Open file in $EDITOR
c Copy path Copy file path to clipboard
d Dependencies Show dependency graph
g Git log Show git history
v Validate Re-validate header
r Refresh Re-scan catalog
? Help Show full help screen
q Quit Exit application

Workflows

Find all proxy-related files:

  1. Launch TUI: ghost-catalog-tui
  2. Navigate to Tags section (Tab)
  3. Find "proxy" tag, press Space to select
  4. File list filters to proxy files

Open a file:

  1. Navigate to file in list (↑/↓)
  2. Press Enter to view details
  3. Press 'o' to open in editor

Search:

  1. Press /
  2. Type search query
  3. Results update live
  4. Press Enter to select first result

Automation & Workflows

Git Hooks

Pre-commit: Validate headers

Create .git/hooks/pre-commit:

#!/bin/bash
# Pre-commit hook: Validate catalog headers

echo "Validating catalog headers..."

ghost-catalog validate --strict

if [ $? -ne 0 ]; then
    echo ""
    echo "❌ Catalog validation failed!"
    echo "Fix errors above or run: ghost-catalog validate --fix"
    exit 1
fi

echo "✓ Catalog validation passed"
exit 0

Make executable:

chmod +x .git/hooks/pre-commit

Post-merge: Sync catalog

Create .git/hooks/post-merge:

#!/bin/bash
# Post-merge hook: Re-sync catalog after merges

echo "Syncing catalog after merge..."
ghost-catalog sync --quiet
echo "✓ Catalog synced"

CI/CD Integration

GitHub Actions:

Create .github/workflows/catalog-check.yml:

name: Catalog Validation

on:
  pull_request:
    paths:
      - '**/*.py'
      - '**/*.md'
      - '**/*.yaml'

jobs:
  validate:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Install ghost-catalog
        run: |
          curl -LO https://github.com/somacosf/ghost-catalog/releases/latest/download/ghost-catalog-linux-amd64
          chmod +x ghost-catalog-linux-amd64
          sudo mv ghost-catalog-linux-amd64 /usr/local/bin/ghost-catalog

      - name: Initialize catalog
        run: ghost-catalog init

      - name: Sync catalog
        run: ghost-catalog sync

      - name: Validate headers
        run: ghost-catalog validate --strict

      - name: Check for missing headers
        run: |
          missing=$(ghost-catalog sync --dry-run | grep -c "Missing header" || true)
          if [ "$missing" -gt 0 ]; then
            echo "❌ Found $missing files without catalog headers"
            exit 1
          fi

      - name: Check version consistency
        run: |
          # All files should be v1.0.0+ for release branches
          if [[ "$GITHUB_REF" == refs/heads/release/* ]]; then
            pre_release=$(ghost-catalog list --format json | jq '[.[] | select(.version | startswith("0."))] | length')
            if [ "$pre_release" -gt 0 ]; then
              echo "❌ Found $pre_release files with version < 1.0.0"
              ghost-catalog list --format json | jq '.[] | select(.version | startswith("0.")) | .file_id'
              exit 1
            fi
          fi

      - name: Generate catalog report
        if: always()
        run: |
          ghost-catalog stats > catalog_report.txt
          cat catalog_report.txt

      - name: Upload catalog report
        if: always()
        uses: actions/upload-artifact@v3
        with:
          name: catalog-report
          path: catalog_report.txt

Editor Integration

VS Code Extension (proposed):

Features:

  • Syntax highlighting for catalog headers
  • Autocomplete for tags
  • Quick actions: "Add catalog header", "Update version", "Validate header"
  • Hover tooltips showing file metadata

Installation:

# Once published
code --install-extension somacosf.ghost-catalog

Vim Plugin (proposed):

Features:

  • Snippets for catalog headers (:CatalogHeader)
  • Commands: :CatalogValidate, :CatalogInfo
  • Statusline integration showing file ID

Installation:

" Add to .vimrc
Plug 'somacosf/ghost-catalog.vim'

Part V: Advanced Topics

Dependency Tracking

Why Track Dependencies?

Scenario: You want to refactor db_handler.py. Which files will break?

Without dependency tracking:

# Grep for imports (unreliable)
grep -r "from.*db_handler import" .
grep -r "import db_handler" .

With dependency tracking:

SELECT fc.file_id, fc.name
FROM file_dependencies fd
JOIN file_catalog fc ON fd.file_id = fc.file_id
WHERE fd.depends_on_file_id = 'SOM-SCR-XXXX-v1.0.0';

Result:

SOM-SCR-0012-v1.1.0  core.py
SOM-SCR-0013-v1.0.0  collector.py
SOM-SCR-0014-v1.0.0  cli.py

Building the Dependency Graph

Automated detection (Python example):

import ast
from pathlib import Path

def extract_imports(filepath):
    """Extract all import statements from Python file."""
    with open(filepath) as f:
        try:
            tree = ast.parse(f.read())
        except SyntaxError:
            return []

    imports = []

    for node in ast.walk(tree):
        if isinstance(node, ast.Import):
            for alias in node.names:
                imports.append({
                    'type': 'import',
                    'module': alias.name,
                    'alias': alias.asname
                })
        elif isinstance(node, ast.ImportFrom):
            imports.append({
                'type': 'from',
                'module': node.module,
                'names': [alias.name for alias in node.names],
                'level': node.level  # Relative imports
            })

    return imports

# Usage
imports = extract_imports('ghost_shell/cli.py')
print(imports)
# [
#   {'type': 'from', 'module': 'ghost_shell.data.db_handler', 'names': ['DatabaseHandler']},
#   {'type': 'import', 'module': 'rich', 'alias': None},
#   ...
# ]

Map to file IDs:

def resolve_module_to_file_id(module_name, catalog):
    """Map Python module to file ID."""
    # Convert module to file path
    # ghost_shell.data.db_handler → ghost_shell/data/db_handler.py

    path = module_name.replace('.', '/') + '.py'

    # Look up in catalog
    for entry in catalog:
        if entry['path'] == path:
            return entry['file_id']

    return None

# Build dependency graph
dependencies = []
for file_entry in catalog:
    if file_entry['category'] != 'script':
        continue

    imports = extract_imports(file_entry['path'])

    for imp in imports:
        dep_file_id = resolve_module_to_file_id(imp['module'], catalog)
        if dep_file_id:
            dependencies.append({
                'file_id': file_entry['file_id'],
                'depends_on': dep_file_id,
                'type': 'import'
            })

# Insert into database
for dep in dependencies:
    db.execute("""
        INSERT INTO file_dependencies (file_id, depends_on_file_id, dependency_type)
        VALUES (?, ?, ?)
    """, (dep['file_id'], dep['depends_on'], dep['type']))

Dependency Queries

Direct dependencies (what does file X import?):

SELECT fc.file_id, fc.name
FROM file_dependencies fd
JOIN file_catalog fc ON fd.depends_on_file_id = fc.file_id
WHERE fd.file_id = 'SOM-SCR-0014-v1.0.0';

Reverse dependencies (what imports file X?):

SELECT fc.file_id, fc.name
FROM file_dependencies fd
JOIN file_catalog fc ON fd.file_id = fc.file_id
WHERE fd.depends_on_file_id = 'SOM-SCR-XXXX-v1.0.0';

Transitive dependencies (all files X depends on, recursively):

WITH RECURSIVE deps AS (
    -- Base case
    SELECT depends_on_file_id, 1 as depth
    FROM file_dependencies
    WHERE file_id = 'SOM-SCR-0014-v1.0.0'

    UNION ALL

    -- Recursive case
    SELECT fd.depends_on_file_id, deps.depth + 1
    FROM file_dependencies fd
    JOIN deps ON fd.file_id = deps.depends_on_file_id
    WHERE deps.depth < 10
)
SELECT DISTINCT fc.file_id, fc.name, MAX(deps.depth) as depth
FROM deps
JOIN file_catalog fc ON deps.depends_on_file_id = fc.file_id
GROUP BY fc.file_id
ORDER BY depth;

Visualizing Dependencies

Mermaid diagram:

def generate_dependency_graph(root_file_id, catalog, dependencies):
    """Generate Mermaid dependency graph."""
    lines = ["graph LR"]

    # Get all related files
    visited = set()
    queue = [root_file_id]

    while queue:
        current = queue.pop(0)
        if current in visited:
            continue
        visited.add(current)

        # Find dependencies
        for dep in dependencies:
            if dep['file_id'] == current:
                # Add edge
                from_name = get_short_name(current, catalog)
                to_name = get_short_name(dep['depends_on'], catalog)
                lines.append(f"    {from_name} --> {to_name}")

                queue.append(dep['depends_on'])

    return "\n".join(lines)

# Usage
mermaid = generate_dependency_graph('SOM-SCR-0014-v1.0.0', catalog, dependencies)
print(mermaid)

Output:

graph LR
    cli --> db_handler
    cli --> rich
    db_handler --> sqlite3
    db_handler --> duckdb
Loading

Analytics & Insights

Agent Activity Reports

Files created per agent:

SELECT
    ar.name as agent,
    COUNT(*) as files_created
FROM file_catalog fc
JOIN agent_registry ar ON fc.agent_id = ar.id
GROUP BY ar.name
ORDER BY files_created DESC;

Result:

agent             | files_created
AGENT-CLAUDE-002  | 18
AGENT-GPT-001     | 7
AGENT-HUMAN-001   | 5

Agent activity timeline:

SELECT
    date(modified) as day,
    agent_id,
    COUNT(*) as edits
FROM file_catalog
WHERE modified >= date('now', '-30 days')
GROUP BY day, agent_id
ORDER BY day DESC;

Code Churn Analysis

Most frequently modified files:

SELECT
    file_id,
    name,
    julianday(modified) - julianday(created) as age_days,
    CAST(substr(version, 3, 1) AS INTEGER) as patch_count
FROM file_catalog
ORDER BY patch_count DESC
LIMIT 10;

Tag Analytics

Most popular tags:

SELECT tag, COUNT(*) as usage_count
FROM file_tags
GROUP BY tag
ORDER BY usage_count DESC
LIMIT 20;

Tag co-occurrence matrix:

SELECT
    t1.tag,
    t2.tag,
    COUNT(*) as co_occurrences
FROM file_tags t1
JOIN file_tags t2 ON t1.file_id = t2.file_id AND t1.tag < t2.tag
GROUP BY t1.tag, t2.tag
HAVING co_occurrences > 1
ORDER BY co_occurrences DESC
LIMIT 20;

Result:

tag1          | tag2          | co_occurrences
proxy         | security      | 5
cli           | database      | 3
opentelemetry | monitoring    | 4

Project Health Metrics

Version distribution:

SELECT
    CASE
        WHEN version LIKE '0.%' THEN 'Pre-release'
        WHEN version LIKE '1.%' THEN 'v1.x'
        WHEN version LIKE '2.%' THEN 'v2.x'
        ELSE 'Other'
    END as version_range,
    COUNT(*) as file_count
FROM file_catalog
GROUP BY version_range;

Stale files (not updated in 90 days):

SELECT
    file_id,
    name,
    modified,
    julianday('now') - julianday(modified) as days_stale
FROM file_catalog
WHERE days_stale > 90
ORDER BY days_stale DESC;

Integration with Development Tools

Repomix Integration

Generate catalog-enriched output:

# Standard repomix
repomix --output code.txt

# Then parse and annotate with catalog data
python enrich_repomix.py code.txt catalog.db > code_enriched.txt

enrich_repomix.py:

import re
import sqlite3

def enrich_repomix(repomix_path, catalog_db):
    """Add catalog metadata annotations to repomix output."""
    conn = sqlite3.connect(catalog_db)

    with open(repomix_path) as f:
        lines = f.readlines()

    output = []

    for line in lines:
        output.append(line)

        # Detect file boundaries
        if line.startswith('<file path='):
            # Extract path
            match = re.search(r'path="([^"]+)"', line)
            if match:
                path = match.group(1)

                # Look up in catalog
                cursor = conn.execute("""
                    SELECT file_id, description, tags
                    FROM file_catalog
                    WHERE path = ?
                """, (path,))

                row = cursor.fetchone()
                if row:
                    file_id, desc, tags = row
                    output.append(f"<!-- CATALOG: {file_id} -->\n")
                    output.append(f"<!-- DESC: {desc} -->\n")
                    output.append(f"<!-- TAGS: {tags} -->\n")

    return ''.join(output)

# Usage
enriched = enrich_repomix('code.txt', 'data/catalog.db')
print(enriched)

Markdown Documentation Generation

Auto-generate project structure doc:

def generate_project_structure_md(catalog_db):
    """Generate markdown documentation from catalog."""
    conn = sqlite3.connect(catalog_db)

    md = ["# Project Structure\n\n"]

    # Group by category
    categories = conn.execute("""
        SELECT DISTINCT category FROM file_catalog ORDER BY category
    """).fetchall()

    for (category,) in categories:
        md.append(f"## {category.title()}s\n\n")

        files = conn.execute("""
            SELECT file_id, name, description, tags
            FROM file_catalog
            WHERE category = ?
            ORDER BY file_id
        """, (category,)).fetchall()

        for file_id, name, desc, tags in files:
            md.append(f"### {name} (`{file_id}`)\n\n")
            md.append(f"**Description**: {desc}\n\n")
            md.append(f"**Tags**: {tags}\n\n")

    return ''.join(md)

# Usage
doc = generate_project_structure_md('data/catalog.db')
with open('PROJECT_STRUCTURE.md', 'w') as f:
    f.write(doc)

Troubleshooting Common Issues

Issue 1: Version Mismatch

Symptom:

ghost-catalog validate
✗ SOM-SCR-0014-v1.0.0  cli.py
  └─ Version mismatch: file_id has v1.0.0 but version field has 1.0.1

Cause: The file_id line and version field don't match.

Fix:

# Auto-fix
ghost-catalog validate --fix

# Or manually edit the file to match

Issue 2: Missing Required Fields

Symptom:

✗ SOM-SCR-0008-v1.0.0  utils.py
  └─ Missing required field: description

Cause: Header is incomplete.

Fix: Open the file, add the missing field:

# description: Utility functions for data processing

Issue 3: Invalid File ID Format

Symptom:

✗ utils.py
  └─ Invalid file_id format: SOM-SCR-1-v1.0.0

Cause: Sequence number must be 4 digits.

Fix: Change SOM-SCR-1-v1.0.0 to SOM-SCR-0001-v1.0.0


Issue 4: Filename Mismatch

Symptom:

✗ SOM-SCR-0014-v1.0.0  cli.py
  └─ Filename mismatch: file is 'cli.py' but header says 'old_cli.py'

Cause: File was renamed but header not updated.

Fix: Update the name field in the header to match actual filename.


Issue 5: Database Out of Sync

Symptom:

ghost-catalog list
# Shows old data, missing recent files

Cause: Catalog database hasn't been synced.

Fix:

ghost-catalog sync

Part VI: Real-World Use Cases

Use Case 1: Onboarding New Team Members

The Problem

Scenario: Sarah joins the Ghost_Shell team. She needs to:

  • Understand project architecture
  • Find relevant files for her first task (add CLI command)
  • Not waste time reading every file

Traditional approach: 2-3 days of code reading

With catalog: 30 minutes


The Workflow

sequenceDiagram
    participant Sarah
    participant TUI as Catalog TUI
    participant Docs as Documentation
    participant Code as Codebase

    Sarah->>TUI: Launch ghost-catalog-tui
    TUI-->>Sarah: Show project overview

    Sarah->>TUI: Filter: category=documentation
    TUI-->>Sarah: Show 4 docs

    Sarah->>Docs: Read SOM-DOC-0001 (Quickstart)
    Docs-->>Sarah: 10-minute overview

    Sarah->>TUI: Search: "CLI"
    TUI-->>Sarah: Found SOM-SCR-0014 (cli.py)

    Sarah->>TUI: Press 'o' to open
    TUI->>Code: Open ghost_shell/cli.py in editor

    Sarah->>Code: Read cli.py, understand pattern
    Sarah->>Code: Add new command (following pattern)

    Note over Sarah: First contribution in 30 minutes!
Loading

What Sarah Learned Without Reading Code

From the catalog, Sarah instantly knew:

  • SOM-DOC-0001: Quickstart guide (read this first)
  • SOM-DOC-0003: Architecture overview (technical deep dive)
  • SOM-SCR-0014: CLI implementation (what she needs to modify)
  • SOM-SCR-XXXX: Database handler (CLI dependency)

From tags:

  • cli + admin = She needs admin tools
  • opentelemetry = Project uses OTel for instrumentation
  • rich = CLI uses Rich library for formatting

Use Case 2: Managing Large Codebases

The Problem

Scenario: Ghost_Shell has grown to 200+ files. Mike the maintainer needs to:

  • Find all files related to a feature (e.g., "proxy")
  • Identify unmaintained files
  • Audit version consistency before release

Traditional approach: Manual file-by-file review

With catalog: Automated queries


Finding All Proxy Files

ghost-catalog search --tag proxy

# Results:
# SOM-SCR-0011-v1.0.0  blocker.py       (Traffic blocking)
# SOM-SCR-0012-v1.1.0  core.py          (Main proxy addon)
# SOM-SCR-0016-v1.0.0  fingerprint.py   (Fingerprint randomization)
# SOM-SCR-0017-v1.0.0  cookies.py       (Cookie interception)
# SOM-DOC-0005-v1.0.0  PROXY_GUIDE.md   (Proxy documentation)

Impact: Found 5 files instantly (vs hours of grep/search)


Identifying Unmaintained Files

-- Files not touched in 6 months
SELECT file_id, name, modified,
       julianday('now') - julianday(modified) as days_stale
FROM file_catalog
WHERE days_stale > 180
ORDER BY days_stale DESC;

Result:

file_id              | name        | modified   | days_stale
SOM-SCR-0003-v1.0.0  | old_util.py | 2024-03-15 | 245
SOM-DOC-0002-v1.0.0  | OLD_API.md  | 2024-05-10 | 189

Action: Archive or update these files.


Version Audit Before Release

-- Find all pre-1.0 files
SELECT file_id, name, version
FROM file_catalog
WHERE version LIKE '0.%';

Release policy: All files must be v1.0.0+ for release.

Action: Bump versions of pre-release files or exclude from release.


Use Case 3: AI Agent Coordination

The Problem

Scenario: Three AI agents work on Ghost_Shell simultaneously:

  • Agent A (CLAUDE): Adds OpenTelemetry instrumentation
  • Agent B (GPT): Refactors database layer
  • Agent C (HUMAN): Fixes bugs in CLI

They need to:

  • Know what each other is working on
  • Avoid editing the same files
  • Hand off work when blocked

Traditional approach: Manual coordination (slow, error-prone)

With catalog: Agents query catalog for activity


Agent A: Find Files Needing OTel

# Agent A task: Add OTel to all files without it
ghost-catalog search --tag opentelemetry --format json > has_otel.json

ghost-catalog list --category script --format json > all_scripts.json

# Subtract: scripts WITHOUT otel
python find_missing_otel.py has_otel.json all_scripts.json > need_otel.txt

Result: Agent A has a work list.


Agent B: Check Dependencies Before Refactoring

-- Agent B wants to refactor db_handler.py
-- First, check what depends on it
SELECT fc.file_id, fc.name, fc.agent_id
FROM file_dependencies fd
JOIN file_catalog fc ON fd.file_id = fc.file_id
WHERE fd.depends_on_file_id = 'SOM-SCR-XXXX-v1.0.0';

Result:

file_id              | name         | agent_id
SOM-SCR-0012-v1.1.0  | core.py      | AGENT-CLAUDE-001
SOM-SCR-0014-v1.0.0  | cli.py       | AGENT-HUMAN-001

Action: Agent B notifies Agent A and Agent C about upcoming breaking changes.


Agent C: Check Recent Activity

-- Agent C wants to know what changed in last 24 hours
SELECT file_id, name, modified, agent_id
FROM file_catalog
WHERE modified >= date('now', '-1 day')
ORDER BY modified DESC;

Result: Agent C sees Agent A added OTel to 3 files yesterday.

Action: Agent C reviews changes before starting work.


Handoff: Agent A → Agent B

Agent A completes OTel work, updates diary:

## 2025-01-24: Agent A (CLAUDE)
- Added OTel to SOM-SCR-0012, 0013, 0014
- Updated tags: Added 'opentelemetry' to all
- Next: Agent B should verify OTel integration in database layer

Agent B reads diary, queries catalog:

ghost-catalog search --tag opentelemetry --tag database

Result: Agent B picks up where Agent A left off.


Use Case 4: Code Archaeology

The Problem

Scenario: Production bug in collector.py. Questions:

  • When was this file created?
  • Who created it?
  • Has it been modified recently?
  • What does it depend on?

Traditional approach: Git log + manual inspection

With catalog: Single query


File History

ghost-catalog info SOM-SCR-0013-v1.0.0

Output:

╭──────────────────────────────────────────╮
│ File: SOM-SCR-0013-v1.0.0                │
├──────────────────────────────────────────┤
│ Name:        collector.py                │
│ Path:        ghost_shell/intel/collector.py │
│ Description: Intelligence collector      │
│ Category:    script                      │
│ Tags:        [intel, asn, teamcymru]     │
│ Version:     1.0.0                       │
│ Created:     2025-11-23                  │
│ Modified:    2025-11-23 (today!)         │
│ Agent:       AGENT-CLAUDE-002            │
│ Execution:   python -m ghost_shell.intel.collect │
╰──────────────────────────────────────────╯

Insights:

  • Created 2 months ago (2025-11-23)
  • Created by AGENT-CLAUDE-002
  • Modified TODAY (bug likely introduced today)

Blame Analysis

-- Find all modifications to collector.py
SELECT
    date(modified) as change_date,
    version,
    agent_id
FROM file_catalog
WHERE file_id LIKE 'SOM-SCR-0013-%'
ORDER BY change_date DESC;

(Note: This requires version history tracking, which can be added by archiving old versions in catalog)


Dependency Check

-- What does collector.py depend on?
SELECT fc.file_id, fc.name
FROM file_dependencies fd
JOIN file_catalog fc ON fd.depends_on_file_id = fc.file_id
WHERE fd.file_id = 'SOM-SCR-0013-v1.0.0';

Result:

SOM-SCR-XXXX-v1.0.0  db_handler.py

Action: Check if db_handler.py changed recently (possible cause of bug).


Use Case 5: Compliance & Auditing

The Problem

Scenario: Company audit requires:

  • List all files containing security-sensitive code
  • Prove all files have been reviewed in last 6 months
  • Show version control and change tracking

Traditional approach: Manual audit (weeks of work)

With catalog: Automated compliance reports


Security Audit

# Find all security-related files
ghost-catalog search --tag security --format json > security_files.json

# Count them
jq 'length' security_files.json
# Output: 12 files

Report:

# Security Audit Report
- Total security-related files: 12
- All files cataloged: Yes
- All files have agent IDs: Yes (traceability)

Review Compliance

-- Files not reviewed (modified) in 6 months
SELECT
    file_id,
    name,
    modified,
    julianday('now') - julianday(modified) as days_since_review
FROM file_catalog
WHERE category IN ('script', 'configuration')
  AND days_since_review > 180;

Result:

2 files require review

Action: Flag these for review, update modified dates after review.


Change Tracking

-- Show all changes in last quarter
SELECT
    date(modified) as change_date,
    file_id,
    name,
    agent_id
FROM file_catalog
WHERE modified >= date('now', '-90 days')
ORDER BY change_date DESC;

Export for audit:

ghost-catalog export audit_report.csv --format csv --from 2024-10-01

Part VII: Decision Framework

When to Use This System

✅ Use SOM File Catalog When:

1. Multi-Agent Development

  • Multiple AI agents (or AI + humans) work on codebase
  • Need coordination and handoffs
  • Example: Claude, GPT-4, and human developers all contributing

2. Rapid Onboarding

  • New contributors join frequently
  • Need to understand codebase quickly
  • Example: Open-source project with rotating contributors

3. Complex Codebases

  • 50+ files
  • Multiple modules/features
  • Hard to remember what each file does

4. Versioning Matters

  • Track file versions independently
  • Need to know which files are stable (v1.x) vs experimental (v0.x)

5. Semantic Organization

  • Want to group files by purpose (scripts, docs, tests)
  • Tag-based discovery important

6. Compliance/Audit Requirements

  • Need to track who created/modified files
  • Prove review cycles
  • Generate compliance reports

7. Documentation-Heavy Projects

  • Lots of docs that need to stay organized
  • Docs need to reference code files consistently

When NOT to Use This System

❌ Avoid SOM File Catalog When:

1. Tiny Projects

  • 1-10 files
  • Single developer
  • No need for coordination
  • Alternative: Just use good filenames and a README

2. Volatile Early-Stage Projects

  • Files created/deleted constantly
  • Architecture not settled
  • Issue: Maintaining headers is overhead
  • Wait until: Architecture stabilizes

3. No AI Agent Involvement

  • Pure human development
  • Team uses existing tools (IDEs, Git, Jira)
  • Alternative: Stick with what works

4. External Codebase

  • You don't control the files
  • Can't add headers (third-party libraries)
  • Alternative: External documentation

5. Real-Time Collaboration

  • Google Docs-style simultaneous editing
  • Issue: Headers can't track live changes
  • Alternative: Use version control + branch names

6. Binary Files

  • Can't embed headers in executables, images, videos
  • Alternative: External metadata database only

Comparison: SOM IDs vs UUIDs vs Git Hashes

Feature SOM File IDs UUIDs Git Commit Hashes
Uniqueness Sequential (human-assigned) Cryptographically guaranteed Cryptographically guaranteed
Human-Readable ✅ Yes (semantic) ❌ No (random) ❌ No (random)
Self-Documenting ✅ Category + sequence ❌ Opaque ❌ Opaque
Collision Risk ⚠️ Manual (low with coordination) ✅ None (negligible) ✅ None (SHA-1)
Version Tracking ✅ Built-in (semver) ❌ No ✅ Yes (commit history)
Searchability ✅ By category, tag, agent ❌ Exact match only ✅ By branch, author, date
Discoverability ✅ Browse by category ❌ Need registry lookup ⚠️ Browse by git log
Metadata ✅ 12+ fields embedded ❌ None ✅ Commit message, author
Git-Friendly ✅ Readable diffs ❌ Random strings ✅ Built-in
Distributed Systems ⚠️ Coordination needed ✅ Works offline ✅ Decentralized
File Granularity ✅ Per-file tracking ✅ Can be per-file ❌ Per-commit (multiple files)
Tooling ⚠️ Custom (ghost-catalog) ✅ Language built-ins ✅ Git (universal)

When to Use Each

Use SOM File IDs for:

  • AI agent coordination
  • Semantic file organization
  • Rapid onboarding
  • Tag-based discovery

Use UUIDs for:

  • Database primary keys
  • Distributed systems (no coordination)
  • API tokens
  • Anonymous identifiers

Use Git Hashes for:

  • Version control (already using Git)
  • Tracking commit history
  • Branching/merging
  • Reproducible builds

Use All Three:

  • SOM IDs: File catalog (what/who/when)
  • Git: Version history (changes over time)
  • UUIDs: Database records (data objects)

Summary: The Complete Picture

┌─────────────────────────────────────────────────────────┐
│              Ghost_Shell File Catalog System            │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  What:  Semantic file IDs (SOM-XXX-NNNN-vX.X.X)        │
│  Why:   AI agent coordination + rapid onboarding       │
│  How:   Embedded headers + optional SQLite catalog     │
│  When:  Multi-agent, 50+ files, frequent onboarding    │
│  Tools: CLI, TUI, SQL queries, Git hooks               │
│                                                         │
│  ✅ Human-readable, self-documenting                   │
│  ✅ Category-based, tag-searchable                     │
│  ✅ Agent tracking, version control                    │
│  ✅ Git-friendly, works offline                        │
│                                                         │
│  ⚠️  Manual coordination needed (no UUID uniqueness)   │
│  ⚠️  Custom tooling required (not built-in)            │
│  ⚠️  Header maintenance overhead                       │
│                                                         │
└─────────────────────────────────────────────────────────┘

Where to Go From Here

Next Steps:

  1. ✅ Read this guide (you're here!)
  2. 📥 Install ghost-catalog CLI
  3. 🧪 Try it on a small project (5-10 files)
  4. 📚 Add headers to existing project
  5. 🔍 Build catalog database
  6. 🚀 Launch TUI browser
  7. 🔗 Add Git hooks for automation
  8. 📊 Query catalog for insights

Resources:

Community:


END OF GUIDE

Document Version: 1.0.0 Created: 2025-01-24 Author: Claude (Sonnet 4.5) Project: Ghost_Shell / Somacosf License: MIT


"Every file tells a story. The catalog makes sure you can find it."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment