A Practical Guide to Understanding, Using, and Implementing Semantic File IDs
"Every file tells a story. The catalog system makes sure you can find it, understand it, and trust it."
- What Is This System?
- Why Not Just Use Traditional UUIDs?
- The Human Story: A Day in the Life
- Core Concepts Explained
- Quick Start: Your First 5 Minutes
- Reading File IDs Like a Pro
- Finding Files: The Four Methods
- Working with Tags
- Version Management
- System Architecture
- The File ID Format Deep Dive
- Header Formats for Every File Type
- How IDs Are Generated
- The Registry: Current vs Future
- Setting Up Your First Catalog
- Migrating an Existing Project
- CLI Tool Usage Guide
- Bubble Tea TUI Browser
- Automation & Workflows
- Dependency Tracking
- Analytics & Insights
- Integration with Development Tools
- Troubleshooting Common Issues
- Use Case 1: Onboarding New Team Members
- Use Case 2: Managing Large Codebases
- Use Case 3: AI Agent Coordination
- Use Case 4: Code Archaeology
- Use Case 5: Compliance & Auditing
Imagine if every file in your codebase had a name tag that told you:
- What it is
- What it does
- Who created it
- When it was made
- What version it is
- How to run it
- What topics it relates to
That's the Ghost_Shell File Catalog System. Instead of random UUIDs like 550e8400-e29b-41d4-a716-446655440000, you get semantic IDs like SOM-SCR-0014-v1.0.0 that actually mean something.
Traditional File System Ghost_Shell Catalog System
───────────────────── ────────────────────────────
📁 src/ 📁 src/
📄 cli.py 📄 cli.py
📄 core.py ├─ ID: SOM-SCR-0014-v1.0.0
📄 utils.py ├─ "Management CLI"
├─ Tags: [cli, admin]
├─ Agent: AGENT-CLAUDE-002
└─ Run: python -m ghost_shell.cli
📄 core.py
├─ ID: SOM-SCR-0012-v1.1.0
├─ "Main proxy addon"
└─ Tags: [proxy, core]
Problem: In a multi-agent development environment (humans + AI), it's hard to:
- Track who changed what
- Understand file purposes without reading code
- Find related files quickly
- Maintain version consistency
- Coordinate between different agents
Solution: Embed rich metadata directly into every file using a semantic naming system.
# File: 550e8400-e29b-41d4-a716-446655440000.py
# (What does this do? No idea without reading it.)
def process_data():
passUUIDs are great for:
- Distributed systems needing uniqueness guarantees
- Database primary keys
- API tokens
- Anything where collision risk is critical
UUIDs are terrible for:
- Human comprehension
- Semantic grouping (finding related files)
- Version tracking
- Understanding file purpose at a glance
# ==============================================================================
# file_id: SOM-SCR-0014-v1.0.0
# name: cli.py
# description: Ghost_Shell unified management CLI
# project_id: GHOST-SHELL
# category: script
# tags: [cli, management, admin, opentelemetry]
# created: 2025-11-23
# modified: 2025-11-23
# version: 1.0.0
# agent_id: AGENT-CLAUDE-002
# execution: python -m ghost_shell.cli [command] [args]
# ==============================================================================Instantly you know:
- It's a script (SCR)
- It's the 14th script in the project
- It's version 1.0.0
- It's a CLI tool (from tags)
- Claude created it
- How to run it
| Feature | UUID | SOM File ID |
|---|---|---|
| Uniqueness | ✅ Cryptographically guaranteed | ✅ Sequential within category |
| Human Readable | ❌ Completely opaque | ✅ Self-documenting |
| Semantic Grouping | ❌ No meaning | ✅ Category-based |
| Version Tracking | ❌ Needs external system | ✅ Built-in semver |
| Searchability | ❌ Only exact match | ✅ By category, tag, agent |
| Git Friendly | ❌ Random strings in diffs | ✅ Readable diffs |
| Discoverability | ❌ Need registry lookup | ✅ Self-contained metadata |
| Agent Coordination | ❌ No agent info | ✅ Agent ID embedded |
For a single-workspace, multi-agent development environment where:
- Agents need to understand existing code quickly
- Files have clear categories (scripts, docs, tests)
- Version tracking matters
- Human oversight is common
SOM IDs win because they prioritize comprehension over collision resistance.
8:00 AM - Sarah joins the Ghost_Shell project. She's never seen the codebase before.
# Without catalog system:
$ ls ghost_shell/
cli.py core.py main.py utils.py blocker.py collector.py
# She has to open each file and read it to understand what it does# With catalog system:
$ ghost-catalog list
╭────────────────────────┬──────────────────────────────────╮
│ File ID │ Description │
├────────────────────────┼──────────────────────────────────┤
│ SOM-SCR-0010-v1.0.0 │ OpenTelemetry setup │
│ SOM-SCR-0011-v1.0.0 │ Traffic blocking │
│ SOM-SCR-0012-v1.1.0 │ Main proxy addon │
│ SOM-SCR-0013-v1.0.0 │ Intelligence collector │
│ SOM-SCR-0014-v1.0.0 │ Management CLI │
╰────────────────────────┴──────────────────────────────────╯
# She immediately understands the architecture without reading codeResult: Sarah is productive in 5 minutes instead of 5 hours.
12:00 PM - Alex (an AI agent) needs to find all files related to OpenTelemetry to add new metrics.
# Without catalog system:
$ grep -r "opentelemetry" . --include="*.py"
# Returns 200+ lines of code matches, unclear which FILES to edit# With catalog system:
$ ghost-catalog search --tag opentelemetry
Found 4 files:
- SOM-SCR-0010-v1.0.0 telemetry.py (OpenTelemetry setup)
- SOM-SCR-0012-v1.1.0 core.py (Main proxy - uses OTel)
- SOM-SCR-0013-v1.0.0 collector.py (Intel collector - OTel metrics)
- SOM-SCR-0014-v1.0.0 cli.py (CLI - OTel integration)
# Alex knows exactly which 4 files to modifyResult: Precise targeting instead of shotgun edits.
3:00 PM - Mike needs to audit which files haven't been updated in 6 months for the compliance report.
# Without catalog system:
$ find . -name "*.py" -mtime +180
# Returns raw file paths, no context about what they do# With catalog system:
$ ghost-catalog list --sort modified | head -5
Oldest Modified Files:
╭────────────────────────┬──────────────────┬────────────╮
│ File ID │ Description │ Last Edit │
├────────────────────────┼──────────────────┼────────────┤
│ SOM-SCR-0003-v1.0.0 │ Legacy parser │ 2024-06-15 │
│ SOM-DOC-0001-v1.0.0 │ Old setup guide │ 2024-07-20 │
╰────────────────────────┴──────────────────┴────────────╯
# Mike has actionable info with business contextResult: Compliance report done in 10 minutes, not 2 hours.
6:00 PM - Emma (an AI onboarding agent) detects a new contributor and generates a personalized learning path.
# Emma queries the catalog database:
SELECT file_id, name, description, tags
FROM file_catalog
WHERE category = 'documentation'
ORDER BY created;
# Results:
# SOM-DOC-0001-v1.0.0 QUICKSTART.md (Quick start guide)
# SOM-DOC-0003-v1.0.0 CODEBASE_OVERVIEW.md (Complete architecture)
# SOM-DOC-0004-v1.0.0 API_REFERENCE.md (API documentation)Emma generates:
Welcome! Here's your learning path:
1. Start: SOM-DOC-0001 (Quickstart - 10 min read)
2. Then: SOM-DOC-0003 (Architecture overview - 30 min read)
3. Deep dive: SOM-SCR-0012 (Core proxy code)Result: New contributors have a clear path instead of drowning in files.
What: IDs that convey meaning through their structure
Example:
SOM-SCR-0014-v1.0.0
│ │ │ │
│ │ │ └─ Version 1.0.0 (semantic versioning)
│ │ └────── Sequence 0014 (14th script)
│ └────────── Category: SCR (script)
└────────────── Namespace: SOM (Somacosf workspace)
Why it matters: You can tell the file's purpose without opening it.
What: Every file contains a structured header with metadata
Why not a separate database?
- Metadata travels with the file (Git-trackable)
- No database corruption risk
- Works offline immediately
- Self-documenting code
The trade-off:
- Slower searches (scan files vs query DB)
- No relationships between files
- Manual updates needed
The solution: Hybrid approach
- Metadata in files (source of truth)
- Database catalog (for fast queries)
- Sync tool keeps them aligned
What: File types organized by purpose
The 9 Categories:
| Code | Category | Purpose | Examples |
|---|---|---|---|
| CMD | Slash commands | Claude Code commands | Custom workflows |
| SCR | Scripts | Executable code | cli.py, launcher.py |
| DOC | Documentation | Human-readable docs | README, guides |
| CFG | Configuration | Settings files | config.yaml |
| REG | Registry | Catalog metadata | agent_registry.md |
| TST | Tests | Test suites | test_*.py |
| TMP | Templates | Boilerplates | Project scaffolds |
| DTA | Data/schemas | Data definitions | schema.sql |
| LOG | Logs/diaries | Development logs | development_diary.md |
Why categories matter:
# Find all test files instantly
$ ghost-catalog list --category test
# Find all config files
$ ghost-catalog list --category configurationWhat: Free-form keywords that describe file characteristics
Examples:
# For a CLI tool:
tags: [cli, management, admin, sqlite]
# For a proxy module:
tags: [proxy, mitmproxy, blocking, privacy]
# For documentation:
tags: [onboarding, tutorial, quickstart]Tag Strategy:
- Functional: What it does (
cli,api,database) - Technical: What it uses (
opentelemetry,sqlite,mitmproxy) - Topical: What domain (
security,networking,analytics) - Status: Implementation state (
wip,deprecated,experimental)
Why tags matter:
# Find everything related to proxies
$ ghost-catalog search --tag proxy
# Find all OpenTelemetry-instrumented files
$ ghost-catalog search --tag opentelemetry
# Find work-in-progress files
$ ghost-catalog search --tag wipWhat: Every file records which AI agent created/modified it
Format: AGENT-<TYPE>-<NUMBER>
AGENT-CLAUDE-002: Claude Sonnet 4.5AGENT-GPT-001: GPT-4AGENT-HUMAN-001: Human developer Sarah
Why it matters:
- Debugging: "Which agent introduced this bug?"
- Accountability: Track agent activity
- Context switching: "What has agent X been working on?"
- Handoffs: "Agent Y needs to continue agent X's work"
Example:
# Find all files Claude has worked on
$ ghost-catalog list --agent AGENT-CLAUDE-002
# Track agent activity over time
$ sqlite3 catalog.db "
SELECT agent_id, COUNT(*) as files_created
FROM file_catalog
GROUP BY agent_id
ORDER BY files_created DESC;
"What: Version numbers follow semver (MAJOR.MINOR.PATCH)
Rules:
- PATCH (1.0.0 → 1.0.1): Bug fixes, docs updates
- MINOR (1.0.1 → 1.1.0): New features, backward-compatible
- MAJOR (1.1.0 → 2.0.0): Breaking changes
Why it matters:
# Find all v1.x files (stable)
$ ghost-catalog list | grep "v1\."
# Find pre-1.0 files (not release-ready)
$ sqlite3 catalog.db "
SELECT file_id, name
FROM file_catalog
WHERE version LIKE '0.%';
"
# Audit: All files should be v1.0.0+ before release
$ ghost-catalog validate --strict-version 1.0.0In the File ID:
SOM-SCR-0014-v1.0.0 ← Initial release
SOM-SCR-0014-v1.0.1 ← Bug fix
SOM-SCR-0014-v1.1.0 ← New feature added
SOM-SCR-0014-v2.0.0 ← Breaking refactor
Open any Python file in the project:
# ==============================================================================
# file_id: SOM-SCR-0014-v1.0.0
# name: cli.py
# description: Ghost_Shell unified management CLI
# ...
# ==============================================================================✅ Has header = Cataloged file ❌ No header = Needs catalog entry
Method 1: Grep (works everywhere)
grep -r "file_id:" . --include="*.py" --include="*.md"Method 2: CLI Tool (if installed)
ghost-catalog listMethod 3: TUI Browser (if installed)
ghost-catalog-tuiFind all scripts:
ghost-catalog list --category scriptFind files with "proxy" in description:
ghost-catalog search "proxy"Find files by tag:
ghost-catalog search --tag opentelemetryghost-catalog info SOM-SCR-0014-v1.0.0Output:
╭─────────────────────────────────────────────╮
│ File: SOM-SCR-0014-v1.0.0 │
├─────────────────────────────────────────────┤
│ Name: cli.py │
│ Path: ghost_shell/cli.py │
│ Description: Management CLI │
│ Category: script │
│ Tags: [cli, management, admin] │
│ Version: 1.0.0 │
│ Created: 2025-11-23 │
│ Modified: 2025-11-23 │
│ Agent: AGENT-CLAUDE-002 │
│ Execution: python -m ghost_shell.cli │
╰─────────────────────────────────────────────╯
# Option 1: Manual
code ghost_shell/cli.py
# Option 2: From TUI (press 'o' on selected file)
# Option 3: CLI tool
ghost-catalog info SOM-SCR-0014-v1.0.0 --openSOM-SCR-0014-v1.0.0
│ │ │ │
│ │ │ └── Version (semantic versioning)
│ │ └─────── Sequence number (unique within category)
│ └─────────── Category code (3 letters)
└─────────────── Namespace (SOM = Somacosf)
CMD = Commands CFG = Configuration
SCR = Scripts REG = Registry
DOC = Documentation TST = Tests
TMP = Templates DTA = Data/Schemas
LOG = Logs/Diaries
SOM-SCR-0012-v1.1.0
- "This is the 12th script in the project"
- "It's at version 1.1.0 (has received 1 feature update)"
- "It's a SCR (runnable script)"
SOM-DOC-0003-v1.0.0
- "This is the 3rd documentation file"
- "It's at version 1.0.0 (stable release)"
- "It's a DOC (documentation)"
SOM-TST-0001-v2.0.0
- "This is the 1st test file"
- "It's at version 2.0.0 (has had breaking changes)"
- "It's a TST (test suite)"
Low numbers (0001-0010): Core/foundational files
SOM-SCR-0001- Usually the main entry pointSOM-DOC-0001- Usually the README or quickstart
Mid numbers (0011-0050): Feature modules
SOM-SCR-0014- CLI tool (added mid-project)
High numbers (0051+): Recent additions or utilities
SOM-SCR-0087- Probably a recently added helper
v0.x.x: Pre-release, experimental
v0.1.0- Initial draftv0.9.5- Almost ready, but not stable
v1.x.x: Stable, production-ready
v1.0.0- First stable releasev1.5.0- Mature, with several features
v2.x.x+: Major revisions
v2.0.0- Significant refactorv3.0.0- Another major rewrite
List all file IDs:
grep -r "file_id:" . --include="*.py" --include="*.md" | awk '{print $2}'Find by category:
grep -r "file_id: SOM-SCR" . --include="*.py"Find by tag:
grep -r "tags:.*opentelemetry" . --include="*.py"Find recently modified:
grep -r "modified:" . --include="*.py" | sort -t: -k2 -r | head -10Generate:
repomix --output catalog.txt --style xmlParse with Python:
import re
with open('catalog.txt') as f:
content = f.read()
# Extract all file IDs
file_ids = re.findall(r'file_id:\s+(SOM-[A-Z]{3}-\d{4}-v[\d.]+)', content)
print(f"Found {len(file_ids)} cataloged files:")
for fid in sorted(set(file_ids)):
print(f" {fid}")Installation:
# From source (Go)
git clone https://github.com/somacosf/ghost-catalog
cd ghost-catalog
go build -o ghost-catalog
sudo mv ghost-catalog /usr/local/bin/
# Or from releases
curl -LO https://github.com/somacosf/ghost-catalog/releases/latest/download/ghost-catalog
chmod +x ghost-catalogUsage:
# List all files
ghost-catalog list
# Search
ghost-catalog search "proxy"
# Filter by category
ghost-catalog list --category script
# Filter by tag
ghost-catalog search --tag opentelemetry
# Get file info
ghost-catalog info SOM-SCR-0014-v1.0.0
# Validate headers
ghost-catalog validate
# Show statistics
ghost-catalog statsLaunch:
ghost-catalog-tuiInterface:
╔══════════════════╦═════════════════════════════════════╗
║ FILTERS ║ FILE LIST ║
║ ║ [Selected file highlighted] ║
║ Categories ║ ║
║ [x] Scripts ║ FILE DETAILS ║
║ [ ] Docs ║ [Metadata display] ║
║ ║ ║
║ Tags ║ ║
║ [x] proxy ║ ║
╚══════════════════╩═════════════════════════════════════╝
Navigation:
↑/↓: Move selectionEnter: View detailso: Open in editorTab: Switch panes/: Searchq: Quit
| Method | Best For |
|---|---|
| Bash/PowerShell | Quick checks, no tools installed |
| Repomix | Analysis, sharing with AI agents |
| CLI Tool | Daily development, scripting |
| TUI Browser | Exploring unfamiliar codebases |
Tags are your semantic index. Think of them as:
- Hashtags for code
- Labels for organization
- Keywords for search
# CLI tool
tags: [cli, management, admin, interactive, rich]
# Database module
tags: [database, sqlite, duckdb, orm, persistence]
# Security feature
tags: [security, encryption, authentication, jwt]
# Experimental feature
tags: [experimental, wip, prototype, needs-review]# Too generic
tags: [code, file, python] ❌
# Too specific (use description instead)
tags: [this-uses-requests-library-version-2-31] ❌
# Redundant with category
tags: [script] ❌ (category already says "script")- Use lowercase with hyphens:
multi-word-tag - 3-7 tags per file (sweet spot)
- Mix functional + technical tags
- Include status if relevant:
wip,deprecated,stable - Think search: "What would I search for to find this?"
Single tag:
ghost-catalog search --tag opentelemetryMultiple tags (AND):
ghost-catalog search --tag proxy --tag securityMultiple tags (OR) via SQL:
SELECT DISTINCT fc.file_id, fc.name
FROM file_catalog fc
JOIN file_tags ft ON fc.file_id = ft.file_id
WHERE ft.tag IN ('proxy', 'security');Most used tags:
SELECT tag, COUNT(*) as usage_count
FROM file_tags
GROUP BY tag
ORDER BY usage_count DESC
LIMIT 10;Tag co-occurrence (find related tags):
SELECT t1.tag, t2.tag, COUNT(*) as co_occurrences
FROM file_tags t1
JOIN file_tags t2 ON t1.file_id = t2.file_id AND t1.tag < t2.tag
GROUP BY t1.tag, t2.tag
ORDER BY co_occurrences DESC
LIMIT 20;Result:
proxy + security = 5 files
cli + database = 3 files
opentelemetry + monitoring = 4 files
MAJOR.MINOR.PATCH
| Change Type | Example | Bump |
|---|---|---|
| Bug fix | Fixed null pointer | PATCH (1.0.0 → 1.0.1) |
| Documentation update | Added docstrings | PATCH |
| New function (compatible) | Added get_stats() |
MINOR (1.0.1 → 1.1.0) |
| New optional parameter | def foo(x, y=None) |
MINOR |
| Breaking change | Removed function | MAJOR (1.1.0 → 2.0.0) |
| Changed function signature | foo(x) → foo(x, y) |
MAJOR |
Both places must match:
# file_id: SOM-SCR-0014-v1.0.0 ← Version here
# ...
# version: 1.0.0 ← Must match hereValidate:
ghost-catalog validateOutput:
✗ SOM-SCR-0008-v1.0.0 cli.py
└─ Version mismatch: file_id has v1.0.0 but version field has 1.0.1
Auto-fix:
ghost-catalog validate --fixPre-release files (v0.x.x):
SELECT file_id, name, version
FROM file_catalog
WHERE version LIKE '0.%'
ORDER BY version DESC;Outdated files (not at latest major version):
-- Assuming v2.x is the latest
SELECT file_id, name, version
FROM file_catalog
WHERE CAST(substr(version, 1, 1) AS INTEGER) < 2;Version distribution:
SELECT
CAST(substr(version, 1, 1) AS INTEGER) as major_version,
COUNT(*) as file_count
FROM file_catalog
GROUP BY major_version;Result:
major_version | file_count
0 | 3 (pre-release)
1 | 18 (stable)
2 | 4 (latest)
graph TB
subgraph "Layer 1: File System"
A[Physical Files<br/>with Headers]
end
subgraph "Layer 2: Metadata Schema"
B[Embedded Headers<br/>12+ Fields]
end
subgraph "Layer 3: Registry"
C[SQLite Catalog DB]
D[File System Headers]
end
subgraph "Layer 4: Access APIs"
E[CLI Tool]
F[TUI Browser]
G[SQL Queries]
end
subgraph "Layer 5: Integration"
H[Git Hooks]
I[CI/CD Validation]
J[Editor Plugins]
end
A --> B
B --> C
B --> D
C --> E
C --> F
C --> G
D --> E
E --> H
F --> I
G --> J
Layer 1: File System
- Physical .py, .md, .yaml files
- Each has a header (first 10-20 lines)
Layer 2: Metadata Schema
- Structured data in headers
- 12+ fields per file
- Format varies by file type (Python vs Markdown)
Layer 3: Registry
- Current: Distributed (metadata in files)
- Future: Centralized SQLite database
- Sync tool keeps them aligned
Layer 4: Access APIs
- CLI: Command-line queries
- TUI: Interactive browser
- SQL: Direct database queries
Layer 5: Integration
- Git hooks: Validate on commit
- CI/CD: Check in pipelines
- Editors: Syntax highlighting, autocomplete
SOM-<CATEGORY>-<SEQUENCE>-v<VERSION>
Constraints:
- SOM: Fixed namespace (3 chars)
- CATEGORY: One of 9 valid codes (3 chars, uppercase)
- SEQUENCE: 0001-9999 (4 digits, zero-padded)
- VERSION: MAJOR.MINOR.PATCH (semver)
Regex: ^SOM-[A-Z]{3}-\d{4}-v\d+\.\d+\.\d+$
SOM-SCR-0001-v1.0.0 ✅
SOM-DOC-0042-v2.5.3 ✅
SOM-CFG-0007-v1.0.0 ✅
SOM-XYZ-0001-v1.0.0 ❌ (XYZ is not a valid category)
SOM-SCR-1-v1.0.0 ❌ (sequence must be 4 digits)
SOM-SCR-0001-1.0.0 ❌ (missing 'v' prefix)
SOM-SCR-0001-v1 ❌ (version must be full semver)
som-scr-0001-v1.0.0 ❌ (must be uppercase)
Python:
import re
def parse_file_id(file_id):
pattern = r'^SOM-([A-Z]{3})-(\d{4})-v(\d+)\.(\d+)\.(\d+)$'
match = re.match(pattern, file_id)
if not match:
raise ValueError(f"Invalid file ID: {file_id}")
return {
'namespace': 'SOM',
'category': match.group(1),
'sequence': int(match.group(2)),
'version_major': int(match.group(3)),
'version_minor': int(match.group(4)),
'version_patch': int(match.group(5)),
'version': f"{match.group(3)}.{match.group(4)}.{match.group(5)}"
}
# Usage
info = parse_file_id('SOM-SCR-0014-v1.0.0')
print(info)
# {'namespace': 'SOM', 'category': 'SCR', 'sequence': 14,
# 'version_major': 1, 'version_minor': 0, 'version_patch': 0,
# 'version': '1.0.0'}def get_next_file_id(category, existing_ids):
"""Generate next file ID for a category."""
# Find highest sequence in category
max_seq = 0
for file_id in existing_ids:
info = parse_file_id(file_id)
if info['category'] == category:
max_seq = max(max_seq, info['sequence'])
# Increment
next_seq = max_seq + 1
# Format
return f"SOM-{category}-{next_seq:04d}-v1.0.0"
# Usage
existing = ['SOM-SCR-0012-v1.1.0', 'SOM-SCR-0013-v1.0.0', 'SOM-SCR-0014-v1.0.0']
next_id = get_next_file_id('SCR', existing)
print(next_id) # SOM-SCR-0015-v1.0.0Template:
# ==============================================================================
# file_id: SOM-SCR-NNNN-vX.X.X
# name: filename.py
# description: One-line description
# project_id: PROJECT-NAME
# category: script
# tags: [tag1, tag2, tag3]
# created: YYYY-MM-DD
# modified: YYYY-MM-DD
# version: X.X.X
# agent_id: AGENT-XXX-NNN
# execution: python filename.py [args]
# ==============================================================================
"""Module docstring."""
# Your code hereLine count: 13 lines (including delimiters) Format: Hash comments
Template:
<!--
===============================================================================
file_id: SOM-DOC-NNNN-vX.X.X
name: filename.md
description: One-line description
project_id: PROJECT-NAME
category: documentation
tags: [tag1, tag2, tag3]
created: YYYY-MM-DD
modified: YYYY-MM-DD
version: X.X.X
agent:
id: AGENT-XXX-NNN
name: agent_name
model: model_id
execution:
type: documentation
invocation: Read for [purpose]
===============================================================================
-->
# Document Title
Content here...Line count: 19-21 lines
Format: HTML comment with YAML-like structure
Special: Nested agent and execution fields
Template:
# ==============================================================================
# file_id: SOM-CFG-NNNN-vX.X.X
# name: config.yaml
# description: One-line description
# project_id: PROJECT-NAME
# category: configuration
# tags: [tag1, tag2]
# created: YYYY-MM-DD
# modified: YYYY-MM-DD
# version: X.X.X
# ==============================================================================
# Your config here
proxy:
host: 127.0.0.1
port: 8080Line count: 11 lines Format: YAML comments Note: Simplified (no nested agent/execution fields)
Template:
# ==============================================================================
# file_id: SOM-SCR-NNNN-vX.X.X
# name: script.ps1
# description: One-line description
# project_id: PROJECT-NAME
# category: script
# tags: [tag1, tag2]
# created: YYYY-MM-DD
# modified: YYYY-MM-DD
# version: X.X.X
# agent_id: AGENT-XXX-NNN
# execution: .\script.ps1 [-Param value]
# ==============================================================================
param(
[string]$Param1
)
# Your script hereSame structure as Python, but with PowerShell syntax.
sequenceDiagram
participant Agent
participant Files as File System
participant Dev as Development Diary
Agent->>Agent: 1. Decide to create new file
Agent->>Agent: 2. Determine category (SCR/DOC/CFG/etc)
Agent->>Files: 3. grep "file_id: SOM-SCR" *.py
Files-->>Agent: 4. Returns existing SCR file IDs
Agent->>Agent: 5. Find max sequence (e.g., 0014)
Agent->>Agent: 6. Increment (0014 → 0015)
Agent->>Agent: 7. Format: SOM-SCR-0015-v1.0.0
Agent->>Agent: 8. Populate header template
Agent->>Files: 9. Write file with header
Agent->>Dev: 10. Log in development diary
Steps:
- Decide file category (script/doc/config/etc)
- Search existing files:
grep -r "file_id: SOM-SCR" - Extract sequences, find max
- Increment:
max + 1 - Zero-pad:
15 → 0015 - Construct:
SOM-SCR-0015-v1.0.0 - Fill header template
- Write file
Pros:
- Simple, no tools needed
- Full control
Cons:
- Manual, error-prone
- Collision risk if multiple agents create simultaneously
graph LR
A[Agent requests<br/>new file] --> B[ghost-catalog generate]
B --> C[Query catalog DB<br/>for max sequence]
C --> D[Allocate next ID]
D --> E[Lock sequence<br/>in DB]
E --> F[Generate header]
F --> G[Write file]
G --> H[Register in catalog]
Command:
ghost-catalog generate \
--file new_module.py \
--category script \
--description "New feature module" \
--tags cli,admin,feature-xProcess:
- Query catalog DB for highest sequence in SCR category
- Lock next sequence (prevent collisions)
- Generate file ID:
SOM-SCR-0015-v1.0.0 - Create header from template
- Write file with header + boilerplate code
- Insert into catalog DB
- Return file ID to agent
Pros:
- Atomic (no collisions)
- Consistent formatting
- Automatic catalog registration
Cons:
- Requires tool installation
- Depends on catalog DB
ghost_shell/
├── cli.py
│ └─ Header: SOM-SCR-0014-v1.0.0
├── core.py
│ └─ Header: SOM-SCR-0012-v1.1.0
├── collector.py
│ └─ Header: SOM-SCR-0013-v1.0.0
└── README.md
└─ Header: SOM-DOC-0001-v1.0.0
How it works:
- Metadata lives in each file's header
- No central database
- Query via grep/awk/sed
Advantages: ✅ Git-trackable ✅ No database corruption ✅ Works offline ✅ Self-documenting
Disadvantages: ❌ Slow searches (scan all files) ❌ No relationships (can't link files) ❌ No indexing ❌ Hard to aggregate statistics
ghost_shell/
├── data/
│ └── catalog.db (SQLite)
│ ├── file_catalog table
│ ├── file_tags table
│ ├── agent_registry table
│ └── file_dependencies table
└── files with headers (source of truth)
How it works:
- Metadata still in file headers (source of truth)
- Synced to SQLite database (fast queries)
ghost-catalog synckeeps them aligned
Database Schema:
CREATE TABLE file_catalog (
file_id TEXT PRIMARY KEY,
name TEXT,
path TEXT,
description TEXT,
category TEXT,
version TEXT,
created DATE,
modified DATE,
agent_id TEXT,
execution TEXT,
checksum TEXT
);
CREATE TABLE file_tags (
file_id TEXT,
tag TEXT,
PRIMARY KEY (file_id, tag)
);
CREATE TABLE agent_registry (
id TEXT PRIMARY KEY,
name TEXT,
model TEXT,
first_seen TIMESTAMP
);
CREATE TABLE file_dependencies (
file_id TEXT,
depends_on_file_id TEXT,
dependency_type TEXT
);Sync Process:
# Scan file system, update database
ghost-catalog sync
# Output:
# Scanned: 50 files
# Updated: 3 files
# New: 1 file
# Unchanged: 46 filesAdvantages: ✅ Fast queries (indexed) ✅ Complex filters (tags AND category) ✅ Relationships (dependencies) ✅ Statistics (COUNT, GROUP BY) ✅ Full-text search
Disadvantages: ❌ Database can desync ❌ Requires sync tool ❌ Extra maintenance
CLI Tool:
# From GitHub releases
curl -LO https://github.com/somacosf/ghost-catalog/releases/latest/download/ghost-catalog-linux-amd64
chmod +x ghost-catalog-linux-amd64
sudo mv ghost-catalog-linux-amd64 /usr/local/bin/ghost-catalog
# Verify
ghost-catalog --versionTUI Browser:
# Same process
curl -LO https://github.com/somacosf/ghost-catalog/releases/latest/download/ghost-catalog-tui-linux-amd64
chmod +x ghost-catalog-tui-linux-amd64
sudo mv ghost-catalog-tui-linux-amd64 /usr/local/bin/ghost-catalog-tuiFor a Python script:
- Open the file
- Add header at the top (before any code):
# ==============================================================================
# file_id: SOM-SCR-0001-v1.0.0
# name: my_script.py
# description: My first cataloged script
# project_id: MY-PROJECT
# category: script
# tags: [example, first, tutorial]
# created: 2025-01-24
# modified: 2025-01-24
# version: 1.0.0
# agent_id: AGENT-HUMAN-001
# execution: python my_script.py
# ==============================================================================
"""My script docstring."""
def main():
print("Hello from cataloged script!")
if __name__ == '__main__':
main()- Save the file
# Check that it parses correctly
grep "file_id:" my_script.py
# Output: # file_id: SOM-SCR-0001-v1.0.0
# Validate format
ghost-catalog validate my_script.py
# Output: ✓ SOM-SCR-0001-v1.0.0 - my_script.py# Initialize catalog
ghost-catalog init
# This creates:
# - data/catalog.db (SQLite database)
# - .ghost-catalog.config (settings)
# Scan and populate
ghost-catalog sync
# Output:
# Scanning...
# Found 1 cataloged file
# Inserted 1 file into catalog# List all files
ghost-catalog list
# Get file info
ghost-catalog info SOM-SCR-0001-v1.0.0
# Search
ghost-catalog search "first"Goal: Add catalog headers to all files
Strategy: Automated bulk migration
# List all Python and Markdown files
find . -name "*.py" -o -name "*.md" > files_to_migrate.txt
# Count them
wc -l files_to_migrate.txt
# Output: 50 filesCreate migrate_catalog.py:
#!/usr/bin/env python3
"""Bulk add catalog headers to files."""
from pathlib import Path
from datetime import date
import re
# Configuration
PROJECT_ID = "MY-PROJECT"
AGENT_ID = "AGENT-HUMAN-001"
TODAY = date.today().isoformat()
# Category mapping (file path → category)
CATEGORY_MAP = {
'test_': 'test',
'tests/': 'test',
'docs/': 'documentation',
'config/': 'configuration',
'README': 'documentation',
}
def determine_category(filepath):
"""Guess category from file path."""
name = filepath.name.lower()
path_str = str(filepath).lower()
for keyword, category in CATEGORY_MAP.items():
if keyword in name or keyword in path_str:
return category
# Default: script
return 'script' if filepath.suffix == '.py' else 'documentation'
def get_category_code(category):
"""Map category name to 3-letter code."""
codes = {
'script': 'SCR',
'documentation': 'DOC',
'configuration': 'CFG',
'test': 'TST',
}
return codes.get(category, 'SCR')
def get_next_sequence(category_code, existing_ids):
"""Find next sequence for category."""
max_seq = 0
pattern = rf'SOM-{category_code}-(\d{{4}})-v'
for file_id in existing_ids:
match = re.search(pattern, file_id)
if match:
seq = int(match.group(1))
max_seq = max(max_seq, seq)
return max_seq + 1
def generate_python_header(file_id, filename, description, category):
"""Generate Python file header."""
return f"""# ==============================================================================
# file_id: {file_id}
# name: {filename}
# description: {description}
# project_id: {PROJECT_ID}
# category: {category}
# tags: []
# created: {TODAY}
# modified: {TODAY}
# version: 1.0.0
# agent_id: {AGENT_ID}
# execution: python {filename}
# ==============================================================================
"""
def add_header_to_file(filepath, header):
"""Prepend header to file."""
# Read existing content
with open(filepath, 'r') as f:
existing = f.read()
# Write header + content
with open(filepath, 'w') as f:
f.write(header + existing)
def main():
# Find all files
py_files = list(Path('.').rglob('*.py'))
md_files = list(Path('.').rglob('*.md'))
all_files = py_files + md_files
# Filter out already cataloged
to_migrate = []
existing_ids = []
for filepath in all_files:
with open(filepath, 'r') as f:
content = f.read(500) # Read first 500 chars
if 'file_id:' in content:
# Extract file ID
match = re.search(r'file_id:\s+(SOM-[A-Z]{3}-\d{4}-v[\d.]+)', content)
if match:
existing_ids.append(match.group(1))
else:
to_migrate.append(filepath)
print(f"Found {len(all_files)} total files")
print(f"Already cataloged: {len(existing_ids)}")
print(f"To migrate: {len(to_migrate)}")
# Migrate each file
for filepath in to_migrate:
category = determine_category(filepath)
category_code = get_category_code(category)
sequence = get_next_sequence(category_code, existing_ids)
file_id = f"SOM-{category_code}-{sequence:04d}-v1.0.0"
# Generate description (TODO: use LLM for better descriptions)
description = f"TODO: Add description for {filepath.name}"
if filepath.suffix == '.py':
header = generate_python_header(file_id, filepath.name, description, category)
else:
# Markdown header (similar structure)
header = generate_markdown_header(file_id, filepath.name, description, category)
# Add header
add_header_to_file(filepath, header)
existing_ids.append(file_id)
print(f"✓ {file_id} → {filepath}")
print(f"\n✅ Migrated {len(to_migrate)} files")
print("⚠️ Remember to update descriptions (search for 'TODO')")
if __name__ == '__main__':
main()Run it:
python migrate_catalog.pyThe script adds placeholder descriptions. You need to fill them in:
# Find all TODOs
grep -r "description: TODO" . --include="*.py" --include="*.md"
# For each file, update the description to something meaningfulghost-catalog sync
# Output:
# Scanned: 50 files
# New: 50 files
# Sync completeghost-catalog validate
# Fix any issues
ghost-catalog validate --fixFrom releases (recommended):
# Linux
curl -LO https://github.com/somacosf/ghost-catalog/releases/latest/download/ghost-catalog-linux-amd64
chmod +x ghost-catalog-linux-amd64
sudo mv ghost-catalog-linux-amd64 /usr/local/bin/ghost-catalog
# macOS
curl -LO https://github.com/somacosf/ghost-catalog/releases/latest/download/ghost-catalog-darwin-amd64
chmod +x ghost-catalog-darwin-amd64
sudo mv ghost-catalog-darwin-amd64 /usr/local/bin/ghost-catalog
# Windows
# Download ghost-catalog-windows-amd64.exe from releases
# Add to PATHFrom source (if you have Go installed):
git clone https://github.com/somacosf/ghost-catalog
cd ghost-catalog
go build -o ghost-catalog
sudo mv ghost-catalog /usr/local/bin/Initialize catalog:
ghost-catalog initSync catalog with file system:
ghost-catalog syncList all files:
ghost-catalog listSearch files:
ghost-catalog search "proxy"Get file info:
ghost-catalog info SOM-SCR-0014-v1.0.0Validate headers:
ghost-catalog validateFilter by category:
ghost-catalog list --category script
ghost-catalog list --category documentationFilter by tag:
ghost-catalog search --tag opentelemetry
ghost-catalog search --tag proxy --tag securityFilter by agent:
ghost-catalog list --agent AGENT-CLAUDE-002Sort options:
ghost-catalog list --sort modified # Recently modified first
ghost-catalog list --sort created # Recently created first
ghost-catalog list --sort file_id # Alphabetical by IDOutput formats:
ghost-catalog list --format table # Default
ghost-catalog list --format json # For scripting
ghost-catalog list --format csv # For ExcelGenerate new file:
ghost-catalog generate \
--file new_module.py \
--category script \
--description "New feature module" \
--tags cli,admin,feature-x \
--project MY-PROJECTExport catalog:
ghost-catalog export catalog.json
ghost-catalog export catalog.csv --format csvStatistics:
ghost-catalog statsGet all file paths:
ghost-catalog list --format json | jq -r '.[].path'Count files by category:
ghost-catalog list --format json | jq 'group_by(.category) | map({category: .[0].category, count: length})'Find stale files (not modified in 90 days):
ghost-catalog list --format json | jq --arg cutoff "$(date -d '90 days ago' +%Y-%m-%d)" '.[] | select(.modified < $cutoff) | .file_id'ghost-catalog-tui┌────────────────────────────────────────────────────────────┐
│ Ghost_Shell File Catalog Browser │
├──────────────┬─────────────────────────────────────────────┤
│ FILTERS │ FILE LIST │
│ │ ┌────────────┬──────────────────────────┐ │
│ Categories │ │ File ID │ Description │ │
│ [x] Scripts │ ├────────────┼──────────────────────────┤ │
│ [ ] Docs │ │►SOM-SCR-014│ Management CLI │ │
│ [ ] Config │ └────────────┴──────────────────────────┘ │
│ │ │
│ Tags │ FILE DETAILS │
│ [x] cli │ ┌───────────────────────────────────────┐ │
│ [ ] proxy │ │ Name: cli.py │ │
│ │ │ Path: ghost_shell/cli.py │ │
│ Search │ │ Tags: [cli, management, admin] │ │
│ [_______] │ │ Version: 1.0.0 │ │
│ │ │ Agent: AGENT-CLAUDE-002 │ │
│ Actions │ └───────────────────────────────────────┘ │
│ [o] Open │ │
│ [c] Copy │ │
└──────────────┴─────────────────────────────────────────────┘
│ [?] Help │ [/] Search │ [Tab] Switch │ [q] Quit │
└────────────────────────────────────────────────────────────┘
| Key | Action | Description |
|---|---|---|
↑/k |
Move up | Navigate file list |
↓/j |
Move down | Navigate file list |
Enter |
Select | Show file details |
Tab |
Switch pane | Toggle sidebar/file list |
Space |
Toggle filter | Check/uncheck filter |
/ |
Search | Focus search box |
Esc |
Clear | Clear filters/search |
o |
Open | Open file in $EDITOR |
c |
Copy path | Copy file path to clipboard |
d |
Dependencies | Show dependency graph |
g |
Git log | Show git history |
v |
Validate | Re-validate header |
r |
Refresh | Re-scan catalog |
? |
Help | Show full help screen |
q |
Quit | Exit application |
Find all proxy-related files:
- Launch TUI:
ghost-catalog-tui - Navigate to Tags section (Tab)
- Find "proxy" tag, press Space to select
- File list filters to proxy files
Open a file:
- Navigate to file in list (↑/↓)
- Press Enter to view details
- Press 'o' to open in editor
Search:
- Press
/ - Type search query
- Results update live
- Press Enter to select first result
Pre-commit: Validate headers
Create .git/hooks/pre-commit:
#!/bin/bash
# Pre-commit hook: Validate catalog headers
echo "Validating catalog headers..."
ghost-catalog validate --strict
if [ $? -ne 0 ]; then
echo ""
echo "❌ Catalog validation failed!"
echo "Fix errors above or run: ghost-catalog validate --fix"
exit 1
fi
echo "✓ Catalog validation passed"
exit 0Make executable:
chmod +x .git/hooks/pre-commitPost-merge: Sync catalog
Create .git/hooks/post-merge:
#!/bin/bash
# Post-merge hook: Re-sync catalog after merges
echo "Syncing catalog after merge..."
ghost-catalog sync --quiet
echo "✓ Catalog synced"GitHub Actions:
Create .github/workflows/catalog-check.yml:
name: Catalog Validation
on:
pull_request:
paths:
- '**/*.py'
- '**/*.md'
- '**/*.yaml'
jobs:
validate:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Install ghost-catalog
run: |
curl -LO https://github.com/somacosf/ghost-catalog/releases/latest/download/ghost-catalog-linux-amd64
chmod +x ghost-catalog-linux-amd64
sudo mv ghost-catalog-linux-amd64 /usr/local/bin/ghost-catalog
- name: Initialize catalog
run: ghost-catalog init
- name: Sync catalog
run: ghost-catalog sync
- name: Validate headers
run: ghost-catalog validate --strict
- name: Check for missing headers
run: |
missing=$(ghost-catalog sync --dry-run | grep -c "Missing header" || true)
if [ "$missing" -gt 0 ]; then
echo "❌ Found $missing files without catalog headers"
exit 1
fi
- name: Check version consistency
run: |
# All files should be v1.0.0+ for release branches
if [[ "$GITHUB_REF" == refs/heads/release/* ]]; then
pre_release=$(ghost-catalog list --format json | jq '[.[] | select(.version | startswith("0."))] | length')
if [ "$pre_release" -gt 0 ]; then
echo "❌ Found $pre_release files with version < 1.0.0"
ghost-catalog list --format json | jq '.[] | select(.version | startswith("0.")) | .file_id'
exit 1
fi
fi
- name: Generate catalog report
if: always()
run: |
ghost-catalog stats > catalog_report.txt
cat catalog_report.txt
- name: Upload catalog report
if: always()
uses: actions/upload-artifact@v3
with:
name: catalog-report
path: catalog_report.txtVS Code Extension (proposed):
Features:
- Syntax highlighting for catalog headers
- Autocomplete for tags
- Quick actions: "Add catalog header", "Update version", "Validate header"
- Hover tooltips showing file metadata
Installation:
# Once published
code --install-extension somacosf.ghost-catalogVim Plugin (proposed):
Features:
- Snippets for catalog headers (
:CatalogHeader) - Commands:
:CatalogValidate,:CatalogInfo - Statusline integration showing file ID
Installation:
" Add to .vimrc
Plug 'somacosf/ghost-catalog.vim'Scenario: You want to refactor db_handler.py. Which files will break?
Without dependency tracking:
# Grep for imports (unreliable)
grep -r "from.*db_handler import" .
grep -r "import db_handler" .With dependency tracking:
SELECT fc.file_id, fc.name
FROM file_dependencies fd
JOIN file_catalog fc ON fd.file_id = fc.file_id
WHERE fd.depends_on_file_id = 'SOM-SCR-XXXX-v1.0.0';Result:
SOM-SCR-0012-v1.1.0 core.py
SOM-SCR-0013-v1.0.0 collector.py
SOM-SCR-0014-v1.0.0 cli.py
Automated detection (Python example):
import ast
from pathlib import Path
def extract_imports(filepath):
"""Extract all import statements from Python file."""
with open(filepath) as f:
try:
tree = ast.parse(f.read())
except SyntaxError:
return []
imports = []
for node in ast.walk(tree):
if isinstance(node, ast.Import):
for alias in node.names:
imports.append({
'type': 'import',
'module': alias.name,
'alias': alias.asname
})
elif isinstance(node, ast.ImportFrom):
imports.append({
'type': 'from',
'module': node.module,
'names': [alias.name for alias in node.names],
'level': node.level # Relative imports
})
return imports
# Usage
imports = extract_imports('ghost_shell/cli.py')
print(imports)
# [
# {'type': 'from', 'module': 'ghost_shell.data.db_handler', 'names': ['DatabaseHandler']},
# {'type': 'import', 'module': 'rich', 'alias': None},
# ...
# ]Map to file IDs:
def resolve_module_to_file_id(module_name, catalog):
"""Map Python module to file ID."""
# Convert module to file path
# ghost_shell.data.db_handler → ghost_shell/data/db_handler.py
path = module_name.replace('.', '/') + '.py'
# Look up in catalog
for entry in catalog:
if entry['path'] == path:
return entry['file_id']
return None
# Build dependency graph
dependencies = []
for file_entry in catalog:
if file_entry['category'] != 'script':
continue
imports = extract_imports(file_entry['path'])
for imp in imports:
dep_file_id = resolve_module_to_file_id(imp['module'], catalog)
if dep_file_id:
dependencies.append({
'file_id': file_entry['file_id'],
'depends_on': dep_file_id,
'type': 'import'
})
# Insert into database
for dep in dependencies:
db.execute("""
INSERT INTO file_dependencies (file_id, depends_on_file_id, dependency_type)
VALUES (?, ?, ?)
""", (dep['file_id'], dep['depends_on'], dep['type']))Direct dependencies (what does file X import?):
SELECT fc.file_id, fc.name
FROM file_dependencies fd
JOIN file_catalog fc ON fd.depends_on_file_id = fc.file_id
WHERE fd.file_id = 'SOM-SCR-0014-v1.0.0';Reverse dependencies (what imports file X?):
SELECT fc.file_id, fc.name
FROM file_dependencies fd
JOIN file_catalog fc ON fd.file_id = fc.file_id
WHERE fd.depends_on_file_id = 'SOM-SCR-XXXX-v1.0.0';Transitive dependencies (all files X depends on, recursively):
WITH RECURSIVE deps AS (
-- Base case
SELECT depends_on_file_id, 1 as depth
FROM file_dependencies
WHERE file_id = 'SOM-SCR-0014-v1.0.0'
UNION ALL
-- Recursive case
SELECT fd.depends_on_file_id, deps.depth + 1
FROM file_dependencies fd
JOIN deps ON fd.file_id = deps.depends_on_file_id
WHERE deps.depth < 10
)
SELECT DISTINCT fc.file_id, fc.name, MAX(deps.depth) as depth
FROM deps
JOIN file_catalog fc ON deps.depends_on_file_id = fc.file_id
GROUP BY fc.file_id
ORDER BY depth;Mermaid diagram:
def generate_dependency_graph(root_file_id, catalog, dependencies):
"""Generate Mermaid dependency graph."""
lines = ["graph LR"]
# Get all related files
visited = set()
queue = [root_file_id]
while queue:
current = queue.pop(0)
if current in visited:
continue
visited.add(current)
# Find dependencies
for dep in dependencies:
if dep['file_id'] == current:
# Add edge
from_name = get_short_name(current, catalog)
to_name = get_short_name(dep['depends_on'], catalog)
lines.append(f" {from_name} --> {to_name}")
queue.append(dep['depends_on'])
return "\n".join(lines)
# Usage
mermaid = generate_dependency_graph('SOM-SCR-0014-v1.0.0', catalog, dependencies)
print(mermaid)Output:
graph LR
cli --> db_handler
cli --> rich
db_handler --> sqlite3
db_handler --> duckdb
Files created per agent:
SELECT
ar.name as agent,
COUNT(*) as files_created
FROM file_catalog fc
JOIN agent_registry ar ON fc.agent_id = ar.id
GROUP BY ar.name
ORDER BY files_created DESC;Result:
agent | files_created
AGENT-CLAUDE-002 | 18
AGENT-GPT-001 | 7
AGENT-HUMAN-001 | 5
Agent activity timeline:
SELECT
date(modified) as day,
agent_id,
COUNT(*) as edits
FROM file_catalog
WHERE modified >= date('now', '-30 days')
GROUP BY day, agent_id
ORDER BY day DESC;Most frequently modified files:
SELECT
file_id,
name,
julianday(modified) - julianday(created) as age_days,
CAST(substr(version, 3, 1) AS INTEGER) as patch_count
FROM file_catalog
ORDER BY patch_count DESC
LIMIT 10;Most popular tags:
SELECT tag, COUNT(*) as usage_count
FROM file_tags
GROUP BY tag
ORDER BY usage_count DESC
LIMIT 20;Tag co-occurrence matrix:
SELECT
t1.tag,
t2.tag,
COUNT(*) as co_occurrences
FROM file_tags t1
JOIN file_tags t2 ON t1.file_id = t2.file_id AND t1.tag < t2.tag
GROUP BY t1.tag, t2.tag
HAVING co_occurrences > 1
ORDER BY co_occurrences DESC
LIMIT 20;Result:
tag1 | tag2 | co_occurrences
proxy | security | 5
cli | database | 3
opentelemetry | monitoring | 4
Version distribution:
SELECT
CASE
WHEN version LIKE '0.%' THEN 'Pre-release'
WHEN version LIKE '1.%' THEN 'v1.x'
WHEN version LIKE '2.%' THEN 'v2.x'
ELSE 'Other'
END as version_range,
COUNT(*) as file_count
FROM file_catalog
GROUP BY version_range;Stale files (not updated in 90 days):
SELECT
file_id,
name,
modified,
julianday('now') - julianday(modified) as days_stale
FROM file_catalog
WHERE days_stale > 90
ORDER BY days_stale DESC;Generate catalog-enriched output:
# Standard repomix
repomix --output code.txt
# Then parse and annotate with catalog data
python enrich_repomix.py code.txt catalog.db > code_enriched.txtenrich_repomix.py:
import re
import sqlite3
def enrich_repomix(repomix_path, catalog_db):
"""Add catalog metadata annotations to repomix output."""
conn = sqlite3.connect(catalog_db)
with open(repomix_path) as f:
lines = f.readlines()
output = []
for line in lines:
output.append(line)
# Detect file boundaries
if line.startswith('<file path='):
# Extract path
match = re.search(r'path="([^"]+)"', line)
if match:
path = match.group(1)
# Look up in catalog
cursor = conn.execute("""
SELECT file_id, description, tags
FROM file_catalog
WHERE path = ?
""", (path,))
row = cursor.fetchone()
if row:
file_id, desc, tags = row
output.append(f"<!-- CATALOG: {file_id} -->\n")
output.append(f"<!-- DESC: {desc} -->\n")
output.append(f"<!-- TAGS: {tags} -->\n")
return ''.join(output)
# Usage
enriched = enrich_repomix('code.txt', 'data/catalog.db')
print(enriched)Auto-generate project structure doc:
def generate_project_structure_md(catalog_db):
"""Generate markdown documentation from catalog."""
conn = sqlite3.connect(catalog_db)
md = ["# Project Structure\n\n"]
# Group by category
categories = conn.execute("""
SELECT DISTINCT category FROM file_catalog ORDER BY category
""").fetchall()
for (category,) in categories:
md.append(f"## {category.title()}s\n\n")
files = conn.execute("""
SELECT file_id, name, description, tags
FROM file_catalog
WHERE category = ?
ORDER BY file_id
""", (category,)).fetchall()
for file_id, name, desc, tags in files:
md.append(f"### {name} (`{file_id}`)\n\n")
md.append(f"**Description**: {desc}\n\n")
md.append(f"**Tags**: {tags}\n\n")
return ''.join(md)
# Usage
doc = generate_project_structure_md('data/catalog.db')
with open('PROJECT_STRUCTURE.md', 'w') as f:
f.write(doc)Symptom:
ghost-catalog validate
✗ SOM-SCR-0014-v1.0.0 cli.py
└─ Version mismatch: file_id has v1.0.0 but version field has 1.0.1
Cause: The file_id line and version field don't match.
Fix:
# Auto-fix
ghost-catalog validate --fix
# Or manually edit the file to matchSymptom:
✗ SOM-SCR-0008-v1.0.0 utils.py
└─ Missing required field: description
Cause: Header is incomplete.
Fix: Open the file, add the missing field:
# description: Utility functions for data processingSymptom:
✗ utils.py
└─ Invalid file_id format: SOM-SCR-1-v1.0.0
Cause: Sequence number must be 4 digits.
Fix: Change SOM-SCR-1-v1.0.0 to SOM-SCR-0001-v1.0.0
Symptom:
✗ SOM-SCR-0014-v1.0.0 cli.py
└─ Filename mismatch: file is 'cli.py' but header says 'old_cli.py'
Cause: File was renamed but header not updated.
Fix: Update the name field in the header to match actual filename.
Symptom:
ghost-catalog list
# Shows old data, missing recent files
Cause: Catalog database hasn't been synced.
Fix:
ghost-catalog syncScenario: Sarah joins the Ghost_Shell team. She needs to:
- Understand project architecture
- Find relevant files for her first task (add CLI command)
- Not waste time reading every file
Traditional approach: 2-3 days of code reading
With catalog: 30 minutes
sequenceDiagram
participant Sarah
participant TUI as Catalog TUI
participant Docs as Documentation
participant Code as Codebase
Sarah->>TUI: Launch ghost-catalog-tui
TUI-->>Sarah: Show project overview
Sarah->>TUI: Filter: category=documentation
TUI-->>Sarah: Show 4 docs
Sarah->>Docs: Read SOM-DOC-0001 (Quickstart)
Docs-->>Sarah: 10-minute overview
Sarah->>TUI: Search: "CLI"
TUI-->>Sarah: Found SOM-SCR-0014 (cli.py)
Sarah->>TUI: Press 'o' to open
TUI->>Code: Open ghost_shell/cli.py in editor
Sarah->>Code: Read cli.py, understand pattern
Sarah->>Code: Add new command (following pattern)
Note over Sarah: First contribution in 30 minutes!
From the catalog, Sarah instantly knew:
- SOM-DOC-0001: Quickstart guide (read this first)
- SOM-DOC-0003: Architecture overview (technical deep dive)
- SOM-SCR-0014: CLI implementation (what she needs to modify)
- SOM-SCR-XXXX: Database handler (CLI dependency)
From tags:
cli+admin= She needs admin toolsopentelemetry= Project uses OTel for instrumentationrich= CLI uses Rich library for formatting
Scenario: Ghost_Shell has grown to 200+ files. Mike the maintainer needs to:
- Find all files related to a feature (e.g., "proxy")
- Identify unmaintained files
- Audit version consistency before release
Traditional approach: Manual file-by-file review
With catalog: Automated queries
ghost-catalog search --tag proxy
# Results:
# SOM-SCR-0011-v1.0.0 blocker.py (Traffic blocking)
# SOM-SCR-0012-v1.1.0 core.py (Main proxy addon)
# SOM-SCR-0016-v1.0.0 fingerprint.py (Fingerprint randomization)
# SOM-SCR-0017-v1.0.0 cookies.py (Cookie interception)
# SOM-DOC-0005-v1.0.0 PROXY_GUIDE.md (Proxy documentation)Impact: Found 5 files instantly (vs hours of grep/search)
-- Files not touched in 6 months
SELECT file_id, name, modified,
julianday('now') - julianday(modified) as days_stale
FROM file_catalog
WHERE days_stale > 180
ORDER BY days_stale DESC;Result:
file_id | name | modified | days_stale
SOM-SCR-0003-v1.0.0 | old_util.py | 2024-03-15 | 245
SOM-DOC-0002-v1.0.0 | OLD_API.md | 2024-05-10 | 189
Action: Archive or update these files.
-- Find all pre-1.0 files
SELECT file_id, name, version
FROM file_catalog
WHERE version LIKE '0.%';Release policy: All files must be v1.0.0+ for release.
Action: Bump versions of pre-release files or exclude from release.
Scenario: Three AI agents work on Ghost_Shell simultaneously:
- Agent A (CLAUDE): Adds OpenTelemetry instrumentation
- Agent B (GPT): Refactors database layer
- Agent C (HUMAN): Fixes bugs in CLI
They need to:
- Know what each other is working on
- Avoid editing the same files
- Hand off work when blocked
Traditional approach: Manual coordination (slow, error-prone)
With catalog: Agents query catalog for activity
# Agent A task: Add OTel to all files without it
ghost-catalog search --tag opentelemetry --format json > has_otel.json
ghost-catalog list --category script --format json > all_scripts.json
# Subtract: scripts WITHOUT otel
python find_missing_otel.py has_otel.json all_scripts.json > need_otel.txtResult: Agent A has a work list.
-- Agent B wants to refactor db_handler.py
-- First, check what depends on it
SELECT fc.file_id, fc.name, fc.agent_id
FROM file_dependencies fd
JOIN file_catalog fc ON fd.file_id = fc.file_id
WHERE fd.depends_on_file_id = 'SOM-SCR-XXXX-v1.0.0';Result:
file_id | name | agent_id
SOM-SCR-0012-v1.1.0 | core.py | AGENT-CLAUDE-001
SOM-SCR-0014-v1.0.0 | cli.py | AGENT-HUMAN-001
Action: Agent B notifies Agent A and Agent C about upcoming breaking changes.
-- Agent C wants to know what changed in last 24 hours
SELECT file_id, name, modified, agent_id
FROM file_catalog
WHERE modified >= date('now', '-1 day')
ORDER BY modified DESC;Result: Agent C sees Agent A added OTel to 3 files yesterday.
Action: Agent C reviews changes before starting work.
Agent A completes OTel work, updates diary:
## 2025-01-24: Agent A (CLAUDE)
- Added OTel to SOM-SCR-0012, 0013, 0014
- Updated tags: Added 'opentelemetry' to all
- Next: Agent B should verify OTel integration in database layerAgent B reads diary, queries catalog:
ghost-catalog search --tag opentelemetry --tag databaseResult: Agent B picks up where Agent A left off.
Scenario: Production bug in collector.py. Questions:
- When was this file created?
- Who created it?
- Has it been modified recently?
- What does it depend on?
Traditional approach: Git log + manual inspection
With catalog: Single query
ghost-catalog info SOM-SCR-0013-v1.0.0Output:
╭──────────────────────────────────────────╮
│ File: SOM-SCR-0013-v1.0.0 │
├──────────────────────────────────────────┤
│ Name: collector.py │
│ Path: ghost_shell/intel/collector.py │
│ Description: Intelligence collector │
│ Category: script │
│ Tags: [intel, asn, teamcymru] │
│ Version: 1.0.0 │
│ Created: 2025-11-23 │
│ Modified: 2025-11-23 (today!) │
│ Agent: AGENT-CLAUDE-002 │
│ Execution: python -m ghost_shell.intel.collect │
╰──────────────────────────────────────────╯
Insights:
- Created 2 months ago (2025-11-23)
- Created by AGENT-CLAUDE-002
- Modified TODAY (bug likely introduced today)
-- Find all modifications to collector.py
SELECT
date(modified) as change_date,
version,
agent_id
FROM file_catalog
WHERE file_id LIKE 'SOM-SCR-0013-%'
ORDER BY change_date DESC;(Note: This requires version history tracking, which can be added by archiving old versions in catalog)
-- What does collector.py depend on?
SELECT fc.file_id, fc.name
FROM file_dependencies fd
JOIN file_catalog fc ON fd.depends_on_file_id = fc.file_id
WHERE fd.file_id = 'SOM-SCR-0013-v1.0.0';Result:
SOM-SCR-XXXX-v1.0.0 db_handler.py
Action: Check if db_handler.py changed recently (possible cause of bug).
Scenario: Company audit requires:
- List all files containing security-sensitive code
- Prove all files have been reviewed in last 6 months
- Show version control and change tracking
Traditional approach: Manual audit (weeks of work)
With catalog: Automated compliance reports
# Find all security-related files
ghost-catalog search --tag security --format json > security_files.json
# Count them
jq 'length' security_files.json
# Output: 12 filesReport:
# Security Audit Report
- Total security-related files: 12
- All files cataloged: Yes
- All files have agent IDs: Yes (traceability)-- Files not reviewed (modified) in 6 months
SELECT
file_id,
name,
modified,
julianday('now') - julianday(modified) as days_since_review
FROM file_catalog
WHERE category IN ('script', 'configuration')
AND days_since_review > 180;Result:
2 files require review
Action: Flag these for review, update modified dates after review.
-- Show all changes in last quarter
SELECT
date(modified) as change_date,
file_id,
name,
agent_id
FROM file_catalog
WHERE modified >= date('now', '-90 days')
ORDER BY change_date DESC;Export for audit:
ghost-catalog export audit_report.csv --format csv --from 2024-10-011. Multi-Agent Development
- Multiple AI agents (or AI + humans) work on codebase
- Need coordination and handoffs
- Example: Claude, GPT-4, and human developers all contributing
2. Rapid Onboarding
- New contributors join frequently
- Need to understand codebase quickly
- Example: Open-source project with rotating contributors
3. Complex Codebases
- 50+ files
- Multiple modules/features
- Hard to remember what each file does
4. Versioning Matters
- Track file versions independently
- Need to know which files are stable (v1.x) vs experimental (v0.x)
5. Semantic Organization
- Want to group files by purpose (scripts, docs, tests)
- Tag-based discovery important
6. Compliance/Audit Requirements
- Need to track who created/modified files
- Prove review cycles
- Generate compliance reports
7. Documentation-Heavy Projects
- Lots of docs that need to stay organized
- Docs need to reference code files consistently
1. Tiny Projects
- 1-10 files
- Single developer
- No need for coordination
- Alternative: Just use good filenames and a README
2. Volatile Early-Stage Projects
- Files created/deleted constantly
- Architecture not settled
- Issue: Maintaining headers is overhead
- Wait until: Architecture stabilizes
3. No AI Agent Involvement
- Pure human development
- Team uses existing tools (IDEs, Git, Jira)
- Alternative: Stick with what works
4. External Codebase
- You don't control the files
- Can't add headers (third-party libraries)
- Alternative: External documentation
5. Real-Time Collaboration
- Google Docs-style simultaneous editing
- Issue: Headers can't track live changes
- Alternative: Use version control + branch names
6. Binary Files
- Can't embed headers in executables, images, videos
- Alternative: External metadata database only
| Feature | SOM File IDs | UUIDs | Git Commit Hashes |
|---|---|---|---|
| Uniqueness | Sequential (human-assigned) | Cryptographically guaranteed | Cryptographically guaranteed |
| Human-Readable | ✅ Yes (semantic) | ❌ No (random) | ❌ No (random) |
| Self-Documenting | ✅ Category + sequence | ❌ Opaque | ❌ Opaque |
| Collision Risk | ✅ None (negligible) | ✅ None (SHA-1) | |
| Version Tracking | ✅ Built-in (semver) | ❌ No | ✅ Yes (commit history) |
| Searchability | ✅ By category, tag, agent | ❌ Exact match only | ✅ By branch, author, date |
| Discoverability | ✅ Browse by category | ❌ Need registry lookup | |
| Metadata | ✅ 12+ fields embedded | ❌ None | ✅ Commit message, author |
| Git-Friendly | ✅ Readable diffs | ❌ Random strings | ✅ Built-in |
| Distributed Systems | ✅ Works offline | ✅ Decentralized | |
| File Granularity | ✅ Per-file tracking | ✅ Can be per-file | ❌ Per-commit (multiple files) |
| Tooling | ✅ Language built-ins | ✅ Git (universal) |
Use SOM File IDs for:
- AI agent coordination
- Semantic file organization
- Rapid onboarding
- Tag-based discovery
Use UUIDs for:
- Database primary keys
- Distributed systems (no coordination)
- API tokens
- Anonymous identifiers
Use Git Hashes for:
- Version control (already using Git)
- Tracking commit history
- Branching/merging
- Reproducible builds
Use All Three:
- SOM IDs: File catalog (what/who/when)
- Git: Version history (changes over time)
- UUIDs: Database records (data objects)
┌─────────────────────────────────────────────────────────┐
│ Ghost_Shell File Catalog System │
├─────────────────────────────────────────────────────────┤
│ │
│ What: Semantic file IDs (SOM-XXX-NNNN-vX.X.X) │
│ Why: AI agent coordination + rapid onboarding │
│ How: Embedded headers + optional SQLite catalog │
│ When: Multi-agent, 50+ files, frequent onboarding │
│ Tools: CLI, TUI, SQL queries, Git hooks │
│ │
│ ✅ Human-readable, self-documenting │
│ ✅ Category-based, tag-searchable │
│ ✅ Agent tracking, version control │
│ ✅ Git-friendly, works offline │
│ │
│ ⚠️ Manual coordination needed (no UUID uniqueness) │
│ ⚠️ Custom tooling required (not built-in) │
│ ⚠️ Header maintenance overhead │
│ │
└─────────────────────────────────────────────────────────┘
Next Steps:
- ✅ Read this guide (you're here!)
- 📥 Install ghost-catalog CLI
- 🧪 Try it on a small project (5-10 files)
- 📚 Add headers to existing project
- 🔍 Build catalog database
- 🚀 Launch TUI browser
- 🔗 Add Git hooks for automation
- 📊 Query catalog for insights
Resources:
- GitHub: https://github.com/somacosf/ghost-catalog
- CLI Documentation: https://docs.ghost-catalog.io/cli
- TUI Guide: https://docs.ghost-catalog.io/tui
- Gist (this doc): [Your gist URL]
Community:
- Discord: [Link to Discord server]
- Issues: https://github.com/somacosf/ghost-catalog/issues
- Discussions: https://github.com/somacosf/ghost-catalog/discussions
END OF GUIDE
Document Version: 1.0.0 Created: 2025-01-24 Author: Claude (Sonnet 4.5) Project: Ghost_Shell / Somacosf License: MIT
"Every file tells a story. The catalog makes sure you can find it."