Skip to content

Instantly share code, notes, and snippets.

@zeroasterisk
Last active June 28, 2026 05:10
Show Gist options
  • Select an option

  • Save zeroasterisk/0cb3de27dca004621bf1720c0601632c to your computer and use it in GitHub Desktop.

Select an option

Save zeroasterisk/0cb3de27dca004621bf1720c0601632c to your computer and use it in GitHub Desktop.
Scion Hub Friction Log & Architectural Review

Scion Hub Friction Log & Architectural Review

Scion Hub Friction Log & Architectural Review

Scion Hub Friction Log & Architectural Review

Author: Alan Blount (@zeroasterisk) Date: June 2026 Context: Setting up Scion Hub on a personal NAS, with assistance from Antigravity (Gemini agent)

Executive Summary

Over the past several days, Alan and Antigravity spent multiple hours attempting to get a Scion Hub running on a personal NAS. This is despite:

  • Having a fully functional GCP account with active service accounts
  • Having prior experience with Scion on an already-configured hub (aiopm)
  • Having an AI coding agent (Antigravity) actively helping debug and configure

The setup was not successful. The friction documented below is not theoretical — it blocked a motivated, technically experienced user with working infrastructure from completing a deployment.

These findings are platform-wide architectural issues, not NAS-specific problems. The NAS deployment simply stripped away the enterprise scaffolding that masks the friction on managed infrastructure. Every issue here affects GCE, Kubernetes, and workstation deployments equally.


1. Configuration & Orchestration Friction

#511 — Unified configuration schema with hierarchical global defaults

What Went Poorly

Settings are fragmented across too many layers: host-side .env files, git-committed config/settings.yaml, container-level hub.env variables, and stateful database rows in hub.db.

  • The DB Trap: Configuring a "global" service account key across all projects originally required direct SQL database manipulation (hub.db) because there was no clean file-based config or CLI flag to set it globally.
  • Redundancy: Credentials and GCP settings have to be defined repeatedly on a per-project basis, violating DRY principles.
  • Fragility: Settings can be silently overwritten on deploy, lost on restart, or ignored when defaults aren't respected.

Proposals

  • (a) Unified Configuration Schema: Consolidate all configuration into a single, comprehensive schema-validated YAML file (e.g., settings.yaml). Provide a fully documented master template (settings.yaml.example) detailing every key so users never have to guess.
  • (b) Hierarchical Global Defaults: Introduce a global_defaults block so projects/agents inherit credentials and settings without redundant configuration:
server:
  global_defaults:
    auth:
      selected_type: vertex-ai
      service_account_key_path: /secrets/credentials/credentials.json
    gcp:
      project_id: my-project
      region: us-east5

Related upstream issues

  • #475gce-start-hub.sh --full overwrites settings.yaml, destroying plugin config
  • #147 — Admin settings saved via web UI are lost on pod restart in Kubernetes
  • #212default_harness_config not respected after init
  • #473 — Read/write project config as agent-accessible tool
  • #163 — Support file:// URI for task, matching system_prompt pattern
  • #160 — Allow local host-path projection into file secrets without a Hub
  • #201 — Hub Template importing for local grove has wrong default

2. Onboarding & Bootstrap Friction

#512 — Streamlined onboarding and first-run bootstrap experience

What Went Poorly

Getting Scion running for the first time requires high-friction CLI orchestration (manually cloning repos, building/pulling images, editing environment files, and seeding databases). The onboarding flow immediately exposes the user to abstract concepts of "projects" and "brokers," when they just want to spin up their first coding agent on their local files.

Even with a working GCP account, valid service account keys, and experience on an existing hub, the setup process took hours of debugging and was ultimately unsuccessful. The gap between "I have credentials" and "I have a running agent" is far too wide.

Proposal: Interactive CLI Onboarding (like claude CLI)

The best reference for what this should feel like is the Claude Code CLI onboarding flow — when you first run claude, it walks you through authentication, preferences, and workspace trust in an interactive TUI that feels lightweight and guided. Scion needs the same.

scion init — a guided CLI/TUI onboarding flow that:

  1. Asks where your hub will run — local Docker, remote GCE VM, Kubernetes, NAS, etc. Tailors subsequent steps to the deployment target.
  2. Walks through authentication — asks how you'll authenticate agents (Vertex AI, API key, OAuth, auth file). Accepts a service account JSON key path and validates it in real time.
  3. Sets global defaults along the way — as the user answers questions, the wizard writes a complete settings.yaml with sensible defaults. No separate "edit this YAML" step.
  4. Explains key concepts interactively — brief, contextual explanations of projects, brokers, agents, and the configuration hierarchy as they become relevant:

    "A project groups related agents and their configuration. Most users start with one. You can add more later." "A broker is the runtime that runs your agent containers. Your local Docker daemon is the default broker."

  5. Validates the full stack before finishing — Docker accessible? Broker reachable? Auth working? Can we pull the agent image? Shows a green/red checklist with actionable next steps for failures.
  6. Includes best practices — GCP access patterns, broker placement for local vs remote hardware.
  7. Ends with a running agent — "Start your first agent? [Y/n]" → spins up a coordinator, proving setup works end-to-end.

Why CLI/TUI, not a web UI: it runs before the hub is up, meets users where they are (terminal), and can be re-run (scion init --reconfigure). The Claude CLI and gh auth login prove this pattern works.

Related upstream issues

  • #511 — Unified configuration schema (the wizard should write this config)
  • #245 — Not able to build images (setup tutorial)
  • #137build-images.sh quickstart commands fail with 403 on first run
  • #224 — Pull access denied on first run
  • #254 — gemini-cli → antigravity-cli (naming confusion during setup)
  • #182 — Docs for localhost registry

3. Agent Harness & Runtime Isolation Friction

#513 — Agent harnesses hang on interactive prompts in headless mode

What Went Poorly

When an agent starts up inside a Docker container, it uses an interactive harness (like Google Antigravity or Claude Code).

  • The Headless Hang: These CLIs are designed to run in a user-facing terminal and will prompt for confirmation (e.g., "Do you trust the files in this folder?", "Press Enter to authenticate"). Because they run headlessly in the background, they hang forever.
  • The Workspace Trust Catch-22: To bypass the trust prompt, we had to manually pre-seed config files inside the container:
    // /home/scion/.gemini/antigravity-cli/settings.json
    { "trustedWorkspaces": ["/workspace", "/"] }
    
    // cache/onboarding.json
    { "onboardingCompleted": true }
    The broker has already explicitly authorized the project's workspace path — the harness should not ask the user to confirm what the system already trusts.
  • Non-Standard Path Resolution: Scion assumes a standard Linux environment (like GCE/Debian with docker in /usr/bin/docker), which breaks on appliance environments (Synology NAS, QNAP, TrueNAS) where binaries live in non-standard paths (e.g., /Volume1/@apps/DockerEngine/dockerd/bin/docker). The hub shouldn't require the docker CLI binary inside its container just for socket calls, or it should auto-detect the path / allow a config override like broker.docker_cli_path.
  • Out-of-Sync Images: The Hub requires the docker client to monitor containers. When image layers get out of sync, the hub silently fails to execute docker ps with "executable not found in $PATH" errors, leaving the user with a blank status line.

Proposals

  • (e) Out-of-the-Box Headless Seeding: The template harnesses (gemini_cli.go, claude_code.go) must enforce a strict, guaranteed headless mode when run as a background agent. They should automatically pre-seed config directories (e.g., settings.json with trustedWorkspaces: ["/workspace"] and onboarding.json cache files) prior to starting the container, preventing interactive prompts from blocking startup. Upstream harnesses must automatically accept workspace trust when running inside a broker-controlled container, since the broker has already explicitly authorized the project's workspace path.
  • (g) Error-Resilient Web Terminals: If an agent container fails or waits on interactive approval, do not fail silently in a black box. The Hub UI should provide an option to attach an interactive web terminal (via docker exec -it or a terminal session attached to the inner tmux session) so the user can easily see errors, click through prompts, or type interactive commands directly.

Related upstream issues

  • #125 — Claude harness: trust dialog not pre-accepted (closed — specific fix, general pattern remains)
  • #212default_harness_config not respected after init
  • #215 — Support non-interactive auth-file Kubernetes rounds
  • #87 — Harness home directories should exist before sync into non-root homes
  • #108 — Hub marks running agents as stalled/idle, loses phase on PTY disconnect
  • #165 — Control channel: agent create fails with websocket close 1009 (message too big)

4. Diagnostics & Debugging Friction

#514 — Integrated diagnostics dashboard and unified log aggregation

What Went Poorly

There is no unified logging aggregation. When an agent fails to start, the Hub UI doesn't display the stdout/stderr. To figure out why an agent is stuck, the operator must SSH into the host, run docker logs scion-hub, and then find a way to attach to the inner agent's tmux logs. During hours of debugging, this was the single biggest time sink — knowing something was wrong but having no visibility into what.

  • Telemetry Floods Mask Real Errors: If the Google Service Account is missing specific IAM permissions (like cloudtrace.traces.patch), the telemetry loop inside sciontool floods stdout with gRPC PermissionDenied errors. While it doesn't crash the broker, it makes standard logs unreadable, drowning out actual agent start errors. The bootstrapper should have a --no-telemetry flag or gracefully back off (disable trace exporting) if IAM permissions are denied, rather than retrying every second.

Proposals

  • (f) Integrated Web Log Viewer: Build a centralized "Diagnostics & Logs" dashboard inside the Hub UI. This panel should stream live, aggregated logs from the Hub server, the runtime broker, and active agent container outputs in a single, color-coded stream with built-in log level filters (INFO, WARN, ERROR).
  • Auto-Healing and Verification Checks: The make doctor utility or the Hub startup script should verify that the active container image is fully in-sync with the current Dockerfile dependencies. If a binary (like docker) is missing from the container, it should raise a clear warning in the Hub UI: "Hub image out-of-sync. Rebuild suggested: run 'make rebuild'."

Related upstream issues

  • #174 — Blank Web UI in workstation mode
  • #108 — Hub marks running agents as stalled/idle, loses phase on PTY disconnect
  • #90 — Hosted grove discovery should recover grove IDs from workspace markers
  • #161 — Telemetry: clarify authoritative work-completion signal for external subscribers

5. Stateful Identity Drift on Restart

#515 — Colocated broker generates new ID on hub restart, orphaning all agents

What Went Poorly

When the scion-hub container restarted, the colocated runtime broker generated a fresh, random UUID and registered itself with the hub. But the SQLite database still had all projects and agents assigned to the old broker ID. Every subsequent API call failed with invalid signature / 401 broker_auth_failed.

The hub appeared running. The broker appeared running. But every agent operation silently failed with a cryptic auth error that gave no indication the root cause was a stale broker ID in the database.

Diagnosing and fixing this required direct SQL against hub.db:

-- Fix the colocated broker identity drift when the container restarts:
UPDATE projects SET default_runtime_broker_id = 'c65b1619-a9a3-49d6-8eb5-a0fbc5ea1e2d';
UPDATE agents SET runtime_broker_id = 'c65b1619-a9a3-49d6-8eb5-a0fbc5ea1e2d';

-- Force the agent container's environment to respect persistent mounting paths:
INSERT OR REPLACE INTO env_vars (project_id, name, value)
VALUES ('d65c34ba-033c-4e18-8e96-76e81384e3e9', 'HOME', '/home/scion');

The SQL Debugging Anti-Pattern

This is symptomatic of a broader pattern encountered throughout the setup: debugging and fixing Scion's operational state repeatedly required direct SQLite manipulation. An AI agent (Antigravity) was used to run diagnostic SQL, inspect database state, and apply fixes — work that should be handled by the platform's own operational tooling.

Every time an operator (or their AI agent) has to open the database to understand or fix Scion's state, it's a sign that the platform needs:

  • A scion doctor command that detects and surfaces these mismatches
  • Self-healing on startup (detect stale broker IDs, auto-update references)
  • Persistent broker identity (write ID to a file on first run, reuse on restart)

Proposals

The colocated broker should either:

  1. Persist its broker ID across restarts (write to file, reuse on subsequent runs)
  2. The hub should auto-heal on startup — detect stale broker references and update them when the colocated broker re-registers
  3. Use a well-known sentinel ID (e.g., colocated) instead of a random UUID

The 401 broker_auth_failed error should include diagnostic context: "Broker ID on this agent does not match any registered broker. Did the hub restart? Run scion doctor to diagnose."

Related upstream issues

  • #511 — Unified configuration schema (broker ID persistence is a config concern)
  • #514 — Integrated diagnostics (this mismatch should be surfaced automatically)
  • #108 — Hub marks running agents as stalled/idle
  • #212default_harness_config not respected after init

All Referenced Upstream Issues

# Title Status Friction Area
#511 Unified configuration schema with hierarchical global defaults New Config
#512 Streamlined onboarding and first-run bootstrap experience New Onboarding
#513 Agent harnesses hang on interactive prompts in headless mode New Harness
#514 Integrated diagnostics dashboard and unified log aggregation New Diagnostics
#515 Colocated broker generates new ID on restart, orphaning agents New Identity / Config
#475 gce-start-hub.sh --full overwrites settings.yaml Open Config
#473 Read/write project config as agent-accessible tool Open Config
#254 gemini-cli → antigravity-cli Open Onboarding
#245 Not able to build images (setup tutorial) Open Onboarding
#224 Pull access denied on first run Open Onboarding
#215 Support non-interactive auth-file Kubernetes rounds Open Harness
#212 default_harness_config not respected after init Open Config / Harness
#201 Hub template importing has wrong default Open Config
#182 Docs for localhost registry Open Onboarding
#174 Blank Web UI in workstation mode Open Diagnostics
#165 WebSocket close 1009 (message too big) on agent create Open Harness
#163 Support file:// URI for task config Open Config
#161 Clarify authoritative work-completion signal Open Diagnostics
#160 Allow local host-path projection without Hub Open Config
#147 Admin settings lost on pod restart Open Config
#137 Quickstart commands fail with 403 Open Onboarding
#125 Claude harness trust dialog not pre-accepted Closed Harness
#108 Hub marks running agents as stalled Open Diagnostics
#90 Hosted grove discovery should recover IDs Open Diagnostics
#87 Harness home dirs should exist before sync Open Harness
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment