Scion Hub Friction Log & Architectural Review

Author: Alan Blount (@zeroasterisk) Date: June 2026 Context: Setting up Scion Hub on a personal NAS, with assistance from Antigravity (Gemini agent)

Executive Summary

Over the past several days, Alan and Antigravity spent multiple hours attempting to get a Scion Hub running on a personal NAS. This is despite:

Having a fully functional GCP account with active service accounts
Having prior experience with Scion on an already-configured hub (aiopm)
Having an AI coding agent (Antigravity) actively helping debug and configure

The setup was not successful. The friction documented below is not theoretical — it blocked a motivated, technically experienced user with working infrastructure from completing a deployment.

These findings are platform-wide architectural issues, not NAS-specific problems. The NAS deployment simply stripped away the enterprise scaffolding that masks the friction on managed infrastructure. Every issue here affects GCE, Kubernetes, and workstation deployments equally.

1. Configuration & Orchestration Friction

→ #511 — Unified configuration schema with hierarchical global defaults

What Went Poorly

Settings are fragmented across too many layers: host-side .env files, git-committed config/settings.yaml, container-level hub.env variables, and stateful database rows in hub.db.

The DB Trap: Configuring a "global" service account key across all projects originally required direct SQL database manipulation (hub.db) because there was no clean file-based config or CLI flag to set it globally.
Redundancy: Credentials and GCP settings have to be defined repeatedly on a per-project basis, violating DRY principles.
Fragility: Settings can be silently overwritten on deploy, lost on restart, or ignored when defaults aren't respected.

Proposals

(a) Unified Configuration Schema: Consolidate all configuration into a single, comprehensive schema-validated YAML file (e.g., settings.yaml). Provide a fully documented master template (settings.yaml.example) detailing every key so users never have to guess.
(b) Hierarchical Global Defaults: Introduce a global_defaults block so projects/agents inherit credentials and settings without redundant configuration:

server:
  global_defaults:
    auth:
      selected_type: vertex-ai
      service_account_key_path: /secrets/credentials/credentials.json
    gcp:
      project_id: my-project
      region: us-east5

Related upstream issues

#475 — gce-start-hub.sh --full overwrites settings.yaml, destroying plugin config
#147 — Admin settings saved via web UI are lost on pod restart in Kubernetes
#212 — default_harness_config not respected after init
#473 — Read/write project config as agent-accessible tool
#163 — Support file:// URI for task, matching system_prompt pattern
#160 — Allow local host-path projection into file secrets without a Hub
#201 — Hub Template importing for local grove has wrong default

2. Onboarding & Bootstrap Friction

→ #512 — Streamlined onboarding and first-run bootstrap experience

What Went Poorly

Getting Scion running for the first time requires high-friction CLI orchestration (manually cloning repos, building/pulling images, editing environment files, and seeding databases). The onboarding flow immediately exposes the user to abstract concepts of "projects" and "brokers," when they just want to spin up their first coding agent on their local files.

Even with a working GCP account, valid service account keys, and experience on an existing hub, the setup process took hours of debugging and was ultimately unsuccessful. The gap between "I have credentials" and "I have a running agent" is far too wide.

Proposal: Interactive CLI Onboarding (like `claude` CLI)

The best reference for what this should feel like is the Claude Code CLI onboarding flow — when you first run claude, it walks you through authentication, preferences, and workspace trust in an interactive TUI that feels lightweight and guided. Scion needs the same.

scion init — a guided CLI/TUI onboarding flow that:

Asks where your hub will run — local Docker, remote GCE VM, Kubernetes, NAS, etc. Tailors subsequent steps to the deployment target.
Walks through authentication — asks how you'll authenticate agents (Vertex AI, API key, OAuth, auth file). Accepts a service account JSON key path and validates it in real time.
Sets global defaults along the way — as the user answers questions, the wizard writes a complete settings.yaml with sensible defaults. No separate "edit this YAML" step.
Explains key concepts interactively — brief, contextual explanations of projects, brokers, agents, and the configuration hierarchy as they become relevant:

"A project groups related agents and their configuration. Most users start with one. You can add more later." "A broker is the runtime that runs your agent containers. Your local Docker daemon is the default broker."
Validates the full stack before finishing — Docker accessible? Broker reachable? Auth working? Can we pull the agent image? Shows a green/red checklist with actionable next steps for failures.
Includes best practices — GCP access patterns, broker placement for local vs remote hardware.
Ends with a running agent — "Start your first agent? [Y/n]" → spins up a coordinator, proving setup works end-to-end.

Why CLI/TUI, not a web UI: it runs before the hub is up, meets users where they are (terminal), and can be re-run (scion init --reconfigure). The Claude CLI and gh auth login prove this pattern works.

Related upstream issues

#511 — Unified configuration schema (the wizard should write this config)
#245 — Not able to build images (setup tutorial)
#137 — build-images.sh quickstart commands fail with 403 on first run
#224 — Pull access denied on first run
#254 — gemini-cli → antigravity-cli (naming confusion during setup)
#182 — Docs for localhost registry

3. Agent Harness & Runtime Isolation Friction

→ #513 — Agent harnesses hang on interactive prompts in headless mode

What Went Poorly

When an agent starts up inside a Docker container, it uses an interactive harness (like Google Antigravity or Claude Code).

The Headless Hang: These CLIs are designed to run in a user-facing terminal and will prompt for confirmation (e.g., "Do you trust the files in this folder?", "Press Enter to authenticate"). Because they run headlessly in the background, they hang forever.
The Workspace Trust Catch-22: To bypass the trust prompt, we had to manually pre-seed config files inside the container:
```
// /home/scion/.gemini/antigravity-cli/settings.json
{ "trustedWorkspaces": ["/workspace", "/"] }

// cache/onboarding.json
{ "onboardingCompleted": true }
```
The broker has already explicitly authorized the project's workspace path — the harness should not ask the user to confirm what the system already trusts.
Non-Standard Path Resolution: Scion assumes a standard Linux environment (like GCE/Debian with docker in /usr/bin/docker), which breaks on appliance environments (Synology NAS, QNAP, TrueNAS) where binaries live in non-standard paths (e.g., /Volume1/@apps/DockerEngine/dockerd/bin/docker). The hub shouldn't require the docker CLI binary inside its container just for socket calls, or it should auto-detect the path / allow a config override like broker.docker_cli_path.
Out-of-Sync Images: The Hub requires the docker client to monitor containers. When image layers get out of sync, the hub silently fails to execute docker ps with "executable not found in $PATH" errors, leaving the user with a blank status line.

Proposals

(e) Out-of-the-Box Headless Seeding: The template harnesses (gemini_cli.go, claude_code.go) must enforce a strict, guaranteed headless mode when run as a background agent. They should automatically pre-seed config directories (e.g., settings.json with trustedWorkspaces: ["/workspace"] and onboarding.json cache files) prior to starting the container, preventing interactive prompts from blocking startup. Upstream harnesses must automatically accept workspace trust when running inside a broker-controlled container, since the broker has already explicitly authorized the project's workspace path.
(g) Error-Resilient Web Terminals: If an agent container fails or waits on interactive approval, do not fail silently in a black box. The Hub UI should provide an option to attach an interactive web terminal (via docker exec -it or a terminal session attached to the inner tmux session) so the user can easily see errors, click through prompts, or type interactive commands directly.

Related upstream issues

#125 — Claude harness: trust dialog not pre-accepted (closed — specific fix, general pattern remains)
#212 — default_harness_config not respected after init
#215 — Support non-interactive auth-file Kubernetes rounds
#87 — Harness home directories should exist before sync into non-root homes
#108 — Hub marks running agents as stalled/idle, loses phase on PTY disconnect
#165 — Control channel: agent create fails with websocket close 1009 (message too big)

4. Diagnostics & Debugging Friction

→ #514 — Integrated diagnostics dashboard and unified log aggregation

What Went Poorly

There is no unified logging aggregation. When an agent fails to start, the Hub UI doesn't display the stdout/stderr. To figure out why an agent is stuck, the operator must SSH into the host, run docker logs scion-hub, and then find a way to attach to the inner agent's tmux logs. During hours of debugging, this was the single biggest time sink — knowing something was wrong but having no visibility into what.

Telemetry Floods Mask Real Errors: If the Google Service Account is missing specific IAM permissions (like cloudtrace.traces.patch), the telemetry loop inside sciontool floods stdout with gRPC PermissionDenied errors. While it doesn't crash the broker, it makes standard logs unreadable, drowning out actual agent start errors. The bootstrapper should have a --no-telemetry flag or gracefully back off (disable trace exporting) if IAM permissions are denied, rather than retrying every second.

Proposals

(f) Integrated Web Log Viewer: Build a centralized "Diagnostics & Logs" dashboard inside the Hub UI. This panel should stream live, aggregated logs from the Hub server, the runtime broker, and active agent container outputs in a single, color-coded stream with built-in log level filters (INFO, WARN, ERROR).
Auto-Healing and Verification Checks: The make doctor utility or the Hub startup script should verify that the active container image is fully in-sync with the current Dockerfile dependencies. If a binary (like docker) is missing from the container, it should raise a clear warning in the Hub UI: "Hub image out-of-sync. Rebuild suggested: run 'make rebuild'."

Related upstream issues

#174 — Blank Web UI in workstation mode
#108 — Hub marks running agents as stalled/idle, loses phase on PTY disconnect
#90 — Hosted grove discovery should recover grove IDs from workspace markers
#161 — Telemetry: clarify authoritative work-completion signal for external subscribers

5. Stateful Identity Drift on Restart

→ #515 — Colocated broker generates new ID on hub restart, orphaning all agents

What Went Poorly

When the scion-hub container restarted, the colocated runtime broker generated a fresh, random UUID and registered itself with the hub. But the SQLite database still had all projects and agents assigned to the old broker ID. Every subsequent API call failed with invalid signature / 401 broker_auth_failed.

The hub appeared running. The broker appeared running. But every agent operation silently failed with a cryptic auth error that gave no indication the root cause was a stale broker ID in the database.

Diagnosing and fixing this required direct SQL against hub.db:

-- Fix the colocated broker identity drift when the container restarts:
UPDATE projects SET default_runtime_broker_id = 'c65b1619-a9a3-49d6-8eb5-a0fbc5ea1e2d';
UPDATE agents SET runtime_broker_id = 'c65b1619-a9a3-49d6-8eb5-a0fbc5ea1e2d';

-- Force the agent container's environment to respect persistent mounting paths:
INSERT OR REPLACE INTO env_vars (project_id, name, value)
VALUES ('d65c34ba-033c-4e18-8e96-76e81384e3e9', 'HOME', '/home/scion');

The SQL Debugging Anti-Pattern

This is symptomatic of a broader pattern encountered throughout the setup: debugging and fixing Scion's operational state repeatedly required direct SQLite manipulation. An AI agent (Antigravity) was used to run diagnostic SQL, inspect database state, and apply fixes — work that should be handled by the platform's own operational tooling.

Every time an operator (or their AI agent) has to open the database to understand or fix Scion's state, it's a sign that the platform needs:

A scion doctor command that detects and surfaces these mismatches
Self-healing on startup (detect stale broker IDs, auto-update references)
Persistent broker identity (write ID to a file on first run, reuse on restart)

Proposals

The colocated broker should either:

Persist its broker ID across restarts (write to file, reuse on subsequent runs)
The hub should auto-heal on startup — detect stale broker references and update them when the colocated broker re-registers
Use a well-known sentinel ID (e.g., colocated) instead of a random UUID

The 401 broker_auth_failed error should include diagnostic context: "Broker ID on this agent does not match any registered broker. Did the hub restart? Run scion doctor to diagnose."

Related upstream issues

#511 — Unified configuration schema (broker ID persistence is a config concern)
#514 — Integrated diagnostics (this mismatch should be surfaced automatically)
#108 — Hub marks running agents as stalled/idle
#212 — default_harness_config not respected after init

All Referenced Upstream Issues

#	Title	Status	Friction Area
#511	Unified configuration schema with hierarchical global defaults	New	Config
#512	Streamlined onboarding and first-run bootstrap experience	New	Onboarding
#513	Agent harnesses hang on interactive prompts in headless mode	New	Harness
#514	Integrated diagnostics dashboard and unified log aggregation	New	Diagnostics
#515	Colocated broker generates new ID on restart, orphaning agents	New	Identity / Config
#475	`gce-start-hub.sh --full` overwrites settings.yaml	Open	Config
#473	Read/write project config as agent-accessible tool	Open	Config
#254	gemini-cli → antigravity-cli	Open	Onboarding
#245	Not able to build images (setup tutorial)	Open	Onboarding
#224	Pull access denied on first run	Open	Onboarding
#215	Support non-interactive auth-file Kubernetes rounds	Open	Harness
#212	`default_harness_config` not respected after init	Open	Config / Harness
#201	Hub template importing has wrong default	Open	Config
#182	Docs for localhost registry	Open	Onboarding
#174	Blank Web UI in workstation mode	Open	Diagnostics
#165	WebSocket close 1009 (message too big) on agent create	Open	Harness
#163	Support `file://` URI for task config	Open	Config
#161	Clarify authoritative work-completion signal	Open	Diagnostics
#160	Allow local host-path projection without Hub	Open	Config
#147	Admin settings lost on pod restart	Open	Config
#137	Quickstart commands fail with 403	Open	Onboarding
#125	Claude harness trust dialog not pre-accepted	Closed	Harness
#108	Hub marks running agents as stalled	Open	Diagnostics
#90	Hosted grove discovery should recover IDs	Open	Diagnostics
#87	Harness home dirs should exist before sync	Open	Harness

zeroasterisk/scion-hub-friction-log.md

Select an option

No results found

Select an option

No results found

Scion Hub Friction Log & Architectural Review

Executive Summary

1. Configuration & Orchestration Friction

What Went Poorly

Proposals

Related upstream issues

2. Onboarding & Bootstrap Friction

What Went Poorly

Proposal: Interactive CLI Onboarding (like `claude` CLI)

Related upstream issues

3. Agent Harness & Runtime Isolation Friction

What Went Poorly

Proposals

Related upstream issues

4. Diagnostics & Debugging Friction

What Went Poorly

Proposals

Related upstream issues

5. Stateful Identity Drift on Restart

What Went Poorly

The SQL Debugging Anti-Pattern

Proposals

Related upstream issues

All Referenced Upstream Issues

zeroasterisk/scion-hub-friction-log.md

Scion Hub Friction Log & Architectural Review

Executive Summary

1. Configuration & Orchestration Friction

What Went Poorly

Proposals

Related upstream issues

2. Onboarding & Bootstrap Friction

What Went Poorly

Proposal: Interactive CLI Onboarding (like claude CLI)

Related upstream issues

3. Agent Harness & Runtime Isolation Friction

What Went Poorly

Proposals

Related upstream issues

4. Diagnostics & Debugging Friction

What Went Poorly

Proposals

Related upstream issues

5. Stateful Identity Drift on Restart

What Went Poorly

The SQL Debugging Anti-Pattern

Proposals

Related upstream issues

All Referenced Upstream Issues

Proposal: Interactive CLI Onboarding (like `claude` CLI)