Scion Hub Friction Log & Architectural Review
Scion Hub Friction Log & Architectural Review
Author: Alan Blount (@zeroasterisk) Date: June 2026 Context: Setting up Scion Hub on a personal NAS, with assistance from Antigravity (Gemini agent)
Over the past several days, Alan and Antigravity spent multiple hours attempting to get a Scion Hub running on a personal NAS. This is despite:
- Having a fully functional GCP account with active service accounts
- Having prior experience with Scion on an already-configured hub (aiopm)
- Having an AI coding agent (Antigravity) actively helping debug and configure
The setup was not successful. The friction documented below is not theoretical — it blocked a motivated, technically experienced user with working infrastructure from completing a deployment.
These findings are platform-wide architectural issues, not NAS-specific problems. The NAS deployment simply stripped away the enterprise scaffolding that masks the friction on managed infrastructure. Every issue here affects GCE, Kubernetes, and workstation deployments equally.
→ #511 — Unified configuration schema with hierarchical global defaults
Settings are fragmented across too many layers: host-side .env files, git-committed config/settings.yaml, container-level hub.env variables, and stateful database rows in hub.db.
- The DB Trap: Configuring a "global" service account key across all projects originally required direct SQL database manipulation (
hub.db) because there was no clean file-based config or CLI flag to set it globally. - Redundancy: Credentials and GCP settings have to be defined repeatedly on a per-project basis, violating DRY principles.
- Fragility: Settings can be silently overwritten on deploy, lost on restart, or ignored when defaults aren't respected.
- (a) Unified Configuration Schema: Consolidate all configuration into a single, comprehensive schema-validated YAML file (e.g.,
settings.yaml). Provide a fully documented master template (settings.yaml.example) detailing every key so users never have to guess. - (b) Hierarchical Global Defaults: Introduce a
global_defaultsblock so projects/agents inherit credentials and settings without redundant configuration:
server:
global_defaults:
auth:
selected_type: vertex-ai
service_account_key_path: /secrets/credentials/credentials.json
gcp:
project_id: my-project
region: us-east5- #475 —
gce-start-hub.sh --fulloverwritessettings.yaml, destroying plugin config - #147 — Admin settings saved via web UI are lost on pod restart in Kubernetes
- #212 —
default_harness_confignot respected after init - #473 — Read/write project config as agent-accessible tool
- #163 — Support
file://URI for task, matchingsystem_promptpattern - #160 — Allow local host-path projection into file secrets without a Hub
- #201 — Hub Template importing for local grove has wrong default
→ #512 — Streamlined onboarding and first-run bootstrap experience
Getting Scion running for the first time requires high-friction CLI orchestration (manually cloning repos, building/pulling images, editing environment files, and seeding databases). The onboarding flow immediately exposes the user to abstract concepts of "projects" and "brokers," when they just want to spin up their first coding agent on their local files.
Even with a working GCP account, valid service account keys, and experience on an existing hub, the setup process took hours of debugging and was ultimately unsuccessful. The gap between "I have credentials" and "I have a running agent" is far too wide.
The best reference for what this should feel like is the Claude Code CLI onboarding flow — when you first run claude, it walks you through authentication, preferences, and workspace trust in an interactive TUI that feels lightweight and guided. Scion needs the same.
scion init — a guided CLI/TUI onboarding flow that:
- Asks where your hub will run — local Docker, remote GCE VM, Kubernetes, NAS, etc. Tailors subsequent steps to the deployment target.
- Walks through authentication — asks how you'll authenticate agents (Vertex AI, API key, OAuth, auth file). Accepts a service account JSON key path and validates it in real time.
- Sets global defaults along the way — as the user answers questions, the wizard writes a complete
settings.yamlwith sensible defaults. No separate "edit this YAML" step. - Explains key concepts interactively — brief, contextual explanations of projects, brokers, agents, and the configuration hierarchy as they become relevant:
"A project groups related agents and their configuration. Most users start with one. You can add more later." "A broker is the runtime that runs your agent containers. Your local Docker daemon is the default broker."
- Validates the full stack before finishing — Docker accessible? Broker reachable? Auth working? Can we pull the agent image? Shows a green/red checklist with actionable next steps for failures.
- Includes best practices — GCP access patterns, broker placement for local vs remote hardware.
- Ends with a running agent — "Start your first agent? [Y/n]" → spins up a coordinator, proving setup works end-to-end.
Why CLI/TUI, not a web UI: it runs before the hub is up, meets users where they are (terminal), and can be re-run (scion init --reconfigure). The Claude CLI and gh auth login prove this pattern works.
- #511 — Unified configuration schema (the wizard should write this config)
- #245 — Not able to build images (setup tutorial)
- #137 —
build-images.shquickstart commands fail with 403 on first run - #224 — Pull access denied on first run
- #254 — gemini-cli → antigravity-cli (naming confusion during setup)
- #182 — Docs for localhost registry
→ #513 — Agent harnesses hang on interactive prompts in headless mode
When an agent starts up inside a Docker container, it uses an interactive harness (like Google Antigravity or Claude Code).
- The Headless Hang: These CLIs are designed to run in a user-facing terminal and will prompt for confirmation (e.g., "Do you trust the files in this folder?", "Press Enter to authenticate"). Because they run headlessly in the background, they hang forever.
- The Workspace Trust Catch-22: To bypass the trust prompt, we had to manually pre-seed config files inside the container:
The broker has already explicitly authorized the project's workspace path — the harness should not ask the user to confirm what the system already trusts.
// /home/scion/.gemini/antigravity-cli/settings.json { "trustedWorkspaces": ["/workspace", "/"] } // cache/onboarding.json { "onboardingCompleted": true }
- Non-Standard Path Resolution: Scion assumes a standard Linux environment (like GCE/Debian with
dockerin/usr/bin/docker), which breaks on appliance environments (Synology NAS, QNAP, TrueNAS) where binaries live in non-standard paths (e.g.,/Volume1/@apps/DockerEngine/dockerd/bin/docker). The hub shouldn't require thedockerCLI binary inside its container just for socket calls, or it should auto-detect the path / allow a config override likebroker.docker_cli_path. - Out-of-Sync Images: The Hub requires the
dockerclient to monitor containers. When image layers get out of sync, the hub silently fails to executedocker pswith "executable not found in $PATH" errors, leaving the user with a blank status line.
- (e) Out-of-the-Box Headless Seeding: The template harnesses (
gemini_cli.go,claude_code.go) must enforce a strict, guaranteed headless mode when run as a background agent. They should automatically pre-seed config directories (e.g.,settings.jsonwithtrustedWorkspaces: ["/workspace"]andonboarding.jsoncache files) prior to starting the container, preventing interactive prompts from blocking startup. Upstream harnesses must automatically accept workspace trust when running inside a broker-controlled container, since the broker has already explicitly authorized the project's workspace path. - (g) Error-Resilient Web Terminals: If an agent container fails or waits on interactive approval, do not fail silently in a black box. The Hub UI should provide an option to attach an interactive web terminal (via
docker exec -itor a terminal session attached to the innertmuxsession) so the user can easily see errors, click through prompts, or type interactive commands directly.
- #125 — Claude harness: trust dialog not pre-accepted (closed — specific fix, general pattern remains)
- #212 —
default_harness_confignot respected after init - #215 — Support non-interactive auth-file Kubernetes rounds
- #87 — Harness home directories should exist before sync into non-root homes
- #108 — Hub marks running agents as stalled/idle, loses phase on PTY disconnect
- #165 — Control channel: agent create fails with websocket close 1009 (message too big)
→ #514 — Integrated diagnostics dashboard and unified log aggregation
There is no unified logging aggregation. When an agent fails to start, the Hub UI doesn't display the stdout/stderr. To figure out why an agent is stuck, the operator must SSH into the host, run docker logs scion-hub, and then find a way to attach to the inner agent's tmux logs. During hours of debugging, this was the single biggest time sink — knowing something was wrong but having no visibility into what.
- Telemetry Floods Mask Real Errors: If the Google Service Account is missing specific IAM permissions (like
cloudtrace.traces.patch), the telemetry loop insidesciontoolfloods stdout with gRPCPermissionDeniederrors. While it doesn't crash the broker, it makes standard logs unreadable, drowning out actual agent start errors. The bootstrapper should have a--no-telemetryflag or gracefully back off (disable trace exporting) if IAM permissions are denied, rather than retrying every second.
- (f) Integrated Web Log Viewer: Build a centralized "Diagnostics & Logs" dashboard inside the Hub UI. This panel should stream live, aggregated logs from the Hub server, the runtime broker, and active agent container outputs in a single, color-coded stream with built-in log level filters (INFO, WARN, ERROR).
- Auto-Healing and Verification Checks: The
make doctorutility or the Hub startup script should verify that the active container image is fully in-sync with the current Dockerfile dependencies. If a binary (likedocker) is missing from the container, it should raise a clear warning in the Hub UI: "Hub image out-of-sync. Rebuild suggested: run 'make rebuild'."
- #174 — Blank Web UI in workstation mode
- #108 — Hub marks running agents as stalled/idle, loses phase on PTY disconnect
- #90 — Hosted grove discovery should recover grove IDs from workspace markers
- #161 — Telemetry: clarify authoritative work-completion signal for external subscribers
→ #515 — Colocated broker generates new ID on hub restart, orphaning all agents
When the scion-hub container restarted, the colocated runtime broker generated a fresh, random UUID and registered itself with the hub. But the SQLite database still had all projects and agents assigned to the old broker ID. Every subsequent API call failed with invalid signature / 401 broker_auth_failed.
The hub appeared running. The broker appeared running. But every agent operation silently failed with a cryptic auth error that gave no indication the root cause was a stale broker ID in the database.
Diagnosing and fixing this required direct SQL against hub.db:
-- Fix the colocated broker identity drift when the container restarts:
UPDATE projects SET default_runtime_broker_id = 'c65b1619-a9a3-49d6-8eb5-a0fbc5ea1e2d';
UPDATE agents SET runtime_broker_id = 'c65b1619-a9a3-49d6-8eb5-a0fbc5ea1e2d';
-- Force the agent container's environment to respect persistent mounting paths:
INSERT OR REPLACE INTO env_vars (project_id, name, value)
VALUES ('d65c34ba-033c-4e18-8e96-76e81384e3e9', 'HOME', '/home/scion');This is symptomatic of a broader pattern encountered throughout the setup: debugging and fixing Scion's operational state repeatedly required direct SQLite manipulation. An AI agent (Antigravity) was used to run diagnostic SQL, inspect database state, and apply fixes — work that should be handled by the platform's own operational tooling.
Every time an operator (or their AI agent) has to open the database to understand or fix Scion's state, it's a sign that the platform needs:
- A
scion doctorcommand that detects and surfaces these mismatches - Self-healing on startup (detect stale broker IDs, auto-update references)
- Persistent broker identity (write ID to a file on first run, reuse on restart)
The colocated broker should either:
- Persist its broker ID across restarts (write to file, reuse on subsequent runs)
- The hub should auto-heal on startup — detect stale broker references and update them when the colocated broker re-registers
- Use a well-known sentinel ID (e.g.,
colocated) instead of a random UUID
The 401 broker_auth_failed error should include diagnostic context: "Broker ID on this agent does not match any registered broker. Did the hub restart? Run scion doctor to diagnose."
- #511 — Unified configuration schema (broker ID persistence is a config concern)
- #514 — Integrated diagnostics (this mismatch should be surfaced automatically)
- #108 — Hub marks running agents as stalled/idle
- #212 —
default_harness_confignot respected after init
| # | Title | Status | Friction Area |
|---|---|---|---|
| #511 | Unified configuration schema with hierarchical global defaults | New | Config |
| #512 | Streamlined onboarding and first-run bootstrap experience | New | Onboarding |
| #513 | Agent harnesses hang on interactive prompts in headless mode | New | Harness |
| #514 | Integrated diagnostics dashboard and unified log aggregation | New | Diagnostics |
| #515 | Colocated broker generates new ID on restart, orphaning agents | New | Identity / Config |
| #475 | gce-start-hub.sh --full overwrites settings.yaml |
Open | Config |
| #473 | Read/write project config as agent-accessible tool | Open | Config |
| #254 | gemini-cli → antigravity-cli | Open | Onboarding |
| #245 | Not able to build images (setup tutorial) | Open | Onboarding |
| #224 | Pull access denied on first run | Open | Onboarding |
| #215 | Support non-interactive auth-file Kubernetes rounds | Open | Harness |
| #212 | default_harness_config not respected after init |
Open | Config / Harness |
| #201 | Hub template importing has wrong default | Open | Config |
| #182 | Docs for localhost registry | Open | Onboarding |
| #174 | Blank Web UI in workstation mode | Open | Diagnostics |
| #165 | WebSocket close 1009 (message too big) on agent create | Open | Harness |
| #163 | Support file:// URI for task config |
Open | Config |
| #161 | Clarify authoritative work-completion signal | Open | Diagnostics |
| #160 | Allow local host-path projection without Hub | Open | Config |
| #147 | Admin settings lost on pod restart | Open | Config |
| #137 | Quickstart commands fail with 403 | Open | Onboarding |
| #125 | Claude harness trust dialog not pre-accepted | Closed | Harness |
| #108 | Hub marks running agents as stalled | Open | Diagnostics |
| #90 | Hosted grove discovery should recover IDs | Open | Diagnostics |
| #87 | Harness home dirs should exist before sync | Open | Harness |