Agents can already run code locally. What they still can’t do—safely and ergonomically—is show and allow interaction with their work.
In practice, today’s approaches force a tradeoff (see A field guide to sandboxes for AI):
-
Secure execution, weak UX Containers, MicroVMs, and Wasm sandboxes are great for isolating compute, but they’re often headless in practice: the user gets logs, diffs, and text.
-
Great UX, weak sandbox boundary Browsers/Electron/WebViews can render rich UI, but the moment you need “agent things” (filesystem writes, repo access, shell commands), you either punch holes in the runtime or build large, bespoke bridges.
We want a local runtime that treats interactive UI as a first-class (sandboxable) output channel for agent execution.
We’re building a new kind of runtime sandbox: a Glass Box.
Not “code in, text out”, but “code in, UI out” — where the agent can generate a small web app to visualize results and offer safe, reviewable actions (buttons, forms, previews).
This is not “a browser tab.” It’s a display server for agentic workflows—effectively a Programmable Viewport for AI agents—giving agents a standard, lightweight way to request a window and draw.
This design makes an explicit choice about:
- Boundary: a separate Servo process (and eventually OS-native sandboxing around script execution) and a host process running the agent scaffold.
- Policy: Lock down the Web environment and allow only a narrow host-mediated request channel.
- Lifecycle: A shared CRDT document, Automerge, is used to communicate between Servo and the host agent scaffold.
A simple demo that maps directly to Servo’s strengths and weaknesses:
- Load a real web page in Servo (via a host-mediated capability).
- Extract the page’s DOM/CSS (or a targeted slice) and send it to the agent.
- Have the agent write a transformation script to update styling toward CSS features that Servo supports well.
- Run the script against the live page and render the updated result in the Glass Box for visual review(the agent could add a form to the page for feedback).
This is the end-to-end loop: inspect → propose changes → apply → visually verify, without turning the agent experience into “diffs only.”
A more complex demo that stress-tests the full architecture (UI generation + interaction + iterative inference):
- The Brain (main agent) defines a small standard widget set (e.g., multiple-choice, code snippet, fill-in-the-blank, “explain why”) and generates the initial app shell.
- The app runs in Servo and renders the quiz UI.
- For each question, the app requests a sub-agent inference call to generate:
- the content of the next question, and
- which widget to use (selected from the standard widget set).
- The user answers; the app writes the answer to shared state.
- The answer history is added to the sub-agent context and used to adapt difficulty and topic selection for the next question.
This demo is intentionally "agentic": the UI is not a static form. It evolves turn-by-turn, while keeping the trust boundary crisp: Servo renders and handles interaction, and inference/tooling can remain host-mediated.
The key design choice is separation of concerns:
- Brain (agent loop): runs in the host environment (initially VS Code + GenAIScript + Copilot), with tool access and user permissions. Note that actual LLM inference could be both local or in the cloud. The POC will use Github Copilot and so inference will be in the cloud.
- Hands/Face (sandbox UI): runs in Servo, executing agent-generated HTML/CSS/JS and rendering the interactive view.
Instead of inventing a huge proprietary “Agent API” inside JavaScript, we use a shared synchronized document.
Instead of complex RPC calls or proprietary bindings, we synchronize a shared Automerge document. The shared state is the API.
In particular, this state does not have to live entirely in the agent’s context window.
- High throughput, low token pressure: Automerge handles the “messy” synchronization with the UI process. The shared document is used both for sending commands across process--from the agent to Servo:"execute this web app", "read contents of web app"--and from Servo back to the agent: "run inference", "here are the results of running the app". The Brain only needs to read/write small, structured deltas.
- High throughput, low token pressure: Automerge handles the “messy” synchronization with the UI process. The shared document carries commands across the process boundary—agent → Servo (“execute this web app”, “read app state”) and Servo → agent (“run inference”, “here are the results”). The Brain only needs to read/write small, structured deltas.
- Flexible API: One portion of the document is reserved for a small, fixed set of communication fields between Servo and the host. Outside that, the workflow can define its own schema. In other words, the agent can invent a task-specific “API” for the app it generated.
- Reducing prompt-injection exposure: Untrusted interaction output can stay in the shared document and be transformed/validated in code, so the high-privilege Brain doesn’t have to ingest raw content to operate on it.
In conclusion: The agent writes standard HTML/CSS/JS for 90% of the work (the UI). For side effects, it uses the shared state as a communication device. This keeps the trust boundary crisp: Servo renders and computes; the host performs privileged actions. This solves the “black box” problem by providing a dedicated channel for visual feedback without polluting the primary inference stream.
We use Servo, a modern web engine with a Rust codebase, to hit the sweet spot: lighter than Electron, richer than Wasm.
- Memory safety where it matters: Rust reduces broad classes of memory-unsafe bugs in the engine implementation. (Servo’s JS engine is SpiderMonkey, which is C++; this is why we still design for defense-in-depth.)
- Embeddable + adaptable: Servo can run as a separate native window for strong process isolation today, and can be evolved toward tighter integrations later (e.g., rendering to a texture/pixel buffer).
- Fast startup + good density: more “spawn a UI per task” than “boot a whole container per task”.
- Eash to adapt codebase:: we can adapt Servo for required features(like metering of memory and execution time), and also slim it down(removing unnecessary Web features like iframes).
Automerge isn’t just IPC; it’s a collaboration primitive.
- Multiple local components can join the same document.
- A remote (cloud) agent can join too, enabling hybrid systems: cloud agent loop + local sandboxed UI + local file operations.
The initial POC is intentionally developer-native:
- Host: VS Code extension + GenAIScript
- Inference: GitHub Copilot
- Sandbox UI: Servo in a separate OS window with Servo’s default capabilities (no strict OS sandboxing in the initial POC).
This lets us prototype quickly in the environment where coding agents already live, without forcing new infra on day one.
This is a “fat” runtime compared to Wasm-only sandboxes, so we rely on a defense-in-depth strategy:
- Blast Radius Containment: Rendering untrusted inputs (PDFs, logs, odd formats) happens in the Glass Box, isolating the parsing complexity from the main agent. If the renderer is compromised, the high-privilege Brain remains untouched. For inference over untrusted content, we can spin up a tool-less sub-agent that only sees the data and writes results back to UI state—so even if the content contains prompt injection, it has no keys, files, or terminal access to exploit.
- OS-Native Isolation (Roadmap): The POC runs Servo in a separate process. A later production hardened version will wrap this process in OS-native sandboxing (Linux Landlock / macOS Seatbelt) to enforce strict syscall policies.
- Capability minimization (design goal): The sandbox UI should not directly perform privileged effects. Instead it emits structured requests to the host, which enforces user consent and policy. This means that network request will either have to be mediated or pre-approved by the host.
- DoS-aware: Treat untrusted UI/code as potentially abusive (CPU/memory/time limits, crash containment). Servo already includes a background hang monitor that can suspend long-running JS.
- Slim down the Web: We can remove legacy features that are not useful in an agent context, for example: iframes.
Concrete defaults:
- Default-deny data access: the Servo side should not have ambient access to the repo, home directory, credentials, or internal networks.
- Narrow, typed requests: the UI can request actions (e.g., “apply this patch”, “write this file”), but the host is the only component that can touch the real machine.
- Explicit lifecycle: the shared document persists UI intent and results; the sandbox UI stays disposable.
Agents shouldn’t be limited to text just because sandboxes are headless.
The Glass Box is a missing layer: a local, securable runtime where agents can execute code and render interactive artifacts—turning “trust me” into “see it, click it, approve it.”